Grep and Unicode

04/20/09

Permalink 08:51:06 pm, by truewill Email , 214 words, 255 views   English (US)
Categories: Tips, Windows, Tools, PowerShell

Grep and Unicode

I really like grep. Windows Search Containing Text seldom seems to give me the results I want. Grep works, it’s fast, and it lets me use regular expressions.

I’m used to an old version of Borland’s Turbo GREP that shipped with Delphi. It’s old, it doesn’t work well when files have long lines, and it doesn’t support Unicode (particularly UTF-16). Microsoft’s SQL Server Management Studio has a nasty habit of saving SQL text files as UTF-16, so I don’t always find the saved query I’m looking for.

I found out that Windows XP (and up) has a utility called FINDSTR that acts much like grep (type “help findstr” in a command prompt for more info). Unfortunately, it doesn’t support Unicode either. See http://stackoverflow.com/questions/408079/findstr-or-grep-that-autodetects-chararacter-encoding-utf-16.

PowerShell comes to the rescue. It appears to support Unicode/UTF-16, at least if the byte order mark is present. See http://kevin-berridge.blogspot.com/2008/06/powershell-grep.html. I think the first comment on that post is half right; the issue is that “ls” in PowerShell is an alias for Get-ChildItem and is returning a collection of FileInfo and DirectoryInfo objects. In UNIX, the output of ls is just text, so that’s all grep can operate on. PowerShell pipes objects, so it is more powerful (albeit sometimes trickier).

Comments, Pingbacks:

Comment from: Kevin Berridge [Visitor] Email · http://kevin-berridge.blogspot.com
Your pingback finally prompted me to make a few minor tweaks to that post, as well as answer the first comment.

Thanks!
PermalinkPermalink 04/21/09 @ 06:59
Comment from: Ron James [Visitor] · http://www.chess.uk.com/stock-management-software
I think 'findstr' doesn't work well with non-english text in powershell. Because I have a friend who is chinese and he tried to include chinese characters into 'findstr' like this PS C:\> ${c:\test.txt}="’†•¶". 'findstr' did not find any chinese characters. What do you think went wrong? How can we fix this scenario?
PermalinkPermalink 02/11/10 @ 09:31

Leave a comment:

Your email address will not be displayed on this site.
Your URL will be displayed.

Allowed XHTML tags: <p, ul, ol, li, dl, dt, dd, address, blockquote, ins, del, span, bdo, br, em, strong, dfn, code, samp, kdb, var, cite, abbr, acronym, q, sub, sup, tt, i, b, big, small>
(Line breaks become <br />)
(Set cookies for name, email and url)
(Allow users to contact you through a message form (your email will NOT be displayed.))
The name truewill is composed of two other words. What is the SECOND word?

Development Central

Development Central is the blog of Bill Sorensen, a professional software developer. Much of this will relate to C#, .NET, and OOP in general.

Disclaimer
These postings are provided "AS IS" with no warranties and confer no rights.

Search

Categories

Linkblog

b2evolution

contributors

XML Feeds

What is RSS?

Who's Online?

  • Guest Users: 2

powered by b2evolution free blog software