Re: Yet another unicode question: windows platform
- From: "Alexei A. Frounze" <alexfru@xxxxxxx>
- Date: Mon, 30 Oct 2006 19:49:48 -0800
tewilk@xxxxxxxxx wrote:
I'm still searching for the answer but I can't find it yet so maybe
someone can point me in the right direction. This is my problem...
I'm reading MS SQL Server errorlog file- sql7 and sql2000 are standard
ascii flat files that I can process with no problems. Then comes
SQL2005... now the file is in unicode format. After a lots reading
(web and perl docs) I ran across my answer...
http://blogs.msdn.com/brettsh/archive/2006/06/07/620986.aspx , the
link talks about writing but I was able to use the same concept to
read the file (open errorlogFH,"<:raw:encoding(UTF16-LE):crlf:utf8",
"$TempErrorlog";) which worked fine. BTW... the information from this
link was more helpful than the perldocs but maybe it just me :o
Now here is my issue that I'm sure is a no brainier for someone out
there... prior to my open, how can I check the file? Is it plain text
so that I can use the standard open OR is it unicode so that I know
that I need to use the "encoding" method?
Also if anyone knows of some good information that has worked for them
as it pertains to unicode, please post so that I can check it out.
Well, in general, if you don't know the type of file (ASCII, UTF8, UTF16LE/BE, UTF32LE/BE, some non-ASCII non-Unicode 8/16-bit encoding), you have to check against all supportable types and if you find that the contains, say, what's a valid UTF8, then so be it. A few hints... Unicode text files may begin with so-called BOM (Byte Order Mark). Notepad usually (if not always) puts it at the beginning of the saved Unicode text file. It's a different sequence of bytes for UTF8, UTF16LE, UTF16BE, etc. If you find it, you may validate the rest of the file pretending you know the Unicode format used (from the BOM). The Unicode standard describes valid "code point" number ranges. If you find something outside these ranges, it's not Unicode or the file is corrupt. To find if the file is plain ASCII, just check that all bytes in it are in the range 0...127. If a file doesn't look like ASCII or Unicode, it's either some other 8-bit or 16-bit encoding or it's corrupt. Btw, 7-bit ASCII is a subset of UTF8.
I highly suggest that you read the Unicode documentation from the Unicode website: http://www.unicode.org. A must to read are: Unicode FAQ, "To the BMP and Beyond!" by Eric Muller -- must be somewhere on the net. I suggest that you start with the latter to get an overall idea of Unicode quickly. And the ultimate source of the information is the Unicode standard itself.
Alex
.
- References:
- Yet another unicode question: windows platform
- From: tewilk@xxxxxxxxx
- Yet another unicode question: windows platform
- Prev by Date: Re: split and grouping in regexp
- Next by Date: Perl Code Creating a Binary File for MATLAB
- Previous by thread: Yet another unicode question: windows platform
- Next by thread: loop over a string to do search/replacement using regexp?
- Index(es):
Relevant Pages
|