Re: Yet another unicode question: windows platform



tewilk@xxxxxxxxx wrote:
I'm still searching for the answer but I can't find it yet so maybe
someone can point me in the right direction. This is my problem...
I'm reading MS SQL Server errorlog file- sql7 and sql2000 are standard
ascii flat files that I can process with no problems. Then comes
SQL2005... now the file is in unicode format. After a lots reading
(web and perl docs) I ran across my answer...
http://blogs.msdn.com/brettsh/archive/2006/06/07/620986.aspx , the
link talks about writing but I was able to use the same concept to
read the file (open errorlogFH,"<:raw:encoding(UTF16-LE):crlf:utf8",
"$TempErrorlog";) which worked fine. BTW... the information from this
link was more helpful than the perldocs but maybe it just me :o

Now here is my issue that I'm sure is a no brainier for someone out
there... prior to my open, how can I check the file? Is it plain text
so that I can use the standard open OR is it unicode so that I know
that I need to use the "encoding" method?

Also if anyone knows of some good information that has worked for them
as it pertains to unicode, please post so that I can check it out.

Well, in general, if you don't know the type of file (ASCII, UTF8, UTF16LE/BE, UTF32LE/BE, some non-ASCII non-Unicode 8/16-bit encoding), you have to check against all supportable types and if you find that the contains, say, what's a valid UTF8, then so be it. A few hints... Unicode text files may begin with so-called BOM (Byte Order Mark). Notepad usually (if not always) puts it at the beginning of the saved Unicode text file. It's a different sequence of bytes for UTF8, UTF16LE, UTF16BE, etc. If you find it, you may validate the rest of the file pretending you know the Unicode format used (from the BOM). The Unicode standard describes valid "code point" number ranges. If you find something outside these ranges, it's not Unicode or the file is corrupt. To find if the file is plain ASCII, just check that all bytes in it are in the range 0...127. If a file doesn't look like ASCII or Unicode, it's either some other 8-bit or 16-bit encoding or it's corrupt. Btw, 7-bit ASCII is a subset of UTF8.

I highly suggest that you read the Unicode documentation from the Unicode website: http://www.unicode.org. A must to read are: Unicode FAQ, "To the BMP and Beyond!" by Eric Muller -- must be somewhere on the net. I suggest that you start with the latter to get an overall idea of Unicode quickly. And the ultimate source of the information is the Unicode standard itself.

Alex

.



Relevant Pages

  • Re: CFile::Read problem ???
    ... As far as the C compiler is concerned, ... you can pretty much always assign a char ... as ASCII and wchar_t as Unicode. ...
    (microsoft.public.windowsce.embedded.vc)
  • Re: Getting clean ascii output
    ... I discovered that it appears to have both Unicode ... and Ascii text in it. ... characters are encoded into bytes) like utf8, ... general-category: Pd (Punctuation, Dash) ...
    (comp.unix.shell)
  • Re: Opening a text file that may be ASCII *or* Unicode
    ... It could well be ASCII empty -- no bytes.) ... UTF & BOM ... Positively Must Know About Unicode and Character Sets ... > regards, Andy ...
    (microsoft.public.scripting.vbscript)
  • Re: Cross-platform e-mail text size problems
    ... ASCII is mentioned mostly as historical reference. ... It says that "plain text" used to require ASCII (and never one of the 'high ascii' variants we were stuck with before Unicode) and goes on to explain how Unicode is replacing ASCII in plain text. ... If you define "plain text" as "lowest common denomiator", I suppose you could say that it has indeed been upgraded from ASCII to Unicode, thanks to Unicode having become ubiquitous enough to be considered a "low enough common denominator". ...
    (comp.sys.mac.apps)
  • Re: Cross-platform e-mail text size problems
    ... ASCII that I referred to. ... stuck with before Unicode) and goes on to explain how Unicode is ... Since Mac OS X the system has Unicode support under the hood. ...
    (comp.sys.mac.apps)