Re: Library function to detect UTF-8 streams without BOM




"marek jedlinski" <marekjed@xxxxxxxxxxxxxxxxx> wrote in message
news:ths4m3t6ljucpt12ne9638k5sc2bvtq6sp@xxxxxxxxxx

I've been testing several Unicode-capable shareware editors for
Windows (can't find one that's quite right for my work), and none
has any problems detecting BOM-less UTF-8, even in non-xml/html
files, where they cannot rely on the encoding specified in the file
itself.

Unless the encoding is specified by a BOM or explicitally (and accurately)
inside the content, then it has to be determined by analyzing the format of
the content and making guesses about what encoding might be used. There are
some blogs about this in MSDN that describe how Notepad tries to auto-detect
the encoding, for instance:

Some files come up strange in Notepad
http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx

The Notepad file encoding problem, redux
http://blogs.msdn.com/oldnewthing/archive/2007/04/17/2158334.aspx


Gambit


.