Re: Library function to detect UTF-8 streams without BOM




"marek jedlinski" <marekjed@xxxxxxxxxxxxxxxxx> wrote in message
news:ths4m3t6ljucpt12ne9638k5sc2bvtq6sp@xxxxxxxxxx

I've been testing several Unicode-capable shareware editors for
Windows (can't find one that's quite right for my work), and none
has any problems detecting BOM-less UTF-8, even in non-xml/html
files, where they cannot rely on the encoding specified in the file
itself.

Unless the encoding is specified by a BOM or explicitally (and accurately)
inside the content, then it has to be determined by analyzing the format of
the content and making guesses about what encoding might be used. There are
some blogs about this in MSDN that describe how Notepad tries to auto-detect
the encoding, for instance:

Some files come up strange in Notepad
http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx

The Notepad file encoding problem, redux
http://blogs.msdn.com/oldnewthing/archive/2007/04/17/2158334.aspx


Gambit


.



Relevant Pages

  • Re: Convert DOS Cyrillic text to Unicode
    ... that a user paste DOS Cyrillic text (taken from Notepad) ... Strings in .NET are always Unicode! ... I've only used the normal encoding for requests & response in ASP.NET, ... Notice that in the above there is a whole lot of converting going on! ...
    (microsoft.public.dotnet.languages.vb)
  • Re: printing to word document
    ... When I open the file in Word, it asks me for the encoding to ... Notepad, but this of course can always be changed. ... proprietary Word format which is binary, ... can also trick Windows by renaming filename.txt into ...
    (comp.soft-sys.matlab)
  • Re: Problem writing non-englisg characters (re-post)
    ... not in Notepad or Edit Plus. ... As you read then write with the same, probably wrong, encoding, the second error cancel the first one. ... I know that wordpad doesn't recognize UTF-8 encoding when the Byte Order Mark is not present. ...
    (microsoft.public.dotnet.framework)
  • Re: Can Dolphin parse a UTF-16 XML file?
    ... The VW XML parser checks the ... encoding of the input stream in XML.StreamWrapper>>checkEncoding. ... Paste this example into Windows notepad. ... The reason for using UCS-2 Little Endian is that when I read a UTF-16 ...
    (comp.lang.smalltalk.dolphin)
  • Re: output ascii text file
    ... > http://mindprod.com Again taking new Java programming contracts. ... When saved as "UTF-8" (from notepad) in the file there ... I tried to figure out what encoding is notepad using, ... If I want the bytes written remain unalterd I have to use the "Unicode" ...
    (comp.lang.java.help)