Re: Defacto standard string library



Richard Tobin wrote:
In article <871vvi11vi.fsf@xxxxxxxxxxxxxxxxxxxx>,
Phil Carmody <thefatphil_demunged@xxxxxxxxxxx> wrote:
Technically, you can't throw them away even if you do know the file is
UTF-8 (or UTF-16), because it's possible that there was no BOM and the
user content actually started with a ZWNBSP...

Possible, but recommended against:

http://unicode.org/faq/utf_bom.html#bom7
<<<
Q: I am using a protocol that has BOM at the start of text. How do I
represent an initial ZWNBSP?

A: Use U+2060 WORD JOINER instead. [MD]

Or use the sequence twice: the first will be interpreted as a BOM, the
second as a ZWNBSP. But why on earth would you want it anyway?

It is extremely unlikely, which is one of the reasons the ZWNBSP was chosen as the BOM. The particular code point for the ZWNBSP (0xFEFF) was chosen, IIRC, because the UTF-16LE and UTF-16BE encodings of it were invalid UTF-8, thus distinguishing exactly which of the three UTFs was in use -- but it can't definitively tell you that it's not some other encoding.

S
.



Relevant Pages

  • Re: unicode file
    ... If there is a BOM, the file is treated as UTF-8 or UTF-16LE ... When a file is opened for writing using _O_WTEXT, UTF-16 ... My small library does the UTF-16 to UTF-8 conversion behind the scene. ...
    (microsoft.public.vc.mfc)
  • Re: How to detect text file encoding in Perl
    ... actually mean UTF-16, stored in little-endian format with BOM. ... that character cannot be at one and the ... same time a BOM and a ZWNBSP: it's either one or the other. ...
    (comp.lang.perl.misc)
  • Re: unicode file
    ... and if is ansi how can i convert it to unicode ... If there is a BOM, the file is treated as UTF-8 or UTF-16LE ... When a file is opened for writing using _O_WTEXT, UTF-16 ...
    (microsoft.public.vc.mfc)
  • Re: Tidy using unicode does not validate
    ... There are two UTF-8 encodings: with and without a BOM at the start of ... Until of course the minions with their UTF-16 ... If you would like a megabyte of cheap Indian Java source where these ...
    (alt.html)
  • Re: Library function to detect UTF-8 streams without BOM
    ... If no "Encoding" attribute is present in the XML's prolog, ... BOM is present, either UTF-8 or UTF-16 can be used, but if a BOM is not ...
    (borland.public.delphi.thirdpartytools.general)