Re: Library function to detect UTF-8 streams without BOM




"Franz-Leo Chomse" <franz-leo.chomse@xxxxxxxx> wrote in message
news:3g05m392tko91h002uoteb281lcj9dldk1@xxxxxxxxxx

For XML files, UNICODE is the default character set,
any other one has to be declared.

If no "Encoding" attribute is present in the XML's prolog, and no Encoding
is specified externally (such as in a MIME header), then the XML has to be
encoded in either UTF-8 or UTF-16, depending on the presence of a BOM. If a
BOM is present, either UTF-8 or UTF-16 can be used, but if a BOM is not
present then UTF-8 must be used. This is clearly outlined in section 4.3.3
of the XML 1.0 spec.


Gambit


.



Relevant Pages

  • Re: Unicode string libraries
    ... encoding negotiation. ... old languages which have adopted Unicode without much pain. ... compatibility with too many old programs; but char as a holder for UTF-8 ... The limitations of UTF-16 ...
    (comp.programming)
  • Re: Defacto standard string library
    ... context was strings that were known to be UTF-8, ... that other programs can recognize the encoding. ... since the BOM convention developed for UTF-16 ... I tried the Vista speech recognition by running the tutorial. ...
    (comp.lang.c)
  • Re: Supporting full Unicode
    ... > Keeping in mind that in UTF-16 some characters take two bytes and ... It is true that variable-width encodings such as UTF-16 or UTF-8 are ... But UTF-8 is gaining momemtum. ... encoding only, it is now in use as an internal encoding, too. ...
    (comp.lang.ada)
  • Re: Unicode Delphi Win32 - which approach
    ... I like the backwards compatibility aspects of UTF-8 vs UTF-16. ... UTF-8 encoding is different from ANSI, at least it's still byte oriented ... encoding, programmers will be forced to "think" Unicode, and not ...
    (borland.public.delphi.non-technical)
  • Re: Unicode question
    ... UTF-8 vs UTF-16? ... efficient in memory consumption. ... The most practical encoding would be UTF-32 - which should be enough in fixed length, up to the point when we establish relations with alien nations on a galactic scale;) ... UTF-16 ended as WinAPI & .Net standard by way of accident/initial shortcuts: MS implemented international character support via UCS-2 initially (where every character is 2 byte long, ...
    (borland.public.delphi.non-technical)