Re: How to detect text file encoding in Perl



On Sun, 21 May 2006, corff@xxxxxxxxxxxxxxxxxx wrote:

Google is probably your friend. If not: <B>yte <O>rder <M>ark.

http://www.unicode.org/faq/utf_bom.html#BOM

store your data as UTF-8, or your data _is_ UTF-8, you'll see that after
storing the bytecount is two bytes more because the byte 0xff 0xef get
prepended automatically,

The BOM is the relevant encoding of the Unicode character U+FEFF. No
way is it 0xff 0xef. The various encoded byte patterns are shown in
that Unicode FAQ, and in utf-8 it's *three* bytes.

in order to tell the software which byte order is to be expected.

"No, a BOM can be used as a signature no matter how the Unicode text
is transformed"

This makes sense with UCS-2 Unicode (the "original" Unicode
encoding)

Yes, but "UCS-2" is out of date:
http://www.unicode.org/faq/basic_q.html#23

The utf-16 encoding form is its present counterpart.

but not with UTF-8 (8-bit transformation format of Unicode) because
the characters encoded in UTF-8 are self-synchronizing and no
information about byte order is needed.

Nevertheless, the Unicode FAQ points out that utf-8 can usefully
start with a BOM as an encoding signature.

In contrast, other programs behaving correctly frequently complain
if the BOM appears where it simply doesn't belong.

Except that it is not inherently incorrect for it to appear at the
beginning of a utf-8 stream - but see the cited FAQ for details.

Seems to me you would have done well to read that FAQ yourself, before
putting misleading opinions on the record.

regards

--

Beware of negative easements.
.



Relevant Pages

  • Re: Defacto standard string library
    ... context was strings that were known to be UTF-8, ... that other programs can recognize the encoding. ... since the BOM convention developed for UTF-16 ... I tried the Vista speech recognition by running the tutorial. ...
    (comp.lang.c)
  • Re: utf8 or utf-8
    ... From Markus Kuhn's excellent Unicode FAQ: ... "The official name and spelling of this encoding is UTF-8, ... send the line "unsubscribe linux-kernel" in ...
    (Linux-Kernel)
  • Re: Defacto standard string library
    ... I challenge you to find a file in the wild that starts with that sequence where it is _not_ a BOM encoded in UTF-8. ... it clearly is not absolute. ... that other programs can recognize the encoding. ...
    (comp.lang.c)
  • Re: Library function to detect UTF-8 streams without BOM
    ... If no "Encoding" attribute is present in the XML's prolog, ... BOM is present, either UTF-8 or UTF-16 can be used, but if a BOM is not ...
    (borland.public.delphi.thirdpartytools.general)
  • Re: Transmitting strings via tcp from a windows c++ client to a Java server
    ... That algorithm will not give you the size in bytes of a UTF-8 encoded string. ... There is no way to compute the length of the UTF-8 encoding of a Unicode ... or Unicode characters. ... I would probably decide that a BOM must not be used, ...
    (comp.lang.java.programmer)