Re: How to detect text file encoding in Perl



On Sun, 21 May 2006, corff@xxxxxxxxxxxxxxxxxx wrote:

Google is probably your friend. If not: <B>yte <O>rder <M>ark.

http://www.unicode.org/faq/utf_bom.html#BOM

store your data as UTF-8, or your data _is_ UTF-8, you'll see that after
storing the bytecount is two bytes more because the byte 0xff 0xef get
prepended automatically,

The BOM is the relevant encoding of the Unicode character U+FEFF. No
way is it 0xff 0xef. The various encoded byte patterns are shown in
that Unicode FAQ, and in utf-8 it's *three* bytes.

in order to tell the software which byte order is to be expected.

"No, a BOM can be used as a signature no matter how the Unicode text
is transformed"

This makes sense with UCS-2 Unicode (the "original" Unicode
encoding)

Yes, but "UCS-2" is out of date:
http://www.unicode.org/faq/basic_q.html#23

The utf-16 encoding form is its present counterpart.

but not with UTF-8 (8-bit transformation format of Unicode) because
the characters encoded in UTF-8 are self-synchronizing and no
information about byte order is needed.

Nevertheless, the Unicode FAQ points out that utf-8 can usefully
start with a BOM as an encoding signature.

In contrast, other programs behaving correctly frequently complain
if the BOM appears where it simply doesn't belong.

Except that it is not inherently incorrect for it to appear at the
beginning of a utf-8 stream - but see the cited FAQ for details.

Seems to me you would have done well to read that FAQ yourself, before
putting misleading opinions on the record.

regards

--

Beware of negative easements.
.



Relevant Pages

  • Re: utf8 or utf-8
    ... From Markus Kuhn's excellent Unicode FAQ: ... "The official name and spelling of this encoding is UTF-8, ... send the line "unsubscribe linux-kernel" in ...
    (Linux-Kernel)
  • Re: Transmitting strings via tcp from a windows c++ client to a Java server
    ... That algorithm will not give you the size in bytes of a UTF-8 encoded string. ... There is no way to compute the length of the UTF-8 encoding of a Unicode ... or Unicode characters. ... I would probably decide that a BOM must not be used, ...
    (comp.lang.java.programmer)
  • Re: Library function to detect UTF-8 streams without BOM
    ... If no "Encoding" attribute is present in the XML's prolog, ... BOM is present, either UTF-8 or UTF-16 can be used, but if a BOM is not ...
    (borland.public.delphi.thirdpartytools.general)
  • Re: encoding
    ... UTF-8 is endianness-neutral, it does not need BOM mark. ... how do I create a file with encoding other than ANSI ...
    (microsoft.public.vc.language)
  • Re: Extra characters being placed on XML prior to send
    ... servers. ... The encoding is UTF-8. ... I'd expect to maybe see a BOM in there, but the length doesn't match either (if it was an UTF-8 BOM you'd be seeing EF BB BF instead). ...
    (microsoft.public.biztalk.general)