Unicode BOM marks

From: Francis Girard (francis.girard_at_free.fr)
Date: 03/07/05


To: python-list@python.org
Date: Mon, 7 Mar 2005 20:24:42 +0100

Hi,

For the first time in my programmer life, I have to take care of character
encoding. I have a question about the BOM marks.

If I understand well, into the UTF-8 unicode binary representation, some
systems add at the beginning of the file a BOM mark (Windows?), some don't.
(Linux?). Therefore, the exact same text encoded in the same UTF-8 will
result in two different binary files, and of a slightly different length.
Right ?

I guess that this leading BOM mark are special marking bytes that can't be, in
no way, decoded as valid text.
Right ?
(I really really hope the answer is yes otherwise we're in hell when moving
file from one platform to another, even with the same Unicode encoding).

I also guess that this leading BOM mark is silently ignored by any unicode
aware file stream reader to which we already indicated that the file follows
the UTF-8 encoding standard.
Right ?

If so, is it the case with the python codecs decoder ?

In python documentation, I see theseconstants. The documentation is not clear
to which encoding these constants apply. Here's my understanding :

BOM : UTF-8 only or UTF-8 and UTF-32 ?
BOM_BE : UTF-8 only or UTF-8 and UTF-32 ?
BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?
BOM_UTF8 : UTF-8 only
BOM_UTF16 : UTF-16 only
BOM_UTF16_BE : UTF-16 only
BOM_UTF16_LE : UTF-16 only
BOM_UTF32 : UTF-32 only
BOM_UTF32_BE : UTF-32 only
BOM_UTF32_LE : UTF-32 only

Why should I need these constants if codecs decoder can handle them without my
help, only specifying the encoding ?

Thank you

Francis Girard

Python tells me to use an encoding declaration at the top of my files (the
message is referring to http://www.python.org/peps/pep-0263.html).

I expected to see there a list of acceptable



Relevant Pages

  • Re: Unicode BOM marks
    ... UTF-8 has no byte-order issues that a "byte order mark" would deal with. ... system that decides to add or leave out the UTF-8 signature, ... The BOM mark decodes as U+FEFF: ... > the UTF-8 encoding standard. ...
    (comp.lang.python)
  • Re: encoding
    ... UTF-8 is endianness-neutral, it does not need BOM mark. ... how do I create a file with encoding other than ANSI ...
    (microsoft.public.vc.language)
  • Re: automating the SQL warning and the choice of text format
    ... automatically select 'yes' and 'utf-8' rather than changing the registry, ... In order to get the correct encoding, I believe that you have to do the ... You need one of those for each data source. ... For a comma-delimited file using UTF-8 encoding, ...
    (microsoft.public.word.mailmerge.fields)
  • Re: DBD::ODBC and character sets
    ... you have and accept UTF-8 encoded data does mean you need to "use ... encoding" but if your script is encoded in xxx you need "use encoding ... Perl sees the left-hand side of eq as a string literal containg sixcharacters encoded as ISO-8859-1 ...
    (perl.dbi.users)
  • Re: PEP 263 status check
    ... > chosing windows-1252 as the source encoding. ... in the string module, the string methods and all through ... encoded data (including utf-8 encodings) ... character that is outside of the 7-bit ascii subset. ...
    (comp.lang.python)