Re: Unicode BOM marks

From: Martin v. Löwis (martin_at_v.loewis.de)
Date: 03/07/05


Date: Mon, 07 Mar 2005 21:54:04 +0100

Francis Girard wrote:
> If I understand well, into the UTF-8 unicode binary representation, some
> systems add at the beginning of the file a BOM mark (Windows?), some don't.
> (Linux?). Therefore, the exact same text encoded in the same UTF-8 will
> result in two different binary files, and of a slightly different length.
> Right ?

Mostly correct. I would prefer if people referred to the thing not as
"BOM" but as "UTF-8 signature", atleast in the context of UTF-8, as
UTF-8 has no byte-order issues that a "byte order mark" would deal with.
(it is correct to call it "BOM" in the context of UTF-16 or UTF-32).

Also, "some systems" is inadequate. It is not so much the operating
system that decides to add or leave out the UTF-8 signature, but much
more the application writing the file. Any high-quality tool will accept
the file with or without signature, whether it is a tool on Windows
or a tool on Unix.

I personally would write my applications so that they put the signature
into files that cannot be concatenated meaningfully (since the
signature simplifies encoding auto-detection) and leave out the
signature from files which can be concatenated (as concatenating the
files will put the signature in the middle of a file).

> I guess that this leading BOM mark are special marking bytes that can't be, in
> no way, decoded as valid text.
> Right ?

Wrong. The BOM mark decodes as U+FEFF:

>>> codecs.BOM_UTF8.decode("utf-8")
u'\ufeff'

This is what makes it a byte order mark: in UTF-16, you can tell the
byte order by checking whether it is FEFF or FFFE. The character U+FFFE
is an invalid character, which cannot be decoded as valid text
(although the Python codec will decode it as invalid text).

> I also guess that this leading BOM mark is silently ignored by any unicode
> aware file stream reader to which we already indicated that the file follows
> the UTF-8 encoding standard.
> Right ?

No. It should eventually be ignored by the application, but whether the
stream reader special-cases it or not is depends on application needs.

> If so, is it the case with the python codecs decoder ?

No; the Python UTF-8 codec is unaware of the UTF-8 signature. It reports
it to the application when it finds it, and it will never generate the
signature on its own. So processing the UTF-8 signature is left to the
application in Python.

> In python documentation, I see theseconstants. The documentation is not clear
> to which encoding these constants apply. Here's my understanding :
>
> BOM : UTF-8 only or UTF-8 and UTF-32 ?

UTF-16.

> BOM_BE : UTF-8 only or UTF-8 and UTF-32 ?
> BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?

UTF-16

> Why should I need these constants if codecs decoder can handle them without my
> help, only specifying the encoding ?

Well, because the codecs don't. It might be useful to add a
"utf-8-signature" codec some day, which generates the signature on
encoding, and removes it on decoding.

Regards,
Martin



Relevant Pages

  • Re: [Patch] Support UTF-8 scripts
    ... For a script, the shell does not care about the encoding ... the interpreter *does* care about the encoding. ... UTF-8, meaning that non-ASCII can be used in string literals, ... > signature, so introducing a signature for UTF-8 does not win anything. ...
    (Linux-Kernel)
  • Unicode BOM marks
    ... If I understand well, into the UTF-8 unicode binary representation, some ... systems add at the beginning of the file a BOM mark, ... file from one platform to another, even with the same Unicode encoding). ... BOM_UTF32_BE: UTF-32 only ...
    (comp.lang.python)
  • Re: Unicode (UTF-8)
    ... UTF-8 signatures. ... much generosity UTF-8 in the signature? ... heureusement que j'ai mon rongeur.] ... (in Swedish "answer") ...
    (alt.html)
  • Re: Creating ANSI text files with international characters
    ... File->Advanced Save Options dialog, it said Western European codepage 1252, ... that mean the encoding UTF-8 IS the codepage 65001!?!? ... Unicode (UTF-8 with signature) codepage 65001, ...
    (microsoft.public.dotnet.framework)
  • Re: Default encoding for ASPX pages
    ... I need UTF-8. ... If I save an aspx file as "UTF-8 without signature" - it doesn't ... I have a web site in Russian. ... encoding that I use in HTML files doesn't work in ASPX pages. ...
    (microsoft.public.vsnet.general)