Writing a UTF-8 file



Hello everybody,

Does anyone know how I can write UTF-8 files without
a BOM in Perl?

Whether I open files in utf8 mode (2nd parameter of open or
via binmode) I always end up with
- A BOM "FF FE" (UTF-16LE afaik) at the start of the output file;
- Encoding with minimum 2 bytes per character.

I am reading strings from an external resource, so the following
is not 100% representative but has the same effect:

my $string_with_special_chars = "Château Müller\nGarçon";
# String contains entities acirc, uuml and ccedil.
open F, ">:utf8", "test.txt";
print F $string_with_special_chars;

Tried it both on Linux (Perl 5.8.6) and Windows (Perl 5.8.7).

Difference between utf8 and default mode:
The file created without explicit utf8 mode is readable in
Firefox (UTF-8 encoding). My hex editor shows that for all
characters the 2nd byte is 0x00.
The file opened with ">:utf8" shows hex C3 00 A2 00 for the
u umlaut resp. in total 6 bytes more due to the 3 special chars.

Where does the BOM 0xFF 0xFE come from?
Why does Perl add it?
Doesn't Perl write UTF-8 by default?
Why adding the BOM and why 2 or more bytes per character?

Puzzeling since ages (ok, days) on this.

Thank you for any hints.
MP

.



Relevant Pages

  • Writing UTF-8 file under Windows
    ... Whatever I try to write a UTF-8 file, I always end up with UTF-16LE ... with the "FF FE" BOM at the beginning and 2 bytes per character. ... I am reading strings from an external resource and try to write to ... Why does Perl add it? ...
    (comp.lang.perl.misc)
  • Re: aps.net : BIG BUG in streamwriter
    ... look the BOM! ... editor which proceeds to rewrite it as UTF-16? ... when i want deserialize it with an utf-8 encoding... ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Custom Resource, XML problem
    ... Why are you assuming that it is 8-bit characters? ... //JWxml is namespace used by CXml ... which is then screamingly obvious as the UTF-8 Byte Order Mark, ... BOM is the only meaning of BOM in my brain was for "Bill Of Material" which ...
    (microsoft.public.vc.mfc)
  • Re: DBD::mysql and UTF-8
    ... Since you want Perl to ... >>> same output as phpMyAdmin, ... > that it doesn't give me all characters (and therefore isn't real UTF-8) ... DB, so phpMyAdmin and cli mysql both give the real UTF-8 output, ...
    (comp.lang.perl.modules)
  • Re: [Regex] Suchen nach Hex-Zeichen
    ... oder Byte-Strings arbeitet und meistens ist es auch vollkommen egal. ... Theoretisch ist es interessant, was Perl da intern ... Wenn Du Deine Daten als UTF-8 ... Du brauchst den I/O-Layer zum Schreiben und zum Lesen. ...
    (de.comp.lang.perl.misc)