Writing UTF-8 file under Windows



Happy New Year,

Whatever I try to write a UTF-8 file, I always end up with UTF-16LE
with the "FF FE" BOM at the beginning and 2 bytes per character.

I am reading strings from an external resource and try to write to
files.

my $string_with_special_chars = "Château Müller\nGarçon";
open F, ">:utf8", "test.txt";
print F $string_with_special_chars;
close F;

Tried it both on Linux (Perl 5.8.6) and Windows (Perl 5.8.7).
(In case you cannot see it: The string contains the chars with
the corresponding HTML entities acirc, uuml and ccedil.

Opening test.txt with my editor (Ultra-Edit) shows me the correct
string, but in hex view I see the "FF FE" BOM and it shows
2 bytes per character, e.g. 0x43 0x00 for the 'C' and
0xE7 0x00 for the ccedil.

Normally I am reading data via LDAP, so 'use utf8' is not required.
If I add it here, I get:
Malformed UTF-8 character (unexpected non-continuation byte 0x74,
immediately after start byte 0xe2) at ./test.pl line 4.

I tried to make sure my input strings are correctly decoded etc., but
no way.
As long as my strings stay within 7-bit ASCII it is fine, but after
that Perl always things it has to write a BOM and decode in a 2-byte
format.
Using Encode to write utf-8 results in a double encoding or at least
some unreadable chars.

Where does the BOM come from?
Why does Perl add it?
Doesn't Perl write UTF-8 by default?

Thank you for any hints. The issue cost me days already and yes, I have
read a lot about Perl and Unicode.

Tony

.



Relevant Pages

  • Writing a UTF-8 file
    ... Whether I open files in utf8 mode (2nd parameter of open or ... Firefox (UTF-8 encoding). ... Where does the BOM 0xFF 0xFE come from? ... Why does Perl add it? ...
    (comp.lang.perl.misc)
  • Re: Writing UTF-8 file under Windows
    ... Whatever I try to write a UTF-8 file, I always end up with UTF-16LE ... with the "FF FE" BOM at the beginning and 2 bytes per character. ... I am reading strings from an external resource and try to write to ... re-write the string using the encoding my Perl expects. ...
    (comp.lang.perl.misc)
  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: RfD: XCHAR wordset
    ... It's somewhat worse, because Windows has "A" prototypes, which convert the ... current code page into UTF-16 on the fly. ... Actually, it might be possible to change the current code page to UTF-8, but ... Windows strings are usually not C strings, ...
    (comp.lang.forth)
  • Re: Unicode in Regex
    ... index, length), using bytestrings and unicode regexp, verses native ... utf-8 strings in 1.9.0. ... *elegant* solution in 1.8., regexps or otherwise. ...
    (comp.lang.ruby)