Re: encoding problem?



On Sep 27, 12:27 pm, braeds...@xxxxxxxxxxx wrote:
I am trying to use perl on the command line to process text files in
various ways, one of which is to decode html entities. As far as I can
see, the following line should work

perl -MHTML::Entities -p -e 'decode_entities($_)' <input.txt

output.txt

it does indeed change the html entities, but not into the required
characters, rather into pairs of unusual characters; and the command
line returns this:

Wide character in print, <> line 1.

It seems to me it is something to do with internal character encoding
being messed up but I can't work out how to control it.

Before you can control it you need to know what it is.

The text files
processed have MacOS character encoding which is required in the
finished file,

What is "MacOS character encoding"?

but perhaps I need to convert to UTF8 before processing
and back again after?

Perl will do this automatically if you tell it the encoding of the
input and output.

perl -MHTML::Entities -p -e 'decode_entities($_)' <input.txt

I think you need something like

perl -MHTML::Entities -p -e "BEGIN { binmode STDIN,
':encoding(whatever)'; binmode STDOUT, ':encoding(whatever)' }
decode_entities($_)"

Where "whatever" is the name Perl uses for that which you are calling
"MacOS character encoding".

For a list of supported encodings:

perldoc Encode::Supported

.



Relevant Pages

  • =?ISO-8859-1?Q?Re=3A_How_to_upload_a_=A3?=
    ... A reference to a character that will display as this glyph ... Correctly encoding some bytes so as to be recognised as this ... ASCII-like encodings are old and only cope with a character set of up ... straight for UTF-8. ...
    (alt.html)
  • Re: C# and encodings
    ... But if windows has numerous code pages, ... encoding, and thus have only 255 code points matched to characters? ... Unicode can't be represented in only 8-bits, ... But Notepad supports Unicode and yet it only recognizes 255 character, ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Writing to the newsgroup?
    ... you should be able to set the encoding and use the encoding you ... I'm not familiear with Unitype Global writer, ... However, if you use its help feature to inquire about 'character encoding', ... Here's the UTF-8 test. ...
    (sci.lang.japan)
  • Re: [PHP] First stupid post of the year. [SOLVED]
    ... one can argue how many bytes are needed to represent a character ... in what encoding, but that doesn't change the character. ... Unicode it is called U+00A0. ... there are a few ways to encode U+00A0. ...
    (php.general)
  • Re: C# and encodings
    ... different encoding than Unicode does (Unicode set uses three ... Any character encoding that is not Unicode by definition uses a different encoding than Unicode does. ... The point is that the Unicode "character" 0xfeff is not representable in any ANSI code page, and is treated specially by stripping it from input rather than replacing it with the "default character". ...
    (microsoft.public.dotnet.languages.csharp)