Re: encoding problem?



On Sep 27, 12:27 pm, braeds...@xxxxxxxxxxx wrote:
I am trying to use perl on the command line to process text files in
various ways, one of which is to decode html entities. As far as I can
see, the following line should work

perl -MHTML::Entities -p -e 'decode_entities($_)' <input.txt

output.txt

it does indeed change the html entities, but not into the required
characters, rather into pairs of unusual characters; and the command
line returns this:

Wide character in print, <> line 1.

It seems to me it is something to do with internal character encoding
being messed up but I can't work out how to control it.

Before you can control it you need to know what it is.

The text files
processed have MacOS character encoding which is required in the
finished file,

What is "MacOS character encoding"?

but perhaps I need to convert to UTF8 before processing
and back again after?

Perl will do this automatically if you tell it the encoding of the
input and output.

perl -MHTML::Entities -p -e 'decode_entities($_)' <input.txt

I think you need something like

perl -MHTML::Entities -p -e "BEGIN { binmode STDIN,
':encoding(whatever)'; binmode STDOUT, ':encoding(whatever)' }
decode_entities($_)"

Where "whatever" is the name Perl uses for that which you are calling
"MacOS character encoding".

For a list of supported encodings:

perldoc Encode::Supported

.



Relevant Pages

  • =?ISO-8859-1?Q?Re=3A_How_to_upload_a_=A3?=
    ... A reference to a character that will display as this glyph ... Correctly encoding some bytes so as to be recognised as this ... ASCII-like encodings are old and only cope with a character set of up ... straight for UTF-8. ...
    (alt.html)
  • Re: Writing to the newsgroup?
    ... you should be able to set the encoding and use the encoding you ... I'm not familiear with Unitype Global writer, ... However, if you use its help feature to inquire about 'character encoding', ... Here's the UTF-8 test. ...
    (sci.lang.japan)
  • Re: Stream and Encoding Confusion
    ... We are each writing programs to read an input file and count the number of ... a simple list that says we the program found so many of each character; ... treated as a character stream or a byte stream. ... I'm also somewhat concerned about encoding. ...
    (comp.lang.java.programmer)
  • Re: [PHP] First stupid post of the year. [SOLVED]
    ... one can argue how many bytes are needed to represent a character ... in what encoding, but that doesn't change the character. ... Unicode it is called U+00A0. ... there are a few ways to encode U+00A0. ...
    (php.general)
  • Re: Understanding simplest HTML page
    ... Even the BBC managed to put invalid ... > technical details of using a particular encoding, ... Bengali and so on using utf-8 ... Mozilla has routines for automatically guessing at character ...
    (comp.infosystems.www.authoring.html)