Re: encoding problem?



braedsjaa@xxxxxxxxxxx wrote:

I am trying to use perl on the command line to process text files in
various ways, one of which is to decode html entities. As far as I can
see, the following line should work

perl -MHTML::Entities -p -e 'decode_entities($_)' <input.txt
output.txt

it does indeed change the html entities, but not into the required
characters, rather into pairs of unusual characters; and the command
line returns this:

Wide character in print, <> line 1.

It seems to me it is something to do with internal character encoding
being messed up but I can't work out how to control it. The text files
processed have MacOS character encoding which is required in the
finished file, but perhaps I need to convert to UTF8 before processing
and back again after?

(I am seriously new to this - only started looking at Perl yesterday!)

HTML Entities are Unicode entities from a set of many thousands of
different characters, which cannot be encoded into a single data byte.
decode_entities() uses UTF-8 encoding, which corresponds to ASCII
encoding for the first 128 characters: beyond that the character will
use two or more data bytes to represent it.

Rob

.



Relevant Pages

  • Re: Docx files
    ... The problem comes, as soon as you hit the reply button/command, then SeaMonkey changes to the system default character encoding, which is used on your computer, if you haven't set up any changed encoding as default encoding type in the prefs settings. ... I probably won't have the problems, if I change from danish to US English as my default system language, - unless I've set SeaMonkey to use a fixed ISO-8859-1 setting instead of the automatic settings.... ... In Danish and Norwegian we have three special characters that doesn't ...
    (comp.sys.mac.apps)
  • Re: problem with java, ASCII and Linux
    ... you have a problem with non-ASCII characters. ... ASCII is US-ASCII, ... Appears that you have some partially utf-8 -based environment. ... to your Java VM which character encoding is used by your terminal ...
    (comp.infosystems.www.servers.unix)
  • Character encodings and invalid characters
    ... The idea here is relatively simple: a java program (I'm using JDK1.4 ... characters (or replaces them in the case of common ones like ... the character encoding information from the server. ... I'm slightly confused by the HTML specification - are the valid ...
    (comp.lang.java.programmer)
  • Re: Help please - Why does ByteBuffer return ? as opposed to what was put?
    ... characters, and for the accompanying niceties of character encoding. ... according to the system's default character encoding scheme. ... Strings, chars, StringBuffers, etc. to hold binary data. ...
    (comp.lang.java.programmer)
  • Re: Help please - Why does ByteBuffer return ? as opposed to what was put?
    ... > characters, and for the accompanying niceties of character encoding. ... > according to the system's default character encoding scheme. ... > Strings, chars, StringBuffers, etc. to hold binary data. ...
    (comp.lang.java.programmer)