Re: encoding problem?
- From: rob.dixon@xxxxxxx (Rob Dixon)
- Date: Thu, 27 Sep 2007 14:39:39 +0100
braedsjaa@xxxxxxxxxxx wrote:
I am trying to use perl on the command line to process text files in
various ways, one of which is to decode html entities. As far as I can
see, the following line should work
perl -MHTML::Entities -p -e 'decode_entities($_)' <input.txtoutput.txt
it does indeed change the html entities, but not into the required
characters, rather into pairs of unusual characters; and the command
line returns this:
Wide character in print, <> line 1.
It seems to me it is something to do with internal character encoding
being messed up but I can't work out how to control it. The text files
processed have MacOS character encoding which is required in the
finished file, but perhaps I need to convert to UTF8 before processing
and back again after?
(I am seriously new to this - only started looking at Perl yesterday!)
HTML Entities are Unicode entities from a set of many thousands of
different characters, which cannot be encoded into a single data byte.
decode_entities() uses UTF-8 encoding, which corresponds to ASCII
encoding for the first 128 characters: beyond that the character will
use two or more data bytes to represent it.
Rob
.
- References:
- encoding problem?
- From: braedsjaa
- encoding problem?
- Prev by Date: Re: Problem with repeating characters in regex
- Next by Date: Re: Displaying a Link on A Web PAge with image/png header.
- Previous by thread: encoding problem?
- Next by thread: Re: encoding problem?
- Index(es):
Relevant Pages
|