Re: Intermittent Character Encoding Issues
From: Ben Morrow (usenet_at_morrow.me.uk)
Date: 11/04/03
- Next message: Tad McClellan: "Re: Renaming files"
- Previous message: Jessica Smith: "Newbie gets Internal Server Error, among others"
- In reply to: David Murray-Rust: "Intermittent Character Encoding Issues"
- Next in thread: David Murray-Rust: "Re: Intermittent Character Encoding Issues"
- Reply: David Murray-Rust: "Re: Intermittent Character Encoding Issues"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 4 Nov 2003 18:55:55 +0000 (UTC)
David Murray-Rust <dave@mo-seph.com> wrote:
> The first version of the bug was that after the line:
>
> $contentList = [ join '', @$contentList ] unless $separate;
>
> certain characters in the entries in @$contentList would be changed to
> two-byte versions. The only happened about 1% of the time this code
> was run. Changing the above line to be:
>
> unless( $separate )
> {
> my $tmp = "";
> foreach my $contentBit ( @$contentList )
> {
> $tmp .= $contentBit;
> }
> $contentList = [ $tmp ];
> }
>
> made the problem go away. In this case, the data comes directly from
> the mysql database. It has been verified that the string is encoded
> correctly up until that line, and wrongly afterwards.
How perl stores the data internally should be considered none of your
business. (It is in fact either iso8859-1 or utf8 on ASCII machines,
with a flag set on each scalar to say which. It is easier, however, to
regard a text string as being a set of Unicode characters, and not
worry about how they are represented.) However, it may be that how it
is stored in your mysql database is confusing perl, if the code you
are using to interface to the database doesn't correctly decode the
data into perl's own encoding. In particular, if you use iso8859-1 you
may get bitten far more irregularly than if you use other encodings.
Decide on how you are going to encode text in the database: I
shall assume you wish to use iso8859-1. Now, every piece of textual
(as opposed to binary) data you write into the database should first
be converted from a sequence of characters into a sequence of octets,
using Encode::encode; and every piece of textual data should be
converted from octets back into character data using
Encode::decode. So, in the example above, you would write:
my $tmp = "";
foreach my $contentBit (@$contentList) {
$tmp .= decode "iso8859-1", $content_Bit;
}
$contentList = [ $tmp ];
(assuming you didn't decode it closer to where it was read from the
database).
> In the second version of the bug, the line:
>
> return $return . $parent;
>
> resulted in a string being returned where all the pound signs in
> $return had been altered. If a different string to $parent is
> appended, there is no problem.
So what does $parent contain, which causes this problem? And what is
the result of
use Encode qw/is_utf8/;
warn is_utf8($parent) ?
"\$parent is chars internally" :
"\$parent is bytes internally";
?
> The current solution is:
>
> my $tmpParent = encode( "iso-8859-1", $parent );
> return $return . $tmpParent;
This is almost certainly Wrong, as $tmpParent will here be considered
to be a string of octets rather than a sequence of characters. The
Right Answer is to make sure $return is considered to be a sequence of
characters as well.
> In this case, there is data in $parent which comes via CGI, so I would
> be able to believe an explanation along the lines of "$parent is
> magically recognised as utf8, so when it is added to $return, $return
> is converted to utf8 octets before they are joined", but I would find
> this quite counter intuitive, since as I understand things perl uses
> it's own internal representation for strings, and should only need to
> convert on the way in or out.
Yup. However, if the module you are using to talk to the database
and/or Apache hasn't been upgraded to 5.8 yet you will have to do
those conversions 'at the borders' by hand. Pushing an :encoding layer
onto your filehandles, perhaps with the 'open' pragma, may help
automate this; although you are using mod_perl, which relies on tied
filehandles: I don't know how well these play with PerlIO layers as
yet. You may want to write a custom 'print', 'readline' &c. that runs
all input through 'decode' and all output through 'encode'.
Another thing to watch out for is that if any of your locale variables
(LANG, LC_ALL, etc.) match /utf-?8/i then perl will assume all IO will
be in UTF8 until you disillusion it. This feature has been removed in
5.8.1, though, so it shouldn't be affecting your problem.
An alternative solution, if you can afford to treat all data as
'binary' rather than 'textual', is simply to put
use bytes;
at the top of every file :).
Ben
--
Although few may originate a policy, we are all able to judge it.
- Pericles of Athens, c.430 B.C.
ben@morrow.me.uk
- Next message: Tad McClellan: "Re: Renaming files"
- Previous message: Jessica Smith: "Newbie gets Internal Server Error, among others"
- In reply to: David Murray-Rust: "Intermittent Character Encoding Issues"
- Next in thread: David Murray-Rust: "Re: Intermittent Character Encoding Issues"
- Reply: David Murray-Rust: "Re: Intermittent Character Encoding Issues"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|