Re: XML::PARSER utf-8 and japanese characters
From: Ben Morrow (usenet_at_morrow.me.uk)
Date: 07/29/04
- Next message: Ben Morrow: "Re: still crabby about copy constuctor craziness"
- Previous message: corff_at_cis.fu-berlin.de: "Re: Bizarre PerlScript/WSH/UTF-8 problem"
- In reply to: Hemant Shah: "XML::PARSER utf-8 and japanese characters"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Thu, 29 Jul 2004 18:42:13 +0100
Quoth NoJunkMailshah@xnet.com:
> I am having problem writing Japanese characters.
>
> I am parsing an XML document that is in utf-8, actually it is a
> content.xml file from Open Office. It contains Japanese text along
> with english text. (english text and it's japanese translation).
>
> I want to write the the english and japanese text into individual
> files.
>
> Another process will read these individual files and insert the it
> into DB2 database which is also in utf-8.
>
> I am having problem writing japanese text to a file.
>
> I am running perl 5.8.3 on AIX 5.2.
That's a good start...
> Here are the code fragments from my script:
>
> use Encode;
> use encoding utf8, STDOUT => "utf8", STDIN => "utf8";
I would have explicitly binmoded the FHs, for clarity, but hey...
> use XML::Parser;
>
>
> $ContentParser = new XML::Parser(Handlers => {Start => \&HandleContentStart,
> End => \&HandleContentEnd,
> Default => \&DefaultContentHandler,
> Char => \&HandleContentChar});
>
> $ContentParser->parsefile ("content.xml", ProtocolEncoding => 'UTF-8');
>
>
>
> # In HandleContentChar() subroutine
> open (TEMPFILE, ">:encoding(utf8)", $TmpFile) ||
Use lexical filehandles.
Use low-precedence operators to avoid brackets.
open my $TEMFILE, '>:encoding(utf8)', $TmpFile or die ...;
> die "Cannot open temporary file for write $TmpFile. $!";
>
> # Code to print XML tags
>
> print TEMPFILE "$JapaneseText";
Don't quote unnecessarily.
> # Code to print XML tags
>
> close(TEMPFILE);
>
> When I look at the Japanese text in content.xml file and $TmpFile (hex dump),
> they are different.
How are they different? Are they equivalent representations of the text
(I don't know if there are any non-canonical representations for
Japanese)? Can you give some examples of input and output text?
> Also is there a way to split the Japanese text at unicode character
> boundary. I would like to store lines of 100 (single byte) characters or
> less per line. I do not have any problem with english and spanish text,
> but japanese characters are double byte,
No they aren't. Most Japanese characters require 3 bytes in the UTF8
encoding, and all accented spanish characters will require at least two.
> so I would like to split the line at 50 japanese characters.
What do you actually mean here? You claim not to mean 100 bytes/line,
but I suspect that might be what you actually want (if this is for some
program with a line-length limitation). Otherwise, do you mean 100
Unicode codepoints (100 complete utf8 sequences), 100 graphemes
(sequences like {LATIN SMALL LETTER A}{COMBINING ACUTE ACCENT}
which, while two Unicode codepoints, display as one character) or 100
(displayed) columns? These can by done by:
$string =~ s/(.{100})/$1\n/g; # CHARS (CODEPOINTS)
$string =~ s/(\X{100})/$1\n/g; # GRAPHEMES (COMBINING SEQUENCES)
; 'bytes' and 'columns' are slightly harder, and I can't see an easy way
to do them with a regex:
# BYTES
{
my $newstring = '';
my $width = 0;
for (split //, $string) {
$width += do { use bytes; length };
$width > 100 and $newstring .= "\n", $width -= 100;
$newstring .= $_;
}
$string = $newstring;
}
# COLUMNS (taking CJK full-width forms into account)
use Unicode::EastAsianWidth; # install from CPAN
{
my $newstring = '';
my $width = 0;
for (split //, $string) {
/\p{IsPrint}/ and $width += /\p{InFullwidth}/ ? 2 : 1;
# There is a bug here: it doesn't deal correctly with
# printing-but-not-spacing characters (like combining accents).
$width > 100 and $newstring .= "\n", $width -= 100;
$newstring .= $_;
}
$string = $newstring;
}
<none of the above tested>. You will need to read the docs for
Unicode::EastAsianWidth if you use it: I don't fully understand what it
says about 'ambiguous width' characters, knowing very little about CJK
writing.
Ben
-- If you put all the prophets, | You'd have so much more reason Mystics and saints | Than ever was born In one room together, | Out of all of the conflicts of time. ben@morrow.me.uk The Levellers, 'Believers'
- Next message: Ben Morrow: "Re: still crabby about copy constuctor craziness"
- Previous message: corff_at_cis.fu-berlin.de: "Re: Bizarre PerlScript/WSH/UTF-8 problem"
- In reply to: Hemant Shah: "XML::PARSER utf-8 and japanese characters"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|