Re: XML::PARSER utf-8 and japanese characters

From: Ben Morrow (usenet_at_morrow.me.uk)
Date: 07/29/04


Date: Thu, 29 Jul 2004 18:42:13 +0100


Quoth NoJunkMailshah@xnet.com:
> I am having problem writing Japanese characters.
>
> I am parsing an XML document that is in utf-8, actually it is a
> content.xml file from Open Office. It contains Japanese text along
> with english text. (english text and it's japanese translation).
>
> I want to write the the english and japanese text into individual
> files.
>
> Another process will read these individual files and insert the it
> into DB2 database which is also in utf-8.
>
> I am having problem writing japanese text to a file.
>
> I am running perl 5.8.3 on AIX 5.2.

That's a good start...

> Here are the code fragments from my script:
>
> use Encode;
> use encoding utf8, STDOUT => "utf8", STDIN => "utf8";

I would have explicitly binmoded the FHs, for clarity, but hey...

> use XML::Parser;
>
>
> $ContentParser = new XML::Parser(Handlers => {Start => \&HandleContentStart,
> End => \&HandleContentEnd,
> Default => \&DefaultContentHandler,
> Char => \&HandleContentChar});
>
> $ContentParser->parsefile ("content.xml", ProtocolEncoding => 'UTF-8');
>
>
>
> # In HandleContentChar() subroutine
> open (TEMPFILE, ">:encoding(utf8)", $TmpFile) ||

Use lexical filehandles.
Use low-precedence operators to avoid brackets.

open my $TEMFILE, '>:encoding(utf8)', $TmpFile or die ...;

> die "Cannot open temporary file for write $TmpFile. $!";
>
> # Code to print XML tags
>
> print TEMPFILE "$JapaneseText";

Don't quote unnecessarily.

> # Code to print XML tags
>
> close(TEMPFILE);
>
> When I look at the Japanese text in content.xml file and $TmpFile (hex dump),
> they are different.

How are they different? Are they equivalent representations of the text
(I don't know if there are any non-canonical representations for
Japanese)? Can you give some examples of input and output text?

> Also is there a way to split the Japanese text at unicode character
> boundary. I would like to store lines of 100 (single byte) characters or
> less per line. I do not have any problem with english and spanish text,
> but japanese characters are double byte,

No they aren't. Most Japanese characters require 3 bytes in the UTF8
encoding, and all accented spanish characters will require at least two.

> so I would like to split the line at 50 japanese characters.

What do you actually mean here? You claim not to mean 100 bytes/line,
but I suspect that might be what you actually want (if this is for some
program with a line-length limitation). Otherwise, do you mean 100
Unicode codepoints (100 complete utf8 sequences), 100 graphemes
(sequences like {LATIN SMALL LETTER A}{COMBINING ACUTE ACCENT}
which, while two Unicode codepoints, display as one character) or 100
(displayed) columns? These can by done by:

$string =~ s/(.{100})/$1\n/g; # CHARS (CODEPOINTS)

$string =~ s/(\X{100})/$1\n/g; # GRAPHEMES (COMBINING SEQUENCES)

; 'bytes' and 'columns' are slightly harder, and I can't see an easy way
to do them with a regex:

# BYTES

{
    my $newstring = '';
    my $width = 0;

    for (split //, $string) {
        $width += do { use bytes; length };
        $width > 100 and $newstring .= "\n", $width -= 100;
        $newstring .= $_;
    }
    
    $string = $newstring;
}

# COLUMNS (taking CJK full-width forms into account)

use Unicode::EastAsianWidth; # install from CPAN

{
    my $newstring = '';
    my $width = 0;

    for (split //, $string) {
        /\p{IsPrint}/ and $width += /\p{InFullwidth}/ ? 2 : 1;
        # There is a bug here: it doesn't deal correctly with
        # printing-but-not-spacing characters (like combining accents).
        
        $width > 100 and $newstring .= "\n", $width -= 100;
        $newstring .= $_;
    }
    
    $string = $newstring;
}

<none of the above tested>. You will need to read the docs for
Unicode::EastAsianWidth if you use it: I don't fully understand what it
says about 'ambiguous width' characters, knowing very little about CJK
writing.

Ben

-- 
   If you put all the prophets,   |   You'd have so much more reason
   Mystics and saints             |   Than ever was born
   In one room together,          |   Out of all of the conflicts of time.
ben@morrow.me.uk                                    The Levellers, 'Believers'


Relevant Pages

  • Display file in EUC japanese format
    ... Japanese characters. ... I want to dump them into a CEdit. ... The English characters display properly, the japanese do not. ...
    (microsoft.public.vc.mfc)
  • XML::PARSER utf-8 and japanese characters
    ... I am having problem writing Japanese characters. ... (english text and it's japanese translation). ... I am having problem writing japanese text to a file. ...
    (comp.lang.perl.misc)
  • Re: Multiple Languages in Terminal Services
    ... When you mentioned "all English users are getting Japanese characters", does 'English' users see Japanese characters in that particular application only, or the entire desktop becomes japanese? ...
    (microsoft.public.windows.terminal_services)
  • Problem with XP application supporting Japanese & English together
    ... and have run into a problem displaying Japanese characters alongside ... however because this application displays a mix of both English text ... I cannot display English text using the font I specify. ...
    (sci.lang.japan)
  • Problem with XP application supporting Japanese & English together
    ... and have run into a problem displaying Japanese characters alongside ... however because this application displays a mix of both English text ... I cannot display English text using the font I specify. ...
    (microsoft.public.windowsxp.setup_deployment)