Re: perl 5.6 multi byte

From: Mihai N. (nmihai_year_2000_at_yahoo.com)
Date: 11/30/03


Date: Sun, 30 Nov 2003 20:53:16 GMT


> Hey guys, I need to do some parsing on a file that includes Japanese
> Shift JIS and Chinese GB1312 and was wondering if someone could help
> me with some errors im getting.
Nobody answered her, so I will give it a try :)

> I am not entirely sure what pragmas i need to use, or really
> how to open a wide character file properly (is GB1312 and Japanese
> Shift JIS wide chars? Is that different from utf8?)
Nothing special with Perl 5.6.
GB1312 is in fact GB2312 and is used for Simplified Chinese.
Both GB2312 and ShiftJIS are double byte character sets (DBCS).
It does not mean they are wide char.
Some characters have on byte, some have two bytes.
This is why in many cases is a problem to do search, search-replace, etc
for bytes that can be half a characters.
For instance back-slash can be the second byte for several Japanese
characters. Same for other characters (second byte can be anything above
0x40)
And yes, they are very different from utf8.
DBCS can have 1 or 2 bytes, utf8 can have up to 5.
DBCS cover one character set only (Simplified Chinese or Japanese, in this
case), utf8 covers the whole Unicode.
For DBCS it is not possible to tell what bytes can be lead or trayling bytes,
without help from the OS or without hard-coded tables. And the tables are
different from DBCS charset to another. UTF8 is clear, no need of tables.

> I have been
> trying to do research on multilingual support for perl 5.6, but it is
> highly confusing and I am positive I am missing something.
Main question: why 5.6? 5.8 is out for a long time already, and it is way
better in handling this kind of problems.
It does supports utf8, regular expressions on utf8, etc.

> My program
> is exiting early without having read the entire file (at least, it is
> only getting through about 10K of a 20K line file).
There is no reason to stop reading, does not matter the encoding.
I suspect something else.
Tell us more about OS, data file (is there a risk to have control
characters?)
It allways stops in the same place? Did you try to delete some lines from the
beginning of the files to see where it stops after this? Maybe there is
a certain line that stops it.

> I've included a
> code snippet and stripped out any attempts at multi-byte compatibility
> I've attempted in the hopes that someone will spot what is obviously
> wrong with it.
Nothing obviously wrong.
Except no ; after "open IN, ..."
And no $g_hLang not defined, but used.

And you increment $i for each line you read, then compare it
against $g_nMaxFiles (again undefined) and exit.
It this what you want? To exit after $g_nMaxFiles lines?
Maybe this is the problem. And has nothing to do with the encoding.

-- 
Mihai
-------------------------
Replace _year_ with _ to get the real email


Relevant Pages

  • Re: perl 5.6 multi byte
    ... GB1312 is in fact GB2312 and is used for Simplified Chinese. ... Both GB2312 and ShiftJIS are double byte character sets (DBCS). ... Some characters have on byte, ... they are very different from utf8. ...
    (comp.lang.perl.modules)
  • Re: How to build a string at run time with two-byte character sets.
    ... Also maybe chinese and such requires 32 bit characters... ... might want to look into UTF32 instead of UTF8 and UTF16... ...
    (alt.comp.lang.borland-delphi)
  • Review: The Promise (2006)
    ... Chinese martial art fantasy is an unique genre that is very popular in ... Chinese filmmaker, Kaige Chen, is an improbable mess. ... talent of a slave, Kunlun, the General acquires Kunlun as his ... never allowed enough time to stop and savor the characters or the ...
    (rec.arts.movies.reviews)
  • Millions of Chinese forced to change their names
    ... name in China, shared by nearly 17 million people. ... as many Chinese do. ... the roughly 55,000 Chinese characters, according to a 2006 government ... her identity card last August, she said, Beijing public security ...
    (soc.culture.baltics)
  • Re: The origins of writing
    ... > "One major difference between Chinese concepts of language and Western ... > characters are inscriptions on oracle bones, ... Cantonese, they are likewise pronounced jing, if we ignore the tone. ...
    (sci.lang)