Re: perl 5.6 multi byte
From: Mihai N. (nmihai_year_2000_at_yahoo.com)
Date: 11/30/03
- Next message: a.k.a Bruha: "ophelia, grim and painful future"
- Previous message: S. Zeidler: "Re: can't install Time::HiRes"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sun, 30 Nov 2003 20:53:16 GMT
> Hey guys, I need to do some parsing on a file that includes Japanese
> Shift JIS and Chinese GB1312 and was wondering if someone could help
> me with some errors im getting.
Nobody answered her, so I will give it a try :)
> I am not entirely sure what pragmas i need to use, or really
> how to open a wide character file properly (is GB1312 and Japanese
> Shift JIS wide chars? Is that different from utf8?)
Nothing special with Perl 5.6.
GB1312 is in fact GB2312 and is used for Simplified Chinese.
Both GB2312 and ShiftJIS are double byte character sets (DBCS).
It does not mean they are wide char.
Some characters have on byte, some have two bytes.
This is why in many cases is a problem to do search, search-replace, etc
for bytes that can be half a characters.
For instance back-slash can be the second byte for several Japanese
characters. Same for other characters (second byte can be anything above
0x40)
And yes, they are very different from utf8.
DBCS can have 1 or 2 bytes, utf8 can have up to 5.
DBCS cover one character set only (Simplified Chinese or Japanese, in this
case), utf8 covers the whole Unicode.
For DBCS it is not possible to tell what bytes can be lead or trayling bytes,
without help from the OS or without hard-coded tables. And the tables are
different from DBCS charset to another. UTF8 is clear, no need of tables.
> I have been
> trying to do research on multilingual support for perl 5.6, but it is
> highly confusing and I am positive I am missing something.
Main question: why 5.6? 5.8 is out for a long time already, and it is way
better in handling this kind of problems.
It does supports utf8, regular expressions on utf8, etc.
> My program
> is exiting early without having read the entire file (at least, it is
> only getting through about 10K of a 20K line file).
There is no reason to stop reading, does not matter the encoding.
I suspect something else.
Tell us more about OS, data file (is there a risk to have control
characters?)
It allways stops in the same place? Did you try to delete some lines from the
beginning of the files to see where it stops after this? Maybe there is
a certain line that stops it.
> I've included a
> code snippet and stripped out any attempts at multi-byte compatibility
> I've attempted in the hopes that someone will spot what is obviously
> wrong with it.
Nothing obviously wrong.
Except no ; after "open IN, ..."
And no $g_hLang not defined, but used.
And you increment $i for each line you read, then compare it
against $g_nMaxFiles (again undefined) and exit.
It this what you want? To exit after $g_nMaxFiles lines?
Maybe this is the problem. And has nothing to do with the encoding.
-- Mihai ------------------------- Replace _year_ with _ to get the real email
- Next message: a.k.a Bruha: "ophelia, grim and painful future"
- Previous message: S. Zeidler: "Re: can't install Time::HiRes"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|