Re: change ISO8859-1 to GB2312 to UTF-8 to EBCDIC to Big5 to ...



On 5月25日, 下午7時02分, RedGrittyBrick <RedGrittyBr...@xxxxxxxxxxxxxxxxx>
wrote:
On 25/05/2010 09:48, moonhkt wrote:

Thank [you]. I am not testing [with] JDBC.

When you wrote "Our database is ISO8859-1 format with some GB2312 and
other non ISO8859-1 data." I got the impression that a DBMS was
involved. If you were using Hibernate or some other framework rather
than JDBC, the same principles would apply.

But tired to GB2312 file , to UTF-8 then BIG5

BIG5! Another character set and encoding! I think that makes seven
you've mentioned in this thread! Any more?

10 TEST1    |测试1
11 TEST2    |测试2
13 TEST4    |测试4

[the program below] can conv[ert a file containing the above data] to UTF-8

When [it] conv[erts from] UTF-8 to BIG5, [it] can not [successfully convert
all characters].Do you know why ?

You are ignoring exceptions. Exceptions might be telling you something
you really need to know about. Don't ignore exceptions.

I'm not familiar with GB2312 and Big5 but I expect that there are
characters in GB2312 that are not in Big5. It is almost certain.

GB2312 originated in the People's Republic of China, where simplified
Chinese characters were mandatory. I think this policy has been relaxed now.

I suspect Big5 originated in either the British colony of Hong Kong or
in the Republic of China (Taiwan/Formosa). In both these places,
Traditional Chinese characters were (and still are) used.

Whether the conversion from GB2312 to UTF-16 and then to Big5 can
convert a simplified character to a traditional counterpart is unknown
to me. Perhaps this causes conversion problems?

[I] Checked [the resulting file] with IE, the BIG5 code is [displayed as] "?"

You have to tell IE what encoding to use to display the file. That was
why I wrote HTML markup containing <meta charset="gb2312">. You can
probably force an encoding using a menu option in IE. You certainly can
in Firefox.

If IE does not have access to a font containing the required glyph, it
will display a placeholder character. I don't use IE much so I'm not
certain what the placeholder IE displays, a small box, a question-mark
or something else.

If Java writes a character that is not present in the specified output
character set then I expect it might also substitute a placeholder
character.

Also Big5 is weird, apparently it doesn't exactly encode characters, it
encodes logograms or parts of graphical characters. It also has to be
paired with a single-byte character-set that isn't specified in the Big5
standard. Also there are variants of Big5. Lots of scope for encoding
issues. Maybe Java and IE disagree about Big5 variants?
<http://en.wikipedia.org/wiki/Big5>

P.S. IE6 is old and a security hazard, I'd upgrade.
--
RGB

Our ISO8859-1 Database(Progress Database) have some Japanese/Korea/
Simplified Chinese and Traditional Chinese. Those Language imported by
lookup function. e.g. When User Input "G" in particular , the lookup
program will get "Green" in corresponding Language Character set.
Also, I checked other GB2312 Database(Progress Database), the Encoding
Value of "测试" (in English "TEST") same as IS08859-1. Checked by unix
tool "od -ct x1 file_name".

For BIG5 conversion, I just for testing how to change GB2312 to BIG5.
My Boss ask me for check what is the encoding value for "TEST" in
GB2312 or BIG5. So, I want convert to BIG5 to check what encoding
value in BIG5.

I will add the exceptions back.

Thank a lot.


moonhkt
.



Relevant Pages

  • Re: [PHP] Re: 0x9f54
    ... I hope you're not using legacy encoding like Big5 or GB. ... While in Big5 every character is represented by two ... you should look at the positive side of using Unicode ...
    (php.general)
  • Re: change ISO8859-1 to GB2312 to UTF-8 to EBCDIC to Big5 to ...
    ... I'm not familiar with GB2312 and Big5 but I expect that there are characters in GB2312 that are not in Big5. ... Whether the conversion from GB2312 to UTF-16 and then to Big5 can convert a simplified character to a traditional counterpart is unknown to me. ... You have to tell IE what encoding to use to display the file. ... If Java writes a character that is not present in the specified output character set then I expect it might also substitute a placeholder character. ...
    (comp.lang.java.programmer)
  • Re: Petition to UN on Abolishment of Traditional Chinese in 2008
    ... >> So why did the traditional character set Big5 merge zhe5 ... Big5 has since long been extended to include zhe5/zhuo. ... the "correct" form for both is actually Morohashi ... The entry for u+8457 refers to Kangxi Index ...
    (sci.lang)
  • Re: Big5--->GB converter
    ... Converting Big5 text to GB text is not as simple as it seems. ... Big5_HKSCS is Big5 plus the Hong Kong Supplimentary Character Set, ... GBK is the de facto Simplified Chinese encoding scheme. ...
    (comp.lang.java.programmer)
  • Re: A Chinese Word for Ten-Thousand-Myriad?
    ... you didn't specify the character ... encoding in the article, so we can't read the Chinese. ... I suppose you use Big5? ...
    (rec.games.mahjong)