Re: [PHP] Re: 0x9f54



Man-wai Chang wrote:
On the other hand, I remember you talked about the type of that
column to be char(2). Have you specified what encoding it's using?
Moreover, I hope you're not using legacy encoding like Big5 or GB. Use
Unicode (UTF-8) if your database is a brand new one.


Unfortunately, I am still using Big5. you need a longer field to store
utf-8 codes for the same big5 string right?

Yes. While in Big5 every (Chinese) character is represented by two
bytes, every Chinese character represented in UTF-8 uses at least three
bytes (in rare occasion, 4 bytes, if very rare characters are used such
as those in ancient Chinese). This is because UTF-8 is designed to be
8-bit compatible to old data-processing functions. In other words, for
a string containing pure Chinese characters, a UTF-8 one is 150% longer
than a Big-5 one.

You could, of course, use UTF-16 as the base format for your
string. In this case, every character is represented by 2 bytes, be it
a Western Latin character or an Eastern CJK character. OK, yes, for
rare characters, you would use up to 4 bytes, but this is rare.

Anyway, you should look at the positive side of using Unicode
instead of the dinosaur encoding, sorry, I mean Big5 :p Hard drives
(and RAM) nowadays are getting real big, string size should be
considered as a first criterion to choose what encoding to use.

Unicode is done by an international consortium and it could support
most languages in the world. For instance, using Big5, you can't even
represent the simplest of Western European characters like in these
words: español or français!! But you could represent them using
Unicode. Actually, the ability to represent (Western) European
characters might not interest you. But using Unicode, you could store
both traditional and simplified Chinese! And this, I'm sure you're
interested. You can't do that in Big5, I'm 100% sure!

Still not convinced yet. Well, Unicode even contains traditional
Chinese characters that Big5 doesn't support. For example, a friend on
mine has this character 驊 in his first name. This character isn't
supported in Big5 and in pre-Unicode period, he had to type (馬華)!
Very stupid! Another example: 氹 is quite a common word in southern
China but this character can't be found in Big5.

So, think about using Unicode. We are in 2007 and be a modern man!



----------
* Zoner PhotoStudio 8 - Your Photos perfect, shared, organised! www.zoner.com/zps
You can download your free version.
.



Relevant Pages

  • Re: change ISO8859-1 to GB2312 to UTF-8 to EBCDIC to Big5 to ...
    ... Another character set and encoding! ... I'm not familiar with GB2312 and Big5 but I expect that there are ... You have to tell IE what encoding to use to display the file. ... Our ISO8859-1 Database(Progress Database) have some Japanese/Korea/ ...
    (comp.lang.java.programmer)
  • Re: change ISO8859-1 to GB2312 to UTF-8 to EBCDIC to Big5 to ...
    ... I'm not familiar with GB2312 and Big5 but I expect that there are characters in GB2312 that are not in Big5. ... Whether the conversion from GB2312 to UTF-16 and then to Big5 can convert a simplified character to a traditional counterpart is unknown to me. ... You have to tell IE what encoding to use to display the file. ... If Java writes a character that is not present in the specified output character set then I expect it might also substitute a placeholder character. ...
    (comp.lang.java.programmer)
  • Re: Petition to UN on Abolishment of Traditional Chinese in 2008
    ... >> So why did the traditional character set Big5 merge zhe5 ... Big5 has since long been extended to include zhe5/zhuo. ... the "correct" form for both is actually Morohashi ... The entry for u+8457 refers to Kangxi Index ...
    (sci.lang)
  • Re: Big5--->GB converter
    ... Converting Big5 text to GB text is not as simple as it seems. ... Big5_HKSCS is Big5 plus the Hong Kong Supplimentary Character Set, ... GBK is the de facto Simplified Chinese encoding scheme. ...
    (comp.lang.java.programmer)
  • Re: A Chinese Word for Ten-Thousand-Myriad?
    ... you didn't specify the character ... encoding in the article, so we can't read the Chinese. ... I suppose you use Big5? ...
    (rec.games.mahjong)