Re: i18n'ed Character Set in DBMS and tables

From: steve (me_at_me.com)
Date: 09/10/04


Date: Fri, 10 Sep 2004 07:03:57 +0800

On Wed, 8 Sep 2004 01:14:42 +0800, Albretch wrote
(in article <f8544ad2.0409070914.784648a0@posting.google.com>):

> "Mark Yudkin" <myudkinATcompuserveDOTcom@nospam.org> wrote in message
> news:<chjjtu$q1d$1@ngspool-d02.news.aol.com>...
>>> Or?
>
>> I18n is about processing, not about storage representation
> AM: Yeah, sure! But ultimately storage, representation, reencoding .
> . . are all part of the functional pipe of 'processing'. These are
> all naturally related as you serve the data to clients.
>
>> consider the problem of how to sort a table containing both German and
>> Swedish names - any choice you make will be totally wrong to the other
>> nationality, and hence you need to consider not the origin of the data but
>> the origin of the consumer!
> AM: This is what I think is exaclty wrong; you do not store in a
> table German and Swedish names together in the same column!
> Say someone hits your site and you get from the language User-Agent
> headers her browser is setting to preferably handle Hungarian then you
> have a large database with foreign names data from a wild array of
> nationalities/character sets (like the immigrant database at Ellis
> Island). Now, say these guys at Ellis Island for
> 'let-me-know-about-your-life' issues decide to include the original
> names in their langs. (an incredible lot of people's names where
> change to the 'gringo' 'Mary' and 'John')
> Would you design a monster table with all names and then set the
> column char set as Unicode or have different tables for the different
> charset+collation pairs + a primary/foreign Key design of the
> database, making your Hungarian user life faster/easier?
>
> If you read into daddy's E.F. Codd defined well-structured 'normal
> forms' of relations and 'normalization', you should not mix 'different
> type of data' in the same column (and the charset+collation makes
> these data differ, don't they?)
>
> And I do think all 'theories' are 'unpractical' until proven
> otherwise
>
>> . . . Getting hold of people with experience in other languages . . .
> AM: I do have 'experience in other languages', threee of them,
> although they are all western ones :-(
> I think this problem can be reduced to a simple mathematical one. You
> don't need to know may people to understand this problem.
>
>> You'll also learn that issues such as sorting (your "way slower" comment)
>> are in many ways the least of your problems.
> AM: . . . 'least of your problems' . . . when you will have to set
> indexes in these types of columns, since they are 'free text'
> searchable ones?
> For me, a developer, "way slower" means anything that would run more
> than 30% slower. Being ready/able to have three 'customer' instead of
> two makes a difference in business as well as in life. DBMS issues
> (and their related IO ones) is the number one performance issue in
> large DBMS-based software development

look it is very simple.

If your app is ONLY EVER going to use a language that can be encoded by
using ascii and single byte data, then that is fine.

if you are going to support multi languages then you MUST use some sort of
multi byte system 2-3 bytes per character, FOR ALL DATA.

so yes you have 1 multi byte enabled table. and you put all the *** in 1
table.

let's consider for a moment how you would handle your system of multi
tables.

say your application deals with single byte and muilti byte data, how for
example are you Going to :

1.spit & con-cat strings ? is it 1 byte or is it 2 bytes, or 3
2.how do you check the length of a string?, is it 200 bytes for a single line
of address data or is it 400, or 600.

3.how are you going to build your sql strings to select the different tables?
4. what happens if the user puts english & chinese characters together on the
same line of data?
( in china & asia we mix asian glifs & roman together)
which table does the resulting string go in?

are you going to check which country the user is in and make dangerous
assumptions?
( origin of the consumer?), say i'm in England, using Chinese windows?

or are you going to scan the string & see if you can stick it in a single
byte table or a multi byte table?

you are going to bury yourself in the logistics of trying to deal with
different encoding lengths.

using 1 table is NOT slower, tables can be partitioned ( well in oracle they
can)
so you could store the data in 1 table, then partition the table on a
"language column"

the only time i have found using multi tables a good idea, is when i have 2
clearly defined languages English & Chinese,.
When i have to store the same address in both forms ( 1 for the english
staff, 1 for the asian staff) AND not every English address has a Chinese
equivalent. ( or visa versa, depending on your point of view)

but i still have to have 2 sets of duplicate edits screens, ( luckily its is
only 10 data fields), and that is such a pain , i am trying to reduce it
back to 1 set of screens.

you need to get over your fixation with "how big the data is" & how slow is
my database going to be.

storage is just too cheap these days.
top end databases , are so optimized these days that they spend a large part
of their time "asleep"


Loading