Re: Unicode question



Hans-Peter Diettrich wrote:

I thought using UCD-4 chars would be good enough for that. Am I
missing something?

You're missing the existence and continued use of UTF-8 and UTF-16.

I am not ignoring them.

All what I am doing different (it seems) is considering each and every
encoding as necessary evils for external representation --i.e. things I
need (have) to use when talking to outside code.

Outside code (taht expects different formats) is a reality i have to
live with.

What I cannot bear is having to assimilate (or be assimilated) by
outside code.

I want to use an extended string that's good enough to cover UCS-4
chars and is reference counted.

Since CG seems to be having a hard time finding a good name for this
sort of thing --since they have already consumed the 'widestring'
name--, they could call it *extendedstring* :)

I only want to be sure, that an Unicode string can hold any
code point, but I don't worry or make assumptions about it's
encoding.

Precisely. This is what I am after too.

How are you involved with such details in applications, apart from
text processors?

I have been doing a linguistic/morphology project for years. This is my
pet project. I plan to publish it in the same time frame as Duke Nukem
Forever ;P

http://www.3drealms.com/duke4/
http://en.wikipedia.org/wiki/Duke_Nukem_Forever

How should I know how to split e.g. Chinese strings into fields,
unless the strings contain known (CSV...) field separators? I only
would split such strings on the known separators (puntctuators...),
and for this purpose a UTF-8 representation works just fine and fast.

These sorts of issues are totally beyond the scope of today's Unicode
work.

ATM, all they are attempting is to handle case folding and text
normalization, and collations. IOW, mostly trivial issues.

What you're referring is way too big. Each lang or dialec may have
different rules about how you separate a word/string.

We dont have all those rules yet.

Nor do we have good enough data: For that, we might need something like
8-byte chars --4 represent each lang/dialec, and 4 to represent
codepoint.

I doubt it will happen in my lifetime --wish it did though.

Since UTF-16 chars can be 2 or 4 bytes, one has to be always alert
about it --I can see a lot of bugs due to this. Whereas, UCS-4 is 4
bytes all the time. One less thing to watch out for.

See above, sometimes multiple characters must be treated as one unit.
Shouldn't an English "th" or "sh" have an code point of it's own, so
that no program would ever try to split such an entity into two
characters? Every language can have such entities, regardless of
their number of inseparable characters or code points. Then accessing
or manipulating single characters (code points), in an string of an
unknown language, is as meaningul as manipulating the bytes of an
UCS-4 character - a very stupid idea :-(

This is an interesting problem.

On the one hand, if you do not let a sequence of codepoints as a single
glyph, you're up against a long standing tradition and caligraphic
aesthetics.

On the other hand, if you do, you're opening up a Pandorra's box
whereby every Tom, *** and Harry would --one day-- want their hand
writings to be represented by such glyphs..

If it comes to this, the 8-byte char/codepoint I proposed <G> above
would be insufficient --we'd have to make it 12-byte <VBG>.

Yes. And no. UTF-16 is either 2 bytes or 4 bytes.

And the string consists of anything between 0 and 30 characters...

Yep.

Which is what we've all been doing since the beginning of time :P

Because we have considered only our native language, which in the
English case does not require more than 7 bits for every character.
Other cultures didn't know about a zero digit, or use glyphs for
words, or use distinct numbers for counting different items.

Or, we didn't know/care that other cultures have an all together
different shapes/forms of written communication.
.


Loading