Re: Unicode Support




Many Thanks Beth,

[about UTF-8] this was the information I missed.

| I know you prefer concise tables, so let's see if I can make a simple table
| for it:

| U-00000000 to U-0000007F: 0xxxxxxx

** U-00000080 to U-000000BF: 10xxxxxx also just a single byte?
Ok No, as mentioned under PT.3 below

| U-00000080 to U-000007FF: 110xxxxx 10xxxxxx
| U-00000800 to U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx

let me check on the latter:
xxxx'xxxxxx'xxxxxx range = 2^16 00800..107ff ??
I assume the 'overflown' aren't used then,
or may indicate an error condition as mentioned with overlong.

| U-00010000 to U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

| *U-00200000 to U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
| *U-04000000 to U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
| 10xxxxxx
|
| [ * Technically, it is intended that no UNICODE character will ever go
| beyond FFFFFh (1024K characters)...so, though these encodings are defined
| for UTF-8 (up to 2^31), you shouldn't ever actually see these in any UTF-8
| file... ]

I see, the format is somehow logical organised and easy to understand,
[just count set MSB's to get involved bytes,
and valid start value of a group can be calculated by a few lines]
the (25% waste) synchron-bits in every byte may make sense for serial
transmission, .


| Examples:
|
| So, for U-0041 (which is plain old ASCII "A" :), the byte is exactly the
| same as it would be in ASCII:
|
| 01000001 (or 41h)
|
| For U-03A0 (which is Greek letter "pi" :), the sequence would be:
|
| 11001110 10100000 (or CEh A0h)
|
| For U-2200 (which is the mathematical "for all" character: That is, an
| upside-down capital A :), the sequence would be:
|
| 11100010 10001000 10000000 (or E2h 88h 80h)
|
| ...so forth...I think you can work out the rest yourself...pick your
| UNICODE character then check what "range" it's in with the table
| above...place its binary digits into the spaces marked with the "x"...
|
| Points to note:
|
| 1. 7-bit ASCII characters are encoded in exactly the same way in UTF-8
|
| 2. Byte-based (no "endianness" worries; Does not require any "BOM" ("byte
| order mark") character at the start of the file or anything like that, as
| Microsoft advise should be in their plain text file UTF-16
| encoding...advice, in fact, that they were overstepping the mark to give
| because UNICODE themselves do not specify this as anything but
| "optional"...but, well, that's Microsoft for you, eh? ;)...
|
| 3. All non-ASCII characters use a multi-byte sequence
|
| 4. Each byte in that multi-byte sequence has the highest bit set (so, it's
| clear what is ASCII and what is non-ASCII and they shouldn't become
| confused)...
|
| 5. The first byte of the sequence has as many highest bits set as there are
| bytes in the entire sequence (e.g. "110xxxxx 10xxxxxx" starts with two set
| bits in the first byte, so the sequence is two bytes long :)...
|
| 6. After the first byte, all further bytes are of the form "10xxxxxx"
| (providing another extra 6 bits of "address range" per byte :)...
|
| 7. The bytes 0xFE and 0xFF are never used in UTF-8 at all...
|
| 8. The first byte of a non-ASCII character is in the range C0h to FDh,
| subsequent bytes in a multi-byte sequence are in the range 80h to BFh,
| ASCII characters are in the range 00h to 7Fh...you can use this for easy
| resynchronisation (if you start reading in the middle of a multi-byte
| sequence, you can _know_ that this is the case by what range the byte is in
| :)...
|
| 9. All UNICODE characters are available (ASCII bytes are still just one
| byte long, all 16-bit "BMP" characters one to three bytes long, all defined
| UNICODE characters four bytes...being "variable-length" then size of files
| dependent on what characters encoded: If all ASCII, no different from
| ordinary ASCII file...if all "upper range" Chinese ideographs, then UTF-8
| encoding is 4 bytes long (though, note that UTF-16 - with 16-bit per
| character, as Windows uses - is also 4 bytes long, so UTF-8, at worst, can
| only be as big as UTF-16 but will typically probably be smaller))...
|
| NOTE: You might notice that it is possible to create what are known as
| "overlong forms" for characters...for example, you could unnecessarily take
| two bytes to encode an ASCII character:
|
| 11000001 10000001 (or C1h 81h)
|
| Instead of the simpler:
|
| 01000001 (or 41h :)
|
| These "overlong forms" are _INVALID_ UTF-8...they should be rejected as
| errors, if encountered...note that these "overlong forms", as the name
| suggest, are _unnecessarily_ long, so rejecting these also ensures the
| shortest possible encoding of the character, as well as "normalising"
| everything so that "comparisons" are easy (i.e. there is only _one_ valid
| way to encode any particular character :)...
|
| In order to help detect "overlong forms" and reject them, here's another
| simple table...if any of these bit patterns are detected, you have an
| invalid "overlong form" (these are all invalid sequences in UTF-8):
|
| 1100000x (10xxxxxx)
| 11100000 100xxxxx (10xxxxxx)
| 11110000 1000xxxx (10xxxxxx 10xxxxxx)
| 11111000 10000xxx (10xxxxxx 10xxxxxx 10xxxxxx)
| 11111100 100000xx (10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx)
|
| [ Note that you can work out if it's "overlong" from the first or second
| bytes, irrespective of what the remaining bytes are (which I've placed in
| brackets)... ]

| As you can see, it's quite a nice "encoding" for UNICODE because it leaves
| ASCII alone...and can access any defined UNICODE character in up to four
| bytes (which is no worse in size than UTF-16 for the "worst case" (a file
| of only "upper range" Chinese ideographs and nothing else...which is
| probably not any file you or I would ever want to create, eh? ;) but will
| generally be smaller in most cases...right down to, as noted, being exactly
| the same size as ASCII, if the file is only 7-bit ASCII characters)...for
| easy parsing, all non-ASCII multi-byte sequences have their highest bit set
| for every byte in that sequence...and the first byte of that sequence is in
| a different "range" to the subsequent bytes, with the number of "high bits"
| set being equal to the number of bytes in the entire sequence...

Yes, got it. Implementation wont interfiere with my text-routines
as UTF-8 has no 00's in the extensions.

All I'd need to go for it is one additional conversion table
('my > 7f' -> UTF-8) and to insert a single instruction into my
text interpreter:

textout:
.... ;getchar
OR al,al
jz NextFunction ;end of text or whatsoever
Js UNIcode ;*** and there he story continues :)
Call copychar ;display/print/file/buffer/..
inc ecx ;string cursur index
jmp txtout


[..the why..]
Yes, my long time schedule see a KESYS-browser since a while ;)

[..]
| Also, there are some "special characters" that require "processing",
| yes...but the vast majority of characters are just characters...you just
| store them in strings...so, other than that the storage is bigger, what's
| the difference if it's 95 characters, 950 characters or 9500 characters?
| The only people who are going to have problems with this are the font
| designers, drawing them all...but then, it is allowable to create fonts
| that only covers certain "ranges"...and programmers can just use fonts from
| the professional font designers...

You know, double size of all text is a very big increase,
especially for one like me who fight every wasted bit. ;)

| Indeed, "one standard with too many characters" might be a problem...but
| consider "too many standards with not enough characters" instead:
| This is a worse fate ;)...

Agreed, I can think of a limited to 'basic needs' standard font support,
like the few pages I printed out from my UNIcode4.0 file.


['Babylonian confusion'..]

| > Wouldn't it help a lot if the whole globe talks only one language?

| No, not in the slightest...I find that a terrible, horrible,...
| no...never...it's an awful idea...

:) Of course, local dialects and mother-tongue will remain alive anyway.

| What might be good is if the whole world had a shared _second language_,
| perhaps...and through this language people could always communicate...
| but the willing destruction of languages and culture and history?
| Absolutely NOT...

Everyone one this globe should know English. That's what I meant.

[side stories...] interesting.

[Druids...]
there are still more around than you may think ...
Understanding nature in a communicative way,
using roots and herbs instead of pharmaceutical products,
'control kids and pets' by charisma rather than by force,..


| > That's very true indeed*, isn't it? *)indead for the French :)

| Oh, no...I salute the French ...

This joke was just about how Rene usually type/spell 'indeed'.

| Let us not forget that Liberty is a French girl...

I've been there, a really b i g female :)

[Euro-'vision']
| Oh, one last thing: Could countries actually vote for the _music_ rather
| than the "block voting" for each other's neighbours in a "political" way?
| Oh, look, all the Eastern European countries are voting for each other!

What would you expect, these voters are humans.

| ... I Love the Eurovision...it's brilliant! ;)

It make me switch channels (if you've seen one you saw them all).
You may have seen the Austrian participant last year, we sent a Clown...

__
wolfgang


.



Relevant Pages

  • Re: Paul Grahams Arc is released today... what is the long term impact?
    ... It's not a matter of characters it is a ... What makes you think that language is not intimately related to history? ... programming in machine code? ... allows for treating a sequence of words as a single unit and yet somehow ...
    (comp.lang.lisp)
  • Re: Unicode Support
    ... > | single bit extra from ASCII for any ordinary ASCII characters... ... UNICODE character then check what "range" it's in with the table ... 7-bit ASCII characters are encoded in exactly the same way in UTF-8 ... All non-ASCII characters use a multi-byte sequence ...
    (alt.lang.asm)
  • Re: Report enhancements
    ... standards developed elsewhere. ... 0x00 to 0x7F (ASCII compatibility). ... All UCS characters>U+007F are encoded as a sequence of several bytes, ...
    (comp.lang.cobol)
  • Re: Directory list collating sequence ?
    ... you just need to specify the desired sequence ... Arranges characters according to EBCDIC sequence. ... Lowercase characters are given the collating value of their ... If two strings compare as equal, ...
    (comp.os.vms)
  • Re: Problem of finding funtion names in any C file
    ... I think that by "valid symbol (sequence of characters and numbers ... identifier (sequence of letters, digits, and underscores starting with ... than writinga simpler parser that does *only* what you want. ...
    (comp.lang.c)