Re: Unicode Support
- From: "wolfgang kern" <nowhere@xxxxxxxxxxx>
- Date: Sat, 23 Apr 2005 02:17:28 +0200
Many Thanks Beth,
[about UTF-8] this was the information I missed.
| I know you prefer concise tables, so let's see if I can make a simple table
| for it:
| U-00000000 to U-0000007F: 0xxxxxxx
** U-00000080 to U-000000BF: 10xxxxxx also just a single byte?
Ok No, as mentioned under PT.3 below
| U-00000080 to U-000007FF: 110xxxxx 10xxxxxx
| U-00000800 to U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
let me check on the latter:
xxxx'xxxxxx'xxxxxx range = 2^16 00800..107ff ??
I assume the 'overflown' aren't used then,
or may indicate an error condition as mentioned with overlong.
| U-00010000 to U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
| *U-00200000 to U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
| *U-04000000 to U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
| 10xxxxxx
|
| [ * Technically, it is intended that no UNICODE character will ever go
| beyond FFFFFh (1024K characters)...so, though these encodings are defined
| for UTF-8 (up to 2^31), you shouldn't ever actually see these in any UTF-8
| file... ]
I see, the format is somehow logical organised and easy to understand,
[just count set MSB's to get involved bytes,
and valid start value of a group can be calculated by a few lines]
the (25% waste) synchron-bits in every byte may make sense for serial
transmission, .
| Examples:
|
| So, for U-0041 (which is plain old ASCII "A" :), the byte is exactly the
| same as it would be in ASCII:
|
| 01000001 (or 41h)
|
| For U-03A0 (which is Greek letter "pi" :), the sequence would be:
|
| 11001110 10100000 (or CEh A0h)
|
| For U-2200 (which is the mathematical "for all" character: That is, an
| upside-down capital A :), the sequence would be:
|
| 11100010 10001000 10000000 (or E2h 88h 80h)
|
| ...so forth...I think you can work out the rest yourself...pick your
| UNICODE character then check what "range" it's in with the table
| above...place its binary digits into the spaces marked with the "x"...
|
| Points to note:
|
| 1. 7-bit ASCII characters are encoded in exactly the same way in UTF-8
|
| 2. Byte-based (no "endianness" worries; Does not require any "BOM" ("byte
| order mark") character at the start of the file or anything like that, as
| Microsoft advise should be in their plain text file UTF-16
| encoding...advice, in fact, that they were overstepping the mark to give
| because UNICODE themselves do not specify this as anything but
| "optional"...but, well, that's Microsoft for you, eh? ;)...
|
| 3. All non-ASCII characters use a multi-byte sequence
|
| 4. Each byte in that multi-byte sequence has the highest bit set (so, it's
| clear what is ASCII and what is non-ASCII and they shouldn't become
| confused)...
|
| 5. The first byte of the sequence has as many highest bits set as there are
| bytes in the entire sequence (e.g. "110xxxxx 10xxxxxx" starts with two set
| bits in the first byte, so the sequence is two bytes long :)...
|
| 6. After the first byte, all further bytes are of the form "10xxxxxx"
| (providing another extra 6 bits of "address range" per byte :)...
|
| 7. The bytes 0xFE and 0xFF are never used in UTF-8 at all...
|
| 8. The first byte of a non-ASCII character is in the range C0h to FDh,
| subsequent bytes in a multi-byte sequence are in the range 80h to BFh,
| ASCII characters are in the range 00h to 7Fh...you can use this for easy
| resynchronisation (if you start reading in the middle of a multi-byte
| sequence, you can _know_ that this is the case by what range the byte is in
| :)...
|
| 9. All UNICODE characters are available (ASCII bytes are still just one
| byte long, all 16-bit "BMP" characters one to three bytes long, all defined
| UNICODE characters four bytes...being "variable-length" then size of files
| dependent on what characters encoded: If all ASCII, no different from
| ordinary ASCII file...if all "upper range" Chinese ideographs, then UTF-8
| encoding is 4 bytes long (though, note that UTF-16 - with 16-bit per
| character, as Windows uses - is also 4 bytes long, so UTF-8, at worst, can
| only be as big as UTF-16 but will typically probably be smaller))...
|
| NOTE: You might notice that it is possible to create what are known as
| "overlong forms" for characters...for example, you could unnecessarily take
| two bytes to encode an ASCII character:
|
| 11000001 10000001 (or C1h 81h)
|
| Instead of the simpler:
|
| 01000001 (or 41h :)
|
| These "overlong forms" are _INVALID_ UTF-8...they should be rejected as
| errors, if encountered...note that these "overlong forms", as the name
| suggest, are _unnecessarily_ long, so rejecting these also ensures the
| shortest possible encoding of the character, as well as "normalising"
| everything so that "comparisons" are easy (i.e. there is only _one_ valid
| way to encode any particular character :)...
|
| In order to help detect "overlong forms" and reject them, here's another
| simple table...if any of these bit patterns are detected, you have an
| invalid "overlong form" (these are all invalid sequences in UTF-8):
|
| 1100000x (10xxxxxx)
| 11100000 100xxxxx (10xxxxxx)
| 11110000 1000xxxx (10xxxxxx 10xxxxxx)
| 11111000 10000xxx (10xxxxxx 10xxxxxx 10xxxxxx)
| 11111100 100000xx (10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx)
|
| [ Note that you can work out if it's "overlong" from the first or second
| bytes, irrespective of what the remaining bytes are (which I've placed in
| brackets)... ]
| As you can see, it's quite a nice "encoding" for UNICODE because it leaves
| ASCII alone...and can access any defined UNICODE character in up to four
| bytes (which is no worse in size than UTF-16 for the "worst case" (a file
| of only "upper range" Chinese ideographs and nothing else...which is
| probably not any file you or I would ever want to create, eh? ;) but will
| generally be smaller in most cases...right down to, as noted, being exactly
| the same size as ASCII, if the file is only 7-bit ASCII characters)...for
| easy parsing, all non-ASCII multi-byte sequences have their highest bit set
| for every byte in that sequence...and the first byte of that sequence is in
| a different "range" to the subsequent bytes, with the number of "high bits"
| set being equal to the number of bytes in the entire sequence...
Yes, got it. Implementation wont interfiere with my text-routines
as UTF-8 has no 00's in the extensions.
All I'd need to go for it is one additional conversion table
('my > 7f' -> UTF-8) and to insert a single instruction into my
text interpreter:
textout:
.... ;getchar
OR al,al
jz NextFunction ;end of text or whatsoever
Js UNIcode ;*** and there he story continues :)
Call copychar ;display/print/file/buffer/..
inc ecx ;string cursur index
jmp txtout
[..the why..]
Yes, my long time schedule see a KESYS-browser since a while ;)
[..]
| Also, there are some "special characters" that require "processing",
| yes...but the vast majority of characters are just characters...you just
| store them in strings...so, other than that the storage is bigger, what's
| the difference if it's 95 characters, 950 characters or 9500 characters?
| The only people who are going to have problems with this are the font
| designers, drawing them all...but then, it is allowable to create fonts
| that only covers certain "ranges"...and programmers can just use fonts from
| the professional font designers...
You know, double size of all text is a very big increase,
especially for one like me who fight every wasted bit. ;)
| Indeed, "one standard with too many characters" might be a problem...but
| consider "too many standards with not enough characters" instead:
| This is a worse fate ;)...
Agreed, I can think of a limited to 'basic needs' standard font support,
like the few pages I printed out from my UNIcode4.0 file.
['Babylonian confusion'..]
| > Wouldn't it help a lot if the whole globe talks only one language?
| No, not in the slightest...I find that a terrible, horrible,...
| no...never...it's an awful idea...
:) Of course, local dialects and mother-tongue will remain alive anyway.
| What might be good is if the whole world had a shared _second language_,
| perhaps...and through this language people could always communicate...
| but the willing destruction of languages and culture and history?
| Absolutely NOT...
Everyone one this globe should know English. That's what I meant.
[side stories...] interesting.
[Druids...]
there are still more around than you may think ...
Understanding nature in a communicative way,
using roots and herbs instead of pharmaceutical products,
'control kids and pets' by charisma rather than by force,..
| > That's very true indeed*, isn't it? *)indead for the French :)
| Oh, no...I salute the French ...
This joke was just about how Rene usually type/spell 'indeed'.
| Let us not forget that Liberty is a French girl...
I've been there, a really b i g female :)
[Euro-'vision']
| Oh, one last thing: Could countries actually vote for the _music_ rather
| than the "block voting" for each other's neighbours in a "political" way?
| Oh, look, all the Eastern European countries are voting for each other!
What would you expect, these voters are humans.
| ... I Love the Eurovision...it's brilliant! ;)
It make me switch channels (if you've seen one you saw them all).
You may have seen the Austrian participant last year, we sent a Clown...
__
wolfgang
.
- References:
- Unicode Support
- From: Chewy509
- Re: Unicode Support
- From: wolfgang kern
- Re: Unicode Support
- From: Chewy509
- Re: Unicode Support
- From: wolfgang kern
- Re: Unicode Support
- From: Beth
- Unicode Support
- Prev by Date: Re: Interesting Web Site on Open Source Development
- Next by Date: Re: Interesting Web Site on Open Source Development
- Previous by thread: Re: Unicode Support
- Next by thread: Re: Unicode Support
- Index(es):
Relevant Pages
|