Re: Enhanced Unicode support for "Go" tools
From: Beth (BethStone21_at_hotmail.NOSPICEDHAM.com)
Date: 05/19/04
- Next message: Betov: "Re: Enhanced Unicode support for "Go" tools"
- Previous message: The Wannabee: "Re: Enhanced Unicode support for "Go" tools"
- In reply to: The Wannabee: "Re: Enhanced Unicode support for "Go" tools"
- Next in thread: Evenbit: "Re: Enhanced Unicode support for "Go" tools"
- Reply: Evenbit: "Re: Enhanced Unicode support for "Go" tools"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 19 May 2004 14:39:19 +0100
The Wannabee wrote:
> Hi. Can someone explain why unicode is needed at all?
Well, I could _try_ to explain it...
> What I have "gotten"
> sofar, is that it makes possible using languages with more
than the normal
> amount of chars. But does it have other important applications
than this ?
Right, you know ASCII? It defines a number for each of the
normal Latin characters like "A", "B", "Z" (in both cases, so
"a", "b", "z" too ;)...we all know how great and useful ASCII is
because, well, that's what we send to each other here all the
time...and HTML files are just ASCII files with some special
sequences in them...and "INI" files and shell scripts and batch
files are all just ASCII plain text files too...
The problem with ASCII, though - from an international
perspective - is what that "A" stands for: "American"...that is,
ASCII has an American - and, hence, also English - bias to
it...in strict 7-bit ASCII, you can't type things like the
accent characters used in French and other European
languages...in strict 7-bit ASCII, you don't have the British
currency symbol (hence why I say "American" bias and not just
"English" bias...though, in most cases - in fact, all but this
sole example, as far as I can recall off-hand - it doesn't make
any difference and could be more generally regarded as "English
language" bias ;)...
And these are only the Latin-based languages...it has basically
no support at all for Cyrillic (used in Russian and other
languages in that area of the world...presumably now including
some of our recent Eastern European friends who've just joined
onto the EU :)...you know, the script with characters that look
like backwards "R" and "N" symbols to the rest of us Latin
scripters...and then they use Hebrew in places like
Israel...Arabic across many places in the Middle-East...Thailand
has its own script...and, of course, Chinese, Korean and
Japanese have their own writing style that even includes those
"ideographic" characters I was talking about before (where
there's one "symbol" that represents an entire word or concept
;)...Africa has lots of scripts too...
Plus, UNICODE isn't just about all the different alphabets out
there being used...it also includes lots of mathematical
symbols, dingbats, "box drawing characters", arrows and shapes,
etc....as well as some scripts included for "historical" and
"academic" purposes like Runic (not used anymore but was once
the original script used around these parts before the Romans
brought the Latin stuff we're all used to today across most of
Western Europe ;)...Ancient Egyptian heiroglyphics was also
proposed for addition last time I looked (something I'd reckon
almost certainly should get approval...though, Tolkien's scripts
have been rejected, I think...probably both one of them looks
almost identical to Runic...well, Tolkien was a linguist and
based his languages on an actual sound linguistical basis)...
Hence, UNICODE also tries to account for things like being able
to store mathematical formulas in their usual notations...or to
support Egyptologists sending documents with heiroglyphics in
them to one another (the Egyptian heiroglyphics, by the way,
actually serve dual purpose that it's both alphabetical and
ideographic at the same time...though, the ideographic part is
to the fore...but the alphabetical use of the same characters
was actually a major help in them being able to originally
"crack the code" and be able to start to read it again...no-one
alive actually remembers how it worked, you see, to teach it to
everyone so it had to be re-learnt and "cracked" like it was a
secret code in the war or something ;)...
The basic idea has evolved over time...but, in a nutshell, the
idea of UNICODE was just to make a kind of "big ASCII" that
would work just like ASCII does and have all the same benefits
but it was just, well, _bigger_...so that it could encode all
these other characters in different languages and other symbols
like mathematical symbols (it also includes, for example, the
standard universal symbols for "biohazard" and the plastic
recycling category symbols and so forth too...yeah, it even has
all the chess characters - king, queen, bishop, rook, pawn,
etc. - in black and white that you could theoretically "printf"
a chess game without using any graphics code, so long as you had
all the UNICODE support and a font with all the characters in it
;)...
It actually started off as two projects (an "ISO" version and
the "UNICODE" project :) but they realised it was kind of daft
to create _two_ "single universal standards" for this so they
merged their stuff together and it's generally called "UNICODE"
(though, it's still an "ISO" standard...just that standard is
now always kept in synch with UNICODE that the difference
between them is highly pedantic ;)...
Before this happened, the solution to the problem of all the
different languages and scripts was "code pages" and different
character sets and "double-byte" stuff...the problem with this
kind of solution, though, was that if I loaded in a file written
with one "code page" into my editor (where my OS is fixed to,
say, the UK "code page" ;) then all the characters would come
out wrongly because the character set the file was written in is
not the same as the character set that the file is being viewed
with...this leads to lots of strange things happening...such as
looking at any DOS text file that uses "box drawing characters"
in Windows...Windows uses so-called "ANSI" which dropped IBM's
"box drawing characters" in order to fit in some of the accented
Latin characters, like that "c" with a "hook" hanging off the
bottom used in French, for example (which ain't in ASCII, so I
can only approximate writing "ca va?" here, as a simply example
;)...if you look at these files in Windows, the characters come
out all wrong and it looks kind of stupid...
UNICODE, though, solves this because it's not a bunch of small
characters sets but one _massive_ one instead that just happens
to include all the characters you can find in all the other
character sets (this was one of the objectives with the project
to "merge" all the common character sets used elsewhere into the
UNICODE characters so that _any_ kind of "code page" you might
have been using before should be fully convertible to
UNICODE...the basic idea being: "convert to UNICODE following
the rules we've defined, never have to convert your character
sets or care about 'code pages' or anything such nonsense ever
again thereafter" ;)...
In a sense, it's not just "a" character set but it's attempting
to be _THE_ character set...one that includes everything all the
others ever did...one that also adds on loads of useful
extensions like the historical scripts, mathematical /
electronic symbols, a whole bunch of "dingbats", shape
characters (square, circle, diamond, box drawing, etc. ;), etc.,
etc....and the point with UNICODE is that it's a standard for
the entire world...so, if I write a document with Japanese Kanji
in it and send it to your machine, then it will still be
Japanese Kanji, no matter where you are in the world or what
your "locale" or "default code page" settings are...
Also, of course, because it supports all of these scripts and
characters in the same character set, then we have another
useful attribute that couldn't be done easily with other
character sets and "code pages"...you can put English right next
to Japanese right next to Arabic right next to Thai all in the
same document...absolutely excellent for some "Learn to speak
Japanese" text file, eh? And, yup, with UNICODE, it could just
be a plain text file and yet have English and Japanese in the
same file (English text explaining, Japanese text as "examples",
yeah? ;)...another useful thing that those "historical" and
"academic" people will like, as they can write a file in English
about their "theories" on Ancient Egyptian heiroglyphics and
actually include those heiroglyphics into the file itself...and,
in light of this, UNICODE also covers "bi-directional" text
too...that is, English reads left-to-right but Arabic reads
right-to-left...well, there are algorithms and "direction"
characters and so forth so that you can have both together and
it just "flips" the writing direction as is appropriate to that
script (but is stored in its "natural order" in the actual
file...something to maintain for stuff like sorting, searching,
spell checking, etc. ;)...
UNICODE attempts to cover _the lot_...and, thus, the "point" of
UNICODE is to finally unite the whole entire thing into _one_
character set that everybody all round the world can use...to
get the "universality" that we have with ASCII but without the
American / English / Latin bias...
It started out 16-bits in size (this is why Windows UNICODE is
16-bits, as they jumped aboard early) but it was realised that
16-bits couldn't quite cover everything when those Chinese /
Korean / Japanese Kanji characters showed up along with
everything else...so, it's now been extended (strictly, to
21-bits...but the usual way is to just use 32-bits per character
;)...those "UTF" things you hear people referring to are just
ways of encoding UNICODE...as 32-bits per character all the time
would be somewhat wasteful, there's "UTF-16" where it's 16-bits
per character but there are two "special characters" included to
"escape" into the higher 16-bits (which is normally okay for
most things as all the "main" characters are in the first
16-bits - called the BMP ("basic multilingual plane" ;) - which
was the original UNICODE...the stuff in the higher 16-bits are
more Kanji characters and more "obscure" stuff like Ancient
Egyptian and things most people don't want to be using)...
UTF-8 is cool in respect that it's still byte-orientated and
that all strict 7-bit ASCII characters are exactly the same
("ASCII compatibility"...a UTF-8 file with only ASCII characters
is _identical_ to an ASCII file ;)...it's only when the 8th bit
is set that the interpretation of the characters differs and it
"escapes" a variable-length sequence (and, unlike Windows'
implementation, which perhaps jumped on too early, UTF-8 can
cover the _entire_ range :)...the only problem with UTF-8,
really, is that it has a smaller file size _only_ if you're not
typing in Chinese or Japanese or whatever...because those would
actually require more bytes to "escape" to get to than you'd
have if you'd just used UTF-16's "16-bits per
character"...otherwise, though, it's great in being a kind of
"extended ASCII" that's still byte-orientated (usually easy for
the more simple tools to be "upgraded" with little bother...like
I proposed we might do with LuxAsm to include the support
without too much bother in our own coding...well, at least, not
until the text editor has to be coded ;)...
Most OSes have the support built-in now...Windows took the
approach of also supporting 16-bit "wide character" strings in
its API (hence, the "A" and "W" versions of any API that takes a
string ;)...the Linux approach was to mostly use UTF-8 for the
"ASCII compatibility" stuff (the support is in the kernel and
many distributions are increasingly adopting it as the
"default"...Redhat has already gone fully behind this and
"locale charmap" (I think that's the command) should read
"UTF-8" as your default character set, no matter where you are
or what "locale" you have set ;)...Xterm and much of Linux have
been "upgraded" accordingly and the various UNICODE fonts
installed in distributions...
It might not seem like that much of a big deal but some of the
things it makes possible are impossible or at least really
bloody awkward without it...really, what things currently stand
at is getting a growing awareness amongst programmers to start
using it...you know, to begin to not always assume that any
character string is automatically going to be ASCII and that
"number of bytes" and "number of characters" aren't necessarily
the same thing (you know, using "strlen" for both purposes...to
work out how many bytes to allocate, as well as using it to work
out how many columns to "tab" across for printing things out on
screen ;)...that is, the actual effort of getting everyone to
move over to this "one universal character set"...because, once
achieved, then a whole bunch of communication problems
vanish...such as the times, for example, when I've tried to post
some non-ASCII characters in my posts and they don't appear on
the newsgroup as I wrote them...if all posts were posted in
UTF-8 and all newsservers (and all the software tools they are
using ;)propogated it properly and so forth, then we could all
start posting actual mathematical formulas in the proper
notation and "dingbats" back and forth - imagine all those
"ASCII diagrams" but with proper "box drawing characters" and
"arrows" and everything - with no problems...and, in fact, our
OSes almost certainly support this already, as do most
newsreaders (well, maybe not Annie's text-based ones in
DOS...which is ironic when she's the one who makes the most use
of "ASCII diagrams" in every post...doubtless, though, the
diagrams would still be ASCII, even if she could post something
else...well, her surname's "ASCII", not UTF-8, after all
;)...the actual problem tends to be the software along the
"propogation" line that only understands ASCII and would mangle
anything that's non-ASCII...
Beth :)
- Next message: Betov: "Re: Enhanced Unicode support for "Go" tools"
- Previous message: The Wannabee: "Re: Enhanced Unicode support for "Go" tools"
- In reply to: The Wannabee: "Re: Enhanced Unicode support for "Go" tools"
- Next in thread: Evenbit: "Re: Enhanced Unicode support for "Go" tools"
- Reply: Evenbit: "Re: Enhanced Unicode support for "Go" tools"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]