Re: Fast UTF-8 strlen function
- From: "Beth" <BethStone21@xxxxxxxxxxxxxxxxxxxxxxx>
- Date: Mon, 16 May 2005 13:35:07 GMT
Randy wrote:
> P.S., How does one enter "UTF-8" (non-ANSI) characters in a program
> like notepad without a special keyboard (or simply resorting to cut and
> paste)?
Well, as Chewy has pointed out, Microsoft have "extended" the old "ALT GR +
numpad" sequence to work with UNICODE characters for Windows (for those who
might not know, this same "trick" worked under DOS for typing in ASCII
characters...press down "ALT GR" and then tap in the value with the numpad
keys...leave go and the character would appear...that's what, I think, the
distinction of the left "ALT" but the right "ALT" is usually labelled "ALT
GR" (at least under DOS, only the right-hand "ALT GR" actually worked for
doing this...don't know if they've changed that or not...if they just say
"ALT" and not "ALT GR" then they probably have)...I have heard people call
this "alt graphic"...whether it does or doesn't actually stand for
"graphic", I have not a clue...this "trick" was useful for adding in the
"box drawing characters" and such into code...you could directly put the
characters into the strings to be printed, using this "trick" to be able to
type it :)...
Although, that's pretty "lame", really...UNICODE characters are always
referenced in _hexadecimal_ and you have to tap it in, in decimal...
And, I just tried it, they have changed it...now it's the other way
around..."ALT GR" doesn't work but "ALT" does...ah, good old
Microsoft...very "consistent", I think not...
Anyway, the other means are, of course, using "character map"...but that's
more than a little awkward for long sequences...
The best means is to install a bunch of "virtual keyboards"...on XP, go to
"Control Panel"...then "Regional and Language Options" (that is, under the
"Classic View" categories, as XP has got that new "clueless newbie mode"
where the control panel icons are categorised differently...I find that
quite unhelpful and stick with "Classic View" myself ;)...on the dialogue
box (US: dialog box) that appears, there's three tabs at the top...switch
to the "Language" tab...
[ You'll note that there are a few checkboxes (UK: tickboxes) at the bottom
of the tab about "Supplemental language support"...these are probably worth
checking (UK: ticking), as they add on some extra files for right-to-left
support and files for "Far East" languages and such (left out of the
typical English language install to save space, probably...you might find
you need your "installation CDs" to get the required files, so you might
need to dig that out :)...though, this is just "useful" and isn't what I'm
directing you to... ]
At the top of the page, you'll see "text services and input
languages"...click the "details..." button to get up the dialogue box for
that...
On this next dialogue box, you can install "input languages" and "virtual
keyboards" for as many languages as you like...just click the "Add"
button...select the language and then select the keyboard (note that there
can be different keyboard layouts for some languages...even US English has
the "Dvorak" option...useful if ever wanted to learn how to type on a
Dvorak layout keyboard :)...you can add as many of these languages and
keyboards as you like to the box (you can also add a language twice but
using different keyboards...I have both QWERTY and Dvorak installed for
English...in fact, even though it's "UK English", the keyboard is "US
Dvorak"...there appears to be no "UK Dvorak", apparently ;)...
Important: At the top of this dialogue box is the "default" option...make
sure that's set to "US English" or whatever, lest you boot up Windows and
suddenly find you can only type in Greek...very annoying...the "default",
of course, is the language and keyboard it "defaults" to when you boot
up...
At the bottom, there are some other useful options under
"Preferences"...click on the "Key settings..." button to bring up a
dialogue box where you can set up "hotkeys" for "fast switching" keyboards
and languages (this option allows you to do things like make it
automatically jump to different language by pressing "special sequences",
like holding both SHIFT keys down at the same time)...
Now, personally, I find this option is best to _SWITCH OFF_...as I once or
twice totally confused myself by accidentally tapping in the "hotkey
sequence" and then Greek appeared when I typed and I had no idea what was
going on...you could set this up for "fast switching" but, to be honest, if
you're not usually typing in multiple languages, the best setting for the
"hotkeys" are _OFF_, so that you don't suddenly end up switching keyboards
and languages in the middle of typing...confusing yourself completely...
Anyway, once you set that up (or switch it off...whatever), then "OK" the
box to get back to the other dialogue box and the most useful option of all
is to now select the "Language bar..." option (this is really what makes
things real easy for typing :)...
There is the checkbox "Show the language bar on the desktop"...turn this
on...
You'll find this "language bar" has now appeared (it's "ghostly"
semi-transparent when the mouse is not over it, so it doesn't "get in the
way" :)...now, on the right edge is a "minimise" button...I'd say it's best
to "minimise" it (and leave it that way)...the "language bar" then lands on
your "taskbar" (move it around by "unlocking" the taskbar and moving the
gripper and then re-locking the taskbar...I prefer to have it next to the
"Start" button myself on the left-hand side, rather than the default
right-hand side...but that's all a matter of "preference", of course :)...
Okay, it's a bit of an effort to get this set-up...BUT, now that you've got
your little minimised "language bar" sitting on the taskbar, you can change
languages or keyboards whenever you feel like by clicking on the icons (the
language icon is a two character abbreivation of the language's name...that
is, "EN" for English, "JP" for Japanese, "RU" for Russian, "EL" for Greek
(obviously, "EL" makes sense as the abbreviation for Greek _in Greek_, even
if it seems odd for English...the abbreviations are always Latin
characters...I think this is some "international standard" they are
following here...while the keyboard icon can be clicked to switch between
keyboards (where applicable)...some langauges - like Japanese - are more
complicated and the icons might change around...just mess around with
it...the worst you can end up doing is, of course, typing "weird
characters"...and you can always switch back by clicking the icons back to
"English" again :)...
And you can access the "settings" of the "language bar" by clicking on the
lower "down arrow" button on the right-hand side, which brings up a menu
(where the "settings" option just brings back up that dialogue box where
you can set-up the installed languages and keyboards :)...
This is Microsoft's "international support"...and, to be fair, it's another
one of the things they aren't quite so bad at doing...once that "language
bar" is there, it's very easy to switch keyboards with a mouse click and
you can install a whole bunch of languages and keyboards from the dialogue
box (which might not only be useful for testing UNICODE stuff...it can also
do Dvorak layout, if that interests anyone, without any actual "change" to
the languages :)...
So, for UNICODE testing, you can just install a whole bunch of languages
and then switch languages...tap out any old crap on the keyboard, just to
see that the characters appear...switch languages, tap out some more
crap...well, you don't actually have to speak those languages just to see
that the right characters appear...
I suppose for good "testing" then the languages to choose would be
"English" (that's your "Latin" and your "default", obviously :)...Russian
picks up the Cyrillic alphabet...Greek for the Greek alphabet...Hebrew
tests "right-to-left"...Thai is another interesting one (tests support for
the more "exotic" languages...or you could try "Hindi" or something
:)...then a "Far East" language like Japanese or Chinese (which are the
"special" ones that cause the most problems generally because of those
"ideographic" characters that number in the tens of thousands...indeed,
Windows' support for these languages is naturally "limited" in that it,
apparently, doesn't go beyond the 16-bit range...so many of the ideographs
are missing...but there are some in the "BMP" too...just only the "most
common" ones)...
Also, there are some already created "test pages" out there on the internet
too...
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/
I like the "runic poem" at the bottom...and a Beastie Boys rap in the IPA
phonetic alphabet...which is quite silly but quite cool at the same time
;)...
The files there are designed specifically for testing out UTF-8 programs
and test out different aspects (though some relate to "fonts" and
"alignment" which aren't much use for what you want and are for testing web
browsers and text editors :)...and I'm sure there are other examples of
such "test files" elsewhere on the 'Net too...
> Also, when I save a file with notepad in UTF-8 format, it seems to
> stick some characters at the beginning of the file that trip up the
> compiler (BOM?)
> What's the deal there?
Oh my goodness...I just tried that out and had a look with a hex viewer at
it...and, yup, twisted old Microsoft are putting a "BOM" at the start of a
UTF-8 file!!! This just conclusively proves the point that Microsoft often
have no idea what they're doing...they are putting a "byte order mark" into
a UTF-8 file, which is _BYTE-BASED_ throughout (and NEVER once has any
word-sized entities nor has the slightest "endianness" concerns: This
attribute is one of the "good points" in the "careful design" of UTF-8, in
fact...that, like ASCII, it is "endianness neutral" because it only deals
with bytes)...
The "deal" here is that Microsoft are a bunch of idiots...yes, it's a
"BOM"...and, no, there is absolutely no good reason whatsoever to include a
"BOM" in a UTF-8 file...it is a total waste of space that serves no purpose
and just demonstrates that Microsoft really don't have a clue what they are
doing...there is no "byte order" concerns with UTF-8, so, clearly, a "byte
order mark" is utterly pointless...
It shouldn't be there...this is a "bug" from Microsoft...no doubt a case of
Microsoft implementing Notepad to pop in a "BOM" as the first character
when UNICODE characters are used into the "buffer"...and then they just
"convert" that buffer into UTF-8 ("BOM" and all, even though it shouldn't
be there in this case)...a "simple" implementation that is, in fact,
_WRONG_ because UTF-8 files should NOT have any "BOM" marks...
You'll just have to "skip" over it...those bytes shouldn't really be
there...UTF-8 requires no "BOM"...this is just a warped Microsoft
implementation..."sloppy coding"...by definition of being "byte-based",
there is, of course, zero need to have a "BOM" anywhere in any UTF-8 for
any reason at all...they've just screwed up there...
This character is entirely meaningless in this context...treat it exactly
as you would treat other "meaningless" characters...for example, ASCII was
not without its "meaningless" characters either...what do the control
characters "ENQ" (ASCII 5) or "ACK" (ASCII 6) mean in an ordinary text file
these days? These relate to a "context" that's meaningless unless you are,
in fact, communicating with "dumb terminals" and that kind of nonsense
(which no-one does anymore)...for a UTF-8 file, the "BOM" UNICODE character
is as equally "meaningless" in that context...so, ignore it as you would
ignore getting an "ENQ" or "ACK" in the middle of a text file too...
Beth :)
.
- References:
- Fast UTF-8 strlen function
- From: randyhyde
- Re: Fast UTF-8 strlen function
- From: Frank Kotler
- Re: Fast UTF-8 strlen function
- From: Beth
- Re: Fast UTF-8 strlen function
- From: randyhyde
- Fast UTF-8 strlen function
- Prev by Date: Re: Byte vs. Dword aligned accesses
- Next by Date: Re: Byte vs. Dword aligned accesses
- Previous by thread: Re: Fast UTF-8 strlen function
- Next by thread: Re: Fast UTF-8 strlen function
- Index(es):
Relevant Pages
|