Re: Unicode Support



websn...@xxxxxxxxx wrote:
>
> For example, there are letters which can take multiple accents. And
> you can often specify them as base-char + accent1 + accent2. The
> question is, if you flip the order of the accents, does that
represent
> a different character or not? The unicode normalization algorithms
say
> no for some cases, and yes in others.
>
Hi Paul,

That appears to be the hardest bit of all. I've also found that in many
cases (now that I'm looking more closely at the actual encodings and
character maps), that some characters, particularly from the Latin set,
have direct equivalents, eg Latin Small Letter A With Tidle, can be
mapped to either U+00E3 or U+0061 + U+0303, (which is Latin Small
Letter A with Combining Tidle) which makes comparison even harder.

While it's stated that any editor conforming to Unicode 4.x must
produce shortest form (which negates the issue I raised), however
copying from other sources the exact encoding format MUST be preserved.
So that if I have an editor which only conforms to version 1.x, which
produces the long form (v1.x IIRC doesn't state which form to produce
to be conformant), and I copy it over to another editor which produces
short form, the second editor (to be conformant) cannot and should not
convert long form to short form, even though the short form is
considered correct.

So basically: If I copy over ã encoded as U+0061, U+0303 into a text
editor that is conformant to Unicode v4.x, it MUST remain in the long
format (eg 2 characters), even though that particular encoding is not
technically correct, (where U+00E3 is the technically correct
encoding).

That's if I'm reading Chap3 correctly. (If I'm wrong please let me
know).

Now my head hurts...

And I can see the resistance to allowing full unicode support for
labels/identifiers.

I would just like to thank everyone that has replied and voiced a
constructive opinion on this topic.

I will have to admit, supporting unicode is a bit more work than I
first thought! :(

--

Darran (aka Chewy509) brought to you by Google Groups.

.



Relevant Pages

  • Re: Setting dynamically the Greek charset in Firefox ?
    ... Greek characters required more than 1 byte in UTF-8 and assumed they ... Unicode is a set of character maps and UTF-8 ... copy-pasted in a UTF-8 editor. ...
    (comp.lang.javascript)
  • Re: Hebrew in php
    ... It looks like the phantom "EF BB BF" bytes are Unicode BOMs. ... see if there's an option to turn off the inclusion of the BOM. ... you're looking for a new editor, I recommend Vim, which has great ...
    (comp.lang.php)
  • Re: Thesaurus Problem
    ... files by using text editor tools, the files must be saved in Unicode format ... (ANSI, Unicode, Big Endian, and UTF-8). ... I then downloaded an XML editor and saved it from there. ...
    (microsoft.public.sqlserver.fulltext)
  • Re: Unicode strings and byte arrays
    ... I just tried saving "as Unicode" with TextPad, and that didn't add the signature, ... guess every editor is using a different standard. ... > beginning of all unicode files as 0xFEFF -- assuming the app that ...
    (microsoft.public.vb.general.discussion)