Re: HTML in utf8 and perl

From: Alan J. Flavell (flavell_at_ph.gla.ac.uk)
Date: 02/28/04


Date: Sat, 28 Feb 2004 18:45:26 +0000

On Sat, 28 Feb 2004, Pawel Niewiadomski wrote:

> "Alan J. Flavell" <flavell@ph.gla.ac.uk> wrote in

> > No, theoretically the second one should generate the Unicode character
> > which you specified. You're confusing Unicode values with their utf-8
> > encodings.
>
> That was the answer I was looking for. I didn't really quite understand
> the difference between the encoding of the character in utf8 and its
> value in Unicode.

Glad it helped.

Of course, now that you know the answer, it should easy to find it in
the documentation. :-}

The Unicode "code points" (the term used in the perluniintro) are
encoded in different ways (different bit-patterns) in utf-8, utf-16 or
indeed other applicable Unicode encodings, but they still represent
the same "code point". It just so happens that Perl chose internally
to represent characters by using utf-8 representation, but the ord()
values of the Unicode characters are still their code point values,
and, as you've now seen, the wide character constant is represented by
\x{...} using its code point value, the same as is tabulated in the
character code charts at the Unicode site,
http://www.unicode.org/charts/

all the best



Relevant Pages

  • Re: Unicode Support
    ... if two Unicode strings are the same? ... UTF-16 is basically telling everyone "ok we all got to start ... character, and will likely support *both* endians. ... UTF-8 encodings are also easy to learn to ...
    (alt.lang.asm)
  • Re: Entities in alt and title text
    ... (Unicode themselves are inconsistent as to what this space is called). ... concept could usefully be extended to describe other character sets, ... encodings, including ISO-8859-* encodings that don't even support ... So if you're going to use ’ as a numeric character reference, ...
    (comp.infosystems.www.authoring.html)
  • Re: iso_8859_1 mystery/tkinter
    ... this isn't about the "sign bit", it's about assumed encodings for byte ... In iso_8859_1 and unicode, the character with value 0xb0 is DEGREE SIGN. ... and then pass x to a Tkinter call, Tkinter treats it as a string encoded ...
    (comp.lang.python)
  • Re: [OT]Re: Unicode strings
    ... >> character set, even for legacy code. ... A long time ago I implemented a string class that supported both UTF-8 ... you can convert non-unicode encodings to unicode internally ...
    (alt.comp.lang.learn.c-cpp)
  • Re: VB - Ascii to Unicode and then Unicode to UTF-8 conversion (Very desperate!!)
    ... Latin together) then you have to use a Unicode column type. ... AscW returns the real Unicode character ... for Chinese characters, ... then the next thing to worry about is your CSV file. ...
    (microsoft.public.vb.general.discussion)