Re: More elegant UTF-8 encoder



* christian.bau wrote in comp.lang.c:
What you are trying to do seems rather bizarre. If you want to encode
Unicode in a 32 bit number, leave it unchanged. If you want to encode
Unicode as a sequence of bytes, store it into a sequence of bytes.

Well, I have what you can consider a regular expression engine based on
Janusz Brzozowski's notion of derivatives of regular expression, meaning
that, given a regular expression and a character, it computes a regular
expression matching the rest of the string. Currently it stores ranges
of characters using the Unicode scalar value and transcodes from UTF-8
to UTF-32.

For several reasons, I want to avoid transcoding to UTF-32, so I want to
change it so that, given a regex and a octet, it computes a new regex. I
am experimenting with possible solutions, one is to exploit

utf8toint(c1) < utf8toint(c2) <=> c1 < c2

which allows me to store the character ranges in their utf8toint encoded
form. The derivative of a range with respect to an octet can then easily
be computed by computing the intersection of the range and a new range
consisting of the minimal and maximal utf8toint value given the octet(s)
seen up to that point (they consist of the current byte followed by n-1
0x80 and 0xBF octets respectively, where n is the required length).

So a range [ U+0000 - U+00FF ] would be stored as [ 0x0000 - 0xc3bf ]
and if it sees e.g. a 0xc2 it would create a range [ 0xc280 - 0xc2bf ],
compute the intersection which is [ 0xc280 - 0xc2bf ] and drop the seen
byte, resulting in [ 0x80 - 0xbf ]; I can always tell, due to how UTF-8
byte patterns are organized, whether a given range is a partial range
and how many bytes are still needed to make a full character, though I
will be storing the remaining byte count for performance reasons.

Obviously I could do something similar by partially decoding the UTF-8
octets and storing Unicode scalar value ranges in the derivative instead
or mix these approaches in some way, but that seemed more difficult to
me. Similarily, rewriting the regular expression upfront so it matches
on bytes rather than characters would be more difficult. So, while it
might be unusual, I don't think this is particularily bizarre.
--
Björn Höhrmann · mailto:bjoern@xxxxxxxxxxxx · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
.



Relevant Pages

  • Re: Any forseeable disasters?
    ... >> Let's say you want to store a character of the Unicode ... > wchar_t is enough sufficient to store Unicode characters. ... A proper signature line consists of the four character sequence: ...
    (comp.lang.cpp)
  • Re: regexp and curly brackets
    ... you might convert the character to its Unicode point ... before executing the regular expression. ...
    (comp.lang.javascript)
  • SQL2000 Collation Problem
    ... Use Unicode, nchar, nvarchar, ntext instead. ... language independant and will store any character you ... >How do I set the collation on SQL2000 so that I can store ...
    (microsoft.public.sqlserver.server)
  • Re: varchar or nvarchar
    ... N stands for National character set. ... NCHAR and NTEXT are Unicode ... data types. ... They store each character as two bytes rather than one to ensure ...
    (microsoft.public.sqlserver.programming)
  • Re: varchar or nvarchar
    ... NTEXT are Unicode ... >data types. ... They store each character as two bytes ...
    (microsoft.public.sqlserver.programming)