Re: Encoding bytes into UTF-8 string



"Robert Dodier" <robert.dodier@xxxxxxxxx> writes:
I want to read bytes from a file containing UTF-8 characters and
encode them into a string. Specifically, I have the byte offset of
the beginning of the string and the number of bytes in the string
(always a whole number of characters), so I am planning to seek
to the beginning, read so-many bytes, then encode the result into
a string.

You've got your concepts all over wrong. That can't work.


The file doesn't contain character. Not on a POSIX system. Files in
POSIX (including unix and MS-Windows) don't contain characters. They
contain only bytes.

These bytes may encode a string of characters using the utf-8 unicode
coding system. But you'll have to read bytes.


I have browsed the CLHS, comp.lang.lisp archives, Seibel's PCL,
and random web pages without coming up with a solution.

One thing that seems promising is CODE-CHAR.

It's not promising at all. There's absolutely no guarantee of what
CODE-CHAR does, with respect to utf-8 or unicode.


Can I make it recognize UTF-8 codes?

No.


One last thing -- the solution needs to be CL; I'm not in a position
to choose a Lisp implementation.

Outch!
You'll have to implement UTF-8 to unicode decoding, and unicode to character.

Since you cannot choose a Lisp implementation, you can count only on
the standard characters:

#\NEWLINE #\SPACE
#\! #\" #\# #\$ #\% #\& #\' #\( #\) #\* #\+ #\, #\- #\. #\/
#\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9 #\: #\; #\< #\= #\> #\?
#\@ #\A #\B #\C #\D #\E #\F #\G #\H #\I #\J #\K #\L #\M #\N #\O
#\P #\Q #\R #\S #\T #\U #\V #\W #\X #\Y #\Z #\[ #\\ #\] #\^ #\_
#\` #\a #\b #\c #\d #\e #\f #\g #\h #\i #\j #\k #\l #\m #\n #\o
#\p #\q #\r #\s #\t #\u #\v #\w #\x #\y #\z #\{ #\| #\} #\~

that's all. So in pure CL, independant of an implementation, you'll be
able to decode utf-8 to unicode and to decode only the unicode that
are between 32 and 126, and the newline to these characters.

Since in a utf-8 stream, the bytes less than 128 encode these
characters, and only these characters are encoded to a sequence of
bytes less than 128, you could actually skip the utf-8 decoding, just
signaling an error on any byte greater than 128.


Thanks in advance for any light you can shed on this question.

If I were you, I'd try to get to use either sbcl or clisp (or both),
read the file as a binary file :external-format '(unsigned-byte 8),
seek to the _byte_ offset you're given, then use #+sbcl
sb-ext:octets-to-string or #+clisp ext:string-from-bytes to DECODE the
bytes and get a string of unicode characters.

http://www.cliki.net/CloserLookAtCharacters


--
__Pascal Bourguignon__ http://www.informatimago.com/
In deep sleep hear sound,
Cat vomit hairball somewhere.
Will find in morning.
.



Relevant Pages

  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • Re: Fast UTF-8 strlen function
    ... >> Is there a fast UTF-8 string length function floating around? ... Length in bytes, or length in characters? ... For UTF-8, the main basic "change" you have to make to your string routines ... then I could individually look up the characters in my UNICODE ...
    (alt.lang.asm)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • Re: Prothon should not borrow Python strings!
    ... """It does not make sense to have a string without knowing what encoding ... same cul de sac as Python. ... Prothon_String_As_ASCII // raises error if there are high characters ... Python's split between byte strings and Unicode strings is ...
    (comp.lang.python)