Re: Encoding bytes into UTF-8 string
- From: Pascal Bourguignon <pjb@xxxxxxxxxxxxxxxxx>
- Date: Fri, 24 Nov 2006 23:40:29 +0100
"Robert Dodier" <robert.dodier@xxxxxxxxx> writes:
I want to read bytes from a file containing UTF-8 characters and
encode them into a string. Specifically, I have the byte offset of
the beginning of the string and the number of bytes in the string
(always a whole number of characters), so I am planning to seek
to the beginning, read so-many bytes, then encode the result into
a string.
You've got your concepts all over wrong. That can't work.
The file doesn't contain character. Not on a POSIX system. Files in
POSIX (including unix and MS-Windows) don't contain characters. They
contain only bytes.
These bytes may encode a string of characters using the utf-8 unicode
coding system. But you'll have to read bytes.
I have browsed the CLHS, comp.lang.lisp archives, Seibel's PCL,
and random web pages without coming up with a solution.
One thing that seems promising is CODE-CHAR.
It's not promising at all. There's absolutely no guarantee of what
CODE-CHAR does, with respect to utf-8 or unicode.
Can I make it recognize UTF-8 codes?
No.
One last thing -- the solution needs to be CL; I'm not in a position
to choose a Lisp implementation.
Outch!
You'll have to implement UTF-8 to unicode decoding, and unicode to character.
Since you cannot choose a Lisp implementation, you can count only on
the standard characters:
#\NEWLINE #\SPACE
#\! #\" #\# #\$ #\% #\& #\' #\( #\) #\* #\+ #\, #\- #\. #\/
#\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9 #\: #\; #\< #\= #\> #\?
#\@ #\A #\B #\C #\D #\E #\F #\G #\H #\I #\J #\K #\L #\M #\N #\O
#\P #\Q #\R #\S #\T #\U #\V #\W #\X #\Y #\Z #\[ #\\ #\] #\^ #\_
#\` #\a #\b #\c #\d #\e #\f #\g #\h #\i #\j #\k #\l #\m #\n #\o
#\p #\q #\r #\s #\t #\u #\v #\w #\x #\y #\z #\{ #\| #\} #\~
that's all. So in pure CL, independant of an implementation, you'll be
able to decode utf-8 to unicode and to decode only the unicode that
are between 32 and 126, and the newline to these characters.
Since in a utf-8 stream, the bytes less than 128 encode these
characters, and only these characters are encoded to a sequence of
bytes less than 128, you could actually skip the utf-8 decoding, just
signaling an error on any byte greater than 128.
Thanks in advance for any light you can shed on this question.
If I were you, I'd try to get to use either sbcl or clisp (or both),
read the file as a binary file :external-format '(unsigned-byte 8),
seek to the _byte_ offset you're given, then use #+sbcl
sb-ext:octets-to-string or #+clisp ext:string-from-bytes to DECODE the
bytes and get a string of unicode characters.
http://www.cliki.net/CloserLookAtCharacters
--
__Pascal Bourguignon__ http://www.informatimago.com/
In deep sleep hear sound,
Cat vomit hairball somewhere.
Will find in morning.
.
- Follow-Ups:
- Re: Encoding bytes into UTF-8 string
- From: Robert Dodier
- Re: Encoding bytes into UTF-8 string
- From: llothar
- Re: Encoding bytes into UTF-8 string
- References:
- Encoding bytes into UTF-8 string
- From: Robert Dodier
- Encoding bytes into UTF-8 string
- Prev by Date: Re: How can I do if I want to write an IDE for clisp?
- Next by Date: Re: Encoding bytes into UTF-8 string
- Previous by thread: Encoding bytes into UTF-8 string
- Next by thread: Re: Encoding bytes into UTF-8 string
- Index(es):
Relevant Pages
|