Re: Using Japanese and English strings, encodings
- From: "Kaz Kylheku" <kkylheku@xxxxxxxxx>
- Date: 13 Apr 2006 10:23:22 -0700
drrobot wrote:
I've been working a few small programs that use both Japanese and
English, and I keep wishing I could closely, reliable, and simply
couple two strings together (one in english and one in Japanese). In
the simplest case, I (naively?) dream of writing something like this:
(princ (eng "This is an english sentence.")
(jap "これは日本語の文章です。"))
Hmm, "Kore-wa Nihongo-no bunshyou desu". Is that the right
translation? Does translation have to preserve truth value and
self-reference? Or is it acceptable to lie and just say "This is an
English sentence" ("これは英語の文章です"). :)
Maybe this should depend on the value of *PRINT-CIRCLE*. :)
和気裸白 (kazu ki-ra-haku :) has been learning some Japanese
lately. I made a nice little web-based program for drilling oneself on
Kanji meanings, using edict + CLISP + araneida. In a few months, I
memorized at least one meaning of 1200 of the beasties. Crazy!
Where ENG and JAP are little macros such that setting something to 'ENG
before the macro is evaluated will make it come out as:
(princ "This is an english sentence")
The above syntax won't work because the reader will turn both macros
into some kind of object. So you are calling PRINC with two arguments.
or in the case of a japanese compile, the sentence tagged with JAP.
So far, I have been getting by with using *features* and #+jap or #+eng
You're on the right track if you want it to work that way, because you
do have to do it at read-time. But there are pitfalls to doing it a
read time. It's hard to dynamically load in new language definitions,
for instance!
It would be much better if you had a single LANG macro, as other
posters have pointed out. Moreover, that macro could translate to code
which does a dynamic lookup.
Really, you should have only the English string in the expression (or
whatever is the preferred language for coding). The translations would
be in an external catalog. E.g.
(lang "Settings")
LANG doesn't have to be a macro, just a function. Perhaps a memoized
one. What it will do is search for that string in the program's
database, and fetch the appropriate translation.
Note that CLISP is internationalized using some extensions that are
based on top of GNU gettext, and these are available for
internationalizing user programs. Are you after a portable solution or
not?
2. I ran into a little problem with the (very simple) CGI library I am
using. It keeps screwing up the EUC-JP encoding of any parameters I
pass the script. I traced the problem down to where the CGI script uses
CODE-CHAR to convert the correct EUC-JP encoded byte(s) into incorrect
UTF-8 byte(s). I think CLISP is using UTF-8 internally, despite my
command-line orders to use EUC-JP for everything. Everything else works
just fine.
CLISP has some special variables for overriding the encoding used for
streams. Internally, strings are 16 bit characters, I think. Or
something like that. The encoding comes into play during I/O.
Code that reads octets from a socket and then uses CODE-CHAR bypasses
the encoding system in CLISP's streams, so of course it will break.
How do I convince CODE-CHAR to use the EUC-JP character set? Or,
skipping over CODE-CHAR entirely, how do I use WRITE-BYTE to just write
the raw byte to a character stream and ignore what sort of (multi-byte
encoded) character it is?
But characters are not necessarily bytes in Lisp. CLISP has functions
for converting between vectors of bytes and strings, through encodings:
(EXT:CONVERT-STRING-FROM-BYTES vector encoding &KEY :START :END)
(EXT:CONVERT-STRING-TO-BYTES string encoding &KEY :START :END)
The CGI script could maybe be hacked to use this instead of CODE-CHAR.
The encoding parameter comes from EXT:MAKE-ENCODING.
If you are dealing with encodings, its probably best to code that into
your CLISP program rather than to try to globally override it, since
that program still sits in an environment full of ordinary data. E.g.
in my Kanji learning program, it's necessary to read EDICT. So what it
does is locally switch to EUC-JP to read that data. It is then parsed
with the help of CL-PPCRE and turned into a Lisp data structure which
is written out to disk, using UTF-8. The next time the program is run,
it looks for that "compiled" version which loads much faster just using
LOAD.
Here is the function that I use to load a text file in EUC-JP:
(defun read-euc-jp-file (name)
(letf ((*default-file-encoding* (make-encoding :charset
'charset:euc-jp)))
(with-open-file (f name :direction :input)
(loop for line = (read-line f nil nil)
while line
collecting line))))
I use short identifiers for some of these CLISP extensions, because I
put this in my package definition:
(:import-from #:ext #:letf #:make-encoding)
(:import-from #:custom #:*default-file-encoding*)
Note that it's LETF not LET in the above, because
*DEFAULT-FILE-ENCODING* isn't an ordinary special variable. It's
unfortunately a symbol macro.
.
- Follow-Ups:
- Re: Using Japanese and English strings, encodings
- From: drrobot
- Re: Using Japanese and English strings, encodings
- References:
- Using Japanese and English strings, encodings
- From: drrobot
- Using Japanese and English strings, encodings
- Prev by Date: Re: Using Japanese and English strings, encodings
- Next by Date: Re: Web-frontend to a game of Go
- Previous by thread: Re: Using Japanese and English strings, encodings
- Next by thread: Re: Using Japanese and English strings, encodings
- Index(es):
Relevant Pages
|