Re: Using Japanese and English strings, encodings



drrobot wrote:
I've been working a few small programs that use both Japanese and
English, and I keep wishing I could closely, reliable, and simply
couple two strings together (one in english and one in Japanese). In
the simplest case, I (naively?) dream of writing something like this:

(princ (eng "This is an english sentence.")
(jap "これは日本語の文章です。"))

Hmm, "Kore-wa Nihongo-no bunshyou desu". Is that the right
translation? Does translation have to preserve truth value and
self-reference? Or is it acceptable to lie and just say "This is an
English sentence" ("これは英語の文章です"). :)

Maybe this should depend on the value of *PRINT-CIRCLE*. :)

和気裸白-san (kazu ki-ra-haku :) has been learning some Japanese
lately. I made a nice little web-based program for drilling oneself on
Kanji meanings, using edict + CLISP + araneida. In a few months, I
memorized at least one meaning of 1200 of the beasties. Crazy!

Where ENG and JAP are little macros such that setting something to 'ENG
before the macro is evaluated will make it come out as:

(princ "This is an english sentence")

The above syntax won't work because the reader will turn both macros
into some kind of object. So you are calling PRINC with two arguments.

or in the case of a japanese compile, the sentence tagged with JAP.

So far, I have been getting by with using *features* and #+jap or #+eng

You're on the right track if you want it to work that way, because you
do have to do it at read-time. But there are pitfalls to doing it a
read time. It's hard to dynamically load in new language definitions,
for instance!

It would be much better if you had a single LANG macro, as other
posters have pointed out. Moreover, that macro could translate to code
which does a dynamic lookup.

Really, you should have only the English string in the expression (or
whatever is the preferred language for coding). The translations would
be in an external catalog. E.g.

(lang "Settings")

LANG doesn't have to be a macro, just a function. Perhaps a memoized
one. What it will do is search for that string in the program's
database, and fetch the appropriate translation.

Note that CLISP is internationalized using some extensions that are
based on top of GNU gettext, and these are available for
internationalizing user programs. Are you after a portable solution or
not?

2. I ran into a little problem with the (very simple) CGI library I am
using. It keeps screwing up the EUC-JP encoding of any parameters I
pass the script. I traced the problem down to where the CGI script uses
CODE-CHAR to convert the correct EUC-JP encoded byte(s) into incorrect
UTF-8 byte(s). I think CLISP is using UTF-8 internally, despite my
command-line orders to use EUC-JP for everything. Everything else works
just fine.

CLISP has some special variables for overriding the encoding used for
streams. Internally, strings are 16 bit characters, I think. Or
something like that. The encoding comes into play during I/O.

Code that reads octets from a socket and then uses CODE-CHAR bypasses
the encoding system in CLISP's streams, so of course it will break.

How do I convince CODE-CHAR to use the EUC-JP character set? Or,
skipping over CODE-CHAR entirely, how do I use WRITE-BYTE to just write
the raw byte to a character stream and ignore what sort of (multi-byte
encoded) character it is?

But characters are not necessarily bytes in Lisp. CLISP has functions
for converting between vectors of bytes and strings, through encodings:

(EXT:CONVERT-STRING-FROM-BYTES vector encoding &KEY :START :END)
(EXT:CONVERT-STRING-TO-BYTES string encoding &KEY :START :END)

The CGI script could maybe be hacked to use this instead of CODE-CHAR.

The encoding parameter comes from EXT:MAKE-ENCODING.

If you are dealing with encodings, its probably best to code that into
your CLISP program rather than to try to globally override it, since
that program still sits in an environment full of ordinary data. E.g.
in my Kanji learning program, it's necessary to read EDICT. So what it
does is locally switch to EUC-JP to read that data. It is then parsed
with the help of CL-PPCRE and turned into a Lisp data structure which
is written out to disk, using UTF-8. The next time the program is run,
it looks for that "compiled" version which loads much faster just using
LOAD.

Here is the function that I use to load a text file in EUC-JP:

(defun read-euc-jp-file (name)
(letf ((*default-file-encoding* (make-encoding :charset
'charset:euc-jp)))
(with-open-file (f name :direction :input)
(loop for line = (read-line f nil nil)
while line
collecting line))))

I use short identifiers for some of these CLISP extensions, because I
put this in my package definition:

(:import-from #:ext #:letf #:make-encoding)
(:import-from #:custom #:*default-file-encoding*)

Note that it's LETF not LET in the above, because
*DEFAULT-FILE-ENCODING* isn't an ordinary special variable. It's
unfortunately a symbol macro.

.



Relevant Pages

  • Re: Using Japanese and English strings, encodings
    ... using edict + CLISP + araneida. ... It would be much better if you had a single LANG macro, ... It keeps screwing up the EUC-JP encoding of any parameters I ... Internally, strings are 16 bit characters, I think. ...
    (comp.lang.lisp)
  • Internationalization and character encoding
    ... Strings to our server. ... If I force a particular encoding is ... I expect problems to be only from characters that do not appear in English ...
    (comp.lang.java.programmer)
  • Re: Proposal: require 7-bit source strs
    ... I'm referring to a time when there was no encoding ... It would be possible to go back and find all strings ... That's why I specified to do this after conversion to ... make the assumption that the character set is ASCII-based, ...
    (comp.lang.python)
  • Re: diferences between 22 and python 23
    ... >> encoding attribute. ... I was being sloppy and using "unicode" as ... The point being to preserve character identity information from the original ... What would be the meaning of concatenating strings, ...
    (comp.lang.python)
  • Re: Substituting the main menu bar(s)
    ... like everything to be in English, partly because my user guide is in ... But I do not try to do anything about other strings that might be displayed in the language of the operating system. ... You mention one disadvantage of your dictionary to be that each English string can only have one translation. ... But you could get around this by having two dictionaries; one for menu/control text and another for tooltip text. ...
    (microsoft.public.vc.mfc)

Quantcast