Re: utf-8, was Re: Three questions: UTF-8, DBM, hash of lists, ...

From: Alan J. Flavell (flavell_at_ph.gla.ac.uk)
Date: 01/15/05


Date: Sat, 15 Jan 2005 22:00:10 +0000

On Sat, 15 Jan 2005, Wes Groleau wrote:

> Welcome to Usenet.

Indeed. It seems from your response, and the rarity of responses from
other contributors, that you're in the position to offer us all a
valuable tutorial on the topic.

> I don't want to know what it does internally, as long as everything
> comes out UTF-8 and is decoded as such going in.

Fine, then we're pretty much up to speed already, and I'm sorry that I
misinterpreted your original posting.

> > Which is not to deny that there can also be situations where you'd
> > want to write unicode characters directly - but then you have to
> > be a lot more careful with how you edit and transfer your source
> > code. See
> > http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Effects-of-Character-Semantics
> > for more details.
>
> Yes, I read that. I'm trying to minimize the need for "being
> careful" about all those ten zillion details by specifying
> "everything is UTF-8."

Point made. If you're really in control of all that data then you're
in a much happier position than I've ever been ;-)

> I 1 STDIN is assumed to be in UTF-8
> O 2 STDOUT will be in UTF-8
> E 4 STDERR will be in UTF-8
> S 7 I + O + E
> i 8 UTF-8 is the default PerlIO layer for input streams
> o 16 UTF-8 is the default PerlIO layer for output streams
> D 24 i + o
>
> Seems to say -CSDA should handle all my IO

It does, doesn't it? Did I miss the specific problem you were having,
and your test case that demonstrated it?

> > > But
> > > another man page seemed to say that "use utf8;" covered
> > > something that -CSD did not, so I put that in, too.
> >
> > The perlunicode pod, for the version of Perl that you're using,
> > should be your "bible". Don't go tossing-in arbitrary bits and
> > pieces that
>
> I have 5.8.1 but no pod, so my 'elsewhere' is the man pages
> derived from the pod.

No disagreement there. More than one way to...read the documentation.

> > See what
> > http://www.perldoc.com/perl5.8.4/pod/perlunicode.html#Important-Caveats
> > says about "use utf8;".
>
> It says the same as my man page: that the pragma is needed
> to "enable UTF-8" in scripts.

Hmmm? At 5.8.4 (and I don't remember it being different in recent
versions before that) it says [this'll need monospace display, and go
sadly wrong with these newfangled usenet-ish interfaces, sorry]:

 As a compatibility measure, the use utf8 pragma must be explicitly
 included to enable recognition of UTF-8 in the Perl scripts
                                         ^^^^^^^^^^^^^^^^^^^
 themselves (in string or regular expression literals, or in
 ^^^^^^^^^^
 identifier names) on ASCII-based machines or to recognize UTF-EBCDIC
 on EBCDIC-based machines. These are the only times when an explicit
                                         ^^^^^^^^^^
 use utf8 is needed.

> However, 'man perlrun' says the -CSD handles the IO,

Indeed, and (fwiw) I don't see anything there about encoding of the
script's source code itself.

> and perlunicode says for script encoding, see encoding
> which says that UTF-8 already works in scripts.

It "works", yes, but (as I understand it, anyway) I think you have to
ask for it. It could just be that if you call for locale-awareness
with -CL, and you have utf-8 in your locale, it will come out in the
wash; but I don't see any harm in asking for it directly, if you're so
certain that you'll never not want it (sorry for the double-negative).

> So, things are a little unclear. I put in both,

Looks as if you're (a) right and (b) unlikely to cause any harm.

> was able to read UTF-8 text, put it in a DBM hash, and
> get it back out. That's good enough for now.

Good luck



Relevant Pages

  • Re: UTF-8 to Unicode conversion in ajax response
    ... This I decode, convert to UTF-8, and store in an SQLite database. ... string I then retrieve and use as a response to an ajax request from a ... and the cell proceeds to display Chinese ... The response is sent as a percent-encoded multi-byte sequence. ...
    (comp.lang.javascript)
  • Re: Writing to the newsgroup?
    ... are responding to a UTF-8 post, ... figure out how you can respond with UTF-8 ... For some reason best known to itself, the IME writer now seems to ... Should have set my response to iso-2022-jp. ...
    (sci.lang.japan)
  • Re: Writing to the newsgroup?
    ... are responding to a UTF-8 post, ... figure out how you can respond with UTF-8 ... For some reason best known to itself, the IME writer now seems to ... Should have set my response to iso-2022-jp. ...
    (sci.lang.japan)
  • Re: pronouncing old french and Joan of Arc
    ... the default encoding and asks if I want to send my message as UTF-8 ... I see them wiped in, for example, Ross's response. ... Ross did indeed use Google. ...
    (sci.lang)