Re: Character semantics for filenames (was: win32 reading wide filenames (unicode))



[A complimentary Cc of this posting was NOT [per weedlist] sent to
Peter J. Holzer
<hjp-usenet2@xxxxxx>], who wrote in article <rftr0e.28r.ln@xxxxxxxxxxx>:
Suppose you append $ext to the string; and suppose $ext happened to be
promoted to utf8 in some way.

Now file name is stored (interally) in utf8 format. How do you pass
it to open()? For simplicity, you may suppose that all the chars in
the string are below \xFF...

In my mental model of perl strings (which may be totally wrong, of
course), it doesn't make any difference whether the string is internally
stored in UTF-8 or byte format if all chars are <= 0xFF.

It does not make any difference TO ANY OPERATION WHICH KNOWS WHAT TO
DO WITH CHARACTERS ABOVE "\xFF". And now we are discussing exactly
this: what should open() do with such characters?

AFTER this question is decided, ONLY THEN the operation on strings
with only 8-bit characters will become "transparent" to the internal
representation (thus to history of handling the string). Right now, I
suspect, open() works on the supplied byte stream AS IS, discregarding
the hints whether the byte stream is an 8-bit representation, or utf8
representation...

thread "Converting codepages to UTF8" for an example). Perl should
do that automatically: filenames should be converted from the OS
encoding to perl strings by readdir and from perl strings to the OS
encoding

On legacy systems (with no special API to get wide-char listing) there
is no "OS encoding". The encoding of a filenames is a property of a
directory; not of OS, and not of user environment.

Which legacy system has per-directory encoding?

As I said: any system with no special API to get wide-char listing.
Most probably this translates to "any system but Plan9, Win*, and a
short list of other stuff".

A filename is stored as a sequence of bytes without any
charset information.

Well, this is exactly what I meant "per-directory". You need to know
the "intent of the creator" of this directory structure to understand
how these byte streams should be translated to character streams.

Any interpretation of these bytes as characters is
a function of the user's environment.

Nope, this is the function of "environment" at the moment of creation
of files, not at the moment of reading them.

If the user changes the environment, the files (apparently) change
their names and may even become unaccessable.

There should be no name change (unless the encoding of a directory is
marked "kinda raw, not human-readable"). In most cases file names are
designed to be human-readable, thus they should contain characters,
not bytes.

There should be a way to inform opendir() of the encoding of the file
listing; there should be no difference with open() in this regard.

Hope this helps,
Ilya
.



Relevant Pages

  • Re: Byte Array to String
    ... retrieved text will mismatch the original characters. ... encoding the characters. ... Dim strFileData as String ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: Send string to IP address
    ... "Plain hex" implies something formatted as text, but doesn't answer the question of encoding. ... There's no "just" as far as "an ASCII string" is concerned. ... Characters are not bytes and bytes are not characters. ... Normally you'd create the Writer once at the same time as you create the underlying stream, rather than every time you write some text, obviously. ...
    (comp.lang.java.programmer)
  • Re: Byte Array to String
    ... retrieved text will mismatch the original characters. ... I think VBA may use the default system locale to ... encoding the characters. ... Dim strFileData as String ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: unicode conversion
    ... breaks utf8 output of Chinese characters to an otherwise perfectly utf8- transparent console, see my XML::Simple and utf8 woe posting of ... As I explained in the other thread, what's probably happening is that, without -CS, your data is being read in by Perl as octets, then printed out as octets; however, under -CS your data is still read as octets yet printed to a UTF8-aware filehandle. ... my latest experience is with bulk quantities of utf8 data (latin, CJK material, _tons_ of characters with accents and diacritics in one soup). ... When I try to segment such a string with approx. ...
    (comp.lang.perl.misc)
  • urwid with multi-byte encoded and bidirectional text?
    ... I would like to support whatever encoding the user likes. ... *new* line translation format would have to support characters that are ... N bytes in the string and M columns wide when displayed, ...
    (comp.lang.python)