Re: Character semantics for filenames (was: win32 reading wide filenames (unicode))
- From: robic0
- Date: Mon, 03 Apr 2006 17:44:22 -0700
On Mon, 3 Apr 2006 23:18:52 +0000 (UTC), Ilya Zakharevich <nospam-abuse@xxxxxxxxx> wrote:
[A complimentary Cc of this posting was NOT [per weedlist] sent toHints? Unocode inserts hints in strings. Everything is nothing without
Peter J. Holzer
<hjp-usenet2@xxxxxx>], who wrote in article <rftr0e.28r.ln@xxxxxxxxxxx>:
Suppose you append $ext to the string; and suppose $ext happened to be
promoted to utf8 in some way.
Now file name is stored (interally) in utf8 format. How do you pass
it to open()? For simplicity, you may suppose that all the chars in
the string are below \xFF...
In my mental model of perl strings (which may be totally wrong, of
course), it doesn't make any difference whether the string is internally
stored in UTF-8 or byte format if all chars are <= 0xFF.
It does not make any difference TO ANY OPERATION WHICH KNOWS WHAT TO
DO WITH CHARACTERS ABOVE "\xFF". And now we are discussing exactly
this: what should open() do with such characters?
AFTER this question is decided, ONLY THEN the operation on strings
with only 8-bit characters will become "transparent" to the internal
representation (thus to history of handling the string). Right now, I
suspect, open() works on the supplied byte stream AS IS, discregarding
the hints whether the byte stream is an 8-bit representation, or utf8
representation...
translation, a prior knowledge of a templated form. Can it really be
assured that the byte stream will not be pure ASCII in a true UC stream.
Thats what they bank on!
Wrong, characters in Perl are 4 octets, not 1. Bytes are 1 octet.
thread "Converting codepages to UTF8" for an example). Perl should
do that automatically: filenames should be converted from the OS
encoding to perl strings by readdir and from perl strings to the OS
encoding
On legacy systems (with no special API to get wide-char listing) there
is no "OS encoding". The encoding of a filenames is a property of a
directory; not of OS, and not of user environment.
Which legacy system has per-directory encoding?
As I said: any system with no special API to get wide-char listing.
Most probably this translates to "any system but Plan9, Win*, and a
short list of other stuff".
A filename is stored as a sequence of bytes without any
charset information.
Well, this is exactly what I meant "per-directory". You need to know
the "intent of the creator" of this directory structure to understand
how these byte streams should be translated to character streams.
Any interpretation of these bytes as characters is
a function of the user's environment.
Nope, this is the function of "environment" at the moment of creation
of files, not at the moment of reading them.
If the user changes the environment, the files (apparently) change
their names and may even become unaccessable.
There should be no name change (unless the encoding of a directory is
marked "kinda raw, not human-readable"). In most cases file names are
designed to be human-readable, thus they should contain characters,
not bytes.
By default, Perl converts everything to UC, 4 octets, depending on
a utf-8 flag (jeez). That means all operations are in UC.
Exepet for open(). Why is that? Byte code ASCII can be written out
if utf-8 is off. Seems internally, Perl may (without asking) depend
on the underlying filesystem to create files. Dunno, dunno any of this
***. Learning though...
There should be a way to inform opendir() of the encoding of the file
listing; there should be no difference with open() in this regard.
Hope this helps,
Ilya
.
- References:
- Character semantics for filenames (was: win32 reading wide filenames (unicode))
- From: Peter J. Holzer
- Re: Character semantics for filenames (was: win32 reading wide filenames (unicode))
- From: Ilya Zakharevich
- Re: Character semantics for filenames (was: win32 reading wide filenames (unicode))
- From: Peter J. Holzer
- Re: Character semantics for filenames (was: win32 reading wide filenames (unicode))
- From: Ilya Zakharevich
- Character semantics for filenames (was: win32 reading wide filenames (unicode))
- Prev by Date: Re: Converting byte to integer string
- Next by Date: Re: Extract first part of a string
- Previous by thread: Re: Character semantics for filenames (was: win32 reading wide filenames (unicode))
- Next by thread: Re: Character semantics for filenames (was: win32 reading wide filenames (unicode))
- Index(es):
Loading