Re: UTF8 strings and filesystem access



In article <slrnfh46l5.6q4.hjp-usenet2@xxxxxxxxxxx>,
Peter J. Holzer <hjp-usenet2@xxxxxx> wrote:
On 2007-10-11 22:22, Gary E. Ansok <ansok@xxxxxxxxxxxxxxxxxx> wrote:
In article <slrnfgt1r0.o12.hjp-usenet2@xxxxxxxxxxx>,
Peter J. Holzer <hjp-usenet2@xxxxxx> wrote:
Quoth ansok@xxxxxxxxxxxxxxxxxx (Gary E. Ansok):

1) $dir is encoded internally in UTF8 (even if $dir doesn't
contain any non-ASCII characters)

Then why is it a wide string?

It's read in using XML::Simple from a config file that does not
contain any non-ASCII characters, or any encoding specification in
the XML prolog (though adding "encoding='ISO-8859-1'" didn't help).

Now that I've dug a little deeper, I think upgrading some of our
module versions may help avoid this problem -- a recent change to
XML::LibXML mentioned "strip-off UTF8 flag for consistent behavior
independent of document encoding".

You omitted an important piece here: The entry reads
"strip-off UTF8 flag with $node->toString($format,1) for consistent ..."
$node->toString returns a piece of XML, which always should be a series
of bytes, not characters. I haven't looked at the source code of
XML::Simple, but it probably uses $text->data or $node->nodeValue.

I've worked around the problem by switching from XML::LibXML to
XML::SAX::PurePerl as the underlying parser -- now, the string
read in from the configuration file no longer has the UTF8 flag
set, and the problem does not appear.

I still think it's a bug that a string that can successfully opendir()
a directory, combined (including the appropriate separator) with a
file name read in by readdir(), does not result in a string that can
by used to open() or stat() the file. Especially since the path appears
correct when printed as part of an error message, and it's difficult
to diagnose the problem without resorting to something like Devel::Peek.

Thanks for the assistance,
Gary Ansok
.