Re: problem with special characters in file command?



M. Strobel <sorry_no_mail_here@xxxxxxxxxxx> wrote:
Test session: (the file name is wrapped, there is a space in it, should
not be the problem)
-------------------------------------------------------------------------------
% set f [lindex [glob *.wav] 0]
2$wav@@4$27-06-2008@@6$00$06@@7$27-06-2008
08$45$34@@8$²²@@29$24@@91$!QBLFUX0M@@31$1@@28$0@@.wav

Probably those two non-ascii-chars cause the problem, but this is
not a principial problem, but depends on more factors:

In a nutshell, I suppose your script is running in a unicode
locale, ([encoding system] == "utf-8"), but the uploaders aren't!

What happens is, that the file*names* are probably iso8859-1
encoded, but [glob] "unicode"izes the names, so they appear
ok inside the script. But when you try to access them from
tcl, they're converted to utf-8, resulting in names that are,
well, different from the original, thus, [file readable ...]==0

To verify this theory, do this:

from bourne shell:
ls *.wav | hd
and from tclsh:
set fd [open "|hd" w]
puts $fd [lindex [glob *.wav] 0]
close $fd

I bet they differ.

To work around it, there are a couple of ways:
-) run you script in an iso8859-1 (not -15) locale,
if you need unicode-awareness, to all conversions
manually.
-) write a filename-sanitizer that renames the files to
proper utf-8. I've written such a script, so let me
know, if renaming these files is ok, and you're interested.
-) set "encoding system iso8859-1" -- vaguely equivalent
to first suggestion, except for some subtle differences.
-) wrap access to the files in pairs of "encoding system iso8859-1"
and "encoding system $originalSavedEncoding".

My script also uses some heuristics to try to tell such files
from really utf-8 named files. This would be useful, if some of
the clients actually do happen to use unicode.

.



Relevant Pages

  • Re: Can someone tell me what this is doing
    ... #This filter takes Arabic text (encoded in UTF-8 using the Unicode ... # and performs Arabic glyph joining on it and outputs a UTF-8 octet ... The script is well commented and fairly self explanitory as far as xlat to C ...
    (comp.lang.perl)
  • Re: problem with special characters in file command?
    ... I suppose your script is running in a unicode ... encoded, but "unicode"izes the names, so they appear ok inside the script. ... tcl, they're converted to utf-8, resulting in names that are, ... except for some subtle differences. ...
    (comp.lang.tcl)
  • Re: Unicode Delphi Win32 - which approach
    ... I like the backwards compatibility aspects of UTF-8 vs UTF-16. ... The first 256 Unicode characters map to the ANSI character set. ... entire stream> but calling an API 100 times in a loop I can imagine. ... and explicitly contextualise every string. ...
    (borland.public.delphi.non-technical)
  • Re: Unicode string libraries
    ... UTF-8 is the encoding that must be used ... I initially thought that the variable-length characters ... but also that UTF-8 didn't break when Unicode got extended ...
    (comp.programming)
  • Re: Unicode string libraries
    ... encoding negotiation. ... old languages which have adopted Unicode without much pain. ... compatibility with too many old programs; but char as a holder for UTF-8 ... The limitations of UTF-16 ...
    (comp.programming)