Re: Converting codepages to UTF8



P schreef:

I have one file, which
is in UTF8, which contains a set of strings. I want to
determine whether any of the strings matches any file name
in a specified directory.

Since there can be special
characters in the file names (and in the strings in the UTF8
file), sometimes I'll get false negatives, because a simple
eq on the strings in the UTF8 file and on the file names in
the directory won't match (due to the different encodings).

So I want to normalise the directory listing first (and this
should be dependent on the code page, because different
users might be using different code pages) and compare the
resulting list to the list in the UTF8 file. Does that make
sense? :)

Yes, that is much clearer. I'll assume that you have Windows and maybe
Cygwin.


Have you read perllocale, perluniintro, perlunicode, perlebcdic?


Use the command:

for /f "tokens=4" %w in ('chcp') do dir >text.%w

to create a file called "text.437" (if your chcp is 437)
with the dir-output for the current directory.


Under cygwin, you can use the command:

iconv -f CP437 -t UTF-8 text.437 > text.utf8

to convert the file from cp437 to utf8.


But that second step can also be done with Perl.

(Almost) platform-independent way to see all available encodings:

perl -MEncode -e "print join $/, Encode->encodings(':all')" |more

Now it is your turn to create some code and try to make it work.

--
Affijn, Ruud

"Gewoon is een tijger."

.



Relevant Pages

  • Re: encode UTF8 -> MIME
    ... snip> I have a UTF8 input ... But Perl doesn't actually print an e-acute character! ... expecting UTF8 the it'll be rendered as an e-acute. ... Remember, in Perl there are two types of string, Unicode strings ...
    (perl.beginners)
  • Re: Unicode in Delphi: just deprecate WideString/WideChar
    ... A new compiler switch would indicate that all strings ... character position, character length, and copying substrings). ... They should all start with UTF8* ... of UTF8 vs. UTF16 are Japanese and Chinese. ...
    (borland.public.delphi.non-technical)
  • Re: [Regex] Suchen nach Hex-Zeichen
    ... use utf8; entfernt habe. ... nur selten benutze.Aber ich vermute, dass Perl auch ohne "use utf8;" mit allen Strings klarkommt, in denen nur Zeichen bis 0xFF vorkommen, solange man Strings als Zeichenfolgen betrachtet und nicht die Plattform wechselt. ... Wenn man ein Textstück in ein neues Dokument oder ein Konsolfenster kopiert, ist das plötzlich da. ...
    (de.comp.lang.perl.misc)
  • Re: strings - reading utf8 characters such as japanese. how?
    ... right in that I dont have a firm grasp of java strings, ... This is the code that is used to read the utf8 text resources into ... this down to the fact that java uses a modified utf8 encoding ... dont know what that fucntionality is. ...
    (comp.lang.java.programmer)