Re: Converting codepages to UTF8



Dr.Ruud wrote:
P schreef:

I have one file, which is in UTF8, which contains a set
of strings. I want to determine whether any of the
strings matches any file name in a specified directory.

Since there can be special characters in the file names
(and in the strings in the UTF8 file), sometimes I'll
get false negatives, because a simple eq on the strings
in the UTF8 file and on the file names in the directory
won't match (due to the different encodings).

So I want to normalise the directory listing first (and
this should be dependent on the code page, because
different users might be using different code pages) and
compare the resulting list to the list in the UTF8 file.
Does that make sense? :)

Yes, that is much clearer. I'll assume that you have
Windows and maybe Cygwin.


Have you read perllocale, perluniintro, perlunicode,
perlebcdic?

Yes, I have, and while I consider myself slightly more
intelligent than a garden gnome, I must admit that these
issues concerning character encoding are beyond my abilities
of comprehension (at least at present).


Use the command:

for /f "tokens=4" %w in ('chcp') do dir >text.%w

to create a file called "text.437" (if your chcp is 437)
with the dir-output for the current directory.


I assume this is a demonstration, rather than part of a
solution? Or are you saying I'll have to write a temporary
file in this way to solve my problem?


Under cygwin, you can use the command:

iconv -f CP437 -t UTF-8 text.437 > text.utf8

to convert the file from cp437 to utf8.


I don't have iconv.


But that second step can also be done with Perl.

(Almost) platform-independent way to see all available
encodings:

perl -MEncode -e "print join $/, Encode->encodings(':all')" |more


OK, this, and Mr King's reply tell me that Encode is capable
of doing this. I need 'cp437', 'cp850' and 'cp852'
(depending on which machine I'm using). For the rest of this
post I'll assume that I'll be using 'cp437'.


Now it is your turn to create some code and try to make it
work.


Here's the script (stripped for the purposes of this post)
*before* tackling the encoding issues:

----------
#!/usr/bin/perl
use warnings;
use strict;

opendir(DIR, '.') or die "Can't open input directory: $!";

my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);

while (<DATA>) {
chomp;

if ( exists $files{$_} ) {
print "$_ matches.\n";
}
else {
print "$_ doesn't match.\n";
}
}

__DATA__
Ðorde Bala-evic
----------


A file named "Ðorde Bala-evic" *does* exist in the CWD, yet
when I run this script I get:

Ä?orÄ?e Bala-eviÄ? doesn't match.


So I tried the following fix:

----------
while (<DATA>) {
chomp;

my $key = decode('cp437', $_);

if ( exists $files{$key} ) {
print "$_ matches.\n";
}
else {
print "$_ doesn't match.\n";
}
}
----------


But this gives the same exact result. What am I doing wrong?

--
Best regards,
Angela Druss

.



Relevant Pages

  • Re: System.WCh_Cnv
    ... encodings) are too entangled for my taste. ... for I in Str'Range loop ... I cannot do that with UTF-8 in strings. ... I prefer general solutions, like array interfaces. ...
    (comp.lang.ada)
  • Re: Discussion about transformation TSP to UniqueTSP
    ... set of strings that begin with a '1', ... I'm assuming reasonable encodings are a given. ... classes at issue are sets of languages, ...
    (comp.theory)
  • Re: Encoding/characterset/font family confusion
    ... between strings and texts is one that I have never encountered on the ... programs do translate encodings, and which don't. ... After reading through a few pages of UTF-8, ... Maybe when PHP6 is out, and debugged, and I switch my server to PHP6, ...
    (comp.lang.php)
  • Re: XML Strings in Ada
    ... > If any one wishes to see and use my code for noncommercial purposes, ... > my Software Developers Cooperative License. ... My strings can be transcoded at any time. ... implementation where encodings are identified with names in Unix and ...
    (comp.lang.ada)
  • Re: Python Unicode to String conversion
    ... unicode encode and decode, try using a mix of latin1 and utf8 ... encodings to figure out whats going on, ... All input data should be decoded from byte strings into unicode as early as possible. ...
    (comp.lang.python)