Re: Converting codepages to UTF8
- From: "P" <szpara_ga@xxxxxxx>
- Date: 31 Mar 2006 02:16:41 -0800
Dr.Ruud wrote:
P schreef:
I have one file, which is in UTF8, which contains a set
of strings. I want to determine whether any of the
strings matches any file name in a specified directory.
Since there can be special characters in the file names
(and in the strings in the UTF8 file), sometimes I'll
get false negatives, because a simple eq on the strings
in the UTF8 file and on the file names in the directory
won't match (due to the different encodings).
So I want to normalise the directory listing first (and
this should be dependent on the code page, because
different users might be using different code pages) and
compare the resulting list to the list in the UTF8 file.
Does that make sense? :)
Yes, that is much clearer. I'll assume that you have
Windows and maybe Cygwin.
Have you read perllocale, perluniintro, perlunicode,
perlebcdic?
Yes, I have, and while I consider myself slightly more
intelligent than a garden gnome, I must admit that these
issues concerning character encoding are beyond my abilities
of comprehension (at least at present).
Use the command:
for /f "tokens=4" %w in ('chcp') do dir >text.%w
to create a file called "text.437" (if your chcp is 437)
with the dir-output for the current directory.
I assume this is a demonstration, rather than part of a
solution? Or are you saying I'll have to write a temporary
file in this way to solve my problem?
Under cygwin, you can use the command:
iconv -f CP437 -t UTF-8 text.437 > text.utf8
to convert the file from cp437 to utf8.
I don't have iconv.
But that second step can also be done with Perl.
(Almost) platform-independent way to see all available
encodings:
perl -MEncode -e "print join $/, Encode->encodings(':all')" |more
OK, this, and Mr King's reply tell me that Encode is capable
of doing this. I need 'cp437', 'cp850' and 'cp852'
(depending on which machine I'm using). For the rest of this
post I'll assume that I'll be using 'cp437'.
Now it is your turn to create some code and try to make it
work.
Here's the script (stripped for the purposes of this post)
*before* tackling the encoding issues:
----------
#!/usr/bin/perl
use warnings;
use strict;
opendir(DIR, '.') or die "Can't open input directory: $!";
my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);
while (<DATA>) {
chomp;
if ( exists $files{$_} ) {
print "$_ matches.\n";
}
else {
print "$_ doesn't match.\n";
}
}
__DATA__
Ðorde Bala-evic
----------
A file named "Ðorde Bala-evic" *does* exist in the CWD, yet
when I run this script I get:
Ä?orÄ?e Bala-eviÄ? doesn't match.
So I tried the following fix:
----------
while (<DATA>) {
chomp;
my $key = decode('cp437', $_);
if ( exists $files{$key} ) {
print "$_ matches.\n";
}
else {
print "$_ doesn't match.\n";
}
}
----------
But this gives the same exact result. What am I doing wrong?
--
Best regards,
Angela Druss
.
- Follow-Ups:
- Re: Converting codepages to UTF8
- From: Dr.Ruud
- Re: Converting codepages to UTF8
- From: Dr.Ruud
- Re: Converting codepages to UTF8
- References:
- Converting codepages to UTF8
- From: P
- Re: Converting codepages to UTF8
- From: Dr.Ruud
- Re: Converting codepages to UTF8
- From: P
- Re: Converting codepages to UTF8
- From: Dr.Ruud
- Converting codepages to UTF8
- Prev by Date: Re: Finding the number of occurences in an array
- Next by Date: Re: Reloading perl file dynamically
- Previous by thread: Re: Converting codepages to UTF8
- Next by thread: Re: Converting codepages to UTF8
- Index(es):
Relevant Pages
|