Re: character encoding & regex



On 06/16/2007 05:01 PM, Tom Allison wrote:
Mumia W. wrote:

On 06/16/2007 02:29 PM, Tom Allison wrote:
I'm trying to do some regular expression on strings in email. [...]
And with unicode and locales and bytes it all gets extremely ugly.

I found something that SpamAssassin uses to convert all this "goo" into a repeatable set of characters (which is all I'm really after) by running something that looks like this:


What do you mean by a "repeatable set of characters"? Unicode characters are repeatable.

The fundamental problem is that this:

$string =~ /(\w\w\w+)/
returns nothing because unicode/utf8/Big5 characters are not considered 'words'.
[...]

Many UTF8 characters are words, and many are not. Consider this program (written in utf-8):

#!/usr/bin/perl
use strict;
use warnings;
use encoding 'utf8', 'STDOUT', 'utf8';

my $string2 = '☺ 膄 膅 膆 ☺
á é í ó ú ¶ | ✗ ∷ е み む も
ä ë ï ö ü µ ± × ṁ · ';

my @wchars = $string2 =~ /(\w)/g;
print "@wchars\n";

exit;
__END__

My output for this program is this:

膄 膅 膆 á é í ó ú е み む も ä ë ï ö ü µ ṁ

Notice that some characters made it and some didn't. In order to do this right, I had to enable a utf8 locale in my Debian O/S [ :-) ]. Then I set LANG=en_US.UTF-8 before writing the program in vim.

Furthermore, I had to tell Perl that the program was written in utf8 using the 'encoding' module.

Basically, the '\w' in a regular expression is sensitive to the current locale, and if utf8 is enabled in the locale, '\w' will (probably) know which unicode characters are word characters and which are not.

BTW, I don't know Chinese or Korean. I just know how to play with vim digraphs enough to enter random foreign characters--sort of like a monkey banging on a computer keyboard :-)

And I don't really care to get exactly the right character.
I could just as easily use the character ascii values, but the regex for that is not something I'm familiar with.

I got this far:
my $string = chr(0x263a);
my @A = unpack "C*", $string;

# @A = ( 226, 152, 186 )

At least this is consistent.
But there are a lot of characters that I want to break on and I don't know that I can do this. The best I can come up with is:

my $string = chr(0x263a);
$string = $string .' '. $string;
print $string,"\n";
foreach my $str (split / / ,$string) {
my @A = unpack "C*", $str;
print "FOO: @A\n";
}
exit;

Using the above I can get a consistent array of characters but I don't know if this will work for any character encoding. I guess this is part of my question/quandry.

One thing I'm not sure about is if the MIME::Parser is even decoding things sanely. I suspect it isn't because I get '?' a lot.

I installed urxvt from my Debian installation [ :) ] and I get...


:-)

Wide character in print at unicode_capture.pl line 5.
âº
Wide character in print at unicode_capture.pl line 9.
⺠âº
FOO: 226 152 186
FOO: 226 152 186

However it doesn't print the boxes, which is good.



Put "use encoding 'iso-8859-1', STDOUT => 'utf8';" at the top of your file. Also read up on the encoding module (perldoc encoding).

This will probably work a lot better if you've configured your system to support a utf8 locale:

http://www.debian.org/doc/manuals/reference/ch-tune.en.html#s-activate-locales

BTW, you're using a great O/S ;-)


.



Relevant Pages

  • Re: Character semantics for filenames (was: win32 reading wide filenames (unicode))
    ... Now file name is stored in utf8 format. ... it doesn't make any difference whether the string is internally ... DO WITH CHARACTERS ABOVE "\xFF". ... encoding to perl strings by readdir and from perl strings to the OS ...
    (comp.lang.perl.misc)
  • Re: Submiting Arabic Language characters to ISAPI Extension dll
    ... Inside the ISAPI, you get the querystring: ... call MultiByteToWideCharwith UTF8 as the code page to turn ... into either UTF8, CP_ACP, or any other encoding. ... %-decode it into actual characters and send it. ...
    (microsoft.public.inetserver.iis)
  • Re: Reg multilanguage support by gnuplot
    ... The "locale" setting is need in order to interpret 1-byte character ... It is not needed if you are using UTF-8. ... type the characters directly into your command string. ... set label 1 at screen 0.2, ...
    (comp.graphics.apps.gnuplot)
  • Re: diferent answers with isalpha()
    ... execute a script file with the same code I get False. ... Python uses the "C" locale where the ... alphabetic characters are a-zA-z only. ... there is the matter of encoding. ...
    (comp.lang.python)
  • Re: MIDP MIDlet: which characters are supported in the phone font?
    ... by the locale where the phone is meant to be used. ... "Which unicode characters does a phone support? ... >> the font set on the phone. ... > I rather doubt any of them do not also display latin letters. ...
    (comp.lang.java.programmer)

Loading