Re: replace chars



Chas. Owens wrote:
On Dec 26, 2007 2:59 PM, Gunnar Hjalmarsson <noreply@xxxxxxxxx> wrote:
Well, then you'll probably need to identify the utf8 octet sequences
that correspond to the special characters you want to see transformed.
snip

Perl strings are in UTF-8*, but if you want to specify a character
without using it directly (so the Perl file can still be treated as
ASCII) you use the UNICODE representation instead:

my $a_with_macron = "\x{0101}"; #UTF-8 encoding is C4 81

So, knowing the UTF-8 sequences is fairly useless.

This is the approach I had in mind:

$ cat test.pl
#!/usr/bin/perl
use Encode;

$octets = <DATA>;

$chars = decode 'utf8', $octets;

%special = ( "\xc3\x96" => 'O', "\xc3\xa5" => 'a' );
($translated = $octets) =~ s/(\xc3\x96|\xc3\xa5)/$special{$1}/g;

printf '%-28s%s', 'Raw data (utf8 encoded): ', $octets;
printf '%-28s%s', 'Readable characters: ', $chars;
printf '%-28s%s', 'Translated characters: ', $translated;

__DATA__
Östen Mogård

$ ./test.pl
Raw data (utf8 encoded): Östen Mogård
Readable characters: Östen Mogård
Translated characters: Osten Mogard

However, I now realize that there ought to be smarter approaches...

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
.



Relevant Pages

  • Re: Print Spanish characters in Perl?
    ... and ensure that your file is saved in the UTF-8 format. ... encoding then your display device expects. ... forgetting to specify UTF-8 as charset. ... To avoid this kind of problem, make sure that all the characters are ...
    (comp.lang.perl.misc)
  • Re: DB2 UTF-8 ODBC double conversion
    ... UTF-8 *is* Unicode. ... byte to store characters in the 7-bit ASCII code. ... If I give a UTF-8 string to CreateFile, ... this means that everyone who is using that database has to understand that the ...
    (microsoft.public.vc.mfc)
  • =?utf-8?B?UmU6IFN0cmluZyAiw6LigqzihKIiIHRyYW5zbGF0ZWQgdG8gYXBvc3Ryb3BoZS4gV2h5Pw==?=
    ... it works), though it seems to use mostly just Ascii characters, representing ... but the author is not making the best possible use of UTF-8. ... They don't map it to ASCII apostrophe, ... Latin 1 encoding. ...
    (alt.html)
  • [PATCH] UTF-8 input: composing non-latin1 characters, and copy-paste
    ... One can put the keyboard driver into Unicode mode, load a Unicode keymap, and get single keystrokes generate valid UTF-8 for non-ASCII characters. ...
    (Linux-Kernel)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... For any language using a Latin ... Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. ... But you'll find something that does a reasonable job and *will* work perfectly for most programmers who stick to ASCII identifiers. ... A related problem is if you are making identifiers case-insensitive - it's hard to figure out cases for non-ASCII characters. ...
    (comp.arch.embedded)