Re: replace chars



From: "Chas. Owens" <chas.owens@xxxxxxxxx>

>> I believe the OP will need to identify all the characters he would >> like
>> to see converted, and code the conversion rules himself using the >> tr///
>> or s/// operator.
>
> Yes I think that it might not be any standard transforming algorithm > for
> doing this, and the program that do that, do their own transform.
> So finally I've decided to try finding all the possible chars with
> tildes, acute or grave accents, umlauts, etc, and replace using tr//.
>
> I hope I won't have any issues, because the chars are UTF-8.

Well, then you'll probably need to identify the utf8 octet sequences
that correspond to the special characters you want to see transformed.
snip

Perl strings are in UTF-8*, but if you want to specify a character
without using it directly (so the Perl file can still be treated as
ASCII) you use the UNICODE representation instead:

my $a_with_macron = "\x{0101}"; #UTF-8 encoding is C4 81

So, knowing the UTF-8 sequences is fairly useless.


Ok, and if I want to use tr// to replace a set of UTF-8 chars, how can I do it?

Can I simply use
tr/astâîASTÂÎ/astaiASTAI/;

I am not sure I can because I've tried this, and something's not ok so I'll need to check tomorrow.

I have also seen that length($string) returns the number of bytes of $string, and not the number of chars (if the string contains UTF-8 chars).

How can I get the array of UTF-8 chars and the length of the string in chars?

I haven't used
use bytes;
and neither
use utf-8;

I've tried them both, but... no change.

Thanks.

Octavian

.



Relevant Pages

  • Re: replace chars
    ... $string, and not the number of chars (if the string contains UTF-8 ... Ok, I can get the size of the string using this code, but please tell me how to get the UTF-8 chars from this string. ...
    (perl.beginners)
  • Re: replace chars
    ... $string, and not the number of chars (if the string contains UTF-8 ... This tells me that you are taking input from an octet buffer that comes ...
    (perl.beginners)
  • Re: Fedora, unicode, console
    ... > to get UTF-8 enabled in console? ... *all* the Unicode characters: Fedora has chosen a good one, ... > has not all UTF-8 chars, ... Well, in vim, if you know the Unicode reference, try ...
    (Fedora)
  • Re: How to clean an xml files from non-utf-8 chars?
    ... anything else that relies on the xml files being utf-8. ... module UTF8 ... All chars that are not valid utf8 char sequences will be ...
    (comp.lang.ruby)
  • Re: How does RC file store double-bytes chars ?
    ... Is MBCS same as UTF-8 (also mixed length chars)? ... I alway change to the region I'm editing before making any ... resource editor. ...
    (microsoft.public.vc.language)