Re: Clean out accents in French names
- From: Arndt Jonasson <do-not-use@xxxxxxxxxxx>
- Date: 18 May 2005 15:05:31 +0200
thundergnat <thundergnat@xxxxxxxxxxx> writes:
> Patrick L. Nolan wrote:
> > I have a script that takes information, including people's
> > names, and builds an XML file. I just found that the
> > application that reads the XML is fussy about characters.
> > It choked on the name "Jean-Paul le Fevre", where there
> > was an accent over the first e in Fevre. I don't know
> > how to type that on this keyboard. I edited the file
> > by hand, changing that character to a plain "e", and
> > all was OK. By the way, this isn't Unicode, it's just
> > extended ASCII.
> > I think I know how to identify "non-printing" characters
> > like that, but I would like to translate each one to
> > its nearest equivalent in the basic ASCII character
> > set. Thus the various e's with acute, grave and
> > circumflex accents would all go to "e", and so forth.
> > Has this problem been solved?
> >
>
> As has been pointed out in several other posts, there
> are many reasons to avoid doing this, or at least to
> do it very sparingly.
>
> Never-the-less, I have written routines in the
> past to do this, for when I need to aphabetize a list
> of words which could contain Latin-1 characters > 127
> but I could not be certain of a particular locale
> setting.
>
>
> sub deaccent{
> my $phrase = shift;
> return $phrase unless ($phrase =~ m/[\xC0-\xFF]/);
> $phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/;
> $phrase =~ s/\xC6/AE/g;
> $phrase =~ s/\xE6/ae/g;
> return $phrase;
> }
A few Latin-1 characters are not taken care of by the above function:
upper and lowercase Icelandic thorn (Þ, þ);
upper and lowercase Icelandic eth (ð, Ð);
German ess-zet (ß) (there is no uppercase version).
The following additions may be appropriate:
$phrase =~ s/\xDE/TH/g;
$phrase =~ s/\xFE/th/g;
$phrase =~ s/\xD0/TH/g;
$phrase =~ s/\xF0/th/g;
$phrase =~ s/\xDF/ss/g;
Thorn and eth are certainly not equivalent, but I leave it to an
Icelandic speaker to say whether there is a better conversion.
.
- References:
- Clean out accents in French names
- From: Patrick L. Nolan
- Re: Clean out accents in French names
- From: thundergnat
- Clean out accents in French names
- Prev by Date: Re: Why isn't shift the same as $_[0]?
- Next by Date: replace variable in html page
- Previous by thread: Re: Clean out accents in French names
- Next by thread: test if statement from database
- Index(es):
Relevant Pages
|