Re: Clean out accents in French names




thundergnat <thundergnat@xxxxxxxxxxx> writes:
> Patrick L. Nolan wrote:
> > I have a script that takes information, including people's
> > names, and builds an XML file. I just found that the
> > application that reads the XML is fussy about characters.
> > It choked on the name "Jean-Paul le Fevre", where there
> > was an accent over the first e in Fevre. I don't know
> > how to type that on this keyboard. I edited the file
> > by hand, changing that character to a plain "e", and
> > all was OK. By the way, this isn't Unicode, it's just
> > extended ASCII.
> > I think I know how to identify "non-printing" characters
> > like that, but I would like to translate each one to
> > its nearest equivalent in the basic ASCII character
> > set. Thus the various e's with acute, grave and
> > circumflex accents would all go to "e", and so forth.
> > Has this problem been solved?
> >
>
> As has been pointed out in several other posts, there
> are many reasons to avoid doing this, or at least to
> do it very sparingly.
>
> Never-the-less, I have written routines in the
> past to do this, for when I need to aphabetize a list
> of words which could contain Latin-1 characters > 127
> but I could not be certain of a particular locale
> setting.
>
>
> sub deaccent{
> my $phrase = shift;
> return $phrase unless ($phrase =~ m/[\xC0-\xFF]/);
> $phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/;
> $phrase =~ s/\xC6/AE/g;
> $phrase =~ s/\xE6/ae/g;
> return $phrase;
> }

A few Latin-1 characters are not taken care of by the above function:

upper and lowercase Icelandic thorn (Þ, þ);
upper and lowercase Icelandic eth (ð, Ð);
German ess-zet (ß) (there is no uppercase version).

The following additions may be appropriate:

$phrase =~ s/\xDE/TH/g;
$phrase =~ s/\xFE/th/g;
$phrase =~ s/\xD0/TH/g;
$phrase =~ s/\xF0/th/g;
$phrase =~ s/\xDF/ss/g;

Thorn and eth are certainly not equivalent, but I leave it to an
Icelandic speaker to say whether there is a better conversion.
.



Relevant Pages

  • Re: Character Set Problem?
    ... "Brendan Reynolds" wrote: ... was no problem until I created a test file with accented characters, ... so the actual encoding and the declaration did not match. ... I have an Access 2002 database that imports an XML file. ...
    (microsoft.public.access.modulesdaovba)
  • Re: Character Set Problem?
    ... was no problem until I created a test file with accented characters, ... so the actual encoding and the declaration did not match. ... I have an Access 2002 database that imports an XML file. ...
    (microsoft.public.access.modulesdaovba)
  • Re: Converting "&#x2019;" to an Apostrophe?
    ... all these different strings (including dagger, ellipsis, euro symbol, double quote, etc.) to their ASCII equivalents? ... Perl has so many different modules for handling XML and CGI that it is unlikely my example matches your situation. ... # Demonstrate handling of Unicode characters in a UTF8 encoded XML file ... # First we write some Unicode to an XML file using UTF-8 encoding. ...
    (comp.lang.perl.misc)
  • Re: Unicode Reading
    ... characters. ... > hexa decimal format(representing the unicode) or entities while saving as ... > fonts) appear as character itself in the xml file while the symbols ... > from "symbol font"(or any non-standard font) appear as entities in ...
    (microsoft.public.mac.office.word)