Re: character mapping functions and UNICODE : remove accents, case, etc

From: Alan J. Flavell (flavell_at_ph.gla.ac.uk)
Date: 10/23/03


Date: Thu, 23 Oct 2003 19:18:40 +0100

On Thu, 23 Oct 2003, An. Valula floated out upon a sea of TOFU:

> thank you for your answer, but, no, I do not want to remove bold or
> paragraph marks.

But that *is* what the term "rich text" format normally refers to -
whether used in the generic sense or in particular reference to
Microsoft's "RTF" interchange specification.

> I want to convert "rich" text to "poor" text.

Not really, and that's why you confused the previous respondent. You
need some better term. (Try a glossary of text processing if you
don't believe me).

> There must be someone else who wants to compare strings without diacritical
> signs ?!

Is there a problem? You already know one solution.

> > does anyone out there know about perl capabilities to convert rich
> > text, such as "étrangères" to "etrangere" (remove accents)?
> > Of course, tr/éè/ee/ would do, but I look for sth better: you do not
> > tr/a-z/A-Z/ for uc(), do you?

You probably should note that your tr/// and your uc() perform
*different* operations, in general - also depending on the locale
setting.

Anyhow, I don't have an answer to your requirement, other than the
obvious one. Well, perhaps I do: you could "do the Unicode
decomposition" thing, but it would seem distinctly inefficient
compared to a tr///

Have a look at e.g http://www.perldoc.com/perl5.8.0/pod/perlretut.html
and see whether you really want to fight this via Unicode-style regex
features. If you want to be sure of covering accents that you've
never even heard of, then I guess that's the way to go, but if you're
just looking for the usual Western-European accents then me, I'd go
with the tr/// I reckon. But this is all supposition - it's not a
requirement which I've needed myself.