Re: Comparing diacritic chars



"Tom de Neef" <tdeneef@xxxxxxxx> wrote in message
news:472de835$0$237$e4fe514c@xxxxxxxxxxxxxxxxx

I have to compare names (in family databases). Spelling is often not
uniform, so I use an algorithm to bring a name back to a basic form.
Eg: in double equal consonants the second one will be dropped; F and
V are the same; all vowels are treated as one and the same.

This last rule causes me some difficulties. (for example: recognize
that Étienne starts with a variation of E.)
Is there an easy way to recognize all vowels with diacritics (like
èéêë, ÒÓÔÕÖ, etc) or do I have to test for them explicitly?

That sounds like a Unicode question. The description for U+00C0 ('À')
includes the fact that it's very similar to U+0041 ('A') plus U+0300
(combining '`'), but short of OCR'ing PDFs, I know of no way to
extract that information in any automated manner. However, some
richer database may have it.

However, considering the limited numbers of both vowels and diacritics,
perhaps simply printing a few code charts and hardcoding the lot is
best. While U+FE8F could be transliterated 'b', I doubt you want to
go that far. (Have I mentioned lately that my favourite code point is
U+FDFB? Transliterate *that*, as a single letter.)

Groetjes,
Maarten Wiltink


.



Relevant Pages

  • Re: DAW 1984 (Long)
    ... it involves diacritics I can't make, ... "holiday" and its connotation was "horrible recurring event at ... which the women have to work hard cooking and cleaning up all day ... but I can't tell what the vowels are ...
    (rec.arts.sf.written)
  • Re: Diacritics in the Vietnamese name "Nguyen"
    ... And it seems to me that the diacritics indicate qualities (e.g. "open" ... > The "second layer" consists of tone marks to vowels. ... Of course we can often use Vietnamese ...
    (sci.lang)
  • Re: Hindi and Farsi counting words are identical
    ... > Miguel Carrasquer wrote: ... > have diacritics, so it isn't normal for long vowels to be marked. ... at one point the ^ was removed, butit sort of made a comeback, at least ...
    (sci.lang)
  • Re: Hindi and Farsi counting words are identical
    ... Galata, Istanbul, Kus,adasi, Turk Hava Yollari, etc. ... >> have diacritics, so it isn't normal for long vowels to be marked. ... only if the meter of an arabic poem ...
    (sci.lang)