Re: Function for removing Accents?



Chris Uppal wrote:
You could probably speed up this process considerably by using a pre-existing
Unicode package such as ICU:

I am not sure :-) My understanding of ICU after checking the documentation is that it doesn't do the destructive thing the OP might want to do. It looks more as if the authors of ICU tried very hard to get every aspect of Unicode right. Mapping an accented character to a single non-accented "equivalent" is certainly not right in the scope of Unicode, and also not in the scope of non-ascii languages.

The effort to invest in a solution also depends on how good the solution has to be. Since the original text is anyhow supposed to be butchered, I don't see a reason for 100% accuracy.

So, scripting the parsing of the UCD for finding the interesting values should not take that much time. I would guess less than an hour. That should include scripting of checking the decomposition values for these "bad" accents (probably code points starting at 0x300 up to some value I forgot). The result should be a map of a bunch of characters.

Some more scripting to get that output into a Java data structure, add a lookup method, compile, and that's it.

Incidentally, why is ICU never mentioned around here ?

Probably because people don't know about it (I didn't). And probably because it solves problems not many people have each day.

/Thomas
--
The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/
.



Relevant Pages

  • VMS port of ICU, International Components for Unicode libraries, anyone?
    ... I am working on a project that needs a VMS port of the IBM ICU ... The International Components for Unicode libraries provide ...
    (comp.os.vms)
  • [ANN] ICU4R 0.1.0 - initial release
    ... ICU4R is an attempt to provide better Unicode support for Ruby, ... = Install Notes ... To build ICU4R you'll need GCC and ICU v3.4 libraries, ...
    (comp.lang.ruby)
  • Re: [ANN] ICU4R 0.1.0 - initial release
    ... > ICU4R is an attempt to provide better Unicode support for Ruby, ... > on ICU library. ... > ICU4R is Ruby C-extension binding for ICU library. ...
    (comp.lang.ruby)
  • Re: Function for removing Accents?
    ... Unicode package such as ICU: ... Mapping an accented character to a single non-accented ... in the scope of non-ascii languages. ...
    (comp.lang.java.programmer)
  • wx2.6-examples
    ... Hello, apparently the wxWidget packages from debian are compiled only for work with Unicode versions, I really don't agree with this because I have a lot of code wrote that doesn't work with Unicode. ... dbtest.cpp:3339: error: 'wxDbGridColInfo' was not declared in this scope ... To UNSUBSCRIBE, email to debian-user-REQUEST@xxxxxxxxxxxxxxxx with a subject of "unsubscribe". ...
    (Debian-User)