Re: [PHP] Language detection with PHP



Hi,

Thanks to all of you who made suggestions.

Stayman, I was aware of many of the things you said in your post but I
wasn't aware of some details, thanks for being so specific.

In my original post I was rather simplistic in explaining my approach of
using spell checkers, it is in fact a little bit more compĺex than that.
I had into account the fact that for some languages people do not write
every word exactly in the right manner all the time, for example, is
normal for people to skip diacritical marks and for this reason my
library tries to be a little bit more clever: if a spell checking fails,
it asks the dictionary for a suggestion and remove all kind of marks
from both words and compare them, if they match then it's right.

The problem with this approach is that asking for a suggestion is
extremely slow and if you have to do that for every word that don't
check correctly, then it will be a lot slower.

Now, I tried the second option of using the PEAR class:
[] http://pear.php.net/package/Text_LanguageDetect

And it worked reasonably well, as I suspected it is very fast and it can
detect 52 different languages. The only problem with it, as well as for
all of your suggestions, is that it needs a sample text long enough to
be accurate. According to my tests it needs more than 10 or 20 words to
throw results more or less confident, but with longer samples it is very
accurate. On the other side, my spell checking approach can be accurate
enough with very short samples, sometimes even with just one word.

A big win for the PEAR class is that it can be very accurate with a
sample text long enough and with very very bad spell checking, in this
scenario my spell checking approach would've failed miserably. With
this I mean not only skipping diacritical marks but also skipping some
characters.

Maybe I will use a combination of both (the PEAR class and the spell
checker) when I need to detect a long sample or a short sample
respectively.

Thanks again for sharing your comments,


-William



El mié, 28-03-2007 a las 09:44 +0200, Satyam escribió:
----- Original Message -----
From: "Zoltán Németh" <znemeth@xxxxxxxxxxxxxx>


In formal english, it's not allowed to use 've 'm etc, I'm should be
written as I am. So that's not gonna work i think.
But words like and are really english i think :)
Keep in mind that this is quite a hard way i think, but i don't have a
better solution.
Just for example, Dutch and Afrikaans are not very different, so it's
really hard to see which of the 2 the text is written in.

Tijnema

ps. If you can't get the difference between Dutch and Afrikaans, guess
for Dutch :) It's a lot more used then Afrikaans.

yeah, looking for very frequently used words seems better idea.

greets
Zoltán Németh

In Spanish, as it happens with many languages that use diacritical marks, in
informal chatting you often skip them. This has a long tradition in the
internet since years ago the support for those extra characters was
non-existent and today it is still somewhat patchy. I used to have two
modes of writing in Spanish, formal writing with all proper accents, tilde
and umlauts and email mode, without any of those. Nowadays, with support
for languages using the Roman alphabet widely available, there is no need to
omit diacritical marks, but you will often find them missing, particularly
in comments to blogs and other informal writing, just because of laziness or
carelessness or simply lack of formal education and in that I include
foreigners who more or less handle the language but not the minor details..
If English had accents, I would probably skip them.

So, using a spelling dictionary is not a good idea unless you can count your
input to be properly written. A text in Spanish with its accents missing
will give you lots of errors, and we use just one sort of accent (acute)
plus tilde and umlaut. The French use three sorts of accents, there is a
far higher chance of getting misspellings. I don't know how abundant
accents are in Magyar, for me Zoltan Nemeth is the same as Zoltán Németh,
but the first is a misspelling.

This problem also affect the frequency of individual letters. Should you
first convert accented vowels to their plain version? Because if you find
accented letters, it is a sure sign that it is not English, but if there is
none, it doesn't mean it is English, it might be some non-English text
without the correct accents. Should you count 'a' and 'á' separate or add
them together because people often omit the accent?

So, I also vote for the frequently used words approach and against the
lowest number of misspellings. And I would first convert everything to
plain, with no accents, both for the needle and the haystack.

Satyam

PS: also, it is accepted practice to omit accents on uppercase letters such
as in headings. It is not gramatically correct but a typographical
convention which the printing industry has been using for ages: the accents
simply don't fit nicely.

.



Relevant Pages

  • Re: Hollywood versus the English speaking world
    ... I do keep track of original dramas in other parts of the English ... speaking world (sitcoms, not so much, as I'm generally not a fan ... I therefore have no inkling of whether Americans ... like foreign accents, even when they are speaking the same language, ...
    (rec.arts.tv)
  • Re: do UK-ians ground their children?
    ... now TRUE yankee accents are icky for sure. ... what with me being a very different sort of southerner ... (I am identified as an English southerner anywhere in the British Isles ... And yes I am talking about engineers. ...
    (uk.people.support.depression)
  • Re: Indian and other offshore MT companies
    ... And let's not forget the ESL accents that aren't going to be any more familiar to an ESL transcriptionist than they are to us, probably less so because at least we're trying to decipher words spoken in our native language! ... I do have to give kudos to those ESL transcriptionists who master it and do a good job. ... Back about 8 years ago when I started QAing offshore work, the work was terrible, but the people who actually do the work are intelligent people and with the proper training, I can imagine the actual American English can be learned enough to make an accurate report. ...
    (sci.med.transcription)
  • Re: S4E02 Up Doctor Who
    ... program in English rather than latin - and also appreciate the>> ... Sure...they all speak English..but they speak it as we ... tv drama to feature regional accents. ... the TARDIS doesn't translate more precisely. ...
    (rec.arts.drwho)
  • Re: S4E02 Up Doctor Who
    ... idea which now explains why all the aliens since 1963 spoke English. ... That's why, when she actually spoke Latin, ... Tardis translating alien languages - it's a bit bloody too much when ... tv drama to feature regional accents. ...
    (rec.arts.drwho)