Re: [PHP] Language detection with PHP
- From: williama_lovaton@xxxxxxxxxxxxxx (William Lovaton)
- Date: Thu, 29 Mar 2007 07:36:10 -0500
Hi,
Thanks to all of you who made suggestions.
Stayman, I was aware of many of the things you said in your post but I
wasn't aware of some details, thanks for being so specific.
In my original post I was rather simplistic in explaining my approach of
using spell checkers, it is in fact a little bit more compĺex than that.
I had into account the fact that for some languages people do not write
every word exactly in the right manner all the time, for example, is
normal for people to skip diacritical marks and for this reason my
library tries to be a little bit more clever: if a spell checking fails,
it asks the dictionary for a suggestion and remove all kind of marks
from both words and compare them, if they match then it's right.
The problem with this approach is that asking for a suggestion is
extremely slow and if you have to do that for every word that don't
check correctly, then it will be a lot slower.
Now, I tried the second option of using the PEAR class:
[] http://pear.php.net/package/Text_LanguageDetect
And it worked reasonably well, as I suspected it is very fast and it can
detect 52 different languages. The only problem with it, as well as for
all of your suggestions, is that it needs a sample text long enough to
be accurate. According to my tests it needs more than 10 or 20 words to
throw results more or less confident, but with longer samples it is very
accurate. On the other side, my spell checking approach can be accurate
enough with very short samples, sometimes even with just one word.
A big win for the PEAR class is that it can be very accurate with a
sample text long enough and with very very bad spell checking, in this
scenario my spell checking approach would've failed miserably. With
this I mean not only skipping diacritical marks but also skipping some
characters.
Maybe I will use a combination of both (the PEAR class and the spell
checker) when I need to detect a long sample or a short sample
respectively.
Thanks again for sharing your comments,
-William
El mié, 28-03-2007 a las 09:44 +0200, Satyam escribió:
----- Original Message -----.
From: "Zoltán Németh" <znemeth@xxxxxxxxxxxxxx>
In formal english, it's not allowed to use 've 'm etc, I'm should be
written as I am. So that's not gonna work i think.
But words like and are really english i think :)
Keep in mind that this is quite a hard way i think, but i don't have a
better solution.
Just for example, Dutch and Afrikaans are not very different, so it's
really hard to see which of the 2 the text is written in.
Tijnema
ps. If you can't get the difference between Dutch and Afrikaans, guess
for Dutch :) It's a lot more used then Afrikaans.
yeah, looking for very frequently used words seems better idea.
greets
Zoltán Németh
In Spanish, as it happens with many languages that use diacritical marks, in
informal chatting you often skip them. This has a long tradition in the
internet since years ago the support for those extra characters was
non-existent and today it is still somewhat patchy. I used to have two
modes of writing in Spanish, formal writing with all proper accents, tilde
and umlauts and email mode, without any of those. Nowadays, with support
for languages using the Roman alphabet widely available, there is no need to
omit diacritical marks, but you will often find them missing, particularly
in comments to blogs and other informal writing, just because of laziness or
carelessness or simply lack of formal education and in that I include
foreigners who more or less handle the language but not the minor details..
If English had accents, I would probably skip them.
So, using a spelling dictionary is not a good idea unless you can count your
input to be properly written. A text in Spanish with its accents missing
will give you lots of errors, and we use just one sort of accent (acute)
plus tilde and umlaut. The French use three sorts of accents, there is a
far higher chance of getting misspellings. I don't know how abundant
accents are in Magyar, for me Zoltan Nemeth is the same as Zoltán Németh,
but the first is a misspelling.
This problem also affect the frequency of individual letters. Should you
first convert accented vowels to their plain version? Because if you find
accented letters, it is a sure sign that it is not English, but if there is
none, it doesn't mean it is English, it might be some non-English text
without the correct accents. Should you count 'a' and 'á' separate or add
them together because people often omit the accent?
So, I also vote for the frequently used words approach and against the
lowest number of misspellings. And I would first convert everything to
plain, with no accents, both for the needle and the haystack.
Satyam
PS: also, it is accepted practice to omit accents on uppercase letters such
as in headings. It is not gramatically correct but a typographical
convention which the printing industry has been using for ages: the accents
simply don't fit nicely.
- References:
- Language detection with PHP
- From: William Lovaton
- Re: [PHP] Language detection with PHP
- From: Zoltán Németh
- Re: [PHP] Language detection with PHP
- From: "Tijnema !"
- Re: [PHP] Language detection with PHP
- From: Zoltán Németh
- Re: [PHP] Language detection with PHP
- From: "Satyam"
- Language detection with PHP
- Prev by Date: PHP Security!!! www.armorize.com
- Next by Date: Re: [PHP] What is wrong with this INSERT?
- Previous by thread: Re: [PHP] Language detection with PHP
- Next by thread: pear returns prompt
- Index(es):
Relevant Pages
|
|