Re: Open Source English Language Parser



Luc The Perverse wrote:

A quality english language parser, which would be capible, with simple to moderate modification, of converting all words to their base form (plurals become singular - verbs are unconjugated, adjective endings drop off etc.)


You ask for a language parser (going for the syntax of sentences), but what you describe is aiming at the moprhology of words. Now it depends on what you want. Do you really want the linguistic base form, as you find it in human readable dictionaries? Then a lemmatizer can give you the base form of a word, as well as telling you which form of the word you have at hand. There should definetly be something in Java, as it is a very common problem, implementations in different programming languages are available for decades. Of course the lemmatizer might not give you a unique information about a word, as for instance "run" could be a verb as well as noun. A so called tagger could help you then, reducing this ambiguity, based on the context of the word. It is also a standard problem and a lot of implementations are around. A tagger might still not give you a unique answer, but for most practical applications, this might not be a problem.


If you just want to do some search, dictionary lookup etc., a stemmer might be, what you are looking for. It reduces words not to the linguistic base form, but to a stem, which is not necessarly actually an english word. This would be the simplest solution, used for instance in search engines. This has also been implemented by genererations of computer linguists, so you might search for "stemmer Java" in Google and find something you could use.

Greetings,
Ralf
.