Re: collection framework: using the good interface



SL wrote:
I need to be able to retreive very quickly an objet according to some signifiant properties: its orthographic form and its part of speech for instance. A hash seems suited, but I cannot use a hash where the key is the orthographic form, for instance:
{
"plane" -> ref_to_object_holding_this_word
}
because several properties are signifiant for word identity: there is a plane "noun" and a plane "verbe".


If you want to retrieve by several different properties, and do it efficiently for each of them, you need several different index structures.


In fact I would like to retreive objects using several properties always
together: for instance "plane"+verb, not "plane" alone or "verb" alone.
Two objects may be defined as "equal" if they share this properties.

If you don't care about efficiency for some or all of the properties, you can just iterate over a Set of all the objects asking each of them "are you a noun?".


Efficiency is an issue indeed: the situation is that I have two lexicons
(two "collections" (Set or Map?) of Word objects); each Word is defined by
lexicographic properties (form + POS) and hold other properties (number of
occurrences in a sub-corpus for instance); for each object in one
collection, I need to look at the "same" object in the other collection
("same" = same forme and same POS), compare the two frequency and compute
a probability distribution function. An iteration over all the objects of
the second collection for all the objects of the first collection + a
check for equality without using hashcode are far too long I suppose.

I wouldn't suppose, I would start with the simplest implementation and check whether it is fast enough.


If necessary, consider creating a fixed order among your word objects. It can be fairly arbitrary, such as String sort order as primary key, followed by part of speech for two words with the same spelling. The word object's class would implement Comparable, and should have equals and hashCode consistent with compareTo, but I would not make two words equal unless they are, in all respects, the same word.

Comparing two ordered lists, with the same sort key, to find the intersection is linear time.

If you can't make this work reasonably efficiently with a fairly simple approach, I would consider putting the words in a database with a column for each property. Database managers are designed to do things like selects and joins on multiple attributes very efficiently.

That's why implementing my collections as Map would allow something like:

Word wordInCollection1 = collection1.get(wordInCollection2);

If I compute hashcode and equality using the form and POS properties. But
this imply Map where both key and value hold the same pointer, which looks
strange.

I would not override hashCode and equals for this, because there isn't a single, fixed definition of equality between two of your word-representing objects other than identity.


I was thinking that "lexicographic equality" could be an equality between
the two objects, or should the definition of equality between two objects
always involve all of the properties of the objects? In that case, is it a
good practice to create a "customised equality method" (say:
lexicographicalyEquals() and lexicographicHashcode()), and implement a Map
(LexicographicMap) using this methods insteed of equals() ?)

In programming, honesty is usually the best policy. If you override equals, you are saying "These two distinct objects have the same value". If you use equals to mean something else, it will probably give you trouble later.


I would deal with your properties at the word object level by having set and get methods for each property, such as spelling. The set method is not needed if you always set all properties in the constructor. Things like part of speech seem to me to be attributes of a word.

For generalization of search code, it may also be useful to have a generalized get method, where a parameter indicates which property you are asking about.



Patricia


Many thanks for your insights!

Sylvain
.



Relevant Pages

  • Re: collection framework: using the good interface
    ... signifiant properties: its orthographic form and its part of speech for instance. ... A hash seems suited, but I cannot use a hash where the key is the orthographic form, for instance: ... An iteration over all the objects of the second collection for all the objects of the first collection + a check for equality without using hashcode are far too long I suppose. ...
    (comp.lang.java.help)
  • Re: collection framework: using the good interface
    ... signifiant properties: its orthographic form and its part of speech for instance. ... A hash seems suited, but I cannot use a hash where the key is the orthographic form, for instance: ... An iteration over all the objects of the second collection for all the objects of the first collection + a check for equality without using hashcode are far too long I suppose. ...
    (comp.lang.java.help)
  • collection framework: using the good interface
    ... I need to be able to retreive very quickly an objet according to some signifiant properties: its orthographic form and its part of speech for instance. ... Implementing hashkey() and equalsin my Word object seems to be the natural way, with "orthographic form" and "part of speech" fields both contributing to the definition of hashkey and equality, but in that case I would have a Map where both the key and the corresponding value are references to the same object. ... But its indexOfmethod use equalsand not hashkey, and is probably not as much efficient as the HashMap hashing method. ...
    (comp.lang.java.help)