word aware distance algorythm
- From: rouadec <rouadec@xxxxxxxxx>
- Date: Fri, 29 Jun 2007 09:53:05 -0000
I'm missing some litterature here, I'm doing web scrapping from
different website and trying to compute the similarity between two
sentences (agenda's header and adresses) as to avoid redundancy.
right now I'm using a levenshtein distance whose result I naively
constraint between 0 and 1 like so:
def same(d1, d2, value_of_same=0.1):
dis = editdist.distance(d1, d2 )
if len(d2)> len(d1):
longest_len = len(d2)
else:
longest_len =len(d1)
levensthein_to_len = (1.0 * dis)/ longest_len
return levensthein_to_len < value_of_same
but the levensthein distance isn't word aware and this is causing
issues when matching data, ie something like
'xxxxxx' is widly different in levenshtein terms to 'xxxxxx - yyyyyy'
but not so for humans ;)
Any idea about a metrics which will give a greater weight to sentences
with similar words (even better if it also weight the position of
similar words) ?
John
.
- Follow-Ups:
- Re: word aware distance algorythm
- From: Robert Maas, see http://tinyurl.com/uh3t
- Re: word aware distance algorythm
- Prev by Date: Re: The software I wish I had
- Next by Date: Re: CPU temperature monitoring in C++?
- Previous by thread: Question of the software develpment cycle
- Next by thread: Re: word aware distance algorythm
- Index(es):