word aware distance algorythm



I'm missing some litterature here, I'm doing web scrapping from
different website and trying to compute the similarity between two
sentences (agenda's header and adresses) as to avoid redundancy.

right now I'm using a levenshtein distance whose result I naively
constraint between 0 and 1 like so:

def same(d1, d2, value_of_same=0.1):
dis = editdist.distance(d1, d2 )
if len(d2)> len(d1):
longest_len = len(d2)
else:
longest_len =len(d1)
levensthein_to_len = (1.0 * dis)/ longest_len
return levensthein_to_len < value_of_same

but the levensthein distance isn't word aware and this is causing
issues when matching data, ie something like

'xxxxxx' is widly different in levenshtein terms to 'xxxxxx - yyyyyy'
but not so for humans ;)

Any idea about a metrics which will give a greater weight to sentences
with similar words (even better if it also weight the position of
similar words) ?

John

.