Re: locating strings approximately
- From: John Machin <sjmachin@xxxxxxxxxxx>
- Date: Fri, 30 Jun 2006 09:37:14 +1000
On 29/06/2006 10:52 AM, John Machin wrote:
On 29/06/2006 10:07 AM, BBands wrote:On 6/28/06, John Machin <sjmachin@xxxxxxxxxxx> wrote:On 29/06/2006 9:28 AM, BBands wrote:
> I'd like to see if a string exists, even approximately, in another. For
> example if "black" exists in "blakbird" or if "beatles" exists in
> "beatlemania". The application is to look though a long list of songs
> and return any approximate matches along with a confidence factor. I
> have looked at edit distance, but that isn't a good choice for finding
> a short string in a longer one.
There is a trivial difference between the traditional
distance-matrix-based Levenshtein algorithm for edit distance and the
corresponding one for approximate string searching. Ditto between
finite-state-machine approaches. Ditto between modern bit-bashing
approaches.
The trivial difference is that in the searching case one edge of the dynamic programming matrix is initialised (in theory) to [0] * (len(text)+1) whereas in the distance case it is set to range(len(text)+1).
> I have also explored
> difflib.SequenceMatcher and .get_close_matches, but what I'd really
> like is something like:
>
> a = FindApprox("beatles", "beatlemania")
> print a
> 0.857
>
> Any ideas?
You got no ideas from googling "approximate string search python"???
Yes, many including agrepy and soundex in addition to those I
mentioned already, but none seem really handy at approximately looking
up smaller strings in larger ones. I also note that this has been the
topic of prior discussion without resolutiuon.
jab
It helps if you tell all that you've done. Otherwise people will tell you to do what you've done already :-)
It helps if you reply on-list so that others can see. You may get better answers sooner. I have to vanish now. Will check back tonight.
Here's a possibly better answer:
def approx_search(pattern, text, max_dist):
"""
Return a generator which yields tuples (endpos, dist)
for each possible endpos such that
(1) Levenshtein_distance(pattern, text[startpos:endpos]) = dist
(2) 0 <= dist <= max_dist
for some (undetermined) startpos.
Not restricted to strings:
(a) pattern must support pattern[i] and len(pattern).
(b) text needs only to support enumerate(text).
(c) contents of pattern and text need only support the != operator.
Example:
list(approx_search('Beatles', 'beetles Beat leverage Betelguese', 3)) ->
[(6, 3), (7, 2), (8, 3),
(12, 3), (13, 3), (14, 3), (15, 2), (16, 2), (17, 3),
(26, 3), (27, 3)]
"""
plen = len(pattern)
prange = range(plen)
dprev = range(plen+1)
dcurr = [0] * (plen+1)
dist = plen
for tx, tc in enumerate(text):
# dcurr[0] = 0 # not needed, never changes
for px in prange:
dist = dprev[px] + (tc != pattern[px])
temp = dcurr[px] + 1
if temp < dist: dist = temp
temp = dprev[px+1] + 1
if temp < dist: dist = temp
dcurr[px+1] = dist
if dist <= max_dist:
yield tx+1, dist
dprev, dcurr = dcurr, dprev
If you need/want to poke around discovering "startpos", then you need something like the following, which is similar to the function by Magnus Lie Hetland which you'll find on the web, but appears to be about twice as fast without destroying legibility completely :-)
def Levenshtein_SJM(aseq, bseq):
"""
Calculate Levenshtein distance between aseq and bseq.
Space O(min(m,n))
Time O(mn)
"""
alen = len(aseq)
blen = len(bseq)
if alen > blen:
aseq, bseq = bseq, aseq
alen, blen = blen, alen
if not alen: return blen
arange = range(alen)
dprev = range(alen+1)
dcurr = [0] * (alen+1)
for bx in xrange(blen):
bc = bseq[bx]
dcurr[0] = bx + 1
for ax in arange:
dist = dprev[ax] + (bc != aseq[ax])
temp = dcurr[ax] + 1
if temp < dist: dist = temp
temp = dprev[ax+1] + 1
if temp < dist: dist = temp
dcurr[ax+1] = dist
dprev, dcurr = dcurr, dprev
return dist
Note that both of the above functions use the ancient original algorithm, which takes O(mn) time. For heavy lifting a modern algorithm and use of Pyrex or C are indicated.
HTH,
John
.
- References:
- locating strings approximately
- From: BBands
- Re: locating strings approximately
- From: John Machin
- Re: locating strings approximately
- From: John Machin
- locating strings approximately
- Prev by Date: Re: Bug reporting impossible
- Next by Date: Re: Reddit broke - should have remained on Lisp?
- Previous by thread: Re: locating strings approximately
- Next by thread: socket close problems with Python 2.5b1
- Index(es):
Relevant Pages
|