Re: locating strings approximately



On 29/06/2006 10:52 AM, John Machin wrote:
On 29/06/2006 10:07 AM, BBands wrote:
On 6/28/06, John Machin <sjmachin@xxxxxxxxxxx> wrote:
On 29/06/2006 9:28 AM, BBands wrote:
> I'd like to see if a string exists, even approximately, in another. For
> example if "black" exists in "blakbird" or if "beatles" exists in
> "beatlemania". The application is to look though a long list of songs
> and return any approximate matches along with a confidence factor. I
> have looked at edit distance, but that isn't a good choice for finding
> a short string in a longer one.

There is a trivial difference between the traditional
distance-matrix-based Levenshtein algorithm for edit distance and the
corresponding one for approximate string searching. Ditto between
finite-state-machine approaches. Ditto between modern bit-bashing
approaches.

The trivial difference is that in the searching case one edge of the dynamic programming matrix is initialised (in theory) to [0] * (len(text)+1) whereas in the distance case it is set to range(len(text)+1).


> I have also explored
> difflib.SequenceMatcher and .get_close_matches, but what I'd really
> like is something like:
>
> a = FindApprox("beatles", "beatlemania")
> print a
> 0.857
>
> Any ideas?

You got no ideas from googling "approximate string search python"???

Yes, many including agrepy and soundex in addition to those I
mentioned already, but none seem really handy at approximately looking
up smaller strings in larger ones. I also note that this has been the
topic of prior discussion without resolutiuon.

jab

It helps if you tell all that you've done. Otherwise people will tell you to do what you've done already :-)

It helps if you reply on-list so that others can see. You may get better answers sooner. I have to vanish now. Will check back tonight.


Here's a possibly better answer:

def approx_search(pattern, text, max_dist):
"""
Return a generator which yields tuples (endpos, dist)
for each possible endpos such that
(1) Levenshtein_distance(pattern, text[startpos:endpos]) = dist
(2) 0 <= dist <= max_dist
for some (undetermined) startpos.
Not restricted to strings:
(a) pattern must support pattern[i] and len(pattern).
(b) text needs only to support enumerate(text).
(c) contents of pattern and text need only support the != operator.
Example:
list(approx_search('Beatles', 'beetles Beat leverage Betelguese', 3)) ->
[(6, 3), (7, 2), (8, 3),
(12, 3), (13, 3), (14, 3), (15, 2), (16, 2), (17, 3),
(26, 3), (27, 3)]
"""
plen = len(pattern)
prange = range(plen)
dprev = range(plen+1)
dcurr = [0] * (plen+1)
dist = plen
for tx, tc in enumerate(text):
# dcurr[0] = 0 # not needed, never changes
for px in prange:
dist = dprev[px] + (tc != pattern[px])
temp = dcurr[px] + 1
if temp < dist: dist = temp
temp = dprev[px+1] + 1
if temp < dist: dist = temp
dcurr[px+1] = dist
if dist <= max_dist:
yield tx+1, dist
dprev, dcurr = dcurr, dprev

If you need/want to poke around discovering "startpos", then you need something like the following, which is similar to the function by Magnus Lie Hetland which you'll find on the web, but appears to be about twice as fast without destroying legibility completely :-)

def Levenshtein_SJM(aseq, bseq):
"""
Calculate Levenshtein distance between aseq and bseq.
Space O(min(m,n))
Time O(mn)
"""
alen = len(aseq)
blen = len(bseq)
if alen > blen:
aseq, bseq = bseq, aseq
alen, blen = blen, alen
if not alen: return blen
arange = range(alen)
dprev = range(alen+1)
dcurr = [0] * (alen+1)
for bx in xrange(blen):
bc = bseq[bx]
dcurr[0] = bx + 1
for ax in arange:
dist = dprev[ax] + (bc != aseq[ax])
temp = dcurr[ax] + 1
if temp < dist: dist = temp
temp = dprev[ax+1] + 1
if temp < dist: dist = temp
dcurr[ax+1] = dist
dprev, dcurr = dcurr, dprev
return dist

Note that both of the above functions use the ancient original algorithm, which takes O(mn) time. For heavy lifting a modern algorithm and use of Pyrex or C are indicated.

HTH,
John
.



Relevant Pages

  • Re: Experimenting With 2-D Parabolic Reflectors
    ... There is some "string theory", where you could draw your own of any size ... the reflector) running parallel to the directrix. ... distance from the focus to a point on the parabola and fd is the distance ...
    (alt.internet.wireless)
  • Re: the black fishing hole
    ... say that she is *spatially* near the horizon for r just an itty bit ... there are competing notions of distance even after you have ... string will be redefined". ... the string diverges as the weight nears the horizon). ...
    (sci.physics.research)
  • Re: Bells Spaceship paradox
    ... distance in Relativity. ... observers, the "s^2" as measured by clocks carried by the observers. ... Assuming linear elasticity, the tension ... in the string will be proportional to this distance. ...
    (sci.physics.research)
  • RE: FileSearch to locate the latest (last saved) file
    ... Dim sReport As Workbook, sDashboard As Workbook ... Dim fLdr As String, Fil As String, FPath As String, x As String, _ ... FileDates= temp ... For i = 1 To NewestFile ...
    (microsoft.public.excel.programming)
  • Re: Barometer for building height
    ... floor, mark the string, pull it up, measure the length of the string. ... the Reflective method - place the B at the bottom of the building, ... energy is proportional to the number of revolutions of the shaft. ... distance from the building. ...
    (rec.puzzles)