Re: two interesting data structure/algorithm questions



de wrote:
Q2: Suppose you have a book written in English. Your goal is to search
for an arbitrary string and return the page numbers where this string
is found. There could be more than one page number returned if the
string is cross page or if it's found on multiple pages. For example,
you have a string "this is an " on page 20 and then " interesting
book" on page 21. The string you are going to search for could be
"this" or "this is" or "is an" or " an interesting" or "an interesting
book". Then you'll need to return page numbers [20], [20], [20],
[20,21], [20,21] respectively. Of course, if you look for "book", it
could return hundreds of page numbers since "book" is really common.
Question: what's the best data structure and algorithm to implement
this real-time lookup effeciently and what's the time complexity of
the implementation?
I can think out using hashes but since the string to look up can be
arbitrary, the hash table will be huge...

I think "arbitrary" in this case means that the string consists of
any list of whole words, but it does not include searches like
"teserting boo" that have to match "interesting book".

If that assumption is correct, one way to do this is to build an
inverted index of all the words. This is basically a set of key
value pairs, where the keys are words, and the values are lists of
page numbers on which the word occurs.

So for example in this case, you'd have something like:

"an" => [ ..., 20, ... ]
"book" => [ ..., 21, ... ]
"interesting" => [ ..., 21, ... ]
"is" => [ ..., 20, ... ]
"this" => [ ..., 20, ... ]

Note that these sets of page numbers (the values) can be stored
as bitmaps in some cases.

Now, the first crack at the algorithm is to try to create the set
of page numbers which contain all the words in the search string.
In other words, for now, forget about the search string being able
to span pages.

To do this, you simply look up every word and get the set of page
numbers for each word, and then intersect all those sets. Then
you have a set of page numbers which all at least have the required
words somewhere. You can then go through and linearly search each
page (or use Boyer-Moore or something) to see if your search string
is really there.

But, suppose you want to handle the case where the search string
wraps to the next page. That's still not that hard. All you have
to do is look up the words in the index (the data structure above)
in the order that they occur in the search string. So to get the
list of possible page numbers for "is an interesting", you look
up "is" and you get page 20 (and maybe some other pages). So your
intermediate result is that "is an interesting" might be (might
start, actually) on page 20. Then you look up "an", and you also
get 20. Since those two are the same page, and since both "is"
and "an" might be on the same page, the whole search string might
be on page 20. But then you look up "interesting" and you don't
see that "interesting" occurs on page 20; it just occurs on page
21. For this you use the knowledge that a phrase might be continued
on the following page. That means if your set contains 20 and
you look up a word that contains 21 but not 20, then you allow
the 21.

So, the operation goes like this:

look up "is"
--> get { 20 }
--> therefore search string "is" might be on page 20
look up "an"
--> get { 20 }
--> therefore search string "is an" might be on page 20
look up "interesting
--> get { 21 }
--> the sets { 20 } and { 21 } have some elements
where the first set's element plus one is an
element in the second set
--> therefore, the search string might be on page 20 and 21

Making it more methodical, it looks like this:

look up "is"
--> get { 20 }
--> call that current_set
look up "an"
--> get { 20 }
--> call that new_set
combine sets
--> convert current_set { 20 } into { 20, 21 }
by applying the rule { e | memberof(e, set) or memberof(e-1, set) }
(or in other words, for every e in the set, insert e+1)
--> now current_set is { 20, 21 }
--> let current_set = current_set intersect new_set
= { 20, 21 } intersect { 20 }
= { 20 }
look up "interesting"
--> get { 21 }
--> call that new_set
combine sets
--> convert current_set { 20 } into { 20, 21 }
--> now current_set is { 20, 21 }
--> let current_set = current_set intersect new_set
= { 20, 21 } intersect { 21 }
= { 21 }
notice that search string is finished
--> current_set is not empty
--> therefore, { 21 } is the set of pages that the
search string might *end* on
--> start there, and look for the search string, working
backwards since you know that the search string
ends on that page but may start on the previous page
(or for a really long search string, an earlier page)

To make the last step a little easier, you can reverse the tokens
in the search string and do the set intersection in the reverse
order. Then you need to insert e-1 into the sets instead of
inserting e+1 before you take the intersection. But the set of
page numbers you get out of the process is the page numbers on
which the search strings could possibly *start* instead of ending.

The above assumes that there are no pages which contain zero words
but which a search string could wrap across. For example, if "this
is an" was on page 20, a picture was on page 21, and "interesting
book" was on page 22, the above would fail. (You could get around
that by only indexing the pages that actually have at least one word
on them, then mapping those page numbers to actual page numbers.)

If you cannot index only whole words, you can do the same thing
with digraphs. Then you'd store strings "th", "hi", "is", "s ",
and so on in the inverted index instead of just the word "this".
For a long search string like "this is an interesting book", the
chances of all the digraphs appearing on one page together are
relatively small, generally. But there are some degenerate cases,
like when your search string is "eeeeeeeeeeeeee". So maybe you
want to store trigraphs instead or something.

- Logan
.



Relevant Pages

  • Re: Execution of a full-text operation failed. A clause of the query contained only ignored words.
    ... Paul's search string from his initial posting including a long string of ... The above can be converted to a contains clause using the pubs table ... this is less about searching on noise words in the noise word ... Why would someone want to parse a search string to ...
    (microsoft.public.sqlserver.fulltext)
  • RE: Excel 2003 Macro to add text to front of data in cell
    ... search string that you enter is not the full content of a cell, ... A step would have to be designed to test each cell for the ... this macro would pass it. ...
    (microsoft.public.excel.programming)
  • Re: Tokenizing a large buffer
    ... how long can a Regex match string be? ... So rather than looping with multiple searches using Regex, just loop on the tokens in creating the search string. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Search only values, then replace
    ... subordinate to the Formulas checkbox, though, so I didn't touch that. ... and enter your search string and replace string and ... whose *value* contains the search string, ... inside the cell with a replacement string if the cell doesn't contain ...
    (microsoft.public.excel)
  • Re: two interesting data structure/algorithm questions
    ... cell phone shows all names start with letters "ab" in sorted order. ... Lexical tree is O. ... for an arbitrary string and return the page numbers where this string ...
    (comp.programming)