Re: Searching substrings in records.



janesconference@xxxxxxxxx wrote:
Suppose I have a lot of records, each composed of a fixed number of
fields, like a database. Some of these fields are strings (or char
arrays, what you prefer).

I want to extract every record in my record collection whose string
field contain a given substring. Substrings can be matched only within
words, i.e. "foo" matches "you are a fool" but not "babyproof oo
programming languages" (I've come with the two most idiot examples a
human mind could conceive, I know).

The brute force solution would be iterate on every string field of
every record and perform a substring search on any of them.
If at least one field contains the substring, the record must be
returned.

The question: is there a smarter approach than this? Maybe indexing
somehow the records before performing the searches? And are there
decent (c/c++) libraries dealing with this specific problem?

If you want to know if a substring is in one of the strings of
your records you have to go through them once - how would you
know otherwise? Now, if you would search for the same substring
again and again and the records don't change then, of course, it
might make sense to store the result somewhere and use the al-
ready computed result.

I have no clear idea what you mean by "indexing the records be-
fore performing the search". Unless you have some prior infor-
mations about the substrings that will be searched for all that
you could reasonably precompute is the lengths of the strings
in the records and don't search those strings that are shorter
than the string that gets searched for and similar, very general
informatione (like does the strings contain upper or lower case
characters and, if you know the language, if characters with a
very high or low chance to appear in a text are in the string).
But the question you should askyourself is if searching for sub-
strings is theperformance bottle neck in your application before
investing much time to optimize this aspect of it. Did you con-
sider sticking the records into a database and let the database
do the dirty work? That's probably the nearest match you will
get when looking for a library dealing with such problems.

Regards, Jens
--
\ Jens Thoms Toerring ___ jt@xxxxxxxxxxx
\__________________________ http://toerring.de
.



Relevant Pages

  • Re: Is there a built in command to encode SQL strings?
    ... Semicolon has no special meaning inside strings, ... Actually, as long as the strings are encoded correctly, there are no characters that causes problems for the database. ... It's true that all input should be validated, but for strings it's mostly a matter of keeping the information sane rather than protecting the database. ... For instance, in my applications, I have a method that checks for valid characters on a per-property basis using regex. ...
    (microsoft.public.dotnet.languages.vb)
  • Re: How to return a variable length substring from a function ?
    ... I have a function that returns a variable length character substring. ... Depending on the input, the output is either 1 to 3 characters long. ... strings within fixed length strings. ...
    (comp.lang.fortran)
  • Re: Database Search
    ... portion of the string. ... The database includes the example strings as followed ... characters and add each record to display in a list that includes these ...
    (microsoft.public.vb.general.discussion)
  • Re: FindFirstFile, how much faster than FindNextFile?
    ... >substring in each of about 50 filenames, ... You have not specified how the 50 strings are specified. ... A number like 50 is so tiny that pretty much any in-memory algorithm that scans the ...
    (microsoft.public.vc.mfc)
  • Strings: Parsing data into other fields in the same table
    ... I am making a simple database and I was wondering how can ... the information from a field containing strings of both ... numbers and characters be broken and entered other ... The first number refers to the ...
    (microsoft.public.access.tablesdbdesign)

Loading