Re: Text indexing problem.



On Apr 30, 11:46 am, Andrew Tomazos <and...@xxxxxxxxxxx> wrote:
On Apr 29, 8:24 pm, mohangupta13 <mohangupt...@xxxxxxxxx> wrote:

 As part of my broader project to implement a search engine , I have
to make a text indexer, which should parse a text file (say) and store
the important words and the count of their occurrence in a database.
What can be an efficient way to implement this?

Formally the problem can be defined as:

Given a text file , find out important words (meaning all except
common words like "of,is,the,and,on,I,you" etc )
present in the file and the number of times they occur.?

I would highly recommend Perl for this task.  It is basically what the
language lives for.

This is roughly what would be involved...

my %excluded_words = ...;
my %word_count;
while(<>)  # read stdin
{
    my @words = split /\s+/ $_;  # split on whitespace with regex
    foreach my $word (@words)
    {
        $word_count{$word}++ unless ($excluded_words{$words});
    }

}

foreach my $word (keys %word_count)
{
    enter_into_database($word, $word_count{$word});

}

...and done.
  -Andrew.

Thankn you all for such good suggestions. But I would like to ask are
these the best efficent method given you have to scan millions of
files.
And also about the words that needs to be discarded, is there any
available list somewhere for such words that anyone here knows. Just
by looking at one or few files one can't predict which words are
redundant.

mohan
.



Relevant Pages

  • Re: Parsing and storing formulas
    ... you could easily store different views/stored ... >I was wondering how I can parse a mathematical formula in a storable way. ... > be stored for in the database. ... > The second idea is to represent the formula as a tree. ...
    (microsoft.public.dotnet.languages.csharp)
  • RE: Using XML to serialize a SQL Query
    ... I understand that you need to parse the WHERE clause ... When I retrieve the rule from the database, I need some way to populate ... If I simply store it in the database as "Loan.Amount greater than 500000", ...
    (microsoft.public.dotnet.xml)
  • Read xml file, store data in db. What could be easier?
    ... I am going to receive an xml doc and am going to have to parse the data ... values and store them in the database. ...
    (microsoft.public.sqlserver.xml)
  • Re: Text indexing problem.
    ... which should parse a text file and store ... the important words and the count of their occurrence in a database. ... foreach my $word ...
    (comp.programming)
  • Re: public and private mailboxes randomly dismounting
    ... When posting logs an important piece is the Event ID and Source. ... Information Store First Storage Group: An attempt to move the file ... An error occurred while writing to the database log file of storage group ...
    (microsoft.public.windows.server.sbs)