Re: Text indexing problem.



On Apr 29, 8:24 pm, mohangupta13 <mohangupt...@xxxxxxxxx> wrote:
 As part of my broader project to implement a search engine , I have
to make a text indexer, which should parse a text file (say) and store
the important words and the count of their occurrence in a database.
What can be an efficient way to implement this?

Formally the problem can be defined as:

Given a text file , find out important words (meaning all except
common words like "of,is,the,and,on,I,you" etc )
present in the file and the number of times they occur.?

I would highly recommend Perl for this task. It is basically what the
language lives for.

This is roughly what would be involved...

my %excluded_words = ...;
my %word_count;
while(<>) # read stdin
{
my @words = split /\s+/ $_; # split on whitespace with regex
foreach my $word (@words)
{
$word_count{$word}++ unless ($excluded_words{$words});
}
}

foreach my $word (keys %word_count)
{
enter_into_database($word, $word_count{$word});
}

....and done.
-Andrew.
.



Relevant Pages

  • Re: Parsing and storing formulas
    ... you could easily store different views/stored ... >I was wondering how I can parse a mathematical formula in a storable way. ... > be stored for in the database. ... > The second idea is to represent the formula as a tree. ...
    (microsoft.public.dotnet.languages.csharp)
  • RE: Using XML to serialize a SQL Query
    ... I understand that you need to parse the WHERE clause ... When I retrieve the rule from the database, I need some way to populate ... If I simply store it in the database as "Loan.Amount greater than 500000", ...
    (microsoft.public.dotnet.xml)
  • Re: Text indexing problem.
    ... to make a text indexer, which should parse a text file and store ... the important words and the count of their occurrence in a database. ...
    (comp.programming)
  • Read xml file, store data in db. What could be easier?
    ... I am going to receive an xml doc and am going to have to parse the data ... values and store them in the database. ...
    (microsoft.public.sqlserver.xml)
  • Re: Need help on AoH or array or any other think that might help!
    ... Tim Greer wrote: ... It doesn't even parse. ... -e syntax OK ... It (the 1st foreach loop) parses even when in context. ...
    (comp.lang.perl.misc)