Re: Text indexing problem.
- From: Andrew Tomazos <andrew@xxxxxxxxxxx>
- Date: Wed, 29 Apr 2009 23:46:06 -0700 (PDT)
On Apr 29, 8:24 pm, mohangupta13 <mohangupt...@xxxxxxxxx> wrote:
As part of my broader project to implement a search engine , I have
to make a text indexer, which should parse a text file (say) and store
the important words and the count of their occurrence in a database.
What can be an efficient way to implement this?
Formally the problem can be defined as:
Given a text file , find out important words (meaning all except
common words like "of,is,the,and,on,I,you" etc )
present in the file and the number of times they occur.?
I would highly recommend Perl for this task. It is basically what the
language lives for.
This is roughly what would be involved...
my %excluded_words = ...;
my %word_count;
while(<>) # read stdin
{
my @words = split /\s+/ $_; # split on whitespace with regex
foreach my $word (@words)
{
$word_count{$word}++ unless ($excluded_words{$words});
}
}
foreach my $word (keys %word_count)
{
enter_into_database($word, $word_count{$word});
}
....and done.
-Andrew.
.
- References:
- Text indexing problem.
- From: mohangupta13
- Text indexing problem.
- Prev by Date: Re: Text indexing problem.
- Next by Date: Line by line file processing in bash
- Previous by thread: Re: Text indexing problem.
- Next by thread: Line by line file processing in bash
- Index(es):
Relevant Pages
|