Re: Text indexing problem.
- From: mohangupta13 <mohangupta13@xxxxxxxxx>
- Date: Sat, 2 May 2009 09:58:16 -0700 (PDT)
On Apr 30, 11:46 am, Andrew Tomazos <and...@xxxxxxxxxxx> wrote:
On Apr 29, 8:24 pm, mohangupta13 <mohangupt...@xxxxxxxxx> wrote:
As part of my broader project to implement a search engine , I have
to make a text indexer, which should parse a text file (say) and store
the important words and the count of their occurrence in a database.
What can be an efficient way to implement this?
Formally the problem can be defined as:
Given a text file , find out important words (meaning all except
common words like "of,is,the,and,on,I,you" etc )
present in the file and the number of times they occur.?
I would highly recommend Perl for this task. It is basically what the
language lives for.
This is roughly what would be involved...
my %excluded_words = ...;
my %word_count;
while(<>) # read stdin
{
my @words = split /\s+/ $_; # split on whitespace with regex
foreach my $word (@words)
{
$word_count{$word}++ unless ($excluded_words{$words});
}
}
foreach my $word (keys %word_count)
{
enter_into_database($word, $word_count{$word});
}
...and done.
-Andrew.
Thankn you all for such good suggestions. But I would like to ask are
these the best efficent method given you have to scan millions of
files.
And also about the words that needs to be discarded, is there any
available list somewhere for such words that anyone here knows. Just
by looking at one or few files one can't predict which words are
redundant.
mohan
.
- Follow-Ups:
- Re: Text indexing problem.
- From: stijnvandongen
- Re: Text indexing problem.
- From: CBFalconer
- Re: Text indexing problem.
- From: Bruce C. Baker
- Re: Text indexing problem.
- From: Moi
- Re: Text indexing problem.
- Prev by Date: The irresponsibility of many systems for "parsing regular expressions"
- Next by Date: Re: Text indexing problem.
- Previous by thread: The irresponsibility of many systems for "parsing regular expressions"
- Next by thread: Re: Text indexing problem.
- Index(es):
Relevant Pages
|