Re: count of each word occurred
From: B. v Ingen Schenau (bart_at_ingen.ddns.info)
Date: 06/20/04
- Previous message: Gwar: "Re: do thing with the inputs"
- In reply to: Edo: "Re: count of each word occurred"
- Next in thread: Edo: "Re: count of each word occurred"
- Reply: Edo: "Re: count of each word occurred"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sun, 20 Jun 2004 13:43:26 +0200
Edo wrote:
> Karl Heinz Buchegger wrote:
>>
>> How I came up with this?
>> Well. If you are anything like me, you would do the whole thing
>> on paper and pencil as follows:
>>
>> Have 2 tables. One contains the original words, the other contains
>> the unique words paired with a count how often this word has occoured:
>
> I found it more natural to;
> sort the list
> take each word in order and compare it with the rest of the list
After you have sorted the list, why would you compare the current word with
*all* the rest of them?
Suppose we start out with this list again:
In
the
beginning
the
earth
was
void
and
dark
After sorting, this list becomes:
and
beginning
dark
earth
In
the
the
void
was
Do you notice anything about this sorted list? Yes, all the words that are
identical are now located in adjacent spots.
This is what the sorting does for you, and you can use that fact in
determining how many duplicates of a word there are in the list.
When, for example, you have arrived at 'beginning' and you want to know how
many duplicates there are of that word, do you really look through the
entire remainder of the list, or do you stop once you have found the first
word that is different?
A simple-minded implementation of this algorithm would look like this:
<pseudocode>
sort the list of words
while (more words present in the list)
num_occurrences = 1
word_to_test = next word
while (current word == word_to_test)
advance word_to_test
num_occurrences++
end while
store (current word, num_occurrences)
skip num_occurrences-1 words
end while
</pseudocode>
Another version of this algorithm takes advantage of the fact that it does
not matter which occurrence of a word you store.
<pseudocode>
sort the list of words
num_occurrences = 1
loop over all words in the list
if (current word == next word)
num_occurrences++
else
store (current word, num_occurrences)
num_occurrences = 1
endif
end loop
</pseudocode>
<snip>
> int count = 1;
> for (vec_size i=0; i < words.size(); ++i) {
> for (vec_size j=i+1; j < words.size(); ++j) {
> if (words[i] == words[j]){
> count++; //
> ++i; // jumps to next unique word
> }
> }
> WordCount.word.push_back(words[i]);
> WordCount.count.push_back(count);
> count = 1;
> }
Lets see if this algorithm does what is is supposed to do.
Just after entering the two loops, we have the following situation:
words: +---------+ Wordcount .word: +---------+ .count: +---+
(size=9) |and | (size=0) (size=0)
+---------+
|beginning|
+---------+
|dark |
+---------+
|earth |
+---------+
|In |
+---------+
|the |
+---------+
|the |
+---------+
|void |
+---------+
|was |
+---------+
count: +---+ i: +---+ j: +---+
| 1 | | 0 | | 1 |
+---+ +---+ +---+
Now, we compare words[0] with words[1] (resp. 'and' and 'beginning').
These are not equal, so j gets incremented and we compare words[0] with
words[2]. As the list is sorted, and the previous comparison was false,
these two words will also compare unequal, as will the rest of the list.
Thus, words[0] gets stored, and at the end of the first iteration of the
outer loop, you end up with the situation:
words: +---------+ Wordcount .word: +---------+ .count: +---+
(size=9) |and | (size=1) |and | (size=1) | 1 |
+---------+ +---------+ +---+
|beginning|
+---------+
|dark |
+---------+
|earth |
+---------+
|In |
+---------+
|the |
+---------+
|the |
+---------+
|void |
+---------+
|was |
+---------+
count: +---+ i: +---+ j: +---+
| 1 | | 0 | | 9 |
+---+ +---+ +---+
The same thin is repeated until you are ready to process the word 'the' and
you are in this situation:
words: +---------+ Wordcount .word: +---------+ .count: +---+
(size=9) |and | (size=5) |and | (size=5) | 1 |
+---------+ +---------+ +---+
|beginning| |beginning| | 1 |
+---------+ +---------+ +---+
|dark | |dark | | 1 |
+---------+ +---------+ +---+
|earth | |earth | | 1 |
+---------+ +---------+ +---+
|In | |In | | 1 |
+---------+ +---------+ +---+
|the |
+---------+
|the |
+---------+
|void |
+---------+
|was |
+---------+
count: +---+ i: +---+ j: +---+
| 1 | | 5 | | 6 |
+---+ +---+ +---+
When you now compare words[5] and words[6], they are equal, and the
if-branch is taken:
words: +---------+ Wordcount .word: +---------+ .count: +---+
(size=9) |and | (size=5) |and | (size=5) | 1 |
+---------+ +---------+ +---+
|beginning| |beginning| | 1 |
+---------+ +---------+ +---+
|dark | |dark | | 1 |
+---------+ +---------+ +---+
|earth | |earth | | 1 |
+---------+ +---------+ +---+
|In | |In | | 1 |
+---------+ +---------+ +---+
|the |
+---------+
|the |
+---------+
|void |
+---------+
|was |
+---------+
count: +---+ i: +---+ j: +---+
| 2 | | 6 | | 6 |
+---+ +---+ +---+
Next (after j is incremented), words[6] and words[7] are compared (resp.
'the' and 'void') and you start the looping over the rest of the vector
without any possibility of finding more duplicates.
After all loops are finished, you end up with:
words: +---------+ Wordcount .word: +---------+ .count: +---+
(size=9) |and | (size=8) |and | (size=8) | 1 |
+---------+ +---------+ +---+
|beginning| |beginning| | 1 |
+---------+ +---------+ +---+
|dark | |dark | | 1 |
+---------+ +---------+ +---+
|earth | |earth | | 1 |
+---------+ +---------+ +---+
|In | |In | | 1 |
+---------+ +---------+ +---+
|the | |the | | 2 |
+---------+ +---------+ +---+
|the | |void | | 1 |
+---------+ +---------+ +---+
|void | |was | | 1 |
+---------+ +---------+ +---+
|was |
+---------+
count: +---+ i: +---+ j: +---+
| 1 | | 9 | | 9 |
+---+ +---+ +---+
So, in fact the algorithm does work as promised, even though it was not
clear to me when looking at just the code.
It is in fact a rather inefficient (due to unneeded looping) implementation
of the second algorithm I described in pseudocode.
Bart v Ingen Schenau
-- a.c.l.l.c-c++ FAQ: http://www.comeaucomputing.com/learn/faq c.l.c FAQ: http://www.eskimo.com/~scs/C-faq/top.html c.l.c++ FAQ: http://www.parashift.com/c++-faq-lite/
- Previous message: Gwar: "Re: do thing with the inputs"
- In reply to: Edo: "Re: count of each word occurred"
- Next in thread: Edo: "Re: count of each word occurred"
- Reply: Edo: "Re: count of each word occurred"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|