Re: count of each word occurred

From: B. v Ingen Schenau (bart_at_ingen.ddns.info)
Date: 06/20/04

  • Next message: B. v Ingen Schenau: "Re: count of each word occurred"
    Date: Sun, 20 Jun 2004 13:43:26 +0200
    
    

    Edo wrote:

    > Karl Heinz Buchegger wrote:
    >>
    >> How I came up with this?
    >> Well. If you are anything like me, you would do the whole thing
    >> on paper and pencil as follows:
    >>
    >> Have 2 tables. One contains the original words, the other contains
    >> the unique words paired with a count how often this word has occoured:
    >
    > I found it more natural to;
    > sort the list
    > take each word in order and compare it with the rest of the list

    After you have sorted the list, why would you compare the current word with
    *all* the rest of them?

    Suppose we start out with this list again:

    In
    the
    beginning
    the
    earth
    was
    void
    and
    dark

    After sorting, this list becomes:

    and
    beginning
    dark
    earth
    In
    the
    the
    void
    was

    Do you notice anything about this sorted list? Yes, all the words that are
    identical are now located in adjacent spots.
    This is what the sorting does for you, and you can use that fact in
    determining how many duplicates of a word there are in the list.
    When, for example, you have arrived at 'beginning' and you want to know how
    many duplicates there are of that word, do you really look through the
    entire remainder of the list, or do you stop once you have found the first
    word that is different?

    A simple-minded implementation of this algorithm would look like this:

    <pseudocode>
    sort the list of words
    while (more words present in the list)
      num_occurrences = 1
      word_to_test = next word
      while (current word == word_to_test)
        advance word_to_test
        num_occurrences++
      end while
      store (current word, num_occurrences)
      skip num_occurrences-1 words
    end while
    </pseudocode>

    Another version of this algorithm takes advantage of the fact that it does
    not matter which occurrence of a word you store.

    <pseudocode>
    sort the list of words
    num_occurrences = 1
    loop over all words in the list
      if (current word == next word)
        num_occurrences++
      else
        store (current word, num_occurrences)
        num_occurrences = 1
      endif
    end loop
    </pseudocode>

    <snip>
    > int count = 1;
    > for (vec_size i=0; i < words.size(); ++i) {
    > for (vec_size j=i+1; j < words.size(); ++j) {
    > if (words[i] == words[j]){
    > count++; //
    > ++i; // jumps to next unique word
    > }
    > }
    > WordCount.word.push_back(words[i]);
    > WordCount.count.push_back(count);
    > count = 1;
    > }

    Lets see if this algorithm does what is is supposed to do.
    Just after entering the two loops, we have the following situation:

    words: +---------+ Wordcount .word: +---------+ .count: +---+
    (size=9) |and | (size=0) (size=0)
             +---------+
             |beginning|
             +---------+
             |dark |
             +---------+
             |earth |
             +---------+
             |In |
             +---------+
             |the |
             +---------+
             |the |
             +---------+
             |void |
             +---------+
             |was |
             +---------+

    count: +---+ i: +---+ j: +---+
             | 1 | | 0 | | 1 |
             +---+ +---+ +---+

    Now, we compare words[0] with words[1] (resp. 'and' and 'beginning').
    These are not equal, so j gets incremented and we compare words[0] with
    words[2]. As the list is sorted, and the previous comparison was false,
    these two words will also compare unequal, as will the rest of the list.
    Thus, words[0] gets stored, and at the end of the first iteration of the
    outer loop, you end up with the situation:

    words: +---------+ Wordcount .word: +---------+ .count: +---+
    (size=9) |and | (size=1) |and | (size=1) | 1 |
             +---------+ +---------+ +---+
             |beginning|
             +---------+
             |dark |
             +---------+
             |earth |
             +---------+
             |In |
             +---------+
             |the |
             +---------+
             |the |
             +---------+
             |void |
             +---------+
             |was |
             +---------+

    count: +---+ i: +---+ j: +---+
             | 1 | | 0 | | 9 |
             +---+ +---+ +---+

    The same thin is repeated until you are ready to process the word 'the' and
    you are in this situation:

    words: +---------+ Wordcount .word: +---------+ .count: +---+
    (size=9) |and | (size=5) |and | (size=5) | 1 |
             +---------+ +---------+ +---+
             |beginning| |beginning| | 1 |
             +---------+ +---------+ +---+
             |dark | |dark | | 1 |
             +---------+ +---------+ +---+
             |earth | |earth | | 1 |
             +---------+ +---------+ +---+
             |In | |In | | 1 |
             +---------+ +---------+ +---+
             |the |
             +---------+
             |the |
             +---------+
             |void |
             +---------+
             |was |
             +---------+

    count: +---+ i: +---+ j: +---+
             | 1 | | 5 | | 6 |
             +---+ +---+ +---+

    When you now compare words[5] and words[6], they are equal, and the
    if-branch is taken:

    words: +---------+ Wordcount .word: +---------+ .count: +---+
    (size=9) |and | (size=5) |and | (size=5) | 1 |
             +---------+ +---------+ +---+
             |beginning| |beginning| | 1 |
             +---------+ +---------+ +---+
             |dark | |dark | | 1 |
             +---------+ +---------+ +---+
             |earth | |earth | | 1 |
             +---------+ +---------+ +---+
             |In | |In | | 1 |
             +---------+ +---------+ +---+
             |the |
             +---------+
             |the |
             +---------+
             |void |
             +---------+
             |was |
             +---------+

    count: +---+ i: +---+ j: +---+
             | 2 | | 6 | | 6 |
             +---+ +---+ +---+

    Next (after j is incremented), words[6] and words[7] are compared (resp.
    'the' and 'void') and you start the looping over the rest of the vector
    without any possibility of finding more duplicates.

    After all loops are finished, you end up with:
    words: +---------+ Wordcount .word: +---------+ .count: +---+
    (size=9) |and | (size=8) |and | (size=8) | 1 |
             +---------+ +---------+ +---+
             |beginning| |beginning| | 1 |
             +---------+ +---------+ +---+
             |dark | |dark | | 1 |
             +---------+ +---------+ +---+
             |earth | |earth | | 1 |
             +---------+ +---------+ +---+
             |In | |In | | 1 |
             +---------+ +---------+ +---+
             |the | |the | | 2 |
             +---------+ +---------+ +---+
             |the | |void | | 1 |
             +---------+ +---------+ +---+
             |void | |was | | 1 |
             +---------+ +---------+ +---+
             |was |
             +---------+

    count: +---+ i: +---+ j: +---+
             | 1 | | 9 | | 9 |
             +---+ +---+ +---+

    So, in fact the algorithm does work as promised, even though it was not
    clear to me when looking at just the code.
    It is in fact a rather inefficient (due to unneeded looping) implementation
    of the second algorithm I described in pseudocode.

    Bart v Ingen Schenau

    -- 
    a.c.l.l.c-c++ FAQ: http://www.comeaucomputing.com/learn/faq
    c.l.c FAQ: http://www.eskimo.com/~scs/C-faq/top.html
    c.l.c++ FAQ: http://www.parashift.com/c++-faq-lite/
    

  • Next message: B. v Ingen Schenau: "Re: count of each word occurred"

    Relevant Pages

    • Re: count of each word occurred
      ... Otherwise, the algorithm is fine. ... > I think Kerl?s way will require 2 tables to loop through, ... > above loop requires only one pass through the data ?after sorting-not ... a.c.l.l.c-c++ FAQ: http://www.comeaucomputing.com/learn/faq ...
      (alt.comp.lang.learn.c-cpp)
    • Re: 2.6.22.7 autoconf.h is screwed up
      ... compare to 2.6.10's autoconf.h. ... totally changed the meaning of autoconf.h ... how algorithm is written that it makes readable to non-readable. ... Please read the FAQ at http://www.tux.org/lkml/ ...
      (Linux-Kernel)
    • Re: 2.6.22.7 autoconf.h is screwed up
      ... compare to 2.6.10's autoconf.h. ... totally changed the meaning of autoconf.h ... how algorithm is written that it makes readable to non-readable. ... Please read the FAQ at http://www.tux.org/lkml/ ...
      (Linux-Kernel)
    • Re: Letter to US Sen. Byron Dorgan re unpaid overtime
      ... and Richard made it clear that he understands the order ... >> of evaluation of a for loop. ... > using strlen but using an Oalgorithm when there is a trivial O ... >> In most other languages the terminating ...
      (comp.programming)
    • Re: Efficiency questions: combined ifs and looping 4 times
      ... Choice of algorithm always has a far bigger impact in performance than ... Use sizeof before the loop, store the result in a var and use ... speaks well of PHP. ... your order of complexity analysis is off. ...
      (comp.lang.php)