Re: A (psossibly) fast, novel search table technique



On Fri, 23 Mar 2007 20:36:36 -0500, Logan Shaw
<lshaw-usenet@xxxxxxxxxxxxx> wrote:

Richard Harter wrote:
On Fri, 23 Mar 2007 01:30:04 -0500, Logan Shaw
<lshaw-usenet@xxxxxxxxxxxxx> wrote:
Richard Harter wrote:

Suppose we also have a function g(e_i,e_j) such that g returns a
value p such that if e_i .ne. e_j then f(p,e_i) .ne. f(p,e_j).
Again, as an example, g could be the first character position at
which e_i and e_j differ. Unlike function f which is O(1),
the cost of function g is data dependent. When the two elements
are the same, the cost of g is the common length; otherwise it
depends on the data. Commonly, however, it will also be O(1).

Whoa! I was with you all the way through the explanation up to
that last sentence. As far as I can tell, that last sentence is
equivalent to saying that there will often be a magic way to
know which one of M pairs of bytes is the right one to compare,
without having to even look at the M-1 other pairs of bytes.

To put it another way, if you can make g() efficient, can't you
instead compress the sequences and thus reduce the length of the
sequences in the first place?

I believe you've got it backwards. If the information density is high
it is easy to find a pair of bytes that differ; it is when the density
is low that there are potential problems. More precisely it is easy to
find a pair of bytes that differ when the sequences are uncorrelated
and can be hard if they are highly correlated.

I see what you're saying, but maybe it depends on what correlates with
what. I think you're talking about when the bytes correlate across
sequences, like this:

"nbowevfvucwggfjhABCngjfdvgeuif"
"nbowevfvucwggfjhDEFngjfdvgeuif"

And I'm talking about entropy of individual positions in a sequence,
like this:

"Ringo..........Starr.........."
"John...........Lennon........."

In this type of sequence, you know that there is more entropy in positions
0 and 15 in the sequence than in positions 14 and 29 (because most names
are shorter than the allowed maximum of 15 characters, so the 14th and
29th characters are usually "."). Thus you know that positions 14 and 29
are bad places to look for differences, but positions 0 and 15 are good
places to look.

What you say is true, but I don't think that it matters much - if one is
doing a simple sequential scan the thing that is expensive is long
sequences of identical bytes as in your first example.


Perhaps the point that you are missing is that it doesn't matter which
mismatched pair of bytes you find, only that you find a pair. The more
mismatched pairs there are, the easier it is to find one.

Maybe I'm reading too much into what your algorithm is intended to do.

That may be right.


The original problem could be phrased as "group identical things together,
but do not call two things identical unless it is PROVEN they really
are identical". There are sort of two phases to this:

(1) find efficient ways to separate things which different, and
(2) verify the things which step #1 called identical really don't
have hidden differences.

Phase 2 is sort of the end game. Previously, I had assumed that your
algorithm was intended to take the place of both phase 1 and phase 2,
but now I'm starting to think that assumption is wrong. It is when
you get into the endgame that I don't see how you can accomplish
much by trying to intelligently pick the positions to compare.

It doesn't work like that; there isn't a separate phase 1 and phase 2.
What happens is that we sequentially process data items and ask whether
they are a duplicate of any data item already present. There are two
possibilities: Either we efficiently determine that it is different from
all of the items processed up to date or we efficiently identify one and
only reference data item that it might duplicate. The reference item
and the new item are compared byte by byte until a difference is found.
If no difference is found the new item is a duplicate.

There is a step 1 which is roughly the equivalent of your phase 1. It
separates out things that can be proven to be different. However it
doesn't call things identical. If step 1 can't prove that an item is
different it goes on to step 2. Step 2 is converse of your phase 2;
instead of assuming the two items (the reference item and the new item)
are identical and attempting to verify their identity it assumes that
they are different and attempts to find the difference.

To reiterate: The algorithm doesn't call things identical unless they
are identical.

In phase 2/step 2 a full compare is required to identify a duplicate;
this cost is unavoidable.

I hope thing clarifies things.



.



Relevant Pages

  • Re: tuples, index method, Pythons design
    ... Unless you want to introduce a character type into Python ... The properties of strings didn't force the developers to make those ... The same method could then eventually be used in other sequences ... properties where they differ from other sequences it no longer ...
    (comp.lang.python)
  • Re: A (psossibly) fast, novel search table technique
    ... Again, as an example, g could be the first character position at which e_i and e_j differ. ... instead compress the sequences and thus reduce the length of the ... Phase 2 is sort of the end game. ...
    (comp.programming)
  • Re: ga9- Integers found that form the sqrt(2) or transcendental numbers.
    ... >>and how does it differ from that for sqrt? ... The definition says that if you divide an integer by a nonzero ... we have two sequences (of numerator and denominator). ...
    (sci.math)
  • Re: Why is C# 450% slower than C++ on nested loops ??
    ... >Sorry, above is not completely true, the C# IL may differ as I only compared ... >(only for the loop part), the results are quite different. ... exists because the two IL sequences are so different. ... just to set a different JIT flag. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Why is C# 450% slower than C++ on nested loops ??
    ... Here is an interesting article on the optimizations of the new 2005 compiler. ... >>Sorry, above is not completely true, the C# IL may differ as I only compared ... >>(only for the loop part), the results are quite different. ... > exists because the two IL sequences are so different. ...
    (microsoft.public.dotnet.languages.csharp)