Re: A (psossibly) fast, novel search table technique
- From: cri@xxxxxxxx (Richard Harter)
- Date: Sat, 24 Mar 2007 06:48:11 GMT
On Fri, 23 Mar 2007 20:36:36 -0500, Logan Shaw
<lshaw-usenet@xxxxxxxxxxxxx> wrote:
Richard Harter wrote:
On Fri, 23 Mar 2007 01:30:04 -0500, Logan Shaw
<lshaw-usenet@xxxxxxxxxxxxx> wrote:
Richard Harter wrote:
Suppose we also have a function g(e_i,e_j) such that g returns a
value p such that if e_i .ne. e_j then f(p,e_i) .ne. f(p,e_j).
Again, as an example, g could be the first character position at
which e_i and e_j differ. Unlike function f which is O(1),
the cost of function g is data dependent. When the two elements
are the same, the cost of g is the common length; otherwise it
depends on the data. Commonly, however, it will also be O(1).
Whoa! I was with you all the way through the explanation up to
that last sentence. As far as I can tell, that last sentence is
equivalent to saying that there will often be a magic way to
know which one of M pairs of bytes is the right one to compare,
without having to even look at the M-1 other pairs of bytes.
To put it another way, if you can make g() efficient, can't you
instead compress the sequences and thus reduce the length of the
sequences in the first place?
I believe you've got it backwards. If the information density is high
it is easy to find a pair of bytes that differ; it is when the density
is low that there are potential problems. More precisely it is easy to
find a pair of bytes that differ when the sequences are uncorrelated
and can be hard if they are highly correlated.
I see what you're saying, but maybe it depends on what correlates with
what. I think you're talking about when the bytes correlate across
sequences, like this:
"nbowevfvucwggfjhABCngjfdvgeuif"
"nbowevfvucwggfjhDEFngjfdvgeuif"
And I'm talking about entropy of individual positions in a sequence,
like this:
"Ringo..........Starr.........."
"John...........Lennon........."
In this type of sequence, you know that there is more entropy in positions
0 and 15 in the sequence than in positions 14 and 29 (because most names
are shorter than the allowed maximum of 15 characters, so the 14th and
29th characters are usually "."). Thus you know that positions 14 and 29
are bad places to look for differences, but positions 0 and 15 are good
places to look.
What you say is true, but I don't think that it matters much - if one is
doing a simple sequential scan the thing that is expensive is long
sequences of identical bytes as in your first example.
Perhaps the point that you are missing is that it doesn't matter which
mismatched pair of bytes you find, only that you find a pair. The more
mismatched pairs there are, the easier it is to find one.
Maybe I'm reading too much into what your algorithm is intended to do.
That may be right.
The original problem could be phrased as "group identical things together,
but do not call two things identical unless it is PROVEN they really
are identical". There are sort of two phases to this:
(1) find efficient ways to separate things which different, and
(2) verify the things which step #1 called identical really don't
have hidden differences.
Phase 2 is sort of the end game. Previously, I had assumed that your
algorithm was intended to take the place of both phase 1 and phase 2,
but now I'm starting to think that assumption is wrong. It is when
you get into the endgame that I don't see how you can accomplish
much by trying to intelligently pick the positions to compare.
It doesn't work like that; there isn't a separate phase 1 and phase 2.
What happens is that we sequentially process data items and ask whether
they are a duplicate of any data item already present. There are two
possibilities: Either we efficiently determine that it is different from
all of the items processed up to date or we efficiently identify one and
only reference data item that it might duplicate. The reference item
and the new item are compared byte by byte until a difference is found.
If no difference is found the new item is a duplicate.
There is a step 1 which is roughly the equivalent of your phase 1. It
separates out things that can be proven to be different. However it
doesn't call things identical. If step 1 can't prove that an item is
different it goes on to step 2. Step 2 is converse of your phase 2;
instead of assuming the two items (the reference item and the new item)
are identical and attempting to verify their identity it assumes that
they are different and attempts to find the difference.
To reiterate: The algorithm doesn't call things identical unless they
are identical.
In phase 2/step 2 a full compare is required to identify a duplicate;
this cost is unavoidable.
I hope thing clarifies things.
.
- References:
- A (psossibly) fast, novel search table technique
- From: Richard Harter
- Re: A (psossibly) fast, novel search table technique
- From: Logan Shaw
- Re: A (psossibly) fast, novel search table technique
- From: Richard Harter
- Re: A (psossibly) fast, novel search table technique
- From: Logan Shaw
- A (psossibly) fast, novel search table technique
- Prev by Date: counting nodes in BST
- Next by Date: Re: traversing BST
- Previous by thread: Re: A (psossibly) fast, novel search table technique
- Next by thread: Re: A (psossibly) fast, novel search table technique
- Index(es):
Relevant Pages
|