Re: computing on streams of data
- From: "ghoul" <ghoul@xxxxxxxxxxxxx>
- Date: 20 Oct 2006 07:02:05 -0700
pamelafluente@xxxxxxxxx wrote:
[...]
Hi ghoul , your answer seems to implicitly imply the opinion that
anyway (compressed or whatever) we do need to "remember" all the passed
numbers. My wondering was, instead, on the fact if there could exists
at all some "incremental way" (as this happens for the mean) to store
information on a function of a pointer(s) to the mean value(s). By
"incremental way" I mean something "more compact" than storing all the
numbers. For instance in the case of the mean, a "simple sum" and the
number of observed numbers will do it. Is it possible that something
similar (althought clearly quite hard to imagine) is possible for a
pointer to the median ? If it is not theoretically impossible (is it?),
perhaps trying to figure it out would be a nice challenge. That was
what I was wondering...
Yeah, I got the question, I was just trying to get you to nail down
some
things like the number of bits present per entry in the input stream,
and
how few bits were available in memory. I was trying to point out in a
perverse way that due to the Hilbert Hotel "paradox" one has to be
careful of infinities, in this case infinite precision. A similar
argument
applies "left of the decimal point" for unbounded integers. Thus, you
*have to* nail down what you mean by "number" to avoid those cheats.
Let's make it more concrete. Say you have an infinite stream of
positive
64-bit integers of arbitrary distribution. How many bits of memory do
you need to be able to keep a running median for the first N values?
If
it's constant I'll be surprised because out of curiosity I've also been
trying to figure this out for a few years.
When I needed to do something similar (for work), I chose a pragmatic
compromise. We wanted to be able to show the distribution of values
(arbitrary floats) that had been processed so far. We also had to log
this distribution periodically, so a compact representation was
important.
What I ended up doing was creating bins that were logarithmically
distributed in intervals whose right edge was 2^(1/10) times the left
edge. That gave ten intervals for each binary order of magnitude. I
then kept a simple counter within each bin. Adding a new value that
was smaller or larger than the boundary bins would cause new bins to
be added dynamically.
Since the exponent of a normalized float has ~255 values, there could
be at most ~2550 bins (we only processed positive values), which in
practice seems an adequately small amount of data to rewrite
periodically. Note that besides being able to plot a pretty good
(bounded) approximation of the distribution curve, we could also use
this information to locate the bin that actually contained the median.
The actual median must then be between X and X*1.07177, which is
probably adequate precision for most statistics.
Getting back to the original problem, it's likely that you can produce
a K-bit approximation using significantly fewer bits than for the
exact median. I.e., the median is definitely between A and
A + A*2^-K. I'll check back here later to see if you've been able to
produce a bound on the bits for tracking either the median or the
approximate median...
.
- Follow-Ups:
- Re: computing on streams of data
- From: ghoul
- Re: computing on streams of data
- References:
- computing on streams of data
- From: pamelafluente
- Re: computing on streams of data
- From: A.G.McDowell
- Re: computing on streams of data
- From: pamelafluente
- Re: computing on streams of data
- From: Patricia Shanahan
- Re: computing on streams of data
- From: ghoul
- Re: computing on streams of data
- From: pamelafluente
- computing on streams of data
- Prev by Date: Counter example for Mr. Diaby algorithm solving TSP problem in polynomial time
- Next by Date: Re: Counter example for Mr. Diaby algorithm solving TSP problem in polynomial time
- Previous by thread: Re: computing on streams of data
- Next by thread: Re: computing on streams of data
- Index(es):
Relevant Pages
|