Re: computing on streams of data



pamelafluente@xxxxxxxxx wrote:
[...]
Hi ghoul , your answer seems to implicitly imply the opinion that
anyway (compressed or whatever) we do need to "remember" all the passed
numbers. My wondering was, instead, on the fact if there could exists
at all some "incremental way" (as this happens for the mean) to store
information on a function of a pointer(s) to the mean value(s). By
"incremental way" I mean something "more compact" than storing all the
numbers. For instance in the case of the mean, a "simple sum" and the
number of observed numbers will do it. Is it possible that something
similar (althought clearly quite hard to imagine) is possible for a
pointer to the median ? If it is not theoretically impossible (is it?),
perhaps trying to figure it out would be a nice challenge. That was
what I was wondering...

Yeah, I got the question, I was just trying to get you to nail down
some
things like the number of bits present per entry in the input stream,
and
how few bits were available in memory. I was trying to point out in a
perverse way that due to the Hilbert Hotel "paradox" one has to be
careful of infinities, in this case infinite precision. A similar
argument
applies "left of the decimal point" for unbounded integers. Thus, you
*have to* nail down what you mean by "number" to avoid those cheats.

Let's make it more concrete. Say you have an infinite stream of
positive
64-bit integers of arbitrary distribution. How many bits of memory do
you need to be able to keep a running median for the first N values?
If
it's constant I'll be surprised because out of curiosity I've also been
trying to figure this out for a few years.

When I needed to do something similar (for work), I chose a pragmatic
compromise. We wanted to be able to show the distribution of values
(arbitrary floats) that had been processed so far. We also had to log
this distribution periodically, so a compact representation was
important.
What I ended up doing was creating bins that were logarithmically
distributed in intervals whose right edge was 2^(1/10) times the left
edge. That gave ten intervals for each binary order of magnitude. I
then kept a simple counter within each bin. Adding a new value that
was smaller or larger than the boundary bins would cause new bins to
be added dynamically.

Since the exponent of a normalized float has ~255 values, there could
be at most ~2550 bins (we only processed positive values), which in
practice seems an adequately small amount of data to rewrite
periodically. Note that besides being able to plot a pretty good
(bounded) approximation of the distribution curve, we could also use
this information to locate the bin that actually contained the median.
The actual median must then be between X and X*1.07177, which is
probably adequate precision for most statistics.

Getting back to the original problem, it's likely that you can produce
a K-bit approximation using significantly fewer bits than for the
exact median. I.e., the median is definitely between A and
A + A*2^-K. I'll check back here later to see if you've been able to
produce a bound on the bits for tracking either the median or the
approximate median...

.



Relevant Pages

  • Re: The Promise of Forth
    ... Returns the computed median value of a list of numbers, ... number of bins to use for the histogram (more bins brings the computed value ... Association for Computing Machinery Inc., New York, ...
    (comp.lang.forth)
  • Re: The Promise of Forth
    ... Returns the computed median value of a list of numbers, ... number of bins to use for the histogram (more bins brings the computed value ...
    (comp.lang.forth)
  • Re: How to test a distribution for uniformity?
    ... > observations occured is roughly uniform. ... > distribution of observation times differs significantly from ... I am using bins (those are my 45 minute ... I wonder is there such a thing as a chi-square test which is adjusted to ...
    (sci.stat.math)
  • Need help with this
    ... Prepare a plot that shows the distribution of noise in the raw DMA data ... use bins that range from 2.7815 - ... compute the Gaussian distribution for these data by using the ...
    (microsoft.public.excel.worksheet.functions)
  • Re: Goodness of fitting of a distribution
    ... distribution being tested. ... The K-S test has positive efficiency ... which the chi-squared test has decent power are ... To test the uniformity of the distribution in ALL bins. ...
    (sci.stat.math)