Re: Choose k random lines from file

From: Arthur J. O'Dwyer (ajo_at_nospam.andrew.cmu.edu)
Date: 06/02/04

  • Next message: Nick Landsberg: "Re: Choose k random lines from file"
    Date: Tue, 1 Jun 2004 20:55:56 -0400 (EDT)
    
    

    On Tue, 1 Jun 2004, Nick Landsberg wrote:
    >
    > Arthur J. O'Dwyer wrote:
    > > On Mon, 31 May 2004, Nick Landsberg wrote:
    > >>Arthur J. O'Dwyer wrote:
    > >>>Suppose our RNG (doesn't even have to be a PRNG, but any RNG) gives
    > >>>us numbers in the range 0..2^N-1. And we want a number in the range
    > >>>0..16 (seventeen choices total). There's no algorithm to give us
    > >>>that number without a bias. None. There are procedures much less
    > >>>wasteful than the canonical
    [...]
    > > Now you agree this is an unbiased generator, right? :)
    > > (BTW, I'm assuming that 'rand' is completely unbiased here.
    > > If 'rand' were biased, there would be nothing to discuss; we have
    > > to have *some* starting point!)
    >
    > Yes, that is an unbiased generator, but I had thought
    > that one of the things to avoid was the "non-deterministic"
    > nature of the algorithm. Firstly, it would seem to me
    > that there is a small (but non-zero) chance that one
    > could "goto again" almost-forever.

      Yes. Remember how I said a while back that there was *no* algorithm
    that was both unbiased *and* finite? This procedure is unbiased;
    therefore it has an (infinitesimal) chance of going on forever.
    If we make it a true algorithm again by removing the possibility of
    an infinite loop, we will introduce a bias, and vice versa.

    > > [ return (int)(17.*rand()/(RAND_MAX+1.)); ]
    > >
    > > It would be unbiased, yes. But it would use floating-point
    > > math, which on any real system would introduce its own bias; and
    > > furthermore, dragging in FP math isn't really necessary. Integer
    > > math should be plenty. :)
    >
    > Hmmm... if the mantissa of the *double* were greater than the
    > the precision of the integer (say 53(?) bits vs. 32 bits) then
    > would it really insert a bias? Not sure. (yes, I think
    > plain float would introduce a bias since the least
    > significant bits would be lost).

      Well, if you keep all the significant bits and don't introduce any
    more significant bits, then you end up with 2^32 different outcomes.
    17 does not divide 2^32. Thus there must be at least one favored
    choice. Mucking around with words like "precision" and "mantissa"
    doesn't have anything to do with it. :)

    > > [Gerry wrote:]
    > >>>>To get an equally good deterministic algorithm, I suppose you could call
    > >>>>rand() seventeen times and add them. To be honest I'm not 100% sure
    > >>>>that this is valid, but anyway if you have a lot of pages the amount of
    > >>>>calculation rises sharply...
    > >>>
    > >>> Not valid. That gets a random number in the range 0..(17*RAND_MAX),
    > >>>all right, but it's no longer unbiased. There are a lot of numbers
    > >>>clustered around 8.5*RAND_MAX, and only one way to get 0. So just
    > >>>taking the sum of 17 rand()s and modding by 17 doesn't get you an
    > >>>unbiased distribution either.
    > >>
    > >>As I remember from decades back, summing uniformly
    > >>distributed random numbers yields something like a Gaussian
    > >>distribution for the sum. IIRC, the more numbers you
    > >>sum up, the smaller the standard deviation of the
    > >>sum will be.
    > >
    > > Seems to me like the deviation will get *bigger* as you add more
    > > numbers, since each number you add increases the range of the sum
    > > by RAND_MAX. I'm not sure.
    >
    > Think pairs of numbers rather than a single number.
    > Each pair (probabilistically) would increase the computed sum
    > by (RAND_MAX/2) on average. The more pairs of numbers,
    > the greater the clustering around N*(RAND_MAX/2)
    > and the standard deviation relative to the total
    > range should be smaller.

      You mean RAND_MAX, not RAND_MAX/2, and you're wrong anyway. Now
    that I have a calculator handy...

      stdDev(seq(sum(rand(5)),X,1,50
        .5770925008
      stdDev(seq(sum(rand(10)),X,1,50
        1.091296406
      stdDev(seq(sum(rand(15)),X,1,50
        1.123236281
      stdDev(seq(sum(rand(20)),X,1,50
        1.205597181

    That's the TI-83 talking there. It says that if you take N
    random real numbers between 0 and 1 and add them up, and do this
    50 times, then the standard deviation of those 50 random sums
    is F(N), for the following values of N:

      N = 5 F(N) ~ 0.577092501
      N = 10 F(N) ~ 1.091296406
      N = 15 F(N) ~ 1.123236281
      N = 20 F(N) ~ 1.205597181

    > As an extreme case, think of a (flawed) RNG which for
    > a significant run of N numbers (say RAND_MAX*RAND_MAX)
    > gave a perfectly uniform distribution of each number
    > between 0 and RAND_MAX. Thus, there would be just as
    > many 0's in the sequence as there were RAND_MAX values
    > (sum of pairs = RAND_MAX/2), just as many 1's as RAND_MAX-1's
    > (sum of pairs = RAND=MAX/2), etc.
    > The sum of such a sequence would be indistinguishably
    > close to N*(RAND_MAX/2).

      This is true, but it only says that the sums are symmetric
    about N*(RAND_MAX/2). It doesn't say anything about the
    standard deviation. I'm going to believe my calculator unless
    you think you can prove otherwise. (And remember, I've already
    given a common-sense "proof" that makes more sense than yours,
    not to mention the actual computations. :)

    -Arthur


  • Next message: Nick Landsberg: "Re: Choose k random lines from file"

    Relevant Pages

    • Approximating mean and sd with recursive functions
      ... I would also like to calculate the standard deviation. ... the shortcut formula for standard deviation, ... standard deviation if I have N, the sum of the x values, and the sum of ... the squares of the x values. ...
      (sci.stat.math)
    • Re: Summing Noise Sources
      ... white or whatever) and we sum them, then the TOTAL noise is found by ... the square root of that sum. ... There is no peak value for the idealized amplitude statistics, ... the standard deviation of the sum. ...
      (sci.electronics.design)
    • Re: Calculating Standard Deviation
      ... When calculating the standard deviation for a set of values you use the ... sum of all the residuals squared. ... norms are close but not the same. ...
      (sci.math.num-analysis)
    • Re: Summing Noise Sources
      ... each noise source, squaring the values, summing the results and then taking ... the square root of that sum. ... Sounds like a confused school boy mindlessly regurgitating some rules with no rhyme or reason...Your so-called RMS is a simple standard deviation of two independent random variates, and as anyone knows, the standard deviation of the sum is the RSS of the component deviations. ... There is no peak value for the idealized amplitude statistics, there may be a practical limitation in the real world, but the idealized model usually fits with less than 1e-24 chance of deviation from reality, that's why people consider it useful. ...
      (sci.electronics.design)
    • Re: Standard Deviation
      ... is the sum of the divided by ... A differentiable function definitely makes the math more tractable. ... But standard deviation is usually used because either the physics ... If a simulation model matches the physics, ...
      (comp.dsp)