Re: Choose k random lines from file

From: Arthur J. O'Dwyer (ajo_at_nospam.andrew.cmu.edu)
Date: 06/01/04


Date: Tue, 1 Jun 2004 11:15:56 -0400 (EDT)


On Mon, 31 May 2004, Nick Landsberg wrote:
>
> Arthur J. O'Dwyer wrote:
> >
> > It's not the period that would be the problem; it's the range.
> > Suppose our RNG (doesn't even have to be a PRNG, but any RNG) gives
> > us numbers in the range 0..2^N-1. And we want a number in the range
> > 0..16 (seventeen choices total). There's no algorithm to give us
> > that number without a bias. None. There are procedures much less
> > wasteful than the canonical
> >
> > again:
> > R = rand();
> > if (R >= (RAND_MAX/17)*17) goto again;
> > return (R % 17);
> >
> > [which I see you showed below],
> > but there's no way to get an unbiased number with a range of M
> > out of an unbiased generator with a range of N, if M and N are
> > coprime. Or something like that. I'm not going to prove my
> > wild assertions on a federal holiday! ;)
>
> I'm no expert, but a thought experiment using small values
> for N and M (say 7 and 3) seems to indicate a bias towards
> the low end.

  What, for the algorithm shown above, or for any algorithm?
The algorithm shown above is perfectly unbiased; for an equivalent
but much more wasteful algorithm, consider the trivial case in
which we replace (RAND_MAX/17) by (1):

> > again:
> > R = rand();
            if (R >= (1)*17) goto again;
> > return (R % 17);

Now you agree this is an unbiased generator, right? :)
  (BTW, I'm assuming that 'rand' is completely unbiased here.
If 'rand' were biased, there would be nothing to discuss; we have
to have *some* starting point!)

> On the other hand, would something like the following
> give less of a bias using an extra float or double
> conversion? :

[ return (int)(17.*rand()/(RAND_MAX+1.)); ]

  It would be unbiased, yes. But it would use floating-point
math, which on any real system would introduce its own bias; and
furthermore, dragging in FP math isn't really necessary. Integer
math should be plenty. :)

[Gerry wrote:]
> >>To get an equally good deterministic algorithm, I suppose you could call
> >>rand() seventeen times and add them. To be honest I'm not 100% sure
> >>that this is valid, but anyway if you have a lot of pages the amount of
> >>calculation rises sharply...
> >
> > Not valid. That gets a random number in the range 0..(17*RAND_MAX),
> > all right, but it's no longer unbiased. There are a lot of numbers
> > clustered around 8.5*RAND_MAX, and only one way to get 0. So just
> > taking the sum of 17 rand()s and modding by 17 doesn't get you an
> > unbiased distribution either.
>
> As I remember from decades back, summing uniformly
> distributed random numbers yields something like a Gaussian
> distribution for the sum. IIRC, the more numbers you
> sum up, the smaller the standard deviation of the
> sum will be.

  Seems to me like the deviation will get *bigger* as you add more
numbers, since each number you add increases the range of the sum
by RAND_MAX. I'm not sure.

-Arthur



Relevant Pages

  • Re: Choose k random lines from file
    ... > Suppose our RNG ... > that number without a bias. ... > unbiased distribution either. ... distribution for the sum. ...
    (comp.programming)
  • Re: Choose k random lines from file
    ... > What, for the algorithm shown above, or for any algorithm? ... Yes, that is an unbiased generator, but I had thought ... >>give less of a bias using an extra float or double ... >>distribution for the sum. ...
    (comp.programming)
  • Re: Random numbers
    ... is that the resulting distribution should be uniform. ... hence you can see a slight bias. ... random variables, each say uniformly distributed on the integers from ...
    (sci.math)
  • Re: Where do I begin?
    ... >> that sum is a nonzero fraction in x; ... so it is a nonzero fraction whose denominator is the ... it a negative bias. ... >> This exceeds the error term. ...
    (sci.math)
  • Re: binomial association measure?
    ... >> I want to measure the bias in a certain distribution of events. ... So we can ask how extreme the number of successful Pickers is given the ... probability of having a significantly different distribution is, ...
    (sci.stat.math)