Re: Choose k random lines from file

From: Nick Landsberg (hukolau_at_NOSPAM.att.net)
Date: 05/31/04


Date: Mon, 31 May 2004 18:09:59 GMT

Arthur J. O'Dwyer wrote:
> On Mon, 31 May 2004, Gerry Quinn wrote:
>
>>gerryq@DELETETHISindigo.ie says...
>>
>>>ajo@nospam.andrew.cmu.edu says...
>>>
>>>> I don't like to use potentially-infinite procedures when there
>>>>are algorithms available, no matter how fast they are. Call it
>>>>an ideological position. ;)
>>
>>Something interesting just occurred to me - if you want an even
>>distribution you may *need* a non-deterministic algorithm, because the
>>alternative will be an infinite, or at least horribly long one!
>>
>>How so? Well, suppose you have seventeen pages, and you want to choose
>>one at random, but your RNG has a period that's a power of two, or
>>anyway some number that's not a multiple of seventeen.
>
>
> It's not the period that would be the problem; it's the range.
> Suppose our RNG (doesn't even have to be a PRNG, but any RNG) gives
> us numbers in the range 0..2^N-1. And we want a number in the range
> 0..16 (seventeen choices total). There's no algorithm to give us
> that number without a bias. None. There are procedures much less
> wasteful than the canonical
>
> again:
> R = rand();
> if (R >= (RAND_MAX/17)*17) goto again;
> return (R % 17);
>
> [which I see you showed below],
> but there's no way to get an unbiased number with a range of M
> out of an unbiased generator with a range of N, if M and N are
> coprime. Or something like that. I'm not going to prove my
> wild assertions on a federal holiday! ;)

I'm no expert, but a thought experiment using small values
for N and M (say 7 and 3) seems to indicate a bias towards
the low end.

On the other hand, would something like the following
give less of a bias using an extra float or double
conversion? :

     R = rand();
     X = (double) R / (RAND_MAX+1); /* X is of type double*/

     /* this assumes that (RAND_MAX +1) does not
     cause integer overflow (UB). The result should give a
     uniform distribution between 0.0 <= X < 1.0 */

     R2 = X * M; /* R2 and M are the appropriate integer types */

     /* I think this should yield a value 0 <= R2 < M
     with a uniform distribution */

Or did I get it wrong?

>
>
>>To get an equally good deterministic algorithm, I suppose you could call
>>rand() seventeen times and add them. To be honest I'm not 100% sure
>>that this is valid, but anyway if you have a lot of pages the amount of
>>calculation rises sharply...
>
>
> Not valid. That gets a random number in the range 0..(17*RAND_MAX),
> all right, but it's no longer unbiased. There are a lot of numbers
> clustered around 8.5*RAND_MAX, and only one way to get 0. So just
> taking the sum of 17 rand()s and modding by 17 doesn't get you an
> unbiased distribution either.

As I remember from decades back, summing uniformly
distributed random numbers yields something like a Gaussian
distribution for the sum. IIRC, the more numbers you
sum up, the smaller the standard deviation of the
sum will be.

>
>
>
>>Creating a PRNG with a period of 17 has been suggested.
>
>
> Has it? :) s/period/range/ and you have a point that I don't
> recall being made in this thread; but I'm not sure such a beast
> exists for any given prime. Sounds like a job for Knuth to me!
>
> -Arthur
>

Nick L.

-- 
"It is impossible to make anything foolproof
because fools are so ingenious"
  - A. Bloch


Relevant Pages

  • Re: Choose k random lines from file
    ... >> that number without a bias. ... What, for the algorithm shown above, or for any algorithm? ... >> unbiased distribution either. ... > distribution for the sum. ...
    (comp.programming)
  • Re: Random numbers
    ... is that the resulting distribution should be uniform. ... to use the RNG provided for the range 1 to 5. ... hence you can see a slight bias. ...
    (sci.math)
  • Re: Random numbers
    ... is that the resulting distribution should be uniform. ... hence you can see a slight bias. ... random variables, each say uniformly distributed on the integers from ...
    (sci.math)
  • Re: random numbers function
    ... Since 5 and 7 have no common divider, you need to call the RNG 7 times, ... distribution the RNG provides for 5 random numbers). ... Calling the random number generator fewer times in certain ...
    (comp.programming)
  • Re: What are the Parameters for Algorithm AS 62 APPL. STATIST. (1973) VOL.22, NO.2
    ... > c Users are much more likely to need the distribution function. ... SUBROUTINE UDIST(M, N, FRQNCY, LFR, WORK, LWRK, IFAULT) ... REAL ZERO, ONE, SUM ...
    (sci.stat.math)