Re: Choose k random lines from file

From: Nick Landsberg (hukolau_at_NOSPAM.att.net)
Date: 06/02/04


Date: Tue, 01 Jun 2004 23:26:34 GMT

Arthur J. O'Dwyer wrote:

> On Mon, 31 May 2004, Nick Landsberg wrote:
>
>>Arthur J. O'Dwyer wrote:
>>
>>> It's not the period that would be the problem; it's the range.
>>>Suppose our RNG (doesn't even have to be a PRNG, but any RNG) gives
>>>us numbers in the range 0..2^N-1. And we want a number in the range
>>>0..16 (seventeen choices total). There's no algorithm to give us
>>>that number without a bias. None. There are procedures much less
>>>wasteful than the canonical
>>>
>>> again:
>>> R = rand();
>>> if (R >= (RAND_MAX/17)*17) goto again;
>>> return (R % 17);
>>>
>>>[which I see you showed below],
>>>but there's no way to get an unbiased number with a range of M
>>>out of an unbiased generator with a range of N, if M and N are
>>>coprime. Or something like that. I'm not going to prove my
>>>wild assertions on a federal holiday! ;)
>>
>>I'm no expert, but a thought experiment using small values
>>for N and M (say 7 and 3) seems to indicate a bias towards
>>the low end.
>
>
> What, for the algorithm shown above, or for any algorithm?
> The algorithm shown above is perfectly unbiased; for an equivalent
> but much more wasteful algorithm, consider the trivial case in
> which we replace (RAND_MAX/17) by (1):

On deeper reflection, you're right. (It was a holiday after all.)
I was thinking about something else having to do with modulo.
Sorry

>
>
>>> again:
>>> R = rand();
>
> if (R >= (1)*17) goto again;
>
>>> return (R % 17);
>
>
> Now you agree this is an unbiased generator, right? :)
> (BTW, I'm assuming that 'rand' is completely unbiased here.
> If 'rand' were biased, there would be nothing to discuss; we have
> to have *some* starting point!)

Yes, that is an unbiased generator, but I had thought
that one of the things to avoid was the "non-deterministic"
nature of the algorithm. Firstly, it would seem to me
that there is a small (but non-zero) chance that one
could "goto again" almost-forever.
On average, the call to rand() could be executed
millions of times during each iteration. (Depending on the
relative sizes of RAND_MAX and your modulo number.
On my system RAND_MAX ix 0x7fffffff, probability
of finding a number less than 17 is about one
in 126 million.)

>
>
>>On the other hand, would something like the following
>>give less of a bias using an extra float or double
>>conversion? :
>
>
> [ return (int)(17.*rand()/(RAND_MAX+1.)); ]
>
> It would be unbiased, yes. But it would use floating-point
> math, which on any real system would introduce its own bias; and
> furthermore, dragging in FP math isn't really necessary. Integer
> math should be plenty. :)

Hmmm... if the mantissa of the *double* were greater than the
the precision of the integer (say 53(?) bits vs. 32 bits) then
would it really insert a bias? Not sure. (yes, I think
plain float would introduce a bias since the least
significant bits would be lost).

>
>
> [Gerry wrote:]
>
>>>>To get an equally good deterministic algorithm, I suppose you could call
>>>>rand() seventeen times and add them. To be honest I'm not 100% sure
>>>>that this is valid, but anyway if you have a lot of pages the amount of
>>>>calculation rises sharply...
>>>
>>> Not valid. That gets a random number in the range 0..(17*RAND_MAX),
>>>all right, but it's no longer unbiased. There are a lot of numbers
>>>clustered around 8.5*RAND_MAX, and only one way to get 0. So just
>>>taking the sum of 17 rand()s and modding by 17 doesn't get you an
>>>unbiased distribution either.
>>
>>As I remember from decades back, summing uniformly
>>distributed random numbers yields something like a Gaussian
>>distribution for the sum. IIRC, the more numbers you
>>sum up, the smaller the standard deviation of the
>>sum will be.
>
>
> Seems to me like the deviation will get *bigger* as you add more
> numbers, since each number you add increases the range of the sum
> by RAND_MAX. I'm not sure.

Think pairs of numbers rather than a single number.
Each pair (probabilistically) would increase the computed sum
by (RAND_MAX/2) on average. The more pairs of numbers,
the greater the clustering around N*(RAND_MAX/2)
and the standard deviation relative to the total
range should be smaller.

As an extreme case, think of a (flawed) RNG which for
a significant run of N numbers (say RAND_MAX*RAND_MAX)
gave a perfectly uniform distribution of each number
between 0 and RAND_MAX. Thus, there would be just as
many 0's in the sequence as there were RAND_MAX values
(sum of pairs = RAND_MAX/2), just as many 1's as RAND_MAX-1's
(sum of pairs = RAND=MAX/2), etc.
The sum of such a sequence would be indistinguishably
close to N*(RAND_MAX/2).

Multiple runs of that sequence should also yield sums
very, very close to the above. The standard deviation
of the *sums* (which, I believe was proposed as the
results of the algorithm) would be rather small,
I think.

>
> -Arthur
>

-- 
"It is impossible to make anything foolproof
because fools are so ingenious"
  - A. Bloch


Relevant Pages

  • Re: Choose k random lines from file
    ... >> that number without a bias. ... What, for the algorithm shown above, or for any algorithm? ... >> unbiased distribution either. ... > distribution for the sum. ...
    (comp.programming)
  • Re: Is there a polynomial algorithm for that problem?
    ... The sum over all n numbers is negative. ... Unfortunately that algorithm does not scale. ... An instance of 3-Partition consists of a list of 3m positive integers ... list can be partitioned into m lists of size 3, ...
    (sci.math)
  • Re: standard deviation, but without the mean
    ... Suppose Xsum is the sum of the the first n numbes and that X2 is ... Just keep track if X2 and Xsum as you go. ... Actually this is not a correct algorithm, ... the error in the least 9 digits.. ...
    (sci.stat.math)
  • Re: C# generic containers from a "C++ perspective"
    ... If you implement the floating-point and integer algorithms ... Maintain the sum of all the items in an accumulator. ... That is the fundamentally-broken algorithm that I was expecting you to ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Disjoint circle merge NP complete for L^n error?
    ... The key point is that we need to be sure there is no optimal algorithm ... We only need to erase MAX and replace with SUM in the proof ... minimum error partition) algorithm in P. ... MAX. I'm pretty sure that if both these are in NP then all L^n metrics ...
    (comp.theory)