Re: Choose k random lines from file
From: Nick Landsberg (hukolau_at_NOSPAM.att.net)
Date: 06/02/04
- Next message: Edward G. Nilges: "Re: Aspiring highest-order programmer"
- Previous message: Edward G. Nilges: "Re: [OT] Re: Offshore Outsourcing"
- In reply to: Arthur J. O'Dwyer: "Re: Choose k random lines from file"
- Next in thread: Arthur J. O'Dwyer: "Re: Choose k random lines from file"
- Reply: Arthur J. O'Dwyer: "Re: Choose k random lines from file"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 01 Jun 2004 23:26:34 GMT
Arthur J. O'Dwyer wrote:
> On Mon, 31 May 2004, Nick Landsberg wrote:
>
>>Arthur J. O'Dwyer wrote:
>>
>>> It's not the period that would be the problem; it's the range.
>>>Suppose our RNG (doesn't even have to be a PRNG, but any RNG) gives
>>>us numbers in the range 0..2^N-1. And we want a number in the range
>>>0..16 (seventeen choices total). There's no algorithm to give us
>>>that number without a bias. None. There are procedures much less
>>>wasteful than the canonical
>>>
>>> again:
>>> R = rand();
>>> if (R >= (RAND_MAX/17)*17) goto again;
>>> return (R % 17);
>>>
>>>[which I see you showed below],
>>>but there's no way to get an unbiased number with a range of M
>>>out of an unbiased generator with a range of N, if M and N are
>>>coprime. Or something like that. I'm not going to prove my
>>>wild assertions on a federal holiday! ;)
>>
>>I'm no expert, but a thought experiment using small values
>>for N and M (say 7 and 3) seems to indicate a bias towards
>>the low end.
>
>
> What, for the algorithm shown above, or for any algorithm?
> The algorithm shown above is perfectly unbiased; for an equivalent
> but much more wasteful algorithm, consider the trivial case in
> which we replace (RAND_MAX/17) by (1):
On deeper reflection, you're right. (It was a holiday after all.)
I was thinking about something else having to do with modulo.
Sorry
>
>
>>> again:
>>> R = rand();
>
> if (R >= (1)*17) goto again;
>
>>> return (R % 17);
>
>
> Now you agree this is an unbiased generator, right? :)
> (BTW, I'm assuming that 'rand' is completely unbiased here.
> If 'rand' were biased, there would be nothing to discuss; we have
> to have *some* starting point!)
Yes, that is an unbiased generator, but I had thought
that one of the things to avoid was the "non-deterministic"
nature of the algorithm. Firstly, it would seem to me
that there is a small (but non-zero) chance that one
could "goto again" almost-forever.
On average, the call to rand() could be executed
millions of times during each iteration. (Depending on the
relative sizes of RAND_MAX and your modulo number.
On my system RAND_MAX ix 0x7fffffff, probability
of finding a number less than 17 is about one
in 126 million.)
>
>
>>On the other hand, would something like the following
>>give less of a bias using an extra float or double
>>conversion? :
>
>
> [ return (int)(17.*rand()/(RAND_MAX+1.)); ]
>
> It would be unbiased, yes. But it would use floating-point
> math, which on any real system would introduce its own bias; and
> furthermore, dragging in FP math isn't really necessary. Integer
> math should be plenty. :)
Hmmm... if the mantissa of the *double* were greater than the
the precision of the integer (say 53(?) bits vs. 32 bits) then
would it really insert a bias? Not sure. (yes, I think
plain float would introduce a bias since the least
significant bits would be lost).
>
>
> [Gerry wrote:]
>
>>>>To get an equally good deterministic algorithm, I suppose you could call
>>>>rand() seventeen times and add them. To be honest I'm not 100% sure
>>>>that this is valid, but anyway if you have a lot of pages the amount of
>>>>calculation rises sharply...
>>>
>>> Not valid. That gets a random number in the range 0..(17*RAND_MAX),
>>>all right, but it's no longer unbiased. There are a lot of numbers
>>>clustered around 8.5*RAND_MAX, and only one way to get 0. So just
>>>taking the sum of 17 rand()s and modding by 17 doesn't get you an
>>>unbiased distribution either.
>>
>>As I remember from decades back, summing uniformly
>>distributed random numbers yields something like a Gaussian
>>distribution for the sum. IIRC, the more numbers you
>>sum up, the smaller the standard deviation of the
>>sum will be.
>
>
> Seems to me like the deviation will get *bigger* as you add more
> numbers, since each number you add increases the range of the sum
> by RAND_MAX. I'm not sure.
Think pairs of numbers rather than a single number.
Each pair (probabilistically) would increase the computed sum
by (RAND_MAX/2) on average. The more pairs of numbers,
the greater the clustering around N*(RAND_MAX/2)
and the standard deviation relative to the total
range should be smaller.
As an extreme case, think of a (flawed) RNG which for
a significant run of N numbers (say RAND_MAX*RAND_MAX)
gave a perfectly uniform distribution of each number
between 0 and RAND_MAX. Thus, there would be just as
many 0's in the sequence as there were RAND_MAX values
(sum of pairs = RAND_MAX/2), just as many 1's as RAND_MAX-1's
(sum of pairs = RAND=MAX/2), etc.
The sum of such a sequence would be indistinguishably
close to N*(RAND_MAX/2).
Multiple runs of that sequence should also yield sums
very, very close to the above. The standard deviation
of the *sums* (which, I believe was proposed as the
results of the algorithm) would be rather small,
I think.
>
> -Arthur
>
-- "It is impossible to make anything foolproof because fools are so ingenious" - A. Bloch
- Next message: Edward G. Nilges: "Re: Aspiring highest-order programmer"
- Previous message: Edward G. Nilges: "Re: [OT] Re: Offshore Outsourcing"
- In reply to: Arthur J. O'Dwyer: "Re: Choose k random lines from file"
- Next in thread: Arthur J. O'Dwyer: "Re: Choose k random lines from file"
- Reply: Arthur J. O'Dwyer: "Re: Choose k random lines from file"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|