Re: Choose k random lines from file
From: Arthur J. O'Dwyer (ajo_at_nospam.andrew.cmu.edu)
Date: 06/01/04
- Next message: SM Ryan: "Re: MSDS for Liquid Hydrogen"
- Previous message: Grant Edwards: "Re: Creating an operating system"
- Maybe in reply to: qWake: "Re: Choose k random lines from file"
- Next in thread: Nick Landsberg: "Re: Choose k random lines from file"
- Reply: Nick Landsberg: "Re: Choose k random lines from file"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 1 Jun 2004 11:15:56 -0400 (EDT)
On Mon, 31 May 2004, Nick Landsberg wrote:
>
> Arthur J. O'Dwyer wrote:
> >
> > It's not the period that would be the problem; it's the range.
> > Suppose our RNG (doesn't even have to be a PRNG, but any RNG) gives
> > us numbers in the range 0..2^N-1. And we want a number in the range
> > 0..16 (seventeen choices total). There's no algorithm to give us
> > that number without a bias. None. There are procedures much less
> > wasteful than the canonical
> >
> > again:
> > R = rand();
> > if (R >= (RAND_MAX/17)*17) goto again;
> > return (R % 17);
> >
> > [which I see you showed below],
> > but there's no way to get an unbiased number with a range of M
> > out of an unbiased generator with a range of N, if M and N are
> > coprime. Or something like that. I'm not going to prove my
> > wild assertions on a federal holiday! ;)
>
> I'm no expert, but a thought experiment using small values
> for N and M (say 7 and 3) seems to indicate a bias towards
> the low end.
What, for the algorithm shown above, or for any algorithm?
The algorithm shown above is perfectly unbiased; for an equivalent
but much more wasteful algorithm, consider the trivial case in
which we replace (RAND_MAX/17) by (1):
> > again:
> > R = rand();
if (R >= (1)*17) goto again;
> > return (R % 17);
Now you agree this is an unbiased generator, right? :)
(BTW, I'm assuming that 'rand' is completely unbiased here.
If 'rand' were biased, there would be nothing to discuss; we have
to have *some* starting point!)
> On the other hand, would something like the following
> give less of a bias using an extra float or double
> conversion? :
[ return (int)(17.*rand()/(RAND_MAX+1.)); ]
It would be unbiased, yes. But it would use floating-point
math, which on any real system would introduce its own bias; and
furthermore, dragging in FP math isn't really necessary. Integer
math should be plenty. :)
[Gerry wrote:]
> >>To get an equally good deterministic algorithm, I suppose you could call
> >>rand() seventeen times and add them. To be honest I'm not 100% sure
> >>that this is valid, but anyway if you have a lot of pages the amount of
> >>calculation rises sharply...
> >
> > Not valid. That gets a random number in the range 0..(17*RAND_MAX),
> > all right, but it's no longer unbiased. There are a lot of numbers
> > clustered around 8.5*RAND_MAX, and only one way to get 0. So just
> > taking the sum of 17 rand()s and modding by 17 doesn't get you an
> > unbiased distribution either.
>
> As I remember from decades back, summing uniformly
> distributed random numbers yields something like a Gaussian
> distribution for the sum. IIRC, the more numbers you
> sum up, the smaller the standard deviation of the
> sum will be.
Seems to me like the deviation will get *bigger* as you add more
numbers, since each number you add increases the range of the sum
by RAND_MAX. I'm not sure.
-Arthur
- Next message: SM Ryan: "Re: MSDS for Liquid Hydrogen"
- Previous message: Grant Edwards: "Re: Creating an operating system"
- Maybe in reply to: qWake: "Re: Choose k random lines from file"
- Next in thread: Nick Landsberg: "Re: Choose k random lines from file"
- Reply: Nick Landsberg: "Re: Choose k random lines from file"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|