Re: Choose k random lines from file
From: Arthur J. O'Dwyer (ajo_at_nospam.andrew.cmu.edu)
Date: 06/02/04
- Previous message: Fantius: "Re: moonlight coding"
- In reply to: Nick Landsberg: "Re: Choose k random lines from file"
- Next in thread: Nick Landsberg: "Re: Choose k random lines from file"
- Reply: Nick Landsberg: "Re: Choose k random lines from file"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 1 Jun 2004 20:55:56 -0400 (EDT)
On Tue, 1 Jun 2004, Nick Landsberg wrote:
>
> Arthur J. O'Dwyer wrote:
> > On Mon, 31 May 2004, Nick Landsberg wrote:
> >>Arthur J. O'Dwyer wrote:
> >>>Suppose our RNG (doesn't even have to be a PRNG, but any RNG) gives
> >>>us numbers in the range 0..2^N-1. And we want a number in the range
> >>>0..16 (seventeen choices total). There's no algorithm to give us
> >>>that number without a bias. None. There are procedures much less
> >>>wasteful than the canonical
[...]
> > Now you agree this is an unbiased generator, right? :)
> > (BTW, I'm assuming that 'rand' is completely unbiased here.
> > If 'rand' were biased, there would be nothing to discuss; we have
> > to have *some* starting point!)
>
> Yes, that is an unbiased generator, but I had thought
> that one of the things to avoid was the "non-deterministic"
> nature of the algorithm. Firstly, it would seem to me
> that there is a small (but non-zero) chance that one
> could "goto again" almost-forever.
Yes. Remember how I said a while back that there was *no* algorithm
that was both unbiased *and* finite? This procedure is unbiased;
therefore it has an (infinitesimal) chance of going on forever.
If we make it a true algorithm again by removing the possibility of
an infinite loop, we will introduce a bias, and vice versa.
> > [ return (int)(17.*rand()/(RAND_MAX+1.)); ]
> >
> > It would be unbiased, yes. But it would use floating-point
> > math, which on any real system would introduce its own bias; and
> > furthermore, dragging in FP math isn't really necessary. Integer
> > math should be plenty. :)
>
> Hmmm... if the mantissa of the *double* were greater than the
> the precision of the integer (say 53(?) bits vs. 32 bits) then
> would it really insert a bias? Not sure. (yes, I think
> plain float would introduce a bias since the least
> significant bits would be lost).
Well, if you keep all the significant bits and don't introduce any
more significant bits, then you end up with 2^32 different outcomes.
17 does not divide 2^32. Thus there must be at least one favored
choice. Mucking around with words like "precision" and "mantissa"
doesn't have anything to do with it. :)
> > [Gerry wrote:]
> >>>>To get an equally good deterministic algorithm, I suppose you could call
> >>>>rand() seventeen times and add them. To be honest I'm not 100% sure
> >>>>that this is valid, but anyway if you have a lot of pages the amount of
> >>>>calculation rises sharply...
> >>>
> >>> Not valid. That gets a random number in the range 0..(17*RAND_MAX),
> >>>all right, but it's no longer unbiased. There are a lot of numbers
> >>>clustered around 8.5*RAND_MAX, and only one way to get 0. So just
> >>>taking the sum of 17 rand()s and modding by 17 doesn't get you an
> >>>unbiased distribution either.
> >>
> >>As I remember from decades back, summing uniformly
> >>distributed random numbers yields something like a Gaussian
> >>distribution for the sum. IIRC, the more numbers you
> >>sum up, the smaller the standard deviation of the
> >>sum will be.
> >
> > Seems to me like the deviation will get *bigger* as you add more
> > numbers, since each number you add increases the range of the sum
> > by RAND_MAX. I'm not sure.
>
> Think pairs of numbers rather than a single number.
> Each pair (probabilistically) would increase the computed sum
> by (RAND_MAX/2) on average. The more pairs of numbers,
> the greater the clustering around N*(RAND_MAX/2)
> and the standard deviation relative to the total
> range should be smaller.
You mean RAND_MAX, not RAND_MAX/2, and you're wrong anyway. Now
that I have a calculator handy...
stdDev(seq(sum(rand(5)),X,1,50
.5770925008
stdDev(seq(sum(rand(10)),X,1,50
1.091296406
stdDev(seq(sum(rand(15)),X,1,50
1.123236281
stdDev(seq(sum(rand(20)),X,1,50
1.205597181
That's the TI-83 talking there. It says that if you take N
random real numbers between 0 and 1 and add them up, and do this
50 times, then the standard deviation of those 50 random sums
is F(N), for the following values of N:
N = 5 F(N) ~ 0.577092501
N = 10 F(N) ~ 1.091296406
N = 15 F(N) ~ 1.123236281
N = 20 F(N) ~ 1.205597181
> As an extreme case, think of a (flawed) RNG which for
> a significant run of N numbers (say RAND_MAX*RAND_MAX)
> gave a perfectly uniform distribution of each number
> between 0 and RAND_MAX. Thus, there would be just as
> many 0's in the sequence as there were RAND_MAX values
> (sum of pairs = RAND_MAX/2), just as many 1's as RAND_MAX-1's
> (sum of pairs = RAND=MAX/2), etc.
> The sum of such a sequence would be indistinguishably
> close to N*(RAND_MAX/2).
This is true, but it only says that the sums are symmetric
about N*(RAND_MAX/2). It doesn't say anything about the
standard deviation. I'm going to believe my calculator unless
you think you can prove otherwise. (And remember, I've already
given a common-sense "proof" that makes more sense than yours,
not to mention the actual computations. :)
-Arthur
- Previous message: Fantius: "Re: moonlight coding"
- In reply to: Nick Landsberg: "Re: Choose k random lines from file"
- Next in thread: Nick Landsberg: "Re: Choose k random lines from file"
- Reply: Nick Landsberg: "Re: Choose k random lines from file"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|