Re: Creating a unique random id



Fritz Bayer wrote:

> how likely is it that the following code produces duplicate ie
> identical random numbers, when it gets executed on different virtual
> machines running on different computers?

There are two aspects to this question.

The first is theoretical. /Assuming/ that you are generating 20-digit strings
completely at random, you can calculate the probability of finding duplicates.
If the number of possible values of some random value is N, then if you have a
pool of sqrt(N) values, the chances are that one value occurs twice (Google for
the birthday paradox). In this case you have 1e20 possible random strings, so
you would expect to have to create 1e10 (10 billion) strings on average before
you got a duplicate.

However, that's on the assumption that the strings are completely random.
Which happens not to be true in your case. For one thing, java.util.Random is
a pseudo-random generator, and may have statistical properties that make
duplicates more likely (I'm not mathematician enough to know). More
importantly it has the designed-in quality that each instance of Random always
produces the /same/ sequence of numbers given that it has started with a
specific seed. As a result each instance of your RandomNumberGenerators will
also produce the same sequence of numbers if the underlying Random has the same
seed. So the number of possible /sequences/ of 20-digit number is determined
by
the number of possible seeds to Random (the space of possible numbers that they
can produce is larger, because each generator can produce more than one
number -- it may even be as large as 1e20, though I doubt it). In fact, Random
only keeps 48 bits of its seed so the most you can have is 2**48 different
RandomNumberGenerators. That in turn means that if you create 2**24 instances
of RandomNumberGenerator then you'd expect to find two that were identical, and
therefore generated identical sequences of numbers. 2**24 is about 17 million,
which is quite a lot (depending on the application, of course), but remember
that that's an /upper/ bound. The actual situation may be worse.

So, in practise the chances of finding duplicate 20-digit numbers is dominated
by the chances of two machines using the same seed to initialise a
RandomNumberGenerator.

In Sun's Java 1.4.2 the seed used by default (if you don't provide one
yourself) is System.currentTimeMillis(). That is declared to answer a 64-bit
long, but there won't really be 64 bits-worth of possible values because it is
answering an absolute time, and that is likely to be similar on all the
machines.

If two machines create new RandomNumberGenerators at a steady (on average) rate
of one per second, and if their system clocks are synchronised closely enough
that they read the same time on each machine at some point in the run (not
necessarily the same point), then during the periods where they overlap, the
chances are 1 in 1000 of producing an identical RandomNumberGenerator at /each/
step.

Say that you have 10 machines, all synchronised to within 1 minute, and you run
the generators for an hour on all of them (creating a new instance once a
second as before). For 59 out of the 60 minutes, each new
RandomNumberGenerator will be created with a millisecond seed that is not more
than 1 second different from that of one of the others created on each of the
other 9 machines. So you will create 59 * 60 "batches" of 10
RandomNumberGenerators all with seeds that can differ by no more than 1000.
The chances of all 10 from one batch being different are 1.000 * 0.999 * 0.998
* ... * 0.991 = 0.956. So each batch has e a 4.4% chance of containing
duplicate RandomNumberGenerators. Since you create 59*60=3480 such batches in
the hour, the chances of none of them containing duplicates is 0.956**3480.
Which I make to be about 6e-69. I.e. it is essentially certain that you'll
create two identical RandomNumberGenerators which will then naturally produce
identical sequences of 20-digit numbers.

Of course, there's also the possibility that two /different/
RandomNumberGenerators will produce the same 20-digit number at some point, but
the chances are very much lower, as described above.

In Sun's Java 1.5.0 Random uses System.nanoTime() as its default seed; that
isn't an absolute time, so it is more "random". However there is no guarantee
that it can actually answer all 2**64 possible values. If, say, it only
answered 32-bits-worth of real information, then the number of possible random
sequences would be cut down to 2**32.

And there will probably be a tendency for machines to have reset their
nanosecond counters at roughly the same time (within the same month, say) so
there's a higher than chance probability that two machines will answer the same
number at some points in their lives. You can work out the probability in the
same kind of way as I did above.

Repeating the calculation. First off there is less chance that any two
machine's nanosecond clocks will overlap during the hour since they are not
synchronised, but instead reflect when the machine was last reset (or something
like that). For the sake of argument assume that they have all been reset in
the last month. Secondly there is much less chance that any two machines with
nanosecond clocks that differ by < 1 second will answer the same nanosecond
time. If we assume that the nanosecond timer actually has microsecond
resolution [*], then the chances are 1 in a million -- which is a lot smaller
than 1 in 1000.

([*] I don't know how likely this is, but this machine's 'high performance
counter' claims to click 3.6 million times a second so it may be plausible.)

With those assumptions, for each Random created (once per second per machine)
there is a 1 / (31 * 24) chance that it is created during the a "second" (as
seen by the nano clocks) that is covered by any specific one of the other
machines during that run. So there is approximately 9 / (31*24) chance of it
overlapping with any of the other 9 machines. That's about 1.2%. If it /does/
overlap, then there's a 1e-6 chance of it actually seeing the /same/ nanosecond
value. That means that each Random has about 1.2e-8 chance of being identical
to some other, or to put it another way, about (1-1.2e-8) chance of being
unique (i.e. very near certainty) During the run we produce 10*60*60 instances
of Random, so the chances of all of them being unique are about
(1-1.2e-8)**36000. That's about 0.99956, or to put it the other way around
there's about one chance in 2300 of finding a duplicate Random anywhere.

For comparison, if we consider 36000 genuinely random 20-digit numbers, then I
make the chances of there being a duplicate about 1 in 178 billion.

So it all comes down to the seed for Random. I'd say that duplicates might be
quite likely on 1.4 machines. On 1.5 machines they are probably quite unlikely,
providing that the nanosecond counter has a respectably large range of possible
values (as it probably does on Windows machines, I don't know about any
others).

-- chris


.



Relevant Pages

  • Re: Creating a unique random id
    ... Wasn't Fritz just looking for consecutive duplicates anyway? ... when it gets executed on different virtual machines running on different computers? ... If the number of possible values of some random value is N, then if you have a pool of sqrtvalues, the chances are that one value occurs twice. ... RandomNumberGenerators all with seeds that can differ by no more than 1000. ...
    (comp.lang.java.programmer)
  • Re: [PATCH 4/4] ACPI PCI slot detection driver
    ... called the _SUN method on a slot object that existed in the ACPI ... infrastructure warn about duplicate names because for my test ... And then there's the machines with duplicate slot names, ...
    (Linux-Kernel)
  • Re: duplicate name exists on network
    ... > Domain) I still received the error message that a duplicate computer name ... > The DHCP server is an NT server. ... I know how to turn this off on the machines, ... >>> A duplicate name has been detected on the TCP network. ...
    (microsoft.public.win2000.active_directory)
  • Re: [PATCH 4/4] ACPI PCI slot detection driver
    ... My main concern is that BIOS vendors will not fix these bugs, ... And then there's the machines with duplicate slot names, ... machine, one being hotplug, the other non-hotplug. ...
    (Linux-Kernel)
  • Re: mirroring domains
    ... Don't duplicate names in any circumstances; if the machines are connected to ... You can have "duplicate" computers name between domains, ... Whilst the dc is providing dns and wins services, ...
    (microsoft.public.win2000.active_directory)