MATLAB Answers

Does MATLAB have a Birthday Problem?

15 views (last 30 days)
The Birthday Paradox or problem asks for the probability that in a room of n people, 2 or more have the same birthday (not date), assuming all years have N = 365 days. It is called a paradox because most people are surprised by the answer when there are (say) 30 people in the room.
Many applications require long sequences (or large vectors) of random numbers. In MATLAB these are supplied by the simple statement r = rand(n,1). This statement fills the vector r(n,1) with double precision random numbers uniformly distributed on (0,1).
Here is my question:
What is the probability that duplicates occur in r = rand(n,1), as n gets large?
I have an answer to this question but I would like to see what others come up with so that I can check the validity of my answer.
  5 Comments
Derek O'Connor
Derek O'Connor on 9 Feb 2011
Malcolm,
Thank you for sharing that with us. However, it seems to have escaped your notice that we are not talking about actual births or birthdays here.
Derek

Sign in to comment.

Accepted Answer

Derek O'Connor
Derek O'Connor on 5 Feb 2011
Reply from Derek O'Connor:
I must apologize for the omission of the RNG used. It was in my original draft but somehow got left out. I'm using the default Matlab rand(n,1) Here is what the docs (R2008a) say:
"Use the Mersenne Twister algorithm by Nishimura and Matsumoto (the default in MATLAB® Versions 7.4 and later). This method generates double-precision values in the closed interval [2^(-53), 1-2^(-53)], with a period of (2^19937-1)/2."
My answer is here, which is based on the RNG information above: http://www.scribd.com/doc/48243415/The-Birthday-Paradox-and-Random-Number-Generation
A summary of my analysis is that the probability of duplicates rises significantly after n = 10^8 and has converged to 1 at n = 10^9.
Here are the results of testing, with 5 runs for each value of n.
n = 1*10^8, Prob = 0.426..., Duplicates = (1, 0, 1, 0, 1). Avg = 0.6
n = 5*10^8, Prob = 0.999..., Duplicates = (18, 18, 18, 16, 20). Avg = 16.0
n = 1*10^9, Prob = 1.000..., Duplicates = (66, 64,43, 48, 62). Avg = 56.6
I am not an expert on the mathematics or implementation of RNGs and tend to accept what the docs tell me until I'm told otherwise.
This is not "just a mathematics" question. I say this at the end of my notes above which starts with the real question behind my original question:
What are the implications of getting duplicates in long vectors? This question cannot be answered in general because the answer depends on how these random vectors are used. If they are being used to test a new sorting algorithm, for example, then it is difficult to see how a few duplicates would matter.
I would not be so sanguine about the use of such vectors in medical research. A lot of DNA research seems to involve large amounts of computation which may use large vectors.\footnote{I don't really know what is done, so this is purely speculative.} This is particularly troubling, given the appalling record that bio-medical researchers have in misusing and abusing probability, statistics, and software.
I would certainly be uneasy if I found out that the Boeing or Airbus I was flying in used long random vectors as part of their avionics software. Hence my often-repeated question to students:
Would you fly in a plane whose software was written by you? "
Regards,
Derek O'Connor.
------------------------
I'm replying to comments here because I can't use that tiny comment window.
|Dear Bruno, Walter, et al.
I realize that I have not been clear in my answers to your questions and that my statement that rand picks "integer multiples of 2^(-53)" is just plain wrong. Before I get into the floating point aspects of the problem (ugh!), let me state clearly what probability model the analysis is based on:
I assumed that r = rand(n,1) picks r1, r2,...,rn from a bag of N = 2^(53)-1 'objects', WITH replacement, independently, and with equal probability Pr(ri) = 1/N. In other words its the standard "random sample of size n from a population of size N, with replacement" model. The standard textbook analysis and results follow, giving 1-P(N,n) = Pr(1 or more duplicates).
I think these are reasonable assumptions to make about any good random number generator, in any programming language. Hence the analysis and probability calculations are not specific to Matlab, and carry over to any programming language. Of course, for each situation we need to know N, but that is the only parameter needed.
But what about the 'objects' in the bag? In this case they are double precision floating point numbers. But who cares? The Matlab program that counts the number of duplicates merely tests for the equality of two objects. These objects could be integers, singles, doubles, or even matrices.
But you sorted these objects before you tested for equality, and this implies a '<' on the set of objects. Yes, the '<' was used to reduce the search for duplicates to O(nlogn). If '<' does not apply to the bag of objects then we must use an O(n^2) search algorithm that uses '=' only.
And now to the floating point aspects of the problem, where I'll meet my Waterloo, more than likely.
An IEEE 64-bit double precision floating point number is represented as
fl(x) = +/- s x 2^e where s (significand) is a 52-bit integer and e (exponent) is an 11-bit integer and the sign is 1 bit. The precision is p = 53 bits, and machine epsilon emach = 2^(1-p) = 2^(-52) is the distance between fl(1)=1 and the next higher fpn. = 1+emach = 1+2^(-52). A floating point number is normalized if the first bit of s is 1. In base 2 this can be a hidden or implied bit which is not actually stored.
Distribution Properties:
  1. There are 2^(p-1) = 2^(52) normalized dpfpns in [1, 1-2^(-52)]
  2. These dpfpns are uniformly distributed with spacing emach = 2^(-52).
  3. These dpfpns have the form x = 1 + k*emach, k = 0,1,2,...,2^(52)-1.
These numbers have the following bit patterns, with fixed exponent 2^0:
k = 0 1.000...000 x 2^0
k = 1 1.000...001 x 2^0
...
k = 2^(52)-3 1.111...101 x 2^0
k = 2^(52)-2 1.111...110 x 2^0
k = 2^(52)-1 1.111...111 x 2^0
If we wanted uniform numbers in [1, 1-2^(-52)] we could randomly pick an integer k between 0 and 2^(52)-1 and return r = 1+k*emach. Alternatively, we may view a random number as one of the 2^(52) possible bit patterns (000...000) ... (111...111).
Matlab's rand however, picks dpfpns from [2^(-53), 1-2^(-53)]. If we assume that all significand bit patterns are generated then using 2 and 3 above we get
x = 2^(-53) + k*2^(-53)*emach = 2^(-53) + k*2^(-105), k = 0,1, 2^(52)-1.
But this does not give the correct result. To get numbers in the range [2^(-53), 1-2^(-53)] we need to change the exponent. As Walter pointed out 53 different exponents are used. And as Bruno pointed out this means non-uniformity.
Right now I'm stuck. I'm missing something obvious but I can't see what it is. Floating point Aghhhh!
Does this invalidate the Birthday problem analysis, calculations and Matlab test results? No, because I'm not assuming anything about the values of the random numbers generated. I test only for equality. However, I'm not sure: should N = 2^(52) or 2^(53)?
Regards,
Derek.
|
  10 Comments
Walter Roberson
Walter Roberson on 6 Feb 2011
Derek:
>> num2hex(2^-53)
ans =
3ca0000000000000
>> num2hex(1-2^(-53))
ans =
3fefffffffffffff
That is 53 different exponents that are used over the range of random numbers you have documented.
The only integral multiple of 2^(-53) that has an exponent of 3ca is 2^(-53) itself, with an all-0 mantissa. 2*2^(-53) has an exponent of 3cb and an all-0 mantissa. 3*2^(-53) has an exponent of 3cb and a mantissa whose first bit is set. 4*2^(-53) has an exponent of 3cc and a mantissa of 0.
If we follow this further than it becomes clear from the pigeon-hole principle that not all of the values starting with 3cf can be represented.
Example:
>> num2hex(hex2num('3fe0ffffffffffff') / 2^(-53))
ans =
4330ffffffffffff
It follows that 3fe0ffffffffffff is not an exact integral multiple of 2^(-53) and thus will not be randomly generated. (it is the representable number just below 0.53125 exactly)
Any argument made on the basis of equi-probable bit patterns after a constant exponent are suspect. One _can_, however, make arguments about taking integer multiples of 2^(-53) and then normalizing the numbers.
IEEE 754 does not actually store 53 bits of mantissa: it uses 52 bits of mantissa plus the "hidden bit" implied by the exponent.

Sign in to comment.

More Answers (4)

Richard Willey
Richard Willey on 7 Feb 2011
Hi Derek
Your question about the birthday problem is best viewed as a special case of a more general topic: How should measure the statistical properties of a random number generator?
Rather than using this formulation of the birthday problem to reinvent the wheel, I recommend looking at established literature on this topic. A lot of very qualified experts have spent a lot of time and effort developing criteria to measure the quality of random number generators. The “Diehard” test suite is one such test (Diehard uses a reformulated version of the birthday paradox as one part of the test suite). Alternatively, there is another test suite that uses a combination of
  1. A monobit test (testing the distribution of ones and zeros in a sequence)
  2. A “Poker” test (a modified chi-square test)
  3. A Longruns test (checking whether there are runs of length 34 or greater within a 20,000 bit sequence)
  4. An autocorrelation test
Most of MATLAB’s code can be inspected by users. If you have questions about the specific implementation of a given algorithm – for example, what version of Twister does MathWorks use – you typically have the option to inspect the function and see how the code was written.
Regards,
Richard
  4 Comments
Richard Willey
Richard Willey on 9 Feb 2011
Hi Derek
Sorry that I misconstrued your original question. Here’s one last quick comment:
Random number generators are designed to sample from specific distributions.
A random number generator that generates normally distributed random numbers is not identical to a random number generator that generates normally distributed random numbers (but excludes duplicate values)
Suppose you were using an RNG to generate random integers from a Poisson distribution with lambda = 10. Would you be happy if the RNG was “gimmicked” to exclude duplicate values? I’d argue that this would be a very flawed RNG because the samples being generated aren’t actually coming from a Poisson distribution.
Let’s move back to the example using “rand”. We’re dramatically changing the granularity of the problem, but the basic principle remains the same…
At the end of the day, I don’t think it’s reasonable evaluate an RNG based on a design criteria which is (arguably) contrary to the way in one would expect an RNG to behave (in this case, deliberately excluding duplicate values)
If you have an application that needs RNGs that are generated from “some distribution that looks very much like a normal distribution but is guaranteed never to produce a duplicate value” then you should probably find an RNG designed to do precisely that.
From my own perspective, I’d worry about implementations that were this persnicetky…
regards,
Richard

Sign in to comment.


Walter Roberson
Walter Roberson on 5 Feb 2011
Upper bound: 1/2+(1/2)*sqrt(1+8*S*ln(2)) where S is the number of distinct random numbers that can be generated.
Unfortunately Mathworks does not appear to document which Twister version they are using. Traditionally their random number generators were (0,1); mt19937ar over (0,1) is 32 bits precision, but over [0,1) is 53 bits precision.
If we speculate that Mathworks used the (0,1) version with 32 bits precision, then according to Wikipedia the 50% probability is at 7.7 × 10^4 numbers.
  1 Comment
Peter Perkins
Peter Perkins on 8 Feb 2011
Walter, the User Guide includes this: "mt19937ar: The Mersenne Twister, as described in [9], with Mersenne prime 2^19937-1. This is the generator documented at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html. It has a period of 2^19937-1. Each U(0,1) value is created using two 32-bit integers from the generator; the possible values are all multiples of 2^-53 strictly within the interval (0,1)."

Sign in to comment.


Mark Shore
Mark Shore on 5 Feb 2011
Using the formula Walter gives leads to the following numbers for a 50% probability:
32-bit 7.716e4
53-bit 1.112e8
57-bit 4.470e8
64-bit 5.056e9
80-bit 1.294e12
A brute-force check of 1e10 random numbers
for n=1:1000 mindiff(n) = min(diff(sort(rand(1e7,1))))/eps(1); end
found 11 identical values, or a ~50% probability for 4.44e8 numbers, suggesting that the equivalent of a 57-bit precision RNG is in use.
  14 Comments
Walter Roberson
Walter Roberson on 6 Feb 2011
Jan, the testing I did was unable to find a non-zero difference smaller than 2^(-53) as well. But it could just be quite rare. I don't think I have time to generate 2^53-ish values to check!

Sign in to comment.


Walter Roberson
Walter Roberson on 8 Feb 2011
The Matlab content of this question is:
  • How many distinct random numbers can be generated in Matlab. (We still don't know for sure, but 2^53-1 appears likely.)
  • Are the numbers generated with equal probability. (The algorithm would have to be examined in detail to be sure. If there is a bias, it is minute.)
  • Is the period of the generator greater than the number of different possible values, or is it equal to the number of possible values?
Anything beyond that is a straight-forward application of the Birthday Paradox and is thus a mathematical question rather than a Matlab question. This forum exists to serve Matlab questions, not to serve mathematical questions and especially not to discuss the risks of fly-by-wire or to discuss deliberate fraud by Matlab users.
One only has to look at the period of MT and compare that to the largest number directly representable in Matlab in order to see that rand() will eventually generate a duplicate. Software that expects otherwise is broken.
It would be fair game to ask Mathworks to create a random number generator with no repetitions for those people who need one; it would even be fair game to ask Mathworks to rename rand() to make it clear that duplicates were possible. Those would be Matlab issues. Considering the amount of legacy code that uses rand(), I think Mathworks would want some strong evidence that "rand with replacement" was widely enough misunderstood to be worth renaming; I personally don't think speculation about possible trouble in genome software would qualify.
  10 Comments
Bruno Luong
Bruno Luong on 10 Feb 2011
Thanks again Peter. That settles the question.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!