12 views (last 30 days)

Show older comments

Abstractly, a random permutation generator (RPG) is a function that maps a random bit-string of length b into a permutation of {1,2, ..., n}.

There are 2^b distinct bit-strings of length b and n! distinct permutations of length n. Hence we must have 2^b > n! Using Stirling's approximation for n! and taking logs of both sides of the inequality we get b > n*log2(n).

Now RANDPERM can generate a permutation of length n = 10^6 in 0.2 sec and part of its output looks like this: 383882, 905744, 760491, 138901. However, to generate all of the (10^6)! ~ 10^(5,565,709) possible permutations the inequality above says that b > 10^6*Log2(10^6) ~ 2*10^7 bits are necessary. But RANDPERM uses indirectly the Mersenne Twister generator whose internal state has about 625*32 = 2*10^4 bits and so it cannot possibly generate all (10^6)! permutations. Indeed, it must miss most of these permutations.

My question is: What permutations does RANDPERM (or any other RPG) miss? Are these misses spread uniformly over the permutation space?

Jan
on 23 Jul 2011

You are right: The Mersenne-Twister cannot create all permutations from a certain limit. The period of 10^6000 means that the sequence of the created 32 bit numbers is repeated after this number of steps. RANDPERM uses 53 bit doubles, which are built from two 32 integers, such that the half period must be taken.

While the sequence of numbers is repeated after 0.5*10^6000 RNG calls only, the values of the 53 bit floating point number are repeated earlier. RANDPERM uses Matlab's SORT, which is a stable quicksort implementation. Here stability means, that the order of equal inputs is kept in the output. This leads to a lovely bias, which can be found without waiting for years by reducing the RNG range drastically:

rngLimit = 5;

n = 40;

elem = 1:n

elemSum = zeros(1, n);

indexSum = zeros(1, n);

for i = 1:10000

random = floor(rand(1, n) * rngLimit);

randomSum = randomSum + random;

[igno, p] = sort(random); % This is RANDPERM

elemSum = elemSum + elem(p);

end

% plot(randomSum) % no bias

plot(elemSum) % cute bias

However, for 53 bits and periods of 0.5*10^6000 the determination of the bias will lead to other problems: There are only about 10^80 nucleons in the universe. This limits the possibilities to store a list of already chosen permutations. So there are some permutations not reachable by RANDPERM, but the chance to suffer from this is tiny.

As you, Derek, know, I suggest not to use RANDPERM for large data sets, but the Fisher-Yates-Shuffle, e.g. implemented in FEX: RPG Lab or FEX: Shuffle. I've implemented the KISS RNG in Shuffle, which has a period of 10^37 and thought of using Marsaglias' CMWC4096 with a period of 10^33000. But as far as I can see, the limits of the universe are a stricter bound of the number of computable permutations.

I'd looking forward to know, if any advanced mathematician shares this estimation.

Peter Perkins
on 8 Sep 2011

A couple of people in this thread have mentioned the Fisher-Yates-Durstenfeld(-...) shuffle algorithm.

New in R2011b, the RANDPERM function lets you specify a second argument, as in, "give me a random selection of size k from a random permutaion of 1:n". This not only simplifies and speeds up the typical

perm = randperm(n);

r = perm(1:k)

that has been recommended in the past, but also uses F-Y-D. So of course you can ask for a random permutation of all of 1:n as

perm = randperm(n,n)

and have it generated using F-Y-D with the Mersenne Twister underneath. For backwards compatibility, randperm(n) continues to use the existing algorithm based on sort(rand(1,n)).

Daniel Shub
on 23 Jul 2011

Peter Perkins blog on Loren's blog a while back about the random number generators.

It doesn't answer your question exactly. Your question seems to be more is the Mersenne Twister a good random number generator. The answer to that is, yes, it is one of the better general purpose random number generators available today. Yes, there are sequences it will not generate, however, its repeat time is so long that it is basically impossible to detect that from the sequence itself.

Walter Roberson
on 25 Jul 2011

As long as randperm() is implemented as a stable sort of random numbers that can be duplicates, you cannot trust randperm() requesting more than 1 value.

Suppose you rand(1,2) and both results happen to come out exactly the same. If they cannot come out exactly the same then that would be a test that could be applied to easily distinguish pseudo-random numbers from true-random numbers: true random numbers can have duplicates (and the standard birthday-paradox analysis would make short work of detecting the problem.)

If the two values of rand(1,2) can come out exactly the same (no matter what the period is for full repetition), then the stable sort that is applied to the random numbers is always going to choose the same ordering, not a permutation of that.

This suggests an recursive "saferrandperm" algorithm

[vals, idx] = sort(rand(1,N));

diff(vals)

Look for runs of 0 in the diff: these indicate groups if identical values, and identifying them is an order-N operation.

For any one group of identified identical values, extract the indices, take the length of that, and saferrandperm() that length. Use the indices saferrandperm() returns on the subset size to shuffle the extracted indices in the vector to be returned; if there are more groups of identical values, do the same things for those.

When saferrandperm detects ties in the generated rand() values, it does the same recursive shuffle.

The proportion of duplicated values will get smaller and smaller and as log as the RNG does not generate all-identical values, eventually any one subset will be suitably permuted with the ordering *not* stable.

Jan
on 26 Jul 2011

@Walter: Thanks for this safer version of RANDPERM.

If RANDPERM is patched, I suggest to replace it by the faster Fisher-Yates-Alorithm directly.

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!