constrained random number generation

2 views (last 30 days)
Hi All,
I like to generate n numbers, a1, a2,...an, which are uniformly distributed, but there are one constraint that sum(a1,..., an)=1, so how i can solve this simulation task.
Thanks a lot.

Accepted Answer

Walter Roberson
Walter Roberson on 29 Apr 2012
Roger does it right; there are lots of ways to do it wrong, unfortunately.
[Copy of .doc file]
The Theory Behind the 'randfixedsum' Function
This function generates random, uniformly distributed vectors, x = [x1,x2,x3,...,xn]', which have a specified sum s, and for which we have a <= xi <= b, for specified values a and b. It is helpful to regard such vectors as points belonging to n-dimensional Euclidean space and lying in an n-1 dimensional hyperplane constrained to the sum s. Since, for all a and b, the problem can easily be rescaled to the case where a = 0 and b = 1, we will henceforth assume in this description that this is the case, and that we are operating within the unit n-dimensional "cube".
Each such point x can be uniquely classified in the following manner. Start with the "central" point p1 = [s/n,s/n,...,s/n] and project a ray from there running through x and on out until it meets some face of the unit cube - that is, until one of the components first becomes 1 or 0. Call the point reached, r2. Then we can express x as:
x = (1-a1)*p1 + a1*r2,
for some value 0 <= a1 <= 1. Next, select a second "central" point which has all components equal to one another except for the same 1 or 0 as in r2, with the sum of them still at s. Call this second central point, p2. Then follow the ray from p2 through r2 until it again meets the cube's surface at r3 with some second component attaining 1 or 0. This gives us
r2 = (1-a2)*p2 + a2*r3.
Let p3 be the point with these two 1's and/or 0's as in r3, but with all others equal and again with total sum s, and follow the ray from p3 through r3 to r4 as before with a third component at 1 or 0, giving
r3 = (1-a3)*p3 + a3*r4.
If this is continued until pn is reached, pn will have all components except one set to 1's or 0's, and we can combine all the above expressions to form:
x = (1-a1)*p1 + a1*(1-a2)*p2 + a1*a2*(1-a3)*p3 + ... + a1*a2*...*a_n-1*pn
The points p1, p2, ..., pn can be considered as the vertices of an n-1 dimensional simplex (a generalization of 1-dimensional line segments, 2-dimensional triangles, and 3-dimensional tetrahedra,) and the above equality expresses x as lying in its interior. If we ignore cases of x's where two or more cube faces are met simultaneously or where an r coincides with one of the p's, with the justification that they constitute cases of zero probability in n-1 dimensions, then this is a unique procedure. Any given x will lie in only one such simplex. Furthermore, this is reversible. Starting with any set of legitimate p vertices, and choosing any set of a's, each between 0 and 1, we will arrive at a solution x to the problem via the above expression. This completely classifies all solutions x as lying in one of the above-described simplexes.
The vertices of such simplexes can occur in any sequence of the n components, and for a given sequence, in any combination of 1's and 0's throughout for which the total sum is k = floor(s). Hence their total number is n factorial times n-1_C_k. If, for example, n = 7 and k = 3, this gives a total of 20 different types and 5,040 possible permutations of each, for a formidable total of 100,800 different simplexes. For the n = 25 case (which served as the inspiration for writing this function,) the total would be a mind-boggling 41,944,731,705,933,745,734,549,504,000,000 simplexes, (though of only a mere 2,704,156 different types.) Fortunately for our sanity there is an iterative process, explained below, that comes to the rescue and enables such large numbers of simplexes to be handled in a reasonable fashion without having to deal with all of them individually.
There are required three types of random processes that determine the placement of random points x in these simplexes. First of all, when a particular simplex has been selected, points within that simplex are to be uniformly distributed throughout the simplex. This is relatively easy to do in terms of the a1, a2, a3, ... quantities mentioned above. For any particular x given by these a's, we can define the set of "inner" points, xx, determined by all other sets of aa's such that each aai term is bounded by zero and ai: 0 <= aai <= ai. It can readily be seen that the set of these xx's constitutes the interior of another simplex inside the larger one. The volume of such a simplex can be shown to be a1^(n-1)*a2^(n-2)*a3^(n-3)*... times the volume of the outer simplex. For uniform distribution it is necessary and sufficient to have the probability of a point falling in any inner simplex be proportional to its volume. This can obviously be accomplished by n-1 calls on the 'rand' function, and setting a1 = rand^(1/(n-1)), a2 = rand^(1/(n-2)), ..., a_n-1 = rand, and this is what is done in 'randfixedsum'.
The second and more difficult aspect of random action is selecting between different types of simplexes corresponding to different combinations of 1's and 0's within a given permutation in the above "ray" tracing. The probability of landing in each of these different types must be proportional to their respective volumes, and their volumes are in general very different from one another in a way that depends on the value of s.
For this purpose, a table of total hyperplane volumes is generated iteratively starting with n = 1 and extending up to the desired n, before any random generation of x's is begun. By hyperplane volume is meant the n-1 dimensional volume of the set occupied by solutions to the problem for a given s. For a given n and s, call this v(n,s). This the second (optional) value returned by 'randfixedsum'. For n = 2 in the unit square, it is easy to see that it is given by:
v(2,s) = sqrt(2)*s , for 0 <= s <= 1,
sqrt(2)*(2-s), for 1 <= s <= 2, and
0 , elsewhere.
Referring back to the p's mentioned above, all the points in V(n,s) can be divided up into those that, as a first step, initially follow a ray to a 1 value at the cube's edge, and those that end up at a 0 (again ignoring cases of zero probability.) There are n cases that end up at 1, one for each component, and for each one we can define an n-1 dimensional pyramid with the central point p1 at the apex and the 1-surface of the cube at the base containing the corresponding p2. The n-1 dimensional volume of this pyramid will be 1/(n-1) times its altitude times the n-2 dimensional volume of its base. The altitude can be calculated from the definition of p1 and p2 to be (after all the smoke clears away):
(n-s)/sqrt(n*(n-1)),
and the base's volume is simply another similar problem with n-1 components and s-1 sum - that is, it is just v(n-1,s-1). Putting n of these pyramids together, their total volume is therefore:
n/(n-1)*(n-s)/sqrt(n*(n-1))*v(n-1,s-1) = (n/(n-1))^(3/2)*v(n-1,s-1)*(n-s)/n
In a similar fashion, the total volume of those x's projecting onto 0-surfaces will come out to be:
(n/(n-1))^(3/2)*v(n-1,s)*s/n
The sum of these two terms yields the iteration
v(n,s) = (n/(n-1))^(3/2)*(v(n-1,s-1)*(n-s)/n + v(n-1,s)*s/n)
If we define w(n,s) = K*v(n,s)/n^(3/2) where K is any constant, we can rewrite the iteration as:
w(n,s) = w(n-1,s-1)*(n-s)/n + w(n-1,s)*s/n
for easier computation. (In 'randfixedsum' K is selected as 'realmax' to allow full use of double precision's enormous range to delay the onset of underflow.) This constitutes a simple iteration, starting with n = 1 and ending up with the desired value of n. It is somewhat analogous to generating a Pascal triangle. Considered as a function of s, v(n,s) gradually approaches a normal distribution curve centered at n/2 as n increases, as would be predicted by the central limit theorem.
However, it is not the w values that is our principal aim here, but rather the relative value of the two terms involved in the above w iteration - those for ending up at 1's and those for ending up at 0's, and that is the purpose of the 't' table. (Relative values of the w-terms is the same as those of the v-terms.) The t(n,s) elements are set equal to the left term (that containing w(n-1,s-1)) divided by the total w(n,s). This is then used with 'rand' values to determine the choice that is being made for going the 1-route versus the 0-route at each point with the appropriate probability. It provides that this selection will be in accordance with the relative volumes of the two kinds of pyramids. The algorithm, in effect, follows a path through the t table, beginning at the bottom row with n and working its way backwards towards n = 1, at each point making a decision using t as to which way to proceed, the 1-route or the 0-route.
When the above process is finished, a complete simplex type has been chosen with appropriate probabilities corresponding to a particular sequence of 1's and 0's choices. However, the assumption has been made that the simplexes chosen all have 1's and 0's occurring starting with component 1 and going in order up to component n. This means only a fraction, namely 1/n!, of simplexes have been selected from at that point in the algorithm - only one simplex of each type. The last step is therefore to perform a random permutation on the result so as to uniformly distribute the results among simplexes of all possible permutations. There is no mystery about this last step in the program since it uses the 'randperm' algorithm.
The validity of 'randfixedsum' was checked by creating a special version which, along with x and v, output three additional quantities: 1) the particular sequence of 1's and 0's that were selected in each case, thereby determining the simplex type entered, 2) the probability of having selected that sequence in terms of the product of the 'rand' values used, and 3) the probability, again in terms of the product of 'rand' values,
of landing somewhere within the inner simplex defined by x. In effect, it told what simplex type was used, where in the simplex x was located, and the probabilities of this having occurred in terms of its various 'rand' values actually used. The random permutation feature was disabled so all generated x's were of the single permutation [1:n], and thus only one simplex of each type could be generated.
A second program received these probabilities and simplex information along with corresponding x's, and used the above ray tracing procedure to compare them with actual volume calculations of the inner and outer simplexes involved to verify that they were in agreement. It also verified that the [1:n] permutation actually occurred. This was allowed to run for some one hundred thousand x's for various values of n up to 32 with satisfactory results. For the lower values of n this will have resulted in having tested each of the simplex types with numerous x points. For the higher values of n, of course not all of the n-1_C_k types could be checked, but the many thousands that did occur all gave correct answers.
Another kind of test can be performed on 'randfixedsum', this time on the v output. Since this quantity is used in the above tests for simplex volumes in relation to the whole volume, v(n,s), this is deserving of an independent test. Considering v(n,s) as a function of s again, if we were to integrate v(n,s)/sqrt(n) with respect to s - the sqrt(n) being necessary to convert s to a true metric - the answer should turn out to be exactly 1, the n-dimensional volume of the unit cube. Actually, mathematical induction on the above iteration in w can easily be used to show that something stronger must hold. If f lies in the range, 0 <= f <= 1, the sum
(v(n,f) + v(n,f+1) + v(n,f+2) + ... + v(n,f+n-1)) / sqrt(n),
which would ordinarily be regarded as a mere approximation to the above integral, must in fact be precisely 1 for any f. Tests were run on 'randfixedsum', generating f with rand, that verified that indeed this holds true for all such f and many different values of n.
Finally, a third test was conducted, mostly for its dramatic appeal to potential users, whose results can be observed in the image which accompanies 'randfixedsum' in the MathWorks' File Exchange. It depicts a 3D plot of x values using x = randfixedsum(3,16384,1.25,0,1), and as can be seen, demonstrates the uniformness of distribution of the x points thoughout the 2-dimensional hexagonal area of solution for x1 + x2 + x3 = 1.25, with 0 <= xi <= 1.
Roger Stafford - January 19, 2006
  3 Comments
David Zhang
David Zhang on 30 Apr 2012
Hi Walter, do you know which textbook give more background on the very useful work by Roger? Thank you.
Walter Roberson
Walter Roberson on 30 Apr 2012
Hmmm, editors don't seem to like that documentation file. I think it might perhaps have been written for an older Mac word processor.
I stripped the header and trailer out and posted it above.

Sign in to comment.

More Answers (0)

Categories

Find more on Creating and Concatenating Matrices in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!