randperm non uniformly distributed

4 views (last 30 days)
I want to sample from integers 1 through 56 without replacement. Neither randperm nor datasample with 'Replacement',false give a uniformly distributed set if I iterate many times. Why is the last bin in the histogram double the size of the the rest?
perms=zeros(10000,6);
samps=zeros(10000,6);
[rp, cp]=size(perms);
for p=1:rp
permstemp = randperm(56,6);
perms(p,:)=permstemp;
end
[rs, cs]=size(samps);
for s=1:rs
sampstemp = datasample(1:56,6,'Replace',false);
samps(s,:)=sampstemp;
end
histogram(perms(1:end))
histogram(samps(1:end))
nonuniform.png

Accepted Answer

John D'Errico
John D'Errico on 15 Aug 2019
Sigh. This is NOT a question of non-uniformity. Just a question of not understanding how to recognize non-uniformity, and partially how to understand a histogram.
If you create a histogram with too few bins, what happens is there will be SOME bins that have multiple counts in those bins.
It turns out that histogram decided to use bin edges of 1:56 here, so the last bin got used for twice as many samples.
Note the difference between these two calls to histogram:
histogram(perms(1:end))
histogram(perms(1:end),1:56)
histogram(perms(1:end),1:57)
The first two produce the same results. So it appears the default for the bin edges was 1:56. However, when I gave it another bin up to 57, all things appear normal.
So what happens when I have bin edges 1:56? There are integer events at 56, and some at 55. So that last bin had all events that were either 55 OR 56 in the bin. Whereas bin number 1 only had the events that were strictly a 1. When I get it one more bin to use for the histogram, things were now fine.
So before you claim non-uniformity, think about whether the test you are using that asserts non-uniformity might be flawed.
  3 Comments
Steven Lord
Steven Lord on 15 Aug 2019
John is correct. As stated in the histogram documentation page, "Each bin includes the left edge, but does not include the right edge, except for the last bin which includes both edges."
Before John added that last bin edge at 57, the last bin was [55, 56] and the next-to-last bin was [54, 55). So the last bin counted two distinct values from the data.
After John added that last bin edge at 57, the last bin is [56, 57] and the next-to-last bin is [55, 56). Each of the last two bins now counts only one distinct value from the data.
AbioEngineer
AbioEngineer on 15 Aug 2019
Yep! thank you for the answer and comment! I can't believe I forgot to set the bin size properly. I remember back in r2014 there was an issue with random integers that I had to work around more cleverly, and thought my current problem tessellated with that old one.

Sign in to comment.

More Answers (1)

AbioEngineer
AbioEngineer on 15 Aug 2019
I'm an idiot, there were 55 bins in the above image... changing h.NumBins = 56 solves the problem.

Categories

Find more on Data Distribution Plots in Help Center and File Exchange

Products


Release

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!