Distinguishing two probability distributions with a finite number of data points

See if two probability distributions can be distinguished with a finite number of data points.
61 Downloads
Updated 19 Dec 2017

View License

This algorithm determines the likelihood that two probability distributions can be distinguished with only a finite number of data points. If the data is distributed according to True_distribution, then what is the probability that scientists interpreting the results in the context of the Null_distribution will find a statistically significant result at the Confidence_level level of confidence (e.g. 0.99)? A one-tailed test is assumed, so this is not a blind search for deviations from the null hypothesis but a search for deviations in the direction indicated by the alternative. For the purpose of this code, I assume the alternative is actually correct such that the N data points are distributed according to it. These data are then interpreted according to the null hypothesis and the probability recorded that the observed value is in the (1 - Confidence_level) tail of the null hypothesis, on the side predicted by the alternative.
To use the function, create two probability distributions with the same (not necessarily fixed width) bins. This requires creating two vectors of the same length corresponding to the same intervals in the observable x e.g. create 800 bins covering -4 to 4 in increments of 0.01. The Null_distribution vector must contain the probability of x falling in that bin according to the null hypothesis. At the same index, the True_distribution vector must contain the probability of x falling IN THE SAME BIN according to the alternative model. The input probability distributions do not have to sum to 1, but they must be normalised correctly - don't just enter a vector whose elements are only proportional to the probability of x lying in each bin, these must be actually equal. The other arguments are the total number of data points and the confidence level beyond which the null hypothesis will be rejected.

The idea is to use binomial statistics to determine the probability distribution of Q, how many times the variable x falls in a particular range of values, according to the null hypothesis. The range of values inconsistent with the null are then determined, on the side facing the alternative prediction. Finally, the total probability of Q falling in this range is determined, assuming the alternative to be correct.

The output is a grid of probabilities that the test is successful in determining the null to be wrong in a way that favours the alternative. Only half of the results table is filled. For j > i, the bin (i, j) corresponds to the range of x values covered between the bins i through j. If j = i, then this just means the probability corresponds to focusing on how many times x falls in the range covered by bin i and how many times it is expected to do so according to the null hypothesis. Generally, including more bins improves the statistics but including too many bins leads to all the data falling within the specified range in either model. Thus, to distinguish e.g. two normal distributions with dispersion 1 and means of 0 and 0.2, it is not good to focus on the range 0 to 0.001 when there's only e.g. 200 data points. But neither is it good to focus on the range -10 to 10. Something like -3 to 0 will probably work best.

Without the statistics learning toolbox, please create an additional argument Threshold_sigmas and delete lines starting 'Two_CL_minus_1 = ..' to 'Threshold_sigmas = x;'. In this case, please determine the value of Threshold_sigmas yourself e.g. 2.3263 for 99% (1-tailed test). This just corresponds to the number k of standard deviations such that the probability of a normal distribution being below Mean + k*Sigma is equal to Confidence_level. For other confidence levels, it should be readily possible to look up the appropriate value online. Alternatively, delete the block of code inside the 'elseif Mean_null > Gaussian_approx_min && Mean_null < Gaussian_approx_max && Mean_true > Gaussian_approx_min && Mean_true < Gaussian_approx_max'. This uses a Gaussian approximation to the binomial distribution when the numbers are sufficiently large. However, it is not necessary to do so and the standard binomial coefficients can of course be used instead, as occurs with smaller numbers.

The screenshot shows an example where two Gaussians are compared with dispersions of 1 and means of 0 and 0.2. With 200 data points, there is a fair chance of being able to distinguish them at the 99% confidence level. The title shows the best range to focus on. Generally, there are two options - at very low values of x, the null distribution is favoured due to the lower mean, while at very high values, the alternative is favoured. Both will lead to a large difference in the rates of incidence between the theories for x falling in that range. However, this may look substantially different for asymmetric distributions which are far from Gaussian.

The overall idea is to forecast how future statistical investigations are likely to play out without actually doing them, thus helping to plan them.

Cite As

Indranil Banik (2025). Distinguishing two probability distributions with a finite number of data points (https://uk.mathworks.com/matlabcentral/fileexchange/65465-distinguishing-two-probability-distributions-with-a-finite-number-of-data-points), MATLAB Central File Exchange. Retrieved .

MATLAB Release Compatibility
Created with R2017b
Compatible with any release
Platform Compatibility
Windows macOS Linux

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
Version Published Release Notes
2.0.0.0

Slight update to the description.

1.0.0.0

Updated description slightly.

Updated description slightly.