How to compute the probability of a value given a list of samples from a distribution in Python

Question

Not sure if this belongs in statistics, but I am trying to use Python to achieve this. I essentially just have a list of integers:

data = [300,244,543,1011,300,125,300 ... ]

And I would like to know the probability of a value occurring given this data. I graphed histograms of the data using matplotlib and obtained these:

In the first graph, the numbers represent the amount of characters in a sequence. In the second graph, it's a measured amount of time in milliseconds. The minimum is greater than zero, but there isn't necessarily a maximum. The graphs were created using millions of examples, but I'm not sure I can make any other assumptions about the distribution. I want to know the probability of a new value given that I have a few million examples of values. In the first graph, I have a few million sequences of different lengths. Would like to know probability of a 200 length, for example.

I know that for a continuous distribution the probability of any exact point is supposed to be zero, but given a stream of new values, I need be able to say how likely each value is. I've looked through some of the numpy/scipy probability density functions, but I'm not sure which to choose from or how to query for new values once I run something like scipy.stats.norm.pdf(data). It seems like different probability density functions will fit the data differently. Given the shape of the histograms I'm not sure how to decide which to use.

Nandini · Answer 1 · Mar 23, 2022

I recommend adopting a non-parametric density estimation method because you don't appear to have a certain distribution in mind but may have a large number of data samples. One of the data types you specify (time in milliseconds) is clearly continuous, and the histogram, which you already stated, is one way for non-parametric estimation of a probability density function (PDF) for continuous random variables. Kernel Density Estimation (KDE) can, however, be superior, as you'll see below.
The second type of data you mention is discrete (number of letters in a sequence). Kernel density estimation can also be useful and can be thought of as a smoothing technique in circumstances where there aren't enough samples for all values of the discrete variable.

Density Calculation

The following example demonstrates how to produce data samples from a blend of two Gaussian distributions, then use kernel density estimation to determine the probability density function:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from sklearn.neighbors import KernelDensity

# Generate random samples from a mixture of 2 Gaussians
# with modes at 5 and 10
df = np.concatenate((5 + np.random.randn(10, 1),
                       10 + np.random.randn(30, 1)))

# Plot the true distribution
x = np.linspace(0, 16, 1000)[:, np.newaxis]
norm_vals = mlab.normpdf(x, 5, 1) * 0.25 + mlab.normpdf(x, 10, 1) * 0.75
plt.plot(x, norm_vals)

# Plot the data using a normalized histogram
plt.hist(df, 50, normed=True)

# Do kernel density estimation
kd = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(df)

# Plot the estimated densty
kd_vals = np.exp(kd.score_samples(x))
plt.plot(x, kd_vals)

# Show the plots
plt.show()

The true distribution is presented in blue, the histogram is shown in green, and the PDF calculated using KDE is shown in red in the following plot::

As you can see, the PDF generated by the histogram isn't particularly informative in this case, whereas KDE provides a far better estimate. Histogram, on the other hand, might yield a good estimate with a bigger number of data samples and a correct bin size selection.
In the case of KDE, the kernel and bandwidth are the parameters that can be tweaked. The kernel can be thought of as the foundation for the estimated PDF, and Scikit Learn includes multiple kernel functions: gaussian, tophat, epanechnikov, exponential, linear, and cosine. You can alter the bias-variance trade-off by changing the bandwidth. When you raise the bandwidth, you increase the bias, which is helpful if you have fewer data samples. Smaller bandwidth increases variance (fewer samples are included in the estimation), but when more samples are available, it gives a better estimate.
Probability Calculation

Probability is calculated for a PDF by computing the integral over a range of values. As you may have noticed, this results in a chance of 0 for a given value.

There does not appear to be a built-in function for computing probability in Scikit Learn. However, estimating the integral of the PDF across a range is simple. We can do this by evaluating the PDF several times throughout the range and adding the results multiplied by the step size between each evaluation point. K samples are obtained in the example below using step

# Get probability for range of values
start = 5  # Start of the range
stop = 6    # End of the range
K = 100    # Number of evaluation points 
step = (stop - start) / (K - 1)  # Step size
x = np.linspace(start, stop, K)[:, np.newaxis]  # Generate values in the range
kd_vals = np.exp(kd.score_samples(x))  # Get PDF values for each x
probability = np.sum(kd_vals * step)  # Approximate the integral of the PDF
print(probability)

Please keep in mind that kd.score samples calculates the data samples' log-likelihood. As a result, np.exp is required to calculate likelihood.

The same calculation may be done with SciPy's built-in integration methods, which yields a somewhat more accurate result:

from scipy.integrate import quad
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, stop)[0]

For one run, the probability determined by the first technique was 0.0859024655305, while the probability predicted by the second method was 0.0850974209996139.

Supercharge Your Skills with Our Machine Learning Course!