How to compute the probability of a value given a list of samples from a distribution in Python

0 votes

Not sure if this belongs in statistics, but I am trying to use Python to achieve this. I essentially just have a list of integers:

data = [300,244,543,1011,300,125,300 ... ]

And I would like to know the probability of a value occurring given this data. I graphed histograms of the data using matplotlib and obtained these:

enter image description here

enter image description here

In the first graph, the numbers represent the amount of characters in a sequence. In the second graph, it's a measured amount of time in milliseconds. The minimum is greater than zero, but there isn't necessarily a maximum. The graphs were created using millions of examples, but I'm not sure I can make any other assumptions about the distribution. I want to know the probability of a new value given that I have a few million examples of values. In the first graph, I have a few million sequences of different lengths. Would like to know probability of a 200 length, for example.

I know that for a continuous distribution the probability of any exact point is supposed to be zero, but given a stream of new values, I need be able to say how likely each value is. I've looked through some of the numpy/scipy probability density functions, but I'm not sure which to choose from or how to query for new values once I run something like scipy.stats.norm.pdf(data). It seems like different probability density functions will fit the data differently. Given the shape of the histograms I'm not sure how to decide which to use.

Mar 21 in Machine Learning by Dev
• 6,000 points

1 answer to this question.

0 votes

I recommend adopting a non-parametric density estimation method because you don't appear to have a certain distribution in mind but may have a large number of data samples. One of the data types you specify (time in milliseconds) is clearly continuous, and the histogram, which you already stated, is one way for non-parametric estimation of a probability density function (PDF) for continuous random variables. Kernel Density Estimation (KDE) can, however, be superior, as you'll see below.
The second type of data you mention is discrete (number of letters in a sequence). Kernel density estimation can also be useful and can be thought of as a smoothing technique in circumstances where there aren't enough samples for all values of the discrete variable.

Density Calculation

The following example demonstrates how to produce data samples from a blend of two Gaussian distributions, then use kernel density estimation to determine the probability density function:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from sklearn.neighbors import KernelDensity

# Generate random samples from a mixture of 2 Gaussians
# with modes at 5 and 10
df = np.concatenate((5 + np.random.randn(10, 1),
                       10 + np.random.randn(30, 1)))

# Plot the true distribution
x = np.linspace(0, 16, 1000)[:, np.newaxis]
norm_vals = mlab.normpdf(x, 5, 1) * 0.25 + mlab.normpdf(x, 10, 1) * 0.75
plt.plot(x, norm_vals)

# Plot the data using a normalized histogram
plt.hist(df, 50, normed=True)

# Do kernel density estimation
kd = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(df)

# Plot the estimated densty
kd_vals = np.exp(kd.score_samples(x))
plt.plot(x, kd_vals)

# Show the plots

The true distribution is presented in blue, the histogram is shown in green, and the PDF calculated using KDE is shown in red in the following plot::


As you can see, the PDF generated by the histogram isn't particularly informative in this case, whereas KDE provides a far better estimate. Histogram, on the other hand, might yield a good estimate with a bigger number of data samples and a correct bin size selection.
In the case of KDE, the kernel and bandwidth are the parameters that can be tweaked. The kernel can be thought of as the foundation for the estimated PDF, and Scikit Learn includes multiple kernel functions: gaussian, tophat, epanechnikov, exponential, linear, and cosine. You can alter the bias-variance trade-off by changing the bandwidth. When you raise the bandwidth, you increase the bias, which is helpful if you have fewer data samples. Smaller bandwidth increases variance (fewer samples are included in the estimation), but when more samples are available, it gives a better estimate.
Probability Calculation

Probability is calculated for a PDF by computing the integral over a range of values. As you may have noticed, this results in a chance of 0 for a given value.

There does not appear to be a built-in function for computing probability in Scikit Learn. However, estimating the integral of the PDF across a range is simple. We can do this by evaluating the PDF several times throughout the range and adding the results multiplied by the step size between each evaluation point. K samples are obtained in the example below using step 

# Get probability for range of values
start = 5  # Start of the range
stop = 6    # End of the range
K = 100    # Number of evaluation points 
step = (stop - start) / (K - 1)  # Step size
x = np.linspace(start, stop, K)[:, np.newaxis]  # Generate values in the range
kd_vals = np.exp(kd.score_samples(x))  # Get PDF values for each x
probability = np.sum(kd_vals * step)  # Approximate the integral of the PDF

Please keep in mind that kd.score samples calculates the data samples' log-likelihood. As a result, np.exp is required to calculate likelihood.

The same calculation may be done with SciPy's built-in integration methods, which yields a somewhat more accurate result:

from scipy.integrate import quad
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, stop)[0]

For one run, the probability determined by the first technique was 0.0859024655305, while the probability predicted by the second method was 0.0850974209996139.

answered Mar 23 by Nandini
• 5,480 points

Related Questions In Machine Learning

0 votes
1 answer

How to simulate first passage time probability in python for a random walk?

To begin with, you're now computing fp ...READ MORE

answered Apr 5 in Machine Learning by Dev
• 6,000 points
0 votes
1 answer

How to load a model from an HDF5 file in Keras?

Hi@akhtar, If you stored the complete model, not ...READ MORE

answered Jul 14, 2020 in Machine Learning by MD
• 95,340 points
0 votes
1 answer

Formula to calculate chance (probability) of a dice side based on its value

If I understand you correctly, you're looking ...READ MORE

answered Mar 17 in Machine Learning by Dev
• 6,000 points
+1 vote
1 answer

How to create plots using python matplotlib in IPython notebook?

I think you should try: I used %matplotlib inline in ...READ MORE

answered Aug 8, 2018 in Python by Priyaj
• 58,060 points
+1 vote
1 answer

How to handle Real-Time Matplotlib Plotting

To draw a continuous set of random ...READ MORE

answered Sep 26, 2018 in Python by Priyaj
• 58,060 points
0 votes
1 answer

How to increase plt.title font size?

Try the following : import matplotlib.pyplot as plt plt.figtext(.5,.9,'Temperature', ...READ MORE

answered Feb 11, 2019 in Python by SDeb
• 13,300 points
0 votes
1 answer
0 votes
1 answer

Leela Chess Zero: how large is the probability vector in the output layer?

The next move's probability vector (called the ...READ MORE

answered Mar 9 in Machine Learning by Nandini
• 5,480 points
Send OTP
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP