How to estimate number of clusters through EM in scikit-learn

+1 vote

I am trying to implement the cluster estimation method using EM found in Weka, more precisely the following description:

The cross validation performed to determine the number of clusters is done in the following steps:

  1. the number of clusters is set to 1
  2. the training set is split randomly into 10 folds.
  3. EM is performed 10 times using the 10 folds the usual CV way.
  4. the loglikelihood is averaged over all 10 results.
  5. if loglikelihood has increased the number of clusters is increased by 1 and the program continues at step 2.

My current implementation is as follows:

def estimate_n_clusters(X):
   "Find the best number of clusters through maximization of the log-likelihood from EM."
   last_log_likelihood = None
   kf = KFold(n_splits=10, shuffle=True)
   components = range(50)[1:]
   for n_components in components:
       gm = GaussianMixture(n_components=n_components)

       log_likelihood_list = []
       for train, test in kf.split(X):
           gm.fit(X[train, :])
           if not gm.converged_:
               raise Warning("GM not converged")
           log_likelihood = np.log(-gm.score_samples(X[test, :]))

           log_likelihood_list += log_likelihood.tolist()

       avg_log_likelihood = np.average(log_likelihood_list)

       if last_log_likelihood is None:
           last_log_likelihood = avg_log_likelihood
       elif avg_log_likelihood+10E-6 <= last_log_likelihood:
           return n_components
       last_log_likelihood = avg_log_likelihood

I am getting a similar number of clusters both trough Weka and my function, however, using the number of clusters n_clusters estimated by the function

gm = GaussianMixture(n_components=n_clusters).fit(X)
print(np.log(-gm.score(X)))

Results in NaN, since the -gm.score(X) yields negative results (about -2500). While Weka reports Log likelihood: 347.16447.

My guess is that the likelihood mentioned in step 4 of Weka is not the same as the one mentioned in the functionscore_samples().

Can anyone tell where I am getting something wrong?

Thanks

Sep 26, 2018 in Python by bug_seeker
• 14,970 points
33 views

1 answer to this question.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes

For future reference, the fixed function looks like:

def estimate_n_clusters(X):
   "Find the best number of clusters through maximization of the log-likelihood from EM."
   last_log_likelihood = None
   kf = KFold(n_splits=10, shuffle=True)
   components = range(50)[1:]
   for n_components in components:
       gm = GaussianMixture(n_components=n_components)

       log_likelihood_list = []
       for train, test in kf.split(X):
           gm.fit(X[train, :])
           if not gm.converged_:
               raise Warning("GM not converged")
           log_likelihood = -gm.score_samples(X[test, :])

           log_likelihood_list += log_likelihood.tolist()

       avg_log_likelihood = np.average(log_likelihood_list)
       print(avg_log_likelihood)

       if last_log_likelihood is None:
           last_log_likelihood = avg_log_likelihood
       elif avg_log_likelihood+10E-6 <= last_log_likelihood:
           return n_components-1
       last_log_likelihood = avg_log_likelihood
answered Sep 26, 2018 by Priyaj
• 56,120 points

Related Questions In Python

0 votes
2 answers

How to calculate square root of a number in python?

calculate square root in python >>> import math ...READ MORE

answered Apr 2 in Python by anonymous
100 views
0 votes
1 answer

How can I reformat value_counts() analysis in Pandas for large number of columns?

If I were you, I'd do it ...READ MORE

answered Apr 17, 2018 in Python by anonymous
1,533 views
0 votes
1 answer

how to find factorial of a number?

You can find factorial by many ways. ...READ MORE

answered May 4, 2018 in Python by aayushi
• 750 points
19 views
0 votes
1 answer

how can i count the items in a list?

suppose you have a list a = [0,1,2,3,4,5,6,7,8,9,10] now ...READ MORE

answered May 2 in Python by Mohammad
• 1,400 points
23 views
–1 vote
2 answers

How to find the size of a string in Python?

following way to find length of string  x ...READ MORE

answered Mar 29 in Python by rajesh
• 1,210 points
40 views
0 votes
1 answer

Index of predicted wrong data in Keras, how to find it?

Simply, use: model.predict() pred = model.predict(x_test) indices = [i for ...READ MORE

answered Sep 28, 2018 in Python by Priyaj
• 56,120 points
176 views

© 2018 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved.
"PMP®","PMI®", "PMI-ACP®" and "PMBOK®" are registered marks of the Project Management Institute, Inc. MongoDB®, Mongo and the leaf logo are the registered trademarks of MongoDB, Inc.