Naives Bayes Classifier for bag of vectorized sentences

0 votes

Summary: How to train a Naive Bayes Classifier on a bag of vectorized sentences?

Example here :

X_train[0] = [[0, 1, 0, 0], [1, 0, 0, 1], [0, 0, 0, 1]] y_train[0] = 1 X_train[1] = [[0, 0, 0, 0], [0, 1, 0, 1], [1, 0, 0, 1], [0, 1, 0, 1]] y_train[1] = 0


1) Context of the project: perform sentiment analysis on a batch of tweets to perform market prediction

I am working on sentiment analysis for stock market classification. As I am new to these techniques, I tried to replicate one from this article:

But I am facing a big issue with it. Let me explain the main steps of the article I realized :

  1. I collected a huge amount of tweets over 4 months (7 million)

  2. I cleaned them 

  3. I grouped them into period intervals of 1 hour

  4. I created a target that tells if the price of the Bitcoin has gone down or up after one hour (0 = down ; 1 = up)

What I need to do next is to train my Bernoulli Naive Bayes Model with this. To do this, the article mentions vectorizing the tweets this way.

enter image description here

[.....] enter image description here

What I did with the CountVectorizer class from sklearn.

2 ) The issue: the dimension of the inputs doesn't match Naive Bayes standards

But then I encounter an issue when I try to fit the Bernoulli Naive Bayes model, following the article method :

enter image description here

So, one observation is shaped this way :

  • input shape (one observation): (nb_tweets_on_this_1hour_interval, vocabulary_size= 10 000)

one_observation_input = [ [0, 1, 0 ....., 0, 0], #Tweet 1 vectorized ...., [1, 0, ....., 1, 0] #Tweet N vectorized ]#All of the values are 0 or 1

  • output shape (one observation): (1,)
    one_observation_output = [0] #Can only be 0 or 1

When I tried to fit my Sklearn Bernoulli Naive Bayes model with this type of value, I am getting this error

>>>  ValueError: Found array with dim 3. Estimator expected <= 2.

Indeed, the model expects binary input shaped this way :

  • input : (nb_features)
    ex: [0, 0, 1, 0, ...., 1, 0, 1]

while I am giving it vectors of binary values!

3 ) What I have tried

So far, I tried several things to resolve this :

  • Associating the label for every tweet, but the results are not good since the tweets are really noisy

  • Flatten the inputs so the shape for one input is (nb_tweets_on_this_1hour_interval*vocabulary_size, ). But the model can not train as the number of tweets every hour is not constant.

4 ) Conclusion

I don't know if the error comes from my misunderstanding of the article or of the Nayes Bayes models.

How to train efficiently a naive Bayes classifier on a bag of tweets?

Here is my training code :

bnb = BernoulliNB() uniqueY = [0, 1]#I give the algorithm the 2 classes I want to classify the tweets with. This is needed for the partial fit for _index, row in train_df.iterrows():#I have to use a for loop to partialy fit my Bernouilli Naive Bayes classifier to prevent from out of memory issues #row["Tweet"] contains all the (cleaned) tweets over 1hour interval this way : ["I like Bitcoin", "Nice portfolio", ...., "I am the last tweet of the interval"] X_train = vectorizer.transform(row["Tweet"]).toarray() #X_train contrains all of the row["Tweet"] tweets vectorizes with a bag of words algorithm which return this kind of data : [[0, 1, 0 ....., 0, 0], ....,[1, 0, ....., 1, 0]] y_train = row["target"] #Target is 0 if the market is going down after the tweets and 1 if it is going up bnb.partial_fit([X_train], [y_train], uniqueY)

 I use partial fit to avoid out of memory issues

Apr 7 in Blockchain by Soham
• 8,730 points

1 answer to this question.

0 votes

The error is basically the [X_train] which is increasing the number of dimensions in code. In your code

bnb.partial_fit([X_train], [y_train], uniqueY) #X_train in brackets are causing your error

The Bernoulli NB is expecting an array with TWO dimensions only and putting X_train in the square is making it three dimensions instead.

If you change your code to this then it should work:-

bnb.partial_fit(X_train, y_train, uniqueY)

answered Apr 11 by Rahul
• 8,980 points

Related Questions In Blockchain

+1 vote
2 answers
0 votes
1 answer

What could be the best term to use for the collection of contracts in a .sol file?

module - don't think so. Because module ...READ MORE

answered Jun 2, 2018 in Blockchain by Shashank
• 10,400 points
0 votes
1 answer

Classification in Naive Bayes algorithm

Hi@Ogun, The Numpy module doesn't have a predict attribute. ...READ MORE

answered Oct 5, 2020 in Machine Learning by MD
• 95,340 points
0 votes
1 answer

Reliability of Bayes Theorem

Baye’s theorem is useful for determining the ...READ MORE

answered Oct 17, 2018 in Data Analytics by kurt_cobain
• 9,390 points
0 votes
2 answers
0 votes
1 answer
0 votes
1 answer
Send OTP
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP