Naives Bayes Classifier for bag of vectorized sentences

0 votes

Summary: How to train a Naive Bayes Classifier on a bag of vectorized sentences?

Example here :

X_train[0] = [[0, 1, 0, 0], [1, 0, 0, 1], [0, 0, 0, 1]] y_train[0] = 1 X_train[1] = [[0, 0, 0, 0], [0, 1, 0, 1], [1, 0, 0, 1], [0, 1, 0, 1]] y_train[1] = 0

.


1) Context of the project: perform sentiment analysis on a batch of tweets to perform market prediction

I am working on sentiment analysis for stock market classification. As I am new to these techniques, I tried to replicate one from this article: http://cs229.stanford.edu/proj2015/029_report.pdf

But I am facing a big issue with it. Let me explain the main steps of the article I realized :

  1. I collected a huge amount of tweets over 4 months (7 million)

  2. I cleaned them 

  3. I grouped them into period intervals of 1 hour

  4. I created a target that tells if the price of the Bitcoin has gone down or up after one hour (0 = down ; 1 = up)

What I need to do next is to train my Bernoulli Naive Bayes Model with this. To do this, the article mentions vectorizing the tweets this way.

enter image description here

[.....] enter image description here

What I did with the CountVectorizer class from sklearn.

2 ) The issue: the dimension of the inputs doesn't match Naive Bayes standards

But then I encounter an issue when I try to fit the Bernoulli Naive Bayes model, following the article method :

enter image description here

So, one observation is shaped this way :

  • input shape (one observation): (nb_tweets_on_this_1hour_interval, vocabulary_size= 10 000)

one_observation_input = [ [0, 1, 0 ....., 0, 0], #Tweet 1 vectorized ...., [1, 0, ....., 1, 0] #Tweet N vectorized ]#All of the values are 0 or 1

  • output shape (one observation): (1,)
    one_observation_output = [0] #Can only be 0 or 1

When I tried to fit my Sklearn Bernoulli Naive Bayes model with this type of value, I am getting this error

>>>  ValueError: Found array with dim 3. Estimator expected <= 2.

Indeed, the model expects binary input shaped this way :

  • input : (nb_features)
    ex: [0, 0, 1, 0, ...., 1, 0, 1]

while I am giving it vectors of binary values!

3 ) What I have tried

So far, I tried several things to resolve this :

  • Associating the label for every tweet, but the results are not good since the tweets are really noisy

  • Flatten the inputs so the shape for one input is (nb_tweets_on_this_1hour_interval*vocabulary_size, ). But the model can not train as the number of tweets every hour is not constant.

4 ) Conclusion

I don't know if the error comes from my misunderstanding of the article or of the Nayes Bayes models.

How to train efficiently a naive Bayes classifier on a bag of tweets?

Here is my training code :

bnb = BernoulliNB() uniqueY = [0, 1]#I give the algorithm the 2 classes I want to classify the tweets with. This is needed for the partial fit for _index, row in train_df.iterrows():#I have to use a for loop to partialy fit my Bernouilli Naive Bayes classifier to prevent from out of memory issues #row["Tweet"] contains all the (cleaned) tweets over 1hour interval this way : ["I like Bitcoin", "Nice portfolio", ...., "I am the last tweet of the interval"] X_train = vectorizer.transform(row["Tweet"]).toarray() #X_train contrains all of the row["Tweet"] tweets vectorizes with a bag of words algorithm which return this kind of data : [[0, 1, 0 ....., 0, 0], ....,[1, 0, ....., 1, 0]] y_train = row["target"] #Target is 0 if the market is going down after the tweets and 1 if it is going up bnb.partial_fit([X_train], [y_train], uniqueY)

 I use partial fit to avoid out of memory issues

Apr 7 in Blockchain by Soham
• 9,670 points
108 views

1 answer to this question.

0 votes

The error is basically the [X_train] which is increasing the number of dimensions in code. In your code

bnb.partial_fit([X_train], [y_train], uniqueY) #X_train in brackets are causing your error

The Bernoulli NB is expecting an array with TWO dimensions only and putting X_train in the square is making it three dimensions instead.

If you change your code to this then it should work:-

bnb.partial_fit(X_train, y_train, uniqueY)

answered Apr 11 by Rahul
• 9,680 points

Related Questions In Blockchain

+1 vote
2 answers
0 votes
1 answer

What could be the best term to use for the collection of contracts in a .sol file?

module - don't think so. Because module ...READ MORE

answered Jun 2, 2018 in Blockchain by Shashank
• 10,400 points
289 views
0 votes
1 answer

Classification in Naive Bayes algorithm

Hi@Ogun, The Numpy module doesn't have a predict attribute. ...READ MORE

answered Oct 5, 2020 in Machine Learning by MD
• 95,380 points
476 views
0 votes
1 answer

Reliability of Bayes Theorem

Baye’s theorem is useful for determining the ...READ MORE

answered Oct 17, 2018 in Data Analytics by kurt_cobain
• 9,390 points
309 views
0 votes
2 answers
0 votes
1 answer

Where does Hyperledger fabric store the public key and private key of the user?

It signs the transaction (eg. initiated by ...READ MORE

answered Mar 24 in Blockchain by Rahul
• 9,680 points
108 views
0 votes
1 answer
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP