Summary: How to train a Naive Bayes Classifier on a bag of vectorized sentences?

Example here :

X_train = [[0, 1, 0, 0], [1, 0, 0, 1], [0, 0, 0, 1]] y_train = 1 X_train = [[0, 0, 0, 0], [0, 1, 0, 1], [1, 0, 0, 1], [0, 1, 0, 1]] y_train = 0

.

1) Context of the project: perform sentiment analysis on a batch of tweets to perform market prediction

I am working on sentiment analysis for stock market classification. As I am new to these techniques, I tried to replicate one from this article: http://cs229.stanford.edu/proj2015/029_report.pdf

But I am facing a big issue with it. Let me explain the main steps of the article I realized :

1. I collected a huge amount of tweets over 4 months (7 million)

2. I cleaned them

3. I grouped them into period intervals of 1 hour

4. I created a target that tells if the price of the Bitcoin has gone down or up after one hour (0 = down ; 1 = up)

What I need to do next is to train my Bernoulli Naive Bayes Model with this. To do this, the article mentions vectorizing the tweets this way. [.....] What I did with the CountVectorizer class from sklearn.

2 ) The issue: the dimension of the inputs doesn't match Naive Bayes standards

But then I encounter an issue when I try to fit the Bernoulli Naive Bayes model, following the article method : So, one observation is shaped this way :

• input shape (one observation): (nb_tweets_on_this_1hour_interval, vocabulary_size= 10 000)

one_observation_input = [ [0, 1, 0 ....., 0, 0], #Tweet 1 vectorized ...., [1, 0, ....., 1, 0] #Tweet N vectorized ]#All of the values are 0 or 1

• output shape (one observation): (1,)
one_observation_output =  #Can only be 0 or 1

When I tried to fit my Sklearn Bernoulli Naive Bayes model with this type of value, I am getting this error

>>>  ValueError: Found array with dim 3. Estimator expected <= 2.

Indeed, the model expects binary input shaped this way :

• input : (nb_features)
ex: [0, 0, 1, 0, ...., 1, 0, 1]

while I am giving it vectors of binary values!

3 ) What I have tried

So far, I tried several things to resolve this :

• Associating the label for every tweet, but the results are not good since the tweets are really noisy

• Flatten the inputs so the shape for one input is (nb_tweets_on_this_1hour_interval*vocabulary_size, ). But the model can not train as the number of tweets every hour is not constant.

4 ) Conclusion

I don't know if the error comes from my misunderstanding of the article or of the Nayes Bayes models.

How to train efficiently a naive Bayes classifier on a bag of tweets?

Here is my training code :

`bnb = BernoulliNB() uniqueY = [0, 1]#I give the algorithm the 2 classes I want to classify the tweets with. This is needed for the partial fit for _index, row in train_df.iterrows():#I have to use a for loop to partialy fit my Bernouilli Naive Bayes classifier to prevent from out of memory issues #row["Tweet"] contains all the (cleaned) tweets over 1hour interval this way : ["I like Bitcoin", "Nice portfolio", ...., "I am the last tweet of the interval"] X_train = vectorizer.transform(row["Tweet"]).toarray() #X_train contrains all of the row["Tweet"] tweets vectorizes with a bag of words algorithm which return this kind of data : [[0, 1, 0 ....., 0, 0], ....,[1, 0, ....., 1, 0]] y_train = row["target"] #Target is 0 if the market is going down after the tweets and 1 if it is going up bnb.partial_fit([X_train], [y_train], uniqueY)`

I use partial fit to avoid out of memory issues Apr 7 33 views

## 1 answer to this question.

The error is basically the [X_train] which is increasing the number of dimensions in code. In your code

`bnb.partial_fit([X_train], [y_train], uniqueY) #X_train in brackets are causing your error`

The Bernoulli NB is expecting an array with TWO dimensions only and putting X_train in the square is making it three dimensions instead.

If you change your code to this then it should work:-

`bnb.partial_fit(X_train, y_train, uniqueY)` answered Apr 11 by
• 8,980 points

+1 vote

## How can I use blockhain for storing a proof of a document such as an image?

Yes, you're right. Saving entire image in ...READ MORE

## What could be the best term to use for the collection of contracts in a .sol file?

module - don't think so. Because module ...READ MORE

## Are there any plans for composer, to make usage of the recently released Side DB feature?

As of yet, it isn't supported by ...READ MORE

## How to decrypt result of query when using the Hyperledger Client SDK for Node.js

In this case, it is just a ...READ MORE

## Classification in Naive Bayes algorithm

Hi@Ogun, The Numpy module doesn't have a predict attribute. ...READ MORE

## Reliability of Bayes Theorem

Baye’s theorem is useful for determining the ...READ MORE

## how do i change string to a list?

suppose you have a string with a ...READ MORE

## how can i randomly select items from a list?

You can also use the random library's ...READ MORE