Hi guys, I'm trying to use the Naive Bayes Algorithm on my dataset. Dataset can be downloaded here: https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe
This is my code:
#
data = pd.read_json('/Users/rokayadarai/Desktop/Coding/DataSets/Hotel_Reviews.json')
data.head()
#stopword are not usefull (a, and, the)
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)
#merge 2 columns negative_reviews&Positive reviews into 1
data ['Reviews'] = data['Negative_Review'] + data['Positive_Review']
y = data.Reviewer_Score
X = vectorizer.fit_transform(data.Reviews)
# 515738 observations and 83941 unique words
print (y.shape)
print (X.shape)
#split the data - 0.2 means 20% of the data. 123 means use same dataset with every test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=123)
#train naive bayes classifier
classifier = naive_bayes.MultinomialNB()
classifier.fit(X_train, y_train)
But after running it I keep getting the error:
ValueError: Unknown label type: (array([ 7.5, 9.2, 9.2, ..., 5.8, 10. , 9.6]),) for the line classifier.fit(X_train, y_train)
Could somebody please help me out?