Bad logistic regression in trivial example scikit-learn

0 votes

I am trying to run a trivial example of logistic regression using sklearn.linear_model.LogisticRegression

Here is the code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# some randomly generated data with two well differentiated groups 
x1 = np.random.normal(loc=15, scale=2, size=(30,1))
y1 = np.random.normal(loc=10, scale=2, size=(30,1))
x2 = np.random.normal(loc=25, scale=2, size=(30,1))
y2 = np.random.normal(loc=20, scale=2, size=(30,1))

data1 = np.concatenate([x1, y1, np.zeros(shape=(30,1))], axis=1)
data2 = np.concatenate([x2, y2, np.ones(shape=(30,1))], axis=1)

dfa = pd.DataFrame(data=data1, columns=["F1", "F2", "group"])
dfb = pd.DataFrame(data=data2, columns=["F1", "F2", "group"])

df = pd.concat([dfa, dfb], ignore_index=True)

# the actual fitting
features = [item for item in df.columns if item not in ("group")]
logreg = LogisticRegression(verbose=1)
logreg.fit(df[features], df.group)

# plotting and checking the result

theta = logreg.coef_[0,:] # parameters
y0 = logreg.intercept_    # intercept

print("Theta =", theta)
print("Intercept = ", y0)

xdb = np.arange(0, 30, 0.2)  # dummy x vector for decision boundary
ydb = -(y0+theta[0]*xdb) / theta[1] # decision boundary y values

fig = plt.figure()
ax = fig.add_subplot(111)
colors = {0 : "red", 1 : "blue"}
for i, group in df.groupby("group"):
    plt.plot(group["F1"], group["F2"],
             MarkerFaceColor = colors[i], Marker = "o", LineStyle="",
             MarkerEdgeColor=colors[i])
plt.plot(xdb, ydb, LineStyle="--", Color="b")

Shockingly the resulting plot looks like this:

enter image description here

and, in fact, the accuracy can be calculated:

predictions = logreg.predict(df[features])
metrics.accuracy_score(predictions, df["group"])

which yielded 0.966...

I must be doing something wrong, just can't figure out what. Any help is much appreciated!

Mar 15, 2022 in Machine Learning by Nandini
• 5,480 points
409 views

1 answer to this question.

0 votes

This is due to the process of regularization. The optimal value for the line would be around -16 for the intercept, however regularization prevents it from reaching that level.

The loss function, which is a combination of error and weight values, is minimized using logistic regression. When the value of C is increased in this scenario, the focus will be on minimizing error (and thus finding a better decision boundary) rather than weights. As a result, a valid decision boundary is established.

Although, in most real-world settings, regularization is critical. It's vital not to use one in particular situations.

Make the following modification:

logreg = LogisticRegression(verbose=1, C=100)

The output with this is following

enter image description here

answered Mar 17, 2022 by Dev
• 6,000 points

Related Questions In Machine Learning

0 votes
1 answer

Difference between classification and regression score in Python scikit learn

Classification Score is used for discrete values ...READ MORE

answered Feb 24, 2022 in Machine Learning by Nandini
• 5,480 points
406 views
0 votes
1 answer
0 votes
1 answer

Can we change the sigmoid with tanh in Logistic regression transforms??

Hi@Deepanshu, Yes, you can use tanh instead of ...READ MORE

answered May 12, 2020 in Machine Learning by MD
• 95,440 points
2,281 views
0 votes
1 answer

How to save classifier to disk in scikit-learn?

Hi@akhtar, Classifiers are just objects that can be ...READ MORE

answered Jul 14, 2020 in Machine Learning by MD
• 95,440 points
908 views
0 votes
2 answers
+1 vote
2 answers

how can i count the items in a list?

Syntax :            list. count(value) Code: colors = ['red', 'green', ...READ MORE

answered Jul 7, 2019 in Python by Neha
• 330 points

edited Jul 8, 2019 by Kalgi 4,051 views
0 votes
1 answer
0 votes
1 answer

How to get a regression summary in scikit-learn like R does?

In sklearn, there is no R type ...READ MORE

answered Mar 15, 2022 in Machine Learning by Dev
• 6,000 points
3,069 views
0 votes
1 answer

Why is Pymc3 ADVI worse than MCMC in this logistic regression example?

This is a good query! Mean field ...READ MORE

answered Apr 5, 2022 in Machine Learning by Dev
• 6,000 points
489 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP