Bad logistic regression in trivial example scikit-learn

Question

I am trying to run a trivial example of logistic regression using sklearn.linear_model.LogisticRegression

Here is the code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# some randomly generated data with two well differentiated groups 
x1 = np.random.normal(loc=15, scale=2, size=(30,1))
y1 = np.random.normal(loc=10, scale=2, size=(30,1))
x2 = np.random.normal(loc=25, scale=2, size=(30,1))
y2 = np.random.normal(loc=20, scale=2, size=(30,1))

data1 = np.concatenate([x1, y1, np.zeros(shape=(30,1))], axis=1)
data2 = np.concatenate([x2, y2, np.ones(shape=(30,1))], axis=1)

dfa = pd.DataFrame(data=data1, columns=["F1", "F2", "group"])
dfb = pd.DataFrame(data=data2, columns=["F1", "F2", "group"])

df = pd.concat([dfa, dfb], ignore_index=True)

# the actual fitting
features = [item for item in df.columns if item not in ("group")]
logreg = LogisticRegression(verbose=1)
logreg.fit(df[features], df.group)

# plotting and checking the result

theta = logreg.coef_[0,:] # parameters
y0 = logreg.intercept_    # intercept

print("Theta =", theta)
print("Intercept = ", y0)

xdb = np.arange(0, 30, 0.2)  # dummy x vector for decision boundary
ydb = -(y0+theta[0]*xdb) / theta[1] # decision boundary y values

fig = plt.figure()
ax = fig.add_subplot(111)
colors = {0 : "red", 1 : "blue"}
for i, group in df.groupby("group"):
    plt.plot(group["F1"], group["F2"],
             MarkerFaceColor = colors[i], Marker = "o", LineStyle="",
             MarkerEdgeColor=colors[i])
plt.plot(xdb, ydb, LineStyle="--", Color="b")

Shockingly the resulting plot looks like this:

and, in fact, the accuracy can be calculated:

predictions = logreg.predict(df[features])
metrics.accuracy_score(predictions, df["group"])

which yielded 0.966...

I must be doing something wrong, just can't figure out what. Any help is much appreciated!

Dev · Answer 1 · Mar 17, 2022

This is due to the process of regularization. The optimal value for the line would be around -16 for the intercept, however regularization prevents it from reaching that level.

The loss function, which is a combination of error and weight values, is minimized using logistic regression. When the value of C is increased in this scenario, the focus will be on minimizing error (and thus finding a better decision boundary) rather than weights. As a result, a valid decision boundary is established.

Although, in most real-world settings, regularization is critical. It's vital not to use one in particular situations.

Make the following modification:

logreg = LogisticRegression(verbose=1, C=100)

The output with this is following

enter image description here

answered Mar 17, 2022 by Dev
• 6,000 points

Bad logistic regression in trivial example scikit-learn

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In Machine Learning

Difference between classification and regression score in Python scikit learn

Example of Logistic regression with python code

Can we change the sigmoid with tanh in Logistic regression transforms??

How to save classifier to disk in scikit-learn?

how do i change string to a list?

how can i randomly select items from a list?

how can i count the items in a list?

how do i use the enumerate function inside a list?

How to get a regression summary in scikit-learn like R does?

Why is Pymc3 ADVI worse than MCMC in this logistic regression example?

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES