Alternatives to linear regression for dataset with many points with small value and some extreme values

0 votes
I want to model the pharmaceutical costs for a group of patients for the next year, based on this years pharma data (codes for the pharmaceuticals), age, gender and this year's costs.

I used linear regression and got an R^2 of 0.69, which was surprisingly good. When I devided the patients into 5 groups of the same size based on the amount of costs for the current year, I could see that the bottom 80 % performed extremely poorly, while the top 20% made up for it with a score of 0.71.

80% of the people have costs roughly under 500 Euro, while those who have a lot of costs, have extreme costs, up to 500.000 Euros.

I think, since Linear Regression wants to minimise residuals, predicting the bottom costs with the still relatively small residuals does not bring as much gain as minimising high costs.

Is there an alternative model, that would be more useful in this context to predict small costs better as well?
Feb 25 in Machine Learning by Nandini
• 5,480 points
77 views

1 answer to this question.

0 votes

The above situation is the case where variance increased with the expected mean this is heteroscedasticity.
You can choose from the following:

  • Use WLS and use weights depending on the predicted value 
  •  Convert the dependent variable. Example: log (y) and  lognormal model estimation 
  •  Use a distribution where the variance increases on average. 

 For instance The Poisson  variance is equal to the mean. You must use quasi-Poisson for continuous variables. Gamma has a squared variance on average. These distributions are typically implemented in GLM.

answered Feb 25 by Dev
• 6,000 points

Related Questions In Machine Learning

0 votes
1 answer

bias and variance calculation for linear regression

Evaluation of Variance: variance = np.var(prediction) # Where ...READ MORE

answered Mar 2 in Machine Learning by Nandini
• 5,480 points
47 views
0 votes
1 answer

Linear regression returning bad fit with large x values

To make the date values start at ...READ MORE

answered Mar 23 in Machine Learning by Nandini
• 5,480 points
29 views
0 votes
1 answer

How to extract the regression coefficient from statsmodels.api?

The coefficients can be obtained using the ...READ MORE

answered Mar 17 in Machine Learning by Dev
• 6,000 points
77 views
0 votes
1 answer

Why do we use gradient descent in linear regression?

An example you gave is one-dimensional, which ...READ MORE

answered Mar 23 in Machine Learning by Dev
• 6,000 points
88 views
0 votes
1 answer
0 votes
1 answer

Treat outliers in Dataset

Outlier values can be identified by using ...READ MORE

answered Jul 12, 2018 in Data Analytics by Sahiti
• 6,360 points
260 views
webinar REGISTER FOR FREE WEBINAR X
Send OTP
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP