I want to model the pharmaceutical costs for a group of patients for the next year, based on this years pharma data (codes for the pharmaceuticals), age, gender and this year's costs.

I used linear regression and got an R^2 of 0.69, which was surprisingly good. When I devided the patients into 5 groups of the same size based on the amount of costs for the current year, I could see that the bottom 80 % performed extremely poorly, while the top 20% made up for it with a score of 0.71.

80% of the people have costs roughly under 500 Euro, while those who have a lot of costs, have extreme costs, up to 500.000 Euros.

I think, since Linear Regression wants to minimise residuals, predicting the bottom costs with the still relatively small residuals does not bring as much gain as minimising high costs.

Is there an alternative model, that would be more useful in this context to predict small costs better as well?
Feb 25, 2022 315 views

## 1 answer to this question.

The above situation is the case where variance increased with the expected mean this is heteroscedasticity.
You can choose from the following:

• Use WLS and use weights depending on the predicted value
•  Convert the dependent variable. Example: log (y) and  lognormal model estimation
•  Use a distribution where the variance increases on average.

For instance The Poisson  variance is equal to the mean. You must use quasi-Poisson for continuous variables. Gamma has a squared variance on average. These distributions are typically implemented in GLM.

• 6,000 points

## Ignore the NaN and do the linear regression on remaining values

Yes, you can do this using statsmodels: import ...READ MORE

## bias and variance calculation for linear regression

Evaluation of Variance: variance = np.var(prediction) # Where ...READ MORE

## Linear regression with gradient descent to predict house prices using area (one var) in python

Apart from instructional purposes, I'm not sure ...READ MORE

## Linear regression returning bad fit with large x values

To make the date values start at ...READ MORE

## How to extract the regression coefficient from statsmodels.api?

The coefficients can be obtained using the ...READ MORE

## Why do we use gradient descent in linear regression?

An example you gave is one-dimensional, which ...READ MORE

## Are there any Linear Regression Function in SQL Server?

I know of none, to the best ...READ MORE

## Selecting only p-value and r.squared value from linear regression result

You can use the \$ symbol to ...READ MORE