I want to model the pharmaceutical costs for a group of patients for the next year, based on this years pharma data (codes for the pharmaceuticals), age, gender and this year's costs.
I used linear regression and got an R^2 of 0.69, which was surprisingly good. When I devided the patients into 5 groups of the same size based on the amount of costs for the current year, I could see that the bottom 80 % performed extremely poorly, while the top 20% made up for it with a score of 0.71.
80% of the people have costs roughly under 500 Euro, while those who have a lot of costs, have extreme costs, up to 500.000 Euros.
I think, since Linear Regression wants to minimise residuals, predicting the bottom costs with the still relatively small residuals does not bring as much gain as minimising high costs.
Is there an alternative model, that would be more useful in this context to predict small costs better as well?