Thursday, May 4, 2017

Machine Learning. Forecasting Numeric Data with Multiple Linear Regression (Medical Expenses)

We will develop a forecasting example using multiple linear regression. The exercise was originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

The example we will develop is about predicting medical expenses.




We will carry out the exercise verbatim as published in the aforementioned reference and including some simple transformations to the dataset.

For more details on the linear regression algorithms it is recommended to check the aforementioned reference or any other bibliography of your choice.

### install and load required packages
#install.packages("psych")
library(psych)

### read and explore the data
insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)
str(insurance)

### model dependent variable: $expenses
### change $charges name to $expenses

colnames(insurance)[7] <- "expenses"
summary(insurance$expenses)
hist(insurance$expenses, main = "Insurance Expenses", col = "red",
     xlab = "Expenses (USD)")

### explore $region
table(insurance$region)

### exoploring relationships among features
### correlation matrix

cor(insurance[c("age", "bmi", "children", "expenses")])





### visualizing relationships among features
### scatterplot matrix

pairs(insurance[c("age", "bmi", "children", "expenses")])





### scatterplots, distributions and correlations
pairs.panels(insurance[c("age", "bmi", "children", "expenses")])


### training a model on the data
ins_model <- lm(expenses ~ age + children + bmi + sex +
                  smoker + region, data = insurance)
### this do the same
#ins_model <- lm(expenses ~ ., data = insurance)

### explore model parameters

ins_model




### evaluating model performance
summary(ins_model)




The model explains 74.9% of the variation of the dependent variable (adjusted R-squared: 0.7494).

### improving model performance

### adding non-linear relationships
### adding second order term on $age

insurance$age2 <- insurance$age^2

### converting a numeric variable to a binary indicator
### $bmi feature only have impact above some value

insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)

### putting it all together
### improved regression model

ins_model2 <- lm(expenses ~ age + age2 + children + bmi + sex +
                   bmi30*smoker + region, data = insurance)

summary(ins_model2)



The accuracy of the model has improved to an 86.5% of explanation of the variation of the independent variable.

You can get the example and the dataset in:
https://github.com/pakinja/Data-R-Value