*"Machine Learning in R"*by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

The example we will develop is about predicting medical expenses.

We will carry out the exercise verbatim as published in the aforementioned reference and including some simple transformations to the dataset.

For more details on the linear regression algorithms it is recommended to check the aforementioned reference or any other bibliography of your choice.

**### install and load required packages**

#install.packages("psych")

library(psych)

**### read and explore the data**

insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)

str(insurance)

**### model dependent variable: $expenses**

### change $charges name to $expenses

### change $charges name to $expenses

colnames(insurance)[7] <- "expenses"

summary(insurance$expenses)

hist(insurance$expenses, main = "Insurance Expenses", col = "red",

xlab = "Expenses (USD)")

**### explore $region**

table(insurance$region)

**### exoploring relationships among features**

### correlation matrix

### correlation matrix

cor(insurance[c("age", "bmi", "children", "expenses")])

**### visualizing relationships among features**

### scatterplot matrix

### scatterplot matrix

pairs(insurance[c("age", "bmi", "children", "expenses")])

**### scatterplots, distributions and correlations**

pairs.panels(insurance[c("age", "bmi", "children", "expenses")])

**### training a model on the data**

ins_model <- lm(expenses ~ age + children + bmi + sex +

smoker + region, data = insurance)

**### this do the same**

#ins_model <- lm(expenses ~ ., data = insurance)

### explore model parameters

### explore model parameters

ins_model

**### evaluating model performance**

summary(ins_model)

The model explains 74.9% of the variation of the dependent variable (adjusted R-squared: 0.7494).

**### improving model performance**

### adding non-linear relationships

### adding second order term on $age

### adding non-linear relationships

### adding second order term on $age

insurance$age2 <- insurance$age^2

**### converting a numeric variable to a binary indicator**

### $bmi feature only have impact above some value

### $bmi feature only have impact above some value

insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)

### putting it all together

### improved regression model

### putting it all together

### improved regression model

ins_model2 <- lm(expenses ~ age + age2 + children + bmi + sex +

bmi30*smoker + region, data = insurance)

summary(ins_model2)

The accuracy of the model has improved to an 86.5% of explanation of the variation of the independent variable.

You can get the example and the dataset in:

https://github.com/pakinja/Data-R-Value