Friday, May 12, 2017

Machine Learning. Linear Regression Full Example (Boston Housing).

It is important to mention that the present posts began as a personal way of practicing R programming and machine learning. Subsequently feedback from the community, urged me to continue performing these exercises and sharing them. 
The bibliography and corresponding authors are cited at all times and it is a way of honoring them and giving them the credit they deserve for their work.

We will develop a linear regression example, including simple linear regression, multiple linear regression, linear regression with term interaction, linear regression with higher order terms, and finally with a transformation. The exercise was originally published in "An Introduction to Statistical Learning. With applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Springer 2015.

The example we will develop is about predicting the median house value on the Boston Housing dataset.

We will carry out the exercise verbatim as published in the aforementioned reference and only with slight changes in the coding style.

For more details on the models, algorithms and parameters interpretation, it is recommended to check the aforementioned reference or any other bibliography of your choice.



### install and load required packages
#install.packages("ISLR")
#install.packages("car")

library(ISLR)
library(car)
 

### MASS library contains the Boston dataset
library(MASS)

### explore dataset
#fix(Boston)
str(Boston)
summary(Boston)
names(Boston)

### we will seek to predict target: medv (median house value)
### using 13 predictors

### we will start with $medv as target and $lstat as predictor
### create the model

lm.fit <- lm(medv~lstat, data=Boston)

### explore the model parameters

lm.fit
summary(lm.fit)
names(lm.fit)
coef(lm.fit)

### explore confidence interval for the coefficient estimates

confint(lm.fit)

### use predict() to produce confidence intervals and prediction
### intervals for the prediction of $medv for a given value of
### $lstat

predict (lm.fit , data.frame(lstat = c(5 ,10 ,15) ),
          interval = "confidence")
predict (lm.fit , data.frame(lstat = c(5 ,10 ,15) ),
          interval = "prediction")
 

### 95% confidence interval associated with $lstat=10 is
### (24.47, 25.63)
### 95% prediction interval is (12.828, 37.28)
### center around a predicted value of 25.05 for $medv when
### $lstat = 10

### now plot $medv and lstat along with least squares regression
### line

plot(Boston$lstat , Boston$medv, main="Boston Housing Data",
      xlab="Percent of households with low
     socioeconomic status", ylab= "Median house value")

abline(lm.fit, col = "red", lwd = 3)






### explore some diagnostic plots
par(mfrow = c(2, 2))
plot(lm.fit)

plot(predict(lm.fit), residuals(lm.fit))
plot(predict(lm.fit), rstudent(lm.fit))


 

### leverage statistics hatvalues()
plot(hatvalues(lm.fit))
which.max(hatvalues(lm.fit))



### multiple linear regression
### two predictors

lm.fit <- lm(medv ~ lstat + age , data = Boston)
summary(lm.fit)

### all 13 predictors
lm.fit <- lm(medv ~. , data = Boston)
summary(lm.fit)

### R^2
summary(lm.fit)$r.sq
### RSE
summary(lm.fit)$sigma

### use vif() to compute variance inflation factors
### vif is part of car library

vif(lm.fit)

### in the last regression $age has a high p-value
### run regression using all predictors but one: $age

lm.fit1 <- lm(medv ~. -age, data = Boston)
summary(lm.fit1)

### interaction terms
summary(lm(medv ~ lstat*age , data = Boston ))

### non-linear transformation of the predictors

lm.fit2 <- lm(medv ~lstat + I(lstat ^2), data = Boston)
summary(lm.fit2)

### a near cero p-value associated with the quadratic term
### leads to an improved model


### use anova() to further quantify the extent to which quadratic
### fit is superior to the linear fit

lm.fit <- lm(medv ~ lstat, data = Boston)
anova(lm.fit, lm.fit2)

### null hypothesis is that the two models fit data equally well
### alternative hypothesis is that the full model is superior
### F-statistic is 135 and p-value ~ 0 is clear evidence 

### that model containing the predictors lstat and lstat^2
### is far superior
### we saw in the above plots evidence of nonlinearity



### now use higher order polynomials with poly()

lm.fit5 = lm(medv ~ poly(lstat, 5), data = Boston)
summary(lm.fit5)

### additional polynomial terms leads to an improvement in the
### model fit

### now try a log transformation for $rm

summary (lm(medv ~ log(rm) , data = Boston ) )




You can get the example in:
https://github.com/pakinja/Data-R-Value