Friday, May 12, 2017

Machine Learning. Linear Regression Full Example (Boston Housing).

It is important to mention that the present posts began as a personal way of practicing R programming and machine learning. Subsequently feedback from the community, urged me to continue performing these exercises and sharing them. 
The bibliography and corresponding authors are cited at all times and it is a way of honoring them and giving them the credit they deserve for their work.

We will develop a linear regression example, including simple linear regression, multiple linear regression, linear regression with term interaction, linear regression with higher order terms, and finally with a transformation. The exercise was originally published in "An Introduction to Statistical Learning. With applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Springer 2015.

The example we will develop is about predicting the median house value on the Boston Housing dataset.

We will carry out the exercise verbatim as published in the aforementioned reference and only with slight changes in the coding style.

For more details on the models, algorithms and parameters interpretation, it is recommended to check the aforementioned reference or any other bibliography of your choice.



### install and load required packages
#install.packages("ISLR")
#install.packages("car")

library(ISLR)
library(car)
 

### MASS library contains the Boston dataset
library(MASS)

### explore dataset
#fix(Boston)
str(Boston)
summary(Boston)
names(Boston)

### we will seek to predict target: medv (median house value)
### using 13 predictors

### we will start with $medv as target and $lstat as predictor
### create the model

lm.fit <- lm(medv~lstat, data=Boston)

### explore the model parameters

lm.fit
summary(lm.fit)
names(lm.fit)
coef(lm.fit)

### explore confidence interval for the coefficient estimates

confint(lm.fit)

### use predict() to produce confidence intervals and prediction
### intervals for the prediction of $medv for a given value of
### $lstat

predict (lm.fit , data.frame(lstat = c(5 ,10 ,15) ),
          interval = "confidence")
predict (lm.fit , data.frame(lstat = c(5 ,10 ,15) ),
          interval = "prediction")
 

### 95% confidence interval associated with $lstat=10 is
### (24.47, 25.63)
### 95% prediction interval is (12.828, 37.28)
### center around a predicted value of 25.05 for $medv when
### $lstat = 10

### now plot $medv and lstat along with least squares regression
### line

plot(Boston$lstat , Boston$medv, main="Boston Housing Data",
      xlab="Percent of households with low
     socioeconomic status", ylab= "Median house value")

abline(lm.fit, col = "red", lwd = 3)






### explore some diagnostic plots
par(mfrow = c(2, 2))
plot(lm.fit)

plot(predict(lm.fit), residuals(lm.fit))
plot(predict(lm.fit), rstudent(lm.fit))


 

### leverage statistics hatvalues()
plot(hatvalues(lm.fit))
which.max(hatvalues(lm.fit))



### multiple linear regression
### two predictors

lm.fit <- lm(medv ~ lstat + age , data = Boston)
summary(lm.fit)

### all 13 predictors
lm.fit <- lm(medv ~. , data = Boston)
summary(lm.fit)

### R^2
summary(lm.fit)$r.sq
### RSE
summary(lm.fit)$sigma

### use vif() to compute variance inflation factors
### vif is part of car library

vif(lm.fit)

### in the last regression $age has a high p-value
### run regression using all predictors but one: $age

lm.fit1 <- lm(medv ~. -age, data = Boston)
summary(lm.fit1)

### interaction terms
summary(lm(medv ~ lstat*age , data = Boston ))

### non-linear transformation of the predictors

lm.fit2 <- lm(medv ~lstat + I(lstat ^2), data = Boston)
summary(lm.fit2)

### a near cero p-value associated with the quadratic term
### leads to an improved model


### use anova() to further quantify the extent to which quadratic
### fit is superior to the linear fit

lm.fit <- lm(medv ~ lstat, data = Boston)
anova(lm.fit, lm.fit2)

### null hypothesis is that the two models fit data equally well
### alternative hypothesis is that the full model is superior
### F-statistic is 135 and p-value ~ 0 is clear evidence 

### that model containing the predictors lstat and lstat^2
### is far superior
### we saw in the above plots evidence of nonlinearity



### now use higher order polynomials with poly()

lm.fit5 = lm(medv ~ poly(lstat, 5), data = Boston)
summary(lm.fit5)

### additional polynomial terms leads to an improvement in the
### model fit

### now try a log transformation for $rm

summary (lm(medv ~ log(rm) , data = Boston ) )




You can get the example in:
https://github.com/pakinja/Data-R-Value

Wednesday, May 10, 2017

Tuesday, May 9, 2017

Looking for a Programming or Statistics High Level Course? MIT Open Course Ware.

Although MIT OCW has been operating for more than 15 years, I consider it important to do this post as there are still many people who do not know about its existence.



MIT OpenCourseWare (OCW) is a web-based publication of virtually all MIT course content. OCW is open and available to the world and is a permanent MIT activity.

“The idea is simple: to publish all of our course materials online and make them widely available to everyone.”
Dick K.P. Yue, Professor, MIT School of Engineering.

There is an offer of about 2340 courses with more than 200 million visitors.

Some of the most popular courses:


Go to MIT OCW:
https://ocw.mit.edu


Monday, May 8, 2017

Machine Learning. Regression Trees and Model Trees (Predicting Wine Quality)

We will develop a forecasting example using model trees and regression trees algorithms. The exercise was originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

The example we will develop is about predicting wine quality.

Image from Wine Folly


We will carry out the exercise verbatim as published in the aforementioned reference.

For more details on the model trees and regression trees algorithms it is recommended to check the aforementioned reference or any other bibliography of your choice.


### install an load required packages
#install.packages("rpart")
#install.packages("rpart.plot")

library(rpart)
library(rpart.plot)
library(RWeka)

### read and explore the data
wine <- read.csv("whitewines.csv")
str(wine)

### examine the distribution of the outcome variable
hist(wine$quality, main= "Wine Quality", col= "red")



### examine output for outliers or other potential data problems
summary(wine)

### split the data into training and testing datasets
### 75% - 25%

wine_train <- wine[1:3750, ]
wine_test <- wine[3751:4898, ]

### training a model on the data

### training a regression tree model

m.rpart <- rpart(quality ~ ., data = wine_train)

### explore basic information about the tree
### for each node in the tree, the number of examples reaching
### the decision point is listed

m.rpart

### more detailed summary of tree's fit

summary(m.rpart)

### visualazing the decision tree
rpart.plot(m.rpart, digits = 3)



rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE,
           type = 3, extra = 101)


 
### evaluating model performance
p.rpart <- predict(m.rpart, wine_test)

### model is not correctly identifying the extreme cases
summary(p.rpart)
summary(wine_test$quality)

### correlation between actual and predicted quality
cor(p.rpart, wine_test$quality)

### Measuring performance with the mean absolute error

MAE <- function(actual, predicted) {
  mean(abs(actual - predicted))
}

### the MAE for our predictions

MAE(p.rpart, wine_test$quality)

### the mean quality rating in the training data
mean(wine_train$quality)

### If we predicted the value 5.87 for every wine sample,
### we would have a mean absolute error of only about 0.67:

MAE(5.87, wine_test$quality)

### improving model performance?

### model tree
### model tree improves on regression trees by replacing the
### leaf nodes with regression models
### M5 algorithm (M5-prime)

m.m5p <- M5P(quality ~ ., data = wine_train)

### examine the tree and linear models

m.m5p
summary(m.m5p)

### predicted values and performance
p.m5p <- predict(m.m5p, wine_test)

summary(p.m5p)

cor(p.m5p, wine_test$quality)

MAE(wine_test$quality, p.m5p)



###
You can get the example and the dataset in:
https://github.com/pakinja/Data-R-Value