Friday, April 28, 2017

Machine Learning Classification Using Naive Bayes

We will develop a classification exercise using Naive-Bayes algorithm. The exercise was originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

Naive Bayes is a probabilistic classification algorithm that can be applied to problems of text classification such as spam filtering, intrusion detection or network anomalies, diagnosis of medical conditions given a set of symptoms, among others.

The exercise we will develop is about filtering spam and ham sms messages.

We will carry out the exercise verbatim as published in the aforementioned reference.

To develop the Naive Bayes classifier, we will use data adapted from the SMS Spam Collection at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ .

### install required packages

install.packages("tm")
install.packages("NLP")
install.packages("SnowballC")
install.packages("wordcloud")
install.packages("e1071")
install.packages("RColorBrewer")
install.packages("gmodels")

library(tm)
library(NLP)
library(SnowballC)
library(RColorBrewer)
library(wordcloud)
library(e1071)
library(gmodels)

### read the data
sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)

### transform type into factor (ham/spam)
sms_raw$type <- factor(sms_raw$type)

### see frequencies
table(sms_raw$type)

### create volatile (stored in memory) corpus
### Vcorpus create a complex list, we can use list manipulation
### commands to manage it

sms_corpus <- VCorpus(VectorSource(sms_raw$text))

### inspect the two first elements
inspect(sms_corpus[1:2])

### see the first sms (as text)
as.character(sms_corpus[[1]])

### to view multiple documents
lapply(sms_corpus[1:2], as.character)

### cleaning the text in documents (corpus)

### tm_map to map all over the corpus
### content_transformer to access the corpus
### tolower to lowercase all strings

sms_corpus_clean <- tm_map(sms_corpus,
                           content_transformer(tolower))
### check and compare the result of the cleaning
as.character(sms_corpus[[1]])
as.character(sms_corpus_clean[[1]])

### remove numbers
sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)

### remove "stop words"
### see the "stop words" list

stopwords()

sms_corpus_clean <- tm_map(sms_corpus_clean,
                           removeWords, stopwords())
### remove punctuation
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)

### stemming (transform words into it's root form)
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)

### romove additional whitespaces
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)

### check and compare result of cleaning

as.character(sms_corpus[[1]])
as.character(sms_corpus_clean[[1]])

### tokenization
### document term matrix (DTM)
### rows are sms and columns are words

sms_dtm <- DocumentTermMatrix(sms_corpus_clean)

### data preparation (training and test sets)
### 75% train 25% test

sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test <- sms_dtm[4170:5559, ]

### labels for train and test sets
### feature to be classified

sms_train_labels <- sms_raw[1:4169, ]$type
sms_test_labels <- sms_raw[4170:5559, ]$type

### confirm that the subsets are representative of the
### complete set of SMS data

prop.table(table(sms_train_labels))
prop.table(table(sms_test_labels))

### wordcloud
wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)




### separated wordclouds
spam <- subset(sms_raw, type == "spam")
ham <- subset(sms_raw, type == "ham")

wordcloud(spam$text, max.words = 40, scale = c(3, 0.5))
wordcloud(ham$text, max.words = 40, scale = c(3, 0.5))

### creating indicator features for frequent words
findFreqTerms(sms_dtm_train, 5)
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)
str(sms_freq_words)

### filter DTM by frequent terms
sms_dtm_freq_train <- sms_dtm_train[ , sms_freq_words]
sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]

### change DTM frequency to factor (categorical)
convert_counts <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}

sms_train <- apply(sms_dtm_freq_train, MARGIN = 2,
                   convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2,
                  convert_counts)

### training a model on the data
### alternative package for Naives-Bayes ("klaR")

sms_classifier <- naiveBayes(sms_train, sms_train_labels)

### evaluating model performance
### make predictions on test set

sms_test_pred <- predict(sms_classifier, sms_test)

### compare predictions (classifications) with the
### true values

CrossTable(sms_test_pred, sms_test_labels,
           prop.chisq = FALSE, prop.t = FALSE,
           dnn = c('predicted', 'actual'))

 


### improving model performance?
### set Laplace estimator to 1

sms_classifier2 <- naiveBayes(sms_train, sms_train_labels,
                              laplace = 1)

sms_test_pred2 <- predict(sms_classifier2, sms_test)

CrossTable(sms_test_pred2, sms_test_labels,
           prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
           dnn = c('predicted', 'actual'))


You can get the exercise and the dataset in:
https://github.com/pakinja/Data-R-Value 

Tuesday, April 25, 2017

Classification Using Nearest Neighbors k-NN

We will develop a well-known k-NN exercise originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

K-nearest neighbors is a classification algorithm and is perhaps one of the simplest machine learning algorithms.

The exercise we will develop is: "diagnosing breast cancer with the k-NN algorithm", and we will use the Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml.

We will carry out the exercise verbatim as published in the aforementioned reference.


### libraries
library("class")
library("gmodels")

### import data

wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)

### exoplore data structure

str(wbcd)


### remove the id column that will not be needed

wbcd <- wbcd[-1]


### explore how many records are classified as benign and how many ### as malignant
table(wbcd$diagnosis)

###
set the levels of the diagnostic variable

wbcd$diagnosis<- factor(wbcd$diagnosis, levels = c("B", "M"),
                        labels = c("Benign", "Malignant"))
###
explore the percentages of the diagnoses

round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)
 

### explore some other variables
summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])

### define normalizing function
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
 

### normalize the dataset
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
 

summary(wbcd_n$area_mean)

### data preparation (training and test sets)
wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]

wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]

### training the model
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
                      cl = wbcd_train_labels, k = 21)

### evaluating the model
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
           prop.chisq=FALSE)



### improving model performance?

### the exercise will be repeated using z-score to see if there is a ### better performance in the accuracy of the algorithm

### Z-score
wbcd_z <- as.data.frame(scale(wbcd[-1]))

summary(wbcd_z$area_mean)

### data preparation (training and test sets)
wbcd_train <- wbcd_z[1:469, ]
wbcd_test <- wbcd_z[470:569, ]


wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]


wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
                      cl = wbcd_train_labels, k = 21)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
           prop.chisq = FALSE)



Using z-score no improvement in model accuracy is observed.

You can get the exercise and the dataset in:
https://github.com/pakinja/Data-R-Value