We will develop a classification exercise using Naive-Bayes algorithm. The exercise was originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).
Naive Bayes is a
probabilistic classification algorithm that can be applied to problems
of text classification such as spam filtering, intrusion detection or
network anomalies, diagnosis of medical conditions given a set of
symptoms, among others.
The exercise we will develop is about filtering spam and ham sms messages.
We will carry out the exercise verbatim as published in the aforementioned reference.
To develop the Naive Bayes classifier, we will use data adapted from the SMS Spam Collection at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ .
### install required packages
install.packages("tm")
install.packages("NLP")
install.packages("SnowballC")
install.packages("wordcloud")
install.packages("e1071")
install.packages("RColorBrewer")
install.packages("gmodels")
library(tm)
library(NLP)
library(SnowballC)
library(RColorBrewer)
library(wordcloud)
library(e1071)
library(gmodels)
### read the data
sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)
### transform type into factor (ham/spam)
sms_raw$type <- factor(sms_raw$type)
### see frequencies
table(sms_raw$type)
### create volatile (stored in memory) corpus
### Vcorpus create a complex list, we can use list manipulation
### commands to manage it
sms_corpus <- VCorpus(VectorSource(sms_raw$text))
### inspect the two first elements
inspect(sms_corpus[1:2])
### see the first sms (as text)
as.character(sms_corpus[[1]])
### to view multiple documents
lapply(sms_corpus[1:2], as.character)
### cleaning the text in documents (corpus)
### tm_map to map all over the corpus
### content_transformer to access the corpus
### tolower to lowercase all strings
sms_corpus_clean <- tm_map(sms_corpus,
content_transformer(tolower))
### check and compare the result of the cleaning
as.character(sms_corpus[[1]])
as.character(sms_corpus_clean[[1]])
### remove numbers
sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)
### remove "stop words"
### see the "stop words" list
stopwords()
sms_corpus_clean <- tm_map(sms_corpus_clean,
removeWords, stopwords())
### remove punctuation
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)
### stemming (transform words into it's root form)
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)
### romove additional whitespaces
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)
### check and compare result of cleaning
as.character(sms_corpus[[1]])
as.character(sms_corpus_clean[[1]])
### tokenization
### document term matrix (DTM)
### rows are sms and columns are words
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
### data preparation (training and test sets)
### 75% train 25% test
sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test <- sms_dtm[4170:5559, ]
### labels for train and test sets
### feature to be classified
sms_train_labels <- sms_raw[1:4169, ]$type
sms_test_labels <- sms_raw[4170:5559, ]$type
### confirm that the subsets are representative of the
### complete set of SMS data
prop.table(table(sms_train_labels))
prop.table(table(sms_test_labels))
### wordcloud
wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)
### separated wordclouds
spam <- subset(sms_raw, type == "spam")
ham <- subset(sms_raw, type == "ham")
wordcloud(spam$text, max.words = 40, scale = c(3, 0.5))
wordcloud(ham$text, max.words = 40, scale = c(3, 0.5))
### creating indicator features for frequent words
findFreqTerms(sms_dtm_train, 5)
sms_freq_words <- findFreqTerms(sms_dtm_train, 5)
str(sms_freq_words)
### filter DTM by frequent terms
sms_dtm_freq_train <- sms_dtm_train[ , sms_freq_words]
sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]
### change DTM frequency to factor (categorical)
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2,
convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2,
convert_counts)
### training a model on the data
### alternative package for Naives-Bayes ("klaR")
sms_classifier <- naiveBayes(sms_train, sms_train_labels)
### evaluating model performance
### make predictions on test set
sms_test_pred <- predict(sms_classifier, sms_test)
### compare predictions (classifications) with the
### true values
CrossTable(sms_test_pred, sms_test_labels,
prop.chisq = FALSE, prop.t = FALSE,
dnn = c('predicted', 'actual'))
### improving model performance?
### set Laplace estimator to 1
sms_classifier2 <- naiveBayes(sms_train, sms_train_labels,
laplace = 1)
sms_test_pred2 <- predict(sms_classifier2, sms_test)
CrossTable(sms_test_pred2, sms_test_labels,
prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
dnn = c('predicted', 'actual'))
You can get the exercise and the dataset in:
https://github.com/pakinja/Data-R-Value
This is a blog dedicated to R programming language and to the field of data science. Here you will find scripts, hints, algorithms, examples, etc. There will also be published data analyzes that contribute to the improvement of societies and the quality of life of people.
Friday, April 28, 2017
Machine Learning Classification Using Naive Bayes
Subscribe to:
Post Comments (Atom)