Tuesday, April 25, 2017

Classification Using Nearest Neighbors k-NN

We will develop a well-known k-NN exercise originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

K-nearest neighbors is a classification algorithm and is perhaps one of the simplest machine learning algorithms.

The exercise we will develop is: "diagnosing breast cancer with the k-NN algorithm", and we will use the Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml.

We will carry out the exercise verbatim as published in the aforementioned reference.


### libraries
library("class")
library("gmodels")

### import data

wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)

### exoplore data structure

str(wbcd)


### remove the id column that will not be needed

wbcd <- wbcd[-1]


### explore how many records are classified as benign and how many ### as malignant
table(wbcd$diagnosis)

###
set the levels of the diagnostic variable

wbcd$diagnosis<- factor(wbcd$diagnosis, levels = c("B", "M"),
                        labels = c("Benign", "Malignant"))
###
explore the percentages of the diagnoses

round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)
 

### explore some other variables
summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])

### define normalizing function
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
 

### normalize the dataset
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
 

summary(wbcd_n$area_mean)

### data preparation (training and test sets)
wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]

wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]

### training the model
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
                      cl = wbcd_train_labels, k = 21)

### evaluating the model
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
           prop.chisq=FALSE)



### improving model performance?

### the exercise will be repeated using z-score to see if there is a ### better performance in the accuracy of the algorithm

### Z-score
wbcd_z <- as.data.frame(scale(wbcd[-1]))

summary(wbcd_z$area_mean)

### data preparation (training and test sets)
wbcd_train <- wbcd_z[1:469, ]
wbcd_test <- wbcd_z[470:569, ]


wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]


wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
                      cl = wbcd_train_labels, k = 21)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
           prop.chisq = FALSE)



Using z-score no improvement in model accuracy is observed.

You can get the exercise and the dataset in:
https://github.com/pakinja/Data-R-Value