*"Machine Learning in R"*by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

K-nearest neighbors is a classification algorithm and is perhaps one of the simplest machine learning algorithms.

The exercise we will develop is: "diagnosing breast cancer with the k-NN algorithm", and we will use the Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml.

We will carry out the exercise verbatim as published in the aforementioned reference.

**### libraries**

library("class")

library("gmodels")

### import data

### import data

wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)

**### exoplore data structure**

str(wbcd)

**### remove the id column that will not be needed**

wbcd <- wbcd[-1]

**### explore how many records are classified as benign and how many ### as malignant**

table(wbcd$diagnosis)

**###**

**set the levels of the diagnostic variable**

wbcd$diagnosis<- factor(wbcd$diagnosis, levels = c("B", "M"),

labels = c("Benign", "Malignant"))

**###**

**explore the percentages of the diagnoses**

round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)

**### explore some other variables**

summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])

**### define normalizing**

**function**

normalize <- function(x) {

return ((x - min(x)) / (max(x) - min(x)))

}

**### normalize the dataset**

wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))

summary(wbcd_n$area_mean)

**### data preparation (training and test sets)**

wbcd_train <- wbcd_n[1:469, ]

wbcd_test <- wbcd_n[470:569, ]

wbcd_train_labels <- wbcd[1:469, 1]

wbcd_test_labels <- wbcd[470:569, 1]

**### training the model**

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,

cl = wbcd_train_labels, k = 21)

**### evaluating the model**

CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,

prop.chisq=FALSE)

**### improving model performance?**

**###**

**the exercise will be repeated using z-score to see if there is a ### better performance in the accuracy of the algorithm**

**### Z-score**

wbcd_z <- as.data.frame(scale(wbcd[-1]))

summary(wbcd_z$area_mean)

**### data preparation (training and test sets)**

wbcd_train <- wbcd_z[1:469, ]

wbcd_test <- wbcd_z[470:569, ]

wbcd_train_labels <- wbcd[1:469, 1]

wbcd_test_labels <- wbcd[470:569, 1]

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,

cl = wbcd_train_labels, k = 21)

CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,

prop.chisq = FALSE)

Using z-score no improvement in model accuracy is observed.

You can get the exercise and the dataset in:

https://github.com/pakinja/Data-R-Value