Saturday, May 6, 2017

MIT Step by Step Instructions for Creating Your Own R Package.

We will mention the steps to create your own R package using RStudio and we will provide the link to download the complete MIT guide:"Instructions for Creating Your Own R Package", In Song kim, Phil Martin, Nina McMurry. February 23, 2016 .


1. Start by opening a new .R file. Make sure your default directory is clear by typing rm(list= ls()). Check to see that it is empty using ls() (you should see character(0)).

2. Write the code for your functions in this .R file. You can create one file with all of your functions or create separate files for each function. Save these files somewhere where you can easily find them.




3. Install the ‘devtools’ package (install.packages(‘devtools’)).

4. Open a new project in RStudio. Go to the ‘File’ menu and click on ‘New Project.’ Then select ‘New Directory,’ and ‘R Package’ to create a new R package.




5. Type the name of your package, then upload the .R file you created in step 1 under ‘Create package based on source files’. Click ‘Create project.’




6. On the lower right hand side of your screen, you should see a file directory. The ‘R’ folder contains the code for your functions. The ‘man’ folder will contain the help files for each function in your package. Depending on your version of RStudio, the help files may have been generated automatically as .Rd or “R documentation” files when you created your package.
If the ‘man’ folder already contains .Rd files, open each file, add a title under the ‘title’ heading, and save (if not, see step 7). You can go back and edit the content later, but you will need to add a title to each .Rd file in order to compile your package.





7. If your ‘man’ file is empty, you will have to manually create a .Rd file for each function. To do this, go to File > New File > R Documentation, enter the title of the function and select ‘Function’ under the ‘Rd template’ menu. Edit your new file to include something in the ‘title’ field (again, you may make other edits now or go back and make edits later, but your package will not compile if the ‘title’ field is empty). Save each .Rd file in the ‘man’ folder.
NOTE: You will need to complete this step if you add more functions to your package at a later point, even if RStudio automatically generated R documentation files when you initially created the package.





8. Now you are ready to compile your package. Go to ‘Build’ on the top toolbar and select ‘Build and Reload’ (note you can also use the keyboard shortcut Ctrl+Shift+B). If this works, your package will automatically load and you will see library(mynewpackage) at the bottom of your console. Test your functions to make sure they work.




9. Go back and edit the documentation (the help file) for each function. Open each .Rd file, add a brief description of the package, define its arguments and, if applicable, values, and include at least one example. Then, re-compile your package and test out your documentation in the R console (?myfun). NOTE: You will need to re-compile (repeating step 8) each time you make changes to your functions or documentation.




10. Once you have finished creating your functions and documentation, compiled your package, and double checked that the functions and help files work, copy the entire folder containing your package to the Dropbox folder with your name on it.


Download the complete guide at:
https://github.com/pakinja/Data-R-Value

Thursday, May 4, 2017

Machine Learning. Forecasting Numeric Data with Multiple Linear Regression (Medical Expenses)

We will develop a forecasting example using multiple linear regression. The exercise was originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

The example we will develop is about predicting medical expenses.




We will carry out the exercise verbatim as published in the aforementioned reference and including some simple transformations to the dataset.

For more details on the linear regression algorithms it is recommended to check the aforementioned reference or any other bibliography of your choice.

### install and load required packages
#install.packages("psych")
library(psych)

### read and explore the data
insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)
str(insurance)

### model dependent variable: $expenses
### change $charges name to $expenses

colnames(insurance)[7] <- "expenses"
summary(insurance$expenses)
hist(insurance$expenses, main = "Insurance Expenses", col = "red",
     xlab = "Expenses (USD)")

### explore $region
table(insurance$region)

### exoploring relationships among features
### correlation matrix

cor(insurance[c("age", "bmi", "children", "expenses")])





### visualizing relationships among features
### scatterplot matrix

pairs(insurance[c("age", "bmi", "children", "expenses")])





### scatterplots, distributions and correlations
pairs.panels(insurance[c("age", "bmi", "children", "expenses")])


### training a model on the data
ins_model <- lm(expenses ~ age + children + bmi + sex +
                  smoker + region, data = insurance)
### this do the same
#ins_model <- lm(expenses ~ ., data = insurance)

### explore model parameters

ins_model




### evaluating model performance
summary(ins_model)




The model explains 74.9% of the variation of the dependent variable (adjusted R-squared: 0.7494).

### improving model performance

### adding non-linear relationships
### adding second order term on $age

insurance$age2 <- insurance$age^2

### converting a numeric variable to a binary indicator
### $bmi feature only have impact above some value

insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)

### putting it all together
### improved regression model

ins_model2 <- lm(expenses ~ age + age2 + children + bmi + sex +
                   bmi30*smoker + region, data = insurance)

summary(ins_model2)



The accuracy of the model has improved to an 86.5% of explanation of the variation of the independent variable.

You can get the example and the dataset in:
https://github.com/pakinja/Data-R-Value 




Wednesday, May 3, 2017

Machine Learning Classification with 1R and RIPPER Rule Learners (Edible/Poisonous Mushrooms)

We will develop a classification example using 1R and RIPPER rule learners algorithms. The exercise was originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

The example we will develop is about classifying edible and poisonous mushrooms.


We will carry out the exercise verbatim as published in the aforementioned reference and including some simple transformations to the dataset.

For more details on the algorithms it is recommended to check the aforementioned reference.


### install and load required packages
#install.packages("RWeka")

library(RWeka)

### read and explore the data
mushrooms <- read.csv("mushrooms.csv", stringsAsFactors = TRUE)
str(mushrooms)

### drop $veil_type column (does not provide)
### useful information

mushrooms$veil_type <- NULL

### look at the distribution of the muschroom type
### class variable

table(mushrooms$type)

### data transformation

#Below function transforms the $type column

transformType <- function(key){
  switch (as.character(key),
          'p' = 'poisonous',
          'e' = 'edible'
  )
}

#Below function transforms the $odor column
transformOdor <- function(key){
  switch (as.character(key),
          'a' = 'almond',
          'l' = 'anise',
          'c' = 'creosote',
          'y' = 'fishy',
          'f' = 'foul',
          'm' = 'musty',
          'n' = 'none',
          'p' = 'pungent',
          's' = 'spicy'
  )
}

### apply transformations
mushrooms$type <- sapply(mushrooms$type, transformType)
mushrooms$odor <- sapply(mushrooms$odor, transformOdor)

mushrooms$type <- as.factor(mushrooms$type)
mushrooms$odor <- as.factor(mushrooms$odor)
 

### training a model on the data

### use the 1R implementation in the RWeka package
### called OneR()


### consider all possible features to predict $type
mushroom_1R <- OneR(type ~ ., data = mushrooms)
mushroom_1R


### 1R one rule selected:
 




### evaluating model performance
summary(mushroom_1R)



### improving model performance
### JRip() Java-based implementation of the Ripper
### rule learning algorithm

mushroom_JRip <- JRip(type ~ ., data = mushrooms)

### RIPPER selected rules (9):
mushroom_JRip




You can get the example and the dataset in:
https://github.com/pakinja/Data-R-Value

Monday, May 1, 2017

Machine Learning Classification with C5.0 Decision Tree Algorithm

We will develop a classification exercise using C5.0 decision tree algorithm. The exercise was originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

The example we will develop is about identifying risky bank loans.

We will carry out the exercise verbatim as published in the aforementioned reference.



### install required packages
#install.packages("C50")

### load required packages

library(C50)
library(gmodels)

### read and explore the data
credit <- read.csv("credit.csv")
str(credit)

### look a couple of loan features that seem likely
### to predict a default (currency Deutsche Marks DM)

table(credit$checking_balance)
table(credit$savings_balance)

### some of the loan's features are numeric
summary(credit$months_loan_duration)
summary(credit$amount)

### $default indicates whether the loan applicant
### went or not into default


### change levels of $default
credit$default<- factor(credit$default, levels = c("1", "2"),
                        labels = c("no", "yes"))
 

### see frequencies of $default
table(credit$default)

### data preparation
### creating random training (90%) and test datasets (10%)

### random sample numbers

set.seed(123)
train_sample <- sample(1:1000, 900)
 
### split the dataset
credit_train <- credit[train_sample, ]
credit_test <- credit[-train_sample, ]

### check for about 30% of defaulted loans in each of
### the datasets

prop.table(table(credit_train$default))
prop.table(table(credit_test$default))

### training a model on the data

### $default column in credit_train is the default class
### variable, so we need to exclude it from the training
### data frame, but supply it as the target factor vector
### for classification

credit_model <- C5.0(credit_train[-17], credit_train$default)

### see some basic data about the tree

credit_model
summary(credit_model)

### evaluating model performance
credit_pred <- predict(credit_model, credit_test)
 

CrossTable(credit_test$default, credit_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))



### improving model performance?
### boosting the accuracy of decision trees
### add an additional trials parameter indicating the
### number of separate decision trees to use in the
### boosted team

credit_boost10 <- C5.0(credit_train[-17], credit_train$default,
                       trials = 10)

credit_boost10
summary(credit_boost10)

credit_boost_pred10 <- predict(credit_boost10, credit_test)
 

CrossTable(credit_test$default, credit_boost_pred10,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))



### a way to reduces false negatives
### set penalties to different types of errors
### cost matrix

matrix_dimensions <- list(c("no", "yes"), c("no", "yes"))
 

names(matrix_dimensions) <- c("predicted", "actual")

matrix_dimensions

### assign the penalty for the types of errors

### if a loan default costs four times as much as
### a missed opportunity


error_cost <- matrix(c(0, 1, 4, 0), nrow = 2,
                     dimnames = matrix_dimensions)
 

### false negative has a cost of 4 versus a false
### positive's cost of 1

error_cost

### now use costs parameter of the C50()
credit_cost <- C5.0(credit_train[-17], credit_train$default,
                    costs = error_cost)
 

credit_cost_pred <- predict(credit_cost, credit_test)
 

CrossTable(credit_test$default, credit_cost_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))




### false negatives was reduced at the expense of
### increasing false positives


You can get the exercise and the dataset in:
https://github.com/pakinja/Data-R-Value