Monday, June 12, 2017

Machine Learning. Association Rules (Market Basket Analysis).

It is important to mention that the present posts series began as a personal way of practicing R programming and machine learning. Subsequently feedback from the community, urged me to continue performing these exercises and sharing them. The bibliography and corresponding authors are cited at all times, this posts series is a way of honoring and giving them the credit they deserve for their work. 
 
We will develop a finding patterns example using association rules. The example was originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).


The example is about market basket analysis.




We will carry out the exercise verbatim as published in the aforementioned reference.

For more details on the
association rules algorithms it is recommended to check the aforementioned reference or any other bibliography of your choice.




### "Machine Learning in R" by Brett Lantz,
### PACKT publishing 2015
### (open source community experience destilled)



### install and load required packages
#install.packages("arules")

library(arules)


### read and explore the data

groceries <- read.transactions("groceries.csv", sep = ",")
summary(groceries)


### density value 0.0260888 (26%) refers to the proportion
### of nonzero matrix cells

### since 2513/9835 = 0.2555, we can determine that whole
### milk appeared in 25.6% of the transactions

### a total of 2159 transactions contained only a single item
### 4.409 items in mean per transaction

### examining transaction data

inspect(groceries[1:5])

itemFrequency(groceries[, 1:3])


### visualizing item support = 10%

itemFrequencyPlot(groceries, support = 0.1, col="green")






 

itemFrequencyPlot(groceries, topN = 20, col="green")

 

### visualizing the transaction data, sparse matrix

image(groceries[1:5])




image(sample(groceries, 100))





### training a model on the data
### default settings support = 0.1 and confidence = 0.8

apriori(groceries)

### adjusting parameters
groceryrules <- apriori(groceries, parameter = list(support =
            0.006, confidence = 0.25, minlen = 2))

### set of 463 rules

### evaluating model performance

summary(groceryrules)

### inspect somre rules
inspect(groceryrules[1:3])

### improving model performance

### sorting the set of association rules

inspect(sort(groceryrules, by = "lift")[1:5])

### lift = 3.96 implies that people who buys herbs are nearly
### four times more likely to buy root vegetables than typical
### costumer


### taking subsets of association rules
berryrules <- subset(groceryrules, items %in% "berries")
inspect(berryrules)

### there are four rules involving berries, two of which seem
### to be interesting enough to be called actionable

### saving association rules to a file or dataframe

write(groceryrules, file = "groceryrules.csv",
      sep = ",", quote = TRUE, row.names = FALSE)


groceryrules_df <- as(groceryrules, "data.frame")
str(groceryrules_df)




You can get the dataset to perform the exercise in:
https://github.com/pakinja/Data-R-Value

Sunday, June 11, 2017

How Much Embellishment In Your Data Visualizations?

I share this excellent paper on the impact of embellishment in data visualizations and the effects it can have over the understanding of the information that it's wanted to be transmitted.