Thursday, July 6, 2017

Go Game Can Help You Improve Your Skills In Analytics and Programming.

I will start this post telling a story of how the game called Go, catapulted me into learning mathematics.



The physics career is based on math from very basic to the most complex. By the time I started that career, my preparation was very poor and for that reason it was very difficult for me to advance in my studies. Luckily, in college there were a lot of comrades playing Go, which until that moment never caught my eye. I decided to learn to play it, finding it difficult enough at first, but with practice it became more natural and interesting.

When I began to understand the Go game, I also began to understand different academic subjects such as differential and integral calculus, higher algebra, linear algebra and other subjects that required a lot of structured thinking, the Go helped my brain to click on math in an incredible way.

The Go, allows our brain to create, maintain and exercise neural connections, so it also helps us learn and practice programming languages.

In general, the Go is a millennial game in which the objective is to seize as much territory as possible and not to let your opponent obtain it, for this you must have a lot of ability to count quickly and to anticipate a certain amount of future moves.
You can get more information about Go in the following link:
https://www.britgo.org/intro/history



There are different platforms to play Go with people from practically all the world, one of them is Kiseido go server that is 100% free: https://www.kiseido.com/

I would like to mention that playing Go would be a great way to calm the stress of employees engaged in technical intellectual work while at the same time strengthening their cognitive skills.

Finally, I share the following information about Go and Artificial Intelligence:

In October 2015, AlphaGo became the first computer program ever to beat a professional Go player by winning 5-0 against the reigning 3-times European Champion Fan Hui (2-dan pro). That work was featured in a front cover article in the science journal Nature in January 2016.





Monday, June 12, 2017

Machine Learning. Association Rules (Market Basket Analysis).

It is important to mention that the present posts series began as a personal way of practicing R programming and machine learning. Subsequently feedback from the community, urged me to continue performing these exercises and sharing them. The bibliography and corresponding authors are cited at all times, this posts series is a way of honoring and giving them the credit they deserve for their work. 
 
We will develop a finding patterns example using association rules. The example was originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).


The example is about market basket analysis.




We will carry out the exercise verbatim as published in the aforementioned reference.

For more details on the
association rules algorithms it is recommended to check the aforementioned reference or any other bibliography of your choice.




### "Machine Learning in R" by Brett Lantz,
### PACKT publishing 2015
### (open source community experience destilled)



### install and load required packages
#install.packages("arules")

library(arules)


### read and explore the data

groceries <- read.transactions("groceries.csv", sep = ",")
summary(groceries)


### density value 0.0260888 (26%) refers to the proportion
### of nonzero matrix cells

### since 2513/9835 = 0.2555, we can determine that whole
### milk appeared in 25.6% of the transactions

### a total of 2159 transactions contained only a single item
### 4.409 items in mean per transaction

### examining transaction data

inspect(groceries[1:5])

itemFrequency(groceries[, 1:3])


### visualizing item support = 10%

itemFrequencyPlot(groceries, support = 0.1, col="green")






 

itemFrequencyPlot(groceries, topN = 20, col="green")

 

### visualizing the transaction data, sparse matrix

image(groceries[1:5])




image(sample(groceries, 100))





### training a model on the data
### default settings support = 0.1 and confidence = 0.8

apriori(groceries)

### adjusting parameters
groceryrules <- apriori(groceries, parameter = list(support =
            0.006, confidence = 0.25, minlen = 2))

### set of 463 rules

### evaluating model performance

summary(groceryrules)

### inspect somre rules
inspect(groceryrules[1:3])

### improving model performance

### sorting the set of association rules

inspect(sort(groceryrules, by = "lift")[1:5])

### lift = 3.96 implies that people who buys herbs are nearly
### four times more likely to buy root vegetables than typical
### costumer


### taking subsets of association rules
berryrules <- subset(groceryrules, items %in% "berries")
inspect(berryrules)

### there are four rules involving berries, two of which seem
### to be interesting enough to be called actionable

### saving association rules to a file or dataframe

write(groceryrules, file = "groceryrules.csv",
      sep = ",", quote = TRUE, row.names = FALSE)


groceryrules_df <- as(groceryrules, "data.frame")
str(groceryrules_df)




You can get the dataset to perform the exercise in:
https://github.com/pakinja/Data-R-Value

Sunday, June 11, 2017

How Much Embellishment In Your Data Visualizations?

I share this excellent paper on the impact of embellishment in data visualizations and the effects it can have over the understanding of the information that it's wanted to be transmitted.




Thursday, May 25, 2017

The Most Popular Baby Names in the US Per Year. An Exercise in R.

Using the databases of the Social Security Administration (https://www.ssa.gov/oact//babynames/limits.html), I will analyze the most used names in each year from 1880 to 2016.

We can see that there are seasons when certain names become fashionable and that these fashions coincide with the appearance of certain songs, films or events.

I used the following script in R to perform the analysis:

### Author: Francisco Jaramillo
### Data Scientist Pakinja
### Data R Value 2017


### load required libraries
library(dplyr)
library(ggplot2)


### read all 137 data files (one per year)
temp = list.files(pattern="*.txt")
myfiles = lapply(temp, read.csv, header=FALSE)


### filter first data file by female names, get the
### most frequent and make it the first row of a dataframe
pop_names <- filter(data.frame(myfiles[1]), V2 == "F")[1,c(1,3)]

### loop to do the
previous step for all data files
for(i in 2:length(myfiles)){

  pop_names <- rbind(pop_names, filter(data.frame(myfiles[i]),

  V2 == "F")[1,c(1,3)])
 
}


### set levels for $names variable
pop_names$V1 <- factor(pop_names$V1, levels = unique(pop_names$V1))

### make year series
Year <- seq(1880, 2016, 1)


### bind $year variable to dataframe
pop_names <- cbind(pop_names, Year)

### set dataframe names
names(pop_names) <- c("Name", "Frequency", "Year")

### most used female names visualization
ggplot(pop_names, aes(x = Year, y = Frequency))+
  geom_bar(aes(fill = Name), stat = "identity")+
  labs(title="Most Used Female Names in USA Per Year")+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_continuous(breaks = seq(1880, 2016, by = 5))+
  scale_y_continuous(breaks = seq(0, 100000, by = 10000))+
  scale_fill_brewer(palette="Paired")+
  theme(legend.text=element_text(size=13))+
  theme(legend.position=c(0.1, 0.7))+
  theme(legend.background = element_rect(size=0.5, linetype="solid", colour ="darkblue"))




Incredibly, the name Mary remained for 75 years as the most used name.

We also observe that the use of the name Linda explodes since 1947, coinciding with the arrival of the song Linda (Ray Noble & Buddy Clark, Columbia Records) to the Billboard magazine on March 21, 1947 and peaking at number one for several months.

About the name Jennifer, there are different analyzes of why it was the most popular name for about 15 years, you just have to google it.

The most popular name recently is Emma.
 
################ now for males

### filter first data file by male names, get the
### most frequent and make it the first row of a dataframe
pop_names_m <- filter(data.frame(myfiles[1]), V2 == "M")[1,c(1,3)]

###
loop to do the previous step for all data files
 for(i in 2:length(myfiles)){
 
  pop_names_m <- rbind(pop_names_m, filter(data.frame(myfiles[i]),

  V2 == "M")[1,c(1,3)])
 
}


###
set levels for $names variable
 pop_names_m$V1 <- factor(pop_names_m$V1, levels = unique(pop_names_m$V1))

### make year series
Year <- seq(1880, 2016, 1)


### bind $year variable to dataframe
pop_names_m <- cbind(pop_names_m, Year)

### set dataframe names
names(pop_names_m) <- c("Name", "Frequency", "Year")

### most used male names visualization
ggplot(pop_names_m, aes(x = Year, y = Frequency))+
  geom_bar(aes(fill = Name), stat = "identity")+
  labs(title="Most Used Male Names in USA Per Year")+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_continuous(breaks = seq(1880, 2016, by = 5))+
  scale_y_continuous(breaks = seq(0, 100000, by = 5000))+
  scale_fill_brewer(palette="Paired")+
  theme(legend.text=element_text(size=13))+
  theme(legend.position=c(0.1, 0.7))+
  theme(legend.background = element_rect(size=0.5, linetype="solid", colour ="darkblue"))


Not surprisingly, the most popular name for 45 years was John.

The popularity in the use of the name James, coincides with the premiere of the film "The Return of Frank James" on August 6, 1940 and starring Henry Fonda.

The name Micheal remained for about 43 years as the most popular.

The name Jacob begins to be popular from 1999 coinciding with the premiere of the film "Jakob the Liar" carried out by Robin Williams.

Recently the most popular name is Noah.

I hope this exercise has been interesting and useful for you.



Get everything you need to repeat the exercise at:

https://github.com/pakinja/Data-R-Value

Monday, May 22, 2017

Machine Learning. Support Vector Machines (Optical Character Recognition).

It is important to mention that the present posts series began as a personal way of practicing R programming and machine learning. Subsequently feedback from the community, urged me to continue performing these exercises and sharing them.  The bibliography and corresponding authors are cited at all times and this posts series is a way of honoring and giving them the credit they deserve for their work.
 
We will develop a support vector machines example. The example was originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).


The example is about optical character recognition.

 
We will carry out the exercise verbatim as published in the aforementioned reference.

For more details on the support vector machines algorithms it is recommended to check the aforementioned reference or any other bibliography of your choice.



### "Machine Learning in R" by Brett Lantz,
### PACKT publishing 2015
### (open source community experience destilled)



### install an load required packages
install.packages("kernlab")

library(kernlab)

### read and explore the data
letters <- read.csv("letterdata.csv")
str(letters)

### SVM require all features to be numeric and each
### feature scaled to a fairly small interval
### kernlab package perform the rescaling
### automatically


### split the data into training and testing sets
### 80% - 20%
### data is already randomized

letters_train <- letters[1:16000, ]
letters_test <- letters[16001:20000, ]


### training a model on the data
### simple linear kernell function

letter_classifier <- ksvm(letter ~ ., data = letters_train,
                          kernel = "vanilladot")
letter_classifier


### evaluating model performance
letter_predictions <- predict(letter_classifier, letters_test)

head(letter_predictions)

table(letter_predictions, letters_test$letter)
agreement <- letter_predictions == letters_test$letter
table(agreement)
prop.table(table(agreement))


### improving model performance
### Gaussian RBF kernel

letter_classifier_rbf <- ksvm(letter ~ ., data = letters_train,
                              kernel = "rbfdot")
letter_predictions_rbf <- predict(letter_classifier_rbf,
                                  letters_test)
agreement_rbf <- letter_predictions_rbf == letters_test$letter
table(agreement_rbf)
prop.table(table(agreement_rbf))


### changing the kernel function accuracy increased from
### 84% to 93%


You can get the dataset in:
https://github.com/pakinja/Data-R-Value 

Thursday, May 18, 2017

Machine Learning. Artificial Neural Networks (Strength of Concrete).

It is important to mention that the present posts series began as a personal way of practicing R programming and machine learning. Subsequently feedback from the community, urged me to continue performing these exercises and sharing them.  The bibliography and corresponding authors are cited at all times and this posts series is a way of honoring and giving them the credit they deserve for their work.
 
We will develop an artificial neural network example. The example was originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).


The example we will develop is about predicting the strength of concrete based in the ingredients used to made it.

We will carry out the exercise verbatim as published in the aforementioned reference.

For more details on the model trees and regression trees algorithms it is recommended to check the aforementioned reference or any other bibliography of your choice.


### "Machine Learning in R" by Brett Lantz,
### PACKT publishing 2015
### (open source community experience destilled)
### based on: Yeh IC. "Modeling of Strength of
### High Performance Concrete Using Artificial
### Neural Networks." Cement and Concrete Research
### 1998; 28:1797-1808.

### Strength of concrete example
### relationship between the ingredients used in
### concrete and the strength of finished product

### Dataset
### Compressive strength of concrete
### UCI Machine Learning Data Repository
### http://archive.ics.uci.edu/ml


### install an load required packages

#install.packages("neuralnet")

library(neuralnet)

### read and explore the data
concrete <- read.csv("concrete.csv")
str(concrete)

### neural networks work best when the input data
### are scaled to a narrow range around zero

### normalize the dataset values

normalize <- function(x){
  return((x - min(x)) / (max(x) - min(x)) )
}

### apply normalize() to the dataset columns
concrete_norm <- as.data.frame(lapply(concrete, normalize))

### confirm and compare normalization
summary(concrete_norm$strength)
summary(concrete$strength)

### split the data into training and testing sets
### 75% - 25%

concrete_train <- concrete_norm[1:773, ]
concrete_test <- concrete_norm[774:1030, ]

### training model on the data
concrete_model <- neuralnet(strength ~ cement + slag +
              ash + water + superplastic + coarseagg +
              fineagg + age, data = concrete_train)

### visualize the network topology

plot(concrete_model)



### there is one input node for each of the eight
### features, followed by a single hidden node and
### a single output node that predicts the concrete
### strength
### at the bottom of the figure, R reports the number
### of training steps and an error measure called the
### the sum of squared errors (SSE)

### evaluating model performance

### predictions

model_results <- compute(concrete_model, concrete_test[1:8])
predicted_strength <- model_results$net.result

### because this is a numeric prediction problem rather
### than a classification problem, we cannot use a confusion
### matrix to examine model accuracy
### obtain correlation between our predicted concrete strength
### and the true value

cor(predicted_strength, concrete_test$strength)

### correlation indicate a strong linear relationships between
### two variables

### improving model performance
### increase the number of hidden nodes to five

concrete_model2 <- neuralnet(strength ~ cement + slag +
                 ash + water + superplastic + coarseagg +
            fineagg + age, data = concrete_train, hidden = 5)


plot(concrete_model2)




### SSE has been reduced significantly

### predictions
model_results2 <- compute(concrete_model2, concrete_test[1:8])
predicted_strength2 <- model_results2$net.result

### performance
cor(predicted_strength, concrete_test$strength)

### notice that results can differs because neuralnet
### begins with random weights
### if you'd like to match results exactly, use set.seed(12345)
### before building the neural network 

  
You can get the example and the dataset in:
https://github.com/pakinja/Data-R-Value

 

Wednesday, May 17, 2017

Training Neural Networks with Backpropagation. Original Publication.

Neural networks have been a very important area of scientific study that has evolved by different disciplines such as mathematics, biology, psychology, computer science, etc.
The study of neural networks leapt from theory to practice with the emergence of computers.
Training a neural network by adjusting the weights of the connections is computationally very expensive so its application to practical problems took until the mid-80s when a more efficient algorithm was discovered.

That algorithm is now known as back-propagation errors or simply backpropagation.

One of the most cited articles on this algorithm is:

Learning representations by back-propagating errors
David E. Rumelhart*, Geoffrey E. Hinton & Ronald J. Williams*
Nature 323, 533 - 536 (09 October 1986)



Although it is a very technical article, anyone who wants to study and understand neural networks is obliged to pass through this material.

I share the entire article in:
https://github.com/pakinja/Data-R-Value

Tuesday, May 16, 2017

Machine Learning. Stock Market Data, Part 3: Quadratic Discriminant Analysis and KNN.

It is important to mention that the present posts series began as a personal way of practicing R programming and machine learning. Subsequently feedback from the community, urged me to continue performing these exercises and sharing them.  The bibliography and corresponding authors are cited at all times and this posts series is a way of honoring and giving them the credit they deserve for their work.

We will develop a quadratic discriminant analysis and a K nearest neighbors examples. The examples was originally published in "An Introduction to Statistical Learning. With applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Springer 2015.




The example we will develop is about classifying when the market value will rise (UP) or fall (Down).
 

We will carry out the exercise verbatim as published in the aforementioned reference and only with slight changes in the coding style.

For more details on the models, algorithms and parameters interpretation, it is recommended to check the aforementioned reference or any other bibliography of your choice.



### "An Introduction to Statistical Learning.
### With applications in R" by Gareth James,
### Daniela Witten, Trevor Hastie and Robert Tibshirani.
### Springer 2015.


### install and load required packages

library(ISLR)
library(psych)
library(MASS)
library(class)

### split the dataset into train and test sets
train <- (Smarket$Year < 2005)
Smarket.2005 <- Smarket[!train, ]
Direction.2005 <- Smarket$Direction[!train]

### Quadratic Discriminant Analysis

### perform quadratic discriminant analysis QDA on the stock
### market data

qda.fit <- qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

### the output does not contain the coefficients of the linear discriminants,
### because the QDA classifier involves a quadratic, rather than a linear,
### function of the predictors

qda.fit

### predictions
qda.class <- predict(qda.fit, Smarket.2005)$class
table(qda.class, Direction.2005)
mean(qda.class == Direction.2005)

### QDA predictions are accurate almost 60 % of the time,
### even though the 2005 data was not used to fit the model
### quite impressive for stock market data, which is known to
### be quite hard to model accurately

### K-Nearest Neighbors

### split the dataset into train and test sets

train.X <- cbind(Smarket$Lag1, Smarket$Lag2)[train, ]
test.X <- cbind(Smarket$Lag1, Smarket$Lag2)[!train, ]
train.Direction <- Smarket$Direction[train]

### knn() function can be used to predict the market’s movement
### for the dates in 2005

set.seed(1)
knn.pred <- knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction.2005)                  
(83+43)/252                  

### the results using K = 1 are not very good, since only 50 % of the
### observations are correctly predicted

### try k=3

knn.pred <- knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction.2005)
mean(knn.pred == Direction.2005)

### the results have improved slightly. But increasing K further

### turns out to provide no further improvements. It appears that for ### this data, QDA provides the best results of the methods that we
### have examined so far.

You can get the example in:
https://github.com/pakinja/Data-R-Value 

Monday, May 15, 2017

Machine Learning. Stock Market Data, Part 2: Linear Discriminant Analysis.

It is important to mention that the present posts series began as a personal way of practicing R programming and machine learning. Subsequently feedback from the community, urged me to continue performing these exercises and sharing them.  The bibliography and corresponding authors are cited at all times and this posts series is a way of honoring and giving them the credit they deserve for their work.
 
We will develop a linear discriminant analysis example. The exercise was originally published in "An Introduction to Statistical Learning. With applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Springer 2015.

The example we will develop is about classifying when the market value will rise (UP) or fall (Down).
 
We will carry out the exercise verbatim as published in the aforementioned reference and only with slight changes in the coding style.

For more details on the models, algorithms and parameters interpretation, it is recommended to check the aforementioned reference or any other bibliography of your choice.



### "An Introduction to Statistical Learning.
### with applications in R" by Gareth James,
### Daniela Witten, Trevor Hastie and Robert Tibshirani.
### Springer 2015.


### install and load required packages

library(ISLR)
library(psych)
library(MASS)

### perform linear discriminant analysis LDA on the stock
### market data

train <- (Smarket$Year < 2005)
Smarket.2005 <- Smarket[!train, ]
Direction.2005 <- Smarket$Direction[!train]

lda.fit <- lda(Direction ~ Lag1 + Lag2 , data = Smarket , subset = train )
lda.fit





### LDA indicates that 49.2% of training observations
### correspond to days during wich the market went down

### group means suggest that there is a tendency for the
### previous 2 days returns to be negative on days when the
### market increases, and a tendency for the previous days
### returns to be positive on days when the market declines

### coefficients of linear discriminants output provides the
### linear combination of Lag1 and Lag2 that are used to form
### the LDA decision rule

### if (−0.642 * Lag1 − 0.514 * Lag2) is large, then the LDA
### classifier will predict a market increase, and if it is
### small, then the LDA classifier will predict a market decline

### plot() function produces plots of the linear discriminants,
### obtained by computing (−0.642 * Lag1 − 0.514 * Lag2) for
### each of the training observations

plot (lda.fit)




### predictions

### the LDA and logistic regression predictions are almost identical
lda.pred <- predict(lda.fit, Smarket.2005)
names(lda.pred)

lda.class <- lda.pred$class
table(lda.class, Direction.2005)
mean(lda.class == Direction.2005)

### apply a 50% threshold to the posterior probabilities allows
### us to recreate the predictions in lda.pred$class

sum(lda.pred$posterior [ ,1] >= 0.5)
sum(lda.pred$posterior [ ,1] < 0.5)

### posterior probability output by the model corresponds to
### the probability that the market will decrease

lda.pred$posterior[1:20 ,1]
lda.class[1:20]

### use a posterior probability threshold other than 50 % in order
### to make predictions

### suppose that we wish to predict a market decrease only if we
### are very certain that the market will indeed decrease on that
### day-say, if the posterior probability is at least 90%

sum(lda.pred$posterior[ ,1] > 0.9)

### No days in 2005 meet that threshold! In fact, the greatest
### posterior probability of decrease in all of 2005 was 52.02%


You can get the example in:
https://github.com/pakinja/Data-R-Value