This is a blog dedicated to R programming language and to the field of data science.
Here you will find scripts, hints, algorithms, examples, etc. There will also be published data analyzes that contribute to the improvement of societies and the quality of life of people.

I will start this post telling a story of how the game called Go, catapulted me into learning mathematics.

The physics career is based on math from very basic to the most complex. By the time I started that career, my preparation was very poor and for that reason it was very difficult for me to advance in my studies. Luckily, in college there were a lot of comrades playing Go, which until that moment never caught my eye. I decided to learn to play it, finding it difficult enough at first, but with practice it became more natural and interesting. When I began to understand the Go game, I also began to understand different academic subjects such as differential and integral calculus, higher algebra, linear algebra and other subjects that required a lot of structured thinking, the Go helped my brain to click on math in an incredible way. The Go, allows our brain to create, maintain and exercise neural connections, so it also helps us learn and practice programming languages. In general, the Go is a millennial game in which the objective is to seize as much territory as possible and not to let your opponent obtain it, for this you must have a lot of ability to count quickly and to anticipate a certain amount of future moves. You can get more information about Go in the following link: https://www.britgo.org/intro/history

There are different platforms to play Go with people from practically all the world, one of them is Kiseido go server that is 100% free: https://www.kiseido.com/ I would like to mention that playing Go would be a great way to calm the stress of employees engaged in technical intellectual work while at the same time strengthening their cognitive skills. Finally, I share the following information about Go and Artificial Intelligence: In October 2015, AlphaGo became the first computer program ever to beat a professional Go player by winning 5-0 against the reigning 3-times European Champion Fan Hui (2-dan pro). That work was featured in a front cover article in the science journal Nature in January 2016.

It is important to mention that the present posts series began as a personal way of practicing R programming and machine learning.Subsequently feedback from the community, urged me to continue performing these exercises and sharing them.The bibliography and corresponding authors are cited at all times, this posts series is a way of honoring and giving them the credit they deserve for their work. We will develop a finding patternsexample using association rules.The example was originally published in"Machine Learning in R"by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

The example is about market basket analysis.

We will carry out the exercise verbatim as published in the aforementioned reference.

For more details on the association rulesalgorithms it is recommended to check the aforementioned reference orany other bibliography of your choice. ### "Machine Learning in R" by Brett Lantz, ### PACKT publishing 2015 ### (open source community experience destilled) ### install and load required packages #install.packages("arules")

library(arules) ### read and explore the data groceries <- read.transactions("groceries.csv", sep = ",") summary(groceries)

### density value 0.0260888 (26%) refers to the proportion ### of nonzero matrix cells

### since 2513/9835 = 0.2555, we can determine that whole ### milk appeared in 25.6% of the transactions

### a total of 2159 transactions contained only a single item ### 4.409 items in mean per transaction

### examining transaction data inspect(groceries[1:5])

itemFrequency(groceries[, 1:3]) ### visualizing item support = 10% itemFrequencyPlot(groceries, support = 0.1, col="green")

### sorting the set of association rules inspect(sort(groceryrules, by = "lift")[1:5])

### lift = 3.96 implies that people who buys herbs are nearly ### four times more likely to buy root vegetables than typical ### costumer

### taking subsets of association rules berryrules <- subset(groceryrules, items %in% "berries") inspect(berryrules) ### there are four rules involving berries, two of which seem ### to be interesting enough to be called actionable

### saving association rules to a file or dataframe write(groceryrules, file = "groceryrules.csv", sep = ",", quote = TRUE, row.names = FALSE)

I share this excellent paper on the impact of embellishment in data visualizations and the effects it can have over the understanding of the information that it's wanted to be transmitted.

We can see that
there are seasons when certain names become fashionable and that these
fashions coincide with the appearance of certain songs, films or events.

I used the following script in R to perform the analysis:

### Author: Francisco Jaramillo ### Data Scientist Pakinja ### Data R Value 2017

### read all 137 data files (one per year) temp = list.files(pattern="*.txt") myfiles = lapply(temp, read.csv, header=FALSE)

### filter first data file by female names, get the ### most frequent and make it the first row of a dataframe pop_names <- filter(data.frame(myfiles[1]), V2 == "F")[1,c(1,3)]

### loop to do the previous step for all data files for(i in 2:length(myfiles)){

### set levels for $names variable pop_names$V1 <- factor(pop_names$V1, levels = unique(pop_names$V1))

### make year series Year <- seq(1880, 2016, 1) ### bind $year variable to dataframe pop_names <- cbind(pop_names, Year)

### set dataframe names names(pop_names) <- c("Name", "Frequency", "Year")

### most used female names visualization ggplot(pop_names, aes(x = Year, y = Frequency))+ geom_bar(aes(fill = Name), stat = "identity")+ labs(title="Most Used Female Names in USA Per Year")+ theme(plot.title = element_text(hjust = 0.5))+ scale_x_continuous(breaks = seq(1880, 2016, by = 5))+ scale_y_continuous(breaks = seq(0, 100000, by = 10000))+ scale_fill_brewer(palette="Paired")+ theme(legend.text=element_text(size=13))+ theme(legend.position=c(0.1, 0.7))+ theme(legend.background = element_rect(size=0.5, linetype="solid", colour ="darkblue"))

Incredibly, the name Mary remained for 75 years as the most used name.

We also observe that the use of the name Linda explodes since 1947, coinciding with the arrival of the song Linda (Ray Noble & Buddy Clark, Columbia Records) to the Billboard magazine on March 21, 1947 and peaking at number one for several months.

About the name
Jennifer, there are different analyzes of why it was the most popular
name for about 15 years, you just have to google it.

The most popular name recently is Emma. ################ now for males

### filter first data file by male names, get the ### most frequent and make it the first row of a dataframe pop_names_m <- filter(data.frame(myfiles[1]), V2 == "M")[1,c(1,3)]

### loop to do the previous step for all data files for(i in 2:length(myfiles)){

### set levels for $names variable pop_names_m$V1 <- factor(pop_names_m$V1, levels = unique(pop_names_m$V1))

### make year series Year <- seq(1880, 2016, 1)

### bind $year variable to dataframe pop_names_m <- cbind(pop_names_m, Year)

### set dataframe names names(pop_names_m) <- c("Name", "Frequency", "Year")

### most used male names visualization ggplot(pop_names_m, aes(x = Year, y = Frequency))+ geom_bar(aes(fill = Name), stat = "identity")+ labs(title="Most Used Male Names in USA Per Year")+ theme(plot.title = element_text(hjust = 0.5))+ scale_x_continuous(breaks = seq(1880, 2016, by = 5))+ scale_y_continuous(breaks = seq(0, 100000, by = 5000))+ scale_fill_brewer(palette="Paired")+ theme(legend.text=element_text(size=13))+ theme(legend.position=c(0.1, 0.7))+ theme(legend.background = element_rect(size=0.5, linetype="solid", colour ="darkblue"))

Not surprisingly, the most popular name for 45 years was John.

The popularity
in the use of the name James, coincides with the premiere of the film
"The Return of Frank James" on August 6, 1940 and starring Henry Fonda.

The name Micheal remained for about 43 years as the most popular.

The name Jacob
begins to be popular from 1999 coinciding with the premiere of the film
"Jakob the Liar" carried out by Robin Williams.

It
is important to mention that the present posts series began as a
personal way of practicing R programming and machine learning. Subsequently feedback from the community, urged me to continue performing these exercises and sharing them.The
bibliography and corresponding authors are cited at all times and this posts series is
a way of honoring and giving them the credit they deserve for
their work. We will develop a support vector machines example. The example was originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

The example is about optical character recognition.

We will carry out the exercise verbatim as published in the aforementioned reference.

For more details on the support vector machines algorithms it is recommended to check the aforementioned reference or any other bibliography of your choice.

### "Machine Learning in R" by Brett Lantz, ### PACKT publishing 2015 ### (open source community experience destilled)

### install an load required packages install.packages("kernlab")

library(kernlab)

### read and explore the data letters <- read.csv("letterdata.csv") str(letters) ### SVM require all features to be numeric and each ### feature scaled to a fairly small interval ### kernlab package perform the rescaling ### automatically

### split the data into training and testing sets ### 80% - 20% ### data is already randomized letters_train <- letters[1:16000, ] letters_test <- letters[16001:20000, ]

### training a model on the data ### simple linear kernell function letter_classifier <- ksvm(letter ~ ., data = letters_train, kernel = "vanilladot") letter_classifier

### evaluating model performance letter_predictions <- predict(letter_classifier, letters_test)

It
is important to mention that the present posts series began as a
personal way of practicing R programming and machine learning. Subsequently feedback from the community, urged me to continue performing these exercises and sharing them.The
bibliography and corresponding authors are cited at all times and this posts series is
a way of honoring and giving them the credit they deserve for
their work. We will develop an artificial neural network example. The example was originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

The example we will develop is about predicting the strength of concrete based in the ingredients used to made it.

We will carry out the exercise verbatim as published in the aforementioned reference.

For more details on the model trees and regression trees algorithms it is recommended to check the aforementioned reference or any other bibliography of your choice.

### "Machine Learning in R" by Brett Lantz, ### PACKT publishing 2015 ### (open source community experience destilled) ### based on: Yeh IC. "Modeling of Strength of ### High Performance Concrete Using Artificial ### Neural Networks." Cement and Concrete Research ### 1998; 28:1797-1808.

### Strength of concrete example ### relationship between the ingredients used in ### concrete and the strength of finished product

### Dataset ### Compressive strength of concrete ### UCI Machine Learning Data Repository ### http://archive.ics.uci.edu/ml

### install an load required packages #install.packages("neuralnet")

library(neuralnet)

### read and explore the data concrete <- read.csv("concrete.csv") str(concrete)

### neural networks work best when the input data ### are scaled to a narrow range around zero

### apply normalize() to the dataset columns concrete_norm <- as.data.frame(lapply(concrete, normalize))

### confirm and compare normalization summary(concrete_norm$strength) summary(concrete$strength)

### split the data into training and testing sets ### 75% - 25% concrete_train <- concrete_norm[1:773, ] concrete_test <- concrete_norm[774:1030, ]

### training model on the data concrete_model <- neuralnet(strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age, data = concrete_train) ### visualize the network topology plot(concrete_model)

### there is one input node for each of the eight ### features, followed by a single hidden node and ### a single output node that predicts the concrete ### strength ### at the bottom of the figure, R reports the number ### of training steps and an error measure called the ### the sum of squared errors (SSE)

### because this is a numeric prediction problem rather ### than a classification problem, we cannot use a confusion ### matrix to examine model accuracy ### obtain correlation between our predicted concrete strength ### and the true value cor(predicted_strength, concrete_test$strength)

### correlation indicate a strong linear relationships between ### two variables

### improving model performance ### increase the number of hidden nodes to five concrete_model2 <- neuralnet(strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age, data = concrete_train, hidden = 5)

### notice that results can differs because neuralnet ### begins with random weights ### if you'd like to match results exactly, use set.seed(12345) ### before building the neural network You can get the example and the dataset in: https://github.com/pakinja/Data-R-Value

Neural networks have been a very important area of scientific study that has evolved by different disciplines such as mathematics, biology, psychology, computer science, etc. The study of neural networks leapt from theory to practice with the emergence of computers. Training a neural network by adjusting the weights of the connections is computationally very expensive so its application to practical problems took until the mid-80s when a more efficient algorithm was discovered. That algorithm is now known as back-propagation errors or simply backpropagation. One of the most cited articles on this algorithm is:

Learning representations by back-propagating errors David E. Rumelhart^{*}, Geoffrey E. Hinton^{†} & Ronald J. Williams^{*} Nature323, 533 - 536
(09 October 1986)

Although it is a
very technical article, anyone who wants to study and understand neural
networks is obliged to pass through this material. I share the entire article in: https://github.com/pakinja/Data-R-Value

It
is important to mention that the present posts series began as a
personal way of practicing R programming and machine learning. Subsequently feedback from the community, urged me to continue performing these exercises and sharing them.The
bibliography and corresponding authors are cited at all times and this posts series is
a way of honoring and giving them the credit they deserve for
their work. We will develop a quadratic discriminant analysis and a K nearest neighbors examples. The examples was originally published in "An Introduction to Statistical Learning. With applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Springer 2015.

The example we will develop is about classifying when the market value will rise (UP) or fall (Down).

We will carry out the exercise verbatim as published in the aforementioned reference and only with slight changes in the coding style.

For
more details on the models, algorithms and parameters interpretation,
it is recommended to check the aforementioned reference or any other
bibliography of your choice.

### "An Introduction to Statistical Learning. ### With applications in R" by Gareth James, ### Daniela Witten, Trevor Hastie and Robert Tibshirani. ### Springer 2015.

### install and load required packages library(ISLR) library(psych) library(MASS) library(class)

### split the dataset into train and test sets train <- (Smarket$Year < 2005) Smarket.2005 <- Smarket[!train, ] Direction.2005 <- Smarket$Direction[!train] ### Quadratic Discriminant Analysis

### perform quadratic discriminant analysis QDA on the stock ### market data qda.fit <- qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

### the output does not contain the coefficients of the linear discriminants, ### because the QDA classifier involves a quadratic, rather than a linear, ### function of the predictors qda.fit

### QDA predictions are accurate almost 60 % of the time, ### even though the 2005 data was not used to fit the model ### quite impressive for stock market data, which is known to ### be quite hard to model accurately

### K-Nearest Neighbors

### split the dataset into train and test sets train.X <- cbind(Smarket$Lag1, Smarket$Lag2)[train, ] test.X <- cbind(Smarket$Lag1, Smarket$Lag2)[!train, ] train.Direction <- Smarket$Direction[train] ### knn() function can be used to predict the market’s movement ### for the dates in 2005 set.seed(1) knn.pred <- knn(train.X, test.X, train.Direction, k = 1) table(knn.pred, Direction.2005) (83+43)/252

### the results using K = 1 are not very good, since only 50 % of the ### observations are correctly predicted

### the results have improved slightly. But increasing K further ### turns out to provide no further improvements. It appears that for ### this data, QDA provides the best results of the methods that we ### have examined so far.

It
is important to mention that the present posts series began as a
personal way of practicing R programming and machine learning. Subsequently feedback from the community, urged me to continue performing these exercises and sharing them.The
bibliography and corresponding authors are cited at all times and this posts series is
a way of honoring and giving them the credit they deserve for
their work. We will develop a linear discriminant analysis example. The exercise was originally published in "An Introduction to Statistical Learning. With applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Springer 2015. The example we will develop is about classifying when the market value will rise (UP) or fall (Down). We will carry out the exercise verbatim as published in the aforementioned reference and only with slight changes in the coding style.

For
more details on the models, algorithms and parameters interpretation,
it is recommended to check the aforementioned reference or any other
bibliography of your choice.

### "An Introduction to Statistical Learning. ### with applications in R" by Gareth James, ### Daniela Witten, Trevor Hastie and Robert Tibshirani. ### Springer 2015.

### install and load required packages library(ISLR) library(psych) library(MASS)

### perform linear discriminant analysis LDA on the stock ### market data train <- (Smarket$Year < 2005) Smarket.2005 <- Smarket[!train, ] Direction.2005 <- Smarket$Direction[!train]

### LDA indicates that 49.2% of training observations ### correspond to days during wich the market went down

### group means suggest that there is a tendency for the ### previous 2 days returns to be negative on days when the ### market increases, and a tendency for the previous days ### returns to be positive on days when the market declines

### coefficients of linear discriminants output provides the ### linear combination of Lag1 and Lag2 that are used to form ### the LDA decision rule

### if (−0.642 * Lag1 − 0.514 * Lag2) is large, then the LDA ### classifier will predict a market increase, and if it is ### small, then the LDA classifier will predict a market decline

### plot() function produces plots of the linear discriminants, ### obtained by computing (−0.642 * Lag1 − 0.514 * Lag2) for ### each of the training observations plot (lda.fit)

### predictions ### the LDA and logistic regression predictions are almost identical lda.pred <- predict(lda.fit, Smarket.2005) names(lda.pred)

### apply a 50% threshold to the posterior probabilities allows ### us to recreate the predictions in lda.pred$class sum(lda.pred$posterior [ ,1] >= 0.5) sum(lda.pred$posterior [ ,1] < 0.5)

### posterior probability output by the model corresponds to ### the probability that the market will decrease lda.pred$posterior[1:20 ,1] lda.class[1:20]

### use a posterior probability threshold other than 50 % in order ### to make predictions

### suppose that we wish to predict a market decrease only if we ### are very certain that the market will indeed decrease on that ### day-say, if the posterior probability is at least 90% sum(lda.pred$posterior[ ,1] > 0.9)

### No days in 2005 meet that threshold! In fact, the greatest ### posterior probability of decrease in all of 2005 was 52.02%