Thursday, May 25, 2017

The Most Popular Baby Names in the US Per Year. An Exercise in R.

Using the databases of the Social Security Administration (https://www.ssa.gov/oact//babynames/limits.html), I will analyze the most used names in each year from 1880 to 2016.

We can see that there are seasons when certain names become fashionable and that these fashions coincide with the appearance of certain songs, films or events.

I used the following script in R to perform the analysis:

### Author: Francisco Jaramillo
### Data Scientist Pakinja
### Data R Value 2017


### load required libraries
library(dplyr)
library(ggplot2)


### read all 137 data files (one per year)
temp = list.files(pattern="*.txt")
myfiles = lapply(temp, read.csv, header=FALSE)


### filter first data file by female names, get the
### most frequent and make it the first row of a dataframe
pop_names <- filter(data.frame(myfiles[1]), V2 == "F")[1,c(1,3)]

### loop to do the
previous step for all data files
for(i in 2:length(myfiles)){

  pop_names <- rbind(pop_names, filter(data.frame(myfiles[i]),

  V2 == "F")[1,c(1,3)])
 
}


### set levels for $names variable
pop_names$V1 <- factor(pop_names$V1, levels = unique(pop_names$V1))

### make year series
Year <- seq(1880, 2016, 1)


### bind $year variable to dataframe
pop_names <- cbind(pop_names, Year)

### set dataframe names
names(pop_names) <- c("Name", "Frequency", "Year")

### most used female names visualization
ggplot(pop_names, aes(x = Year, y = Frequency))+
  geom_bar(aes(fill = Name), stat = "identity")+
  labs(title="Most Used Female Names in USA Per Year")+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_continuous(breaks = seq(1880, 2016, by = 5))+
  scale_y_continuous(breaks = seq(0, 100000, by = 10000))+
  scale_fill_brewer(palette="Paired")+
  theme(legend.text=element_text(size=13))+
  theme(legend.position=c(0.1, 0.7))+
  theme(legend.background = element_rect(size=0.5, linetype="solid", colour ="darkblue"))




Incredibly, the name Mary remained for 75 years as the most used name.

We also observe that the use of the name Linda explodes since 1947, coinciding with the arrival of the song Linda (Ray Noble & Buddy Clark, Columbia Records) to the Billboard magazine on March 21, 1947 and peaking at number one for several months.

About the name Jennifer, there are different analyzes of why it was the most popular name for about 15 years, you just have to google it.

The most popular name recently is Emma.
 
################ now for males

### filter first data file by male names, get the
### most frequent and make it the first row of a dataframe
pop_names_m <- filter(data.frame(myfiles[1]), V2 == "M")[1,c(1,3)]

###
loop to do the previous step for all data files
 for(i in 2:length(myfiles)){
 
  pop_names_m <- rbind(pop_names_m, filter(data.frame(myfiles[i]),

  V2 == "M")[1,c(1,3)])
 
}


###
set levels for $names variable
 pop_names_m$V1 <- factor(pop_names_m$V1, levels = unique(pop_names_m$V1))

### make year series
Year <- seq(1880, 2016, 1)


### bind $year variable to dataframe
pop_names_m <- cbind(pop_names_m, Year)

### set dataframe names
names(pop_names_m) <- c("Name", "Frequency", "Year")

### most used male names visualization
ggplot(pop_names_m, aes(x = Year, y = Frequency))+
  geom_bar(aes(fill = Name), stat = "identity")+
  labs(title="Most Used Male Names in USA Per Year")+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_continuous(breaks = seq(1880, 2016, by = 5))+
  scale_y_continuous(breaks = seq(0, 100000, by = 5000))+
  scale_fill_brewer(palette="Paired")+
  theme(legend.text=element_text(size=13))+
  theme(legend.position=c(0.1, 0.7))+
  theme(legend.background = element_rect(size=0.5, linetype="solid", colour ="darkblue"))


Not surprisingly, the most popular name for 45 years was John.

The popularity in the use of the name James, coincides with the premiere of the film "The Return of Frank James" on August 6, 1940 and starring Henry Fonda.

The name Micheal remained for about 43 years as the most popular.

The name Jacob begins to be popular from 1999 coinciding with the premiere of the film "Jakob the Liar" carried out by Robin Williams.

Recently the most popular name is Noah.

I hope this exercise has been interesting and useful for you.



Get everything you need to repeat the exercise at:

https://github.com/pakinja/Data-R-Value

Monday, May 22, 2017

Machine Learning. Support Vector Machines (Optical Character Recognition).

It is important to mention that the present posts series began as a personal way of practicing R programming and machine learning. Subsequently feedback from the community, urged me to continue performing these exercises and sharing them.  The bibliography and corresponding authors are cited at all times and this posts series is a way of honoring and giving them the credit they deserve for their work.
 
We will develop a support vector machines example. The example was originally published in "Machine Learning in R" by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).


The example is about optical character recognition.

 
We will carry out the exercise verbatim as published in the aforementioned reference.

For more details on the support vector machines algorithms it is recommended to check the aforementioned reference or any other bibliography of your choice.



### "Machine Learning in R" by Brett Lantz,
### PACKT publishing 2015
### (open source community experience destilled)



### install an load required packages
install.packages("kernlab")

library(kernlab)

### read and explore the data
letters <- read.csv("letterdata.csv")
str(letters)

### SVM require all features to be numeric and each
### feature scaled to a fairly small interval
### kernlab package perform the rescaling
### automatically


### split the data into training and testing sets
### 80% - 20%
### data is already randomized

letters_train <- letters[1:16000, ]
letters_test <- letters[16001:20000, ]


### training a model on the data
### simple linear kernell function

letter_classifier <- ksvm(letter ~ ., data = letters_train,
                          kernel = "vanilladot")
letter_classifier


### evaluating model performance
letter_predictions <- predict(letter_classifier, letters_test)

head(letter_predictions)

table(letter_predictions, letters_test$letter)
agreement <- letter_predictions == letters_test$letter
table(agreement)
prop.table(table(agreement))


### improving model performance
### Gaussian RBF kernel

letter_classifier_rbf <- ksvm(letter ~ ., data = letters_train,
                              kernel = "rbfdot")
letter_predictions_rbf <- predict(letter_classifier_rbf,
                                  letters_test)
agreement_rbf <- letter_predictions_rbf == letters_test$letter
table(agreement_rbf)
prop.table(table(agreement_rbf))


### changing the kernel function accuracy increased from
### 84% to 93%


You can get the dataset in:
https://github.com/pakinja/Data-R-Value