Thursday, May 25, 2017

The Most Popular Baby Names in the US Per Year. An Exercise in R.

Using the databases of the Social Security Administration (https://www.ssa.gov/oact//babynames/limits.html), I will analyze the most used names in each year from 1880 to 2016.

We can see that there are seasons when certain names become fashionable and that these fashions coincide with the appearance of certain songs, films or events.

I used the following script in R to perform the analysis:

### Author: Francisco Jaramillo
### Data Scientist Pakinja
### Data R Value 2017


### load required libraries
library(dplyr)
library(ggplot2)


### read all 137 data files (one per year)
temp = list.files(pattern="*.txt")
myfiles = lapply(temp, read.csv, header=FALSE)


### filter first data file by female names, get the
### most frequent and make it the first row of a dataframe
pop_names <- filter(data.frame(myfiles[1]), V2 == "F")[1,c(1,3)]

### loop to do the
previous step for all data files
for(i in 2:length(myfiles)){

  pop_names <- rbind(pop_names, filter(data.frame(myfiles[i]),

  V2 == "F")[1,c(1,3)])
 
}


### set levels for $names variable
pop_names$V1 <- factor(pop_names$V1, levels = unique(pop_names$V1))

### make year series
Year <- seq(1880, 2016, 1)


### bind $year variable to dataframe
pop_names <- cbind(pop_names, Year)

### set dataframe names
names(pop_names) <- c("Name", "Frequency", "Year")

### most used female names visualization
ggplot(pop_names, aes(x = Year, y = Frequency))+
  geom_bar(aes(fill = Name), stat = "identity")+
  labs(title="Most Used Female Names in USA Per Year")+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_continuous(breaks = seq(1880, 2016, by = 5))+
  scale_y_continuous(breaks = seq(0, 100000, by = 10000))+
  scale_fill_brewer(palette="Paired")+
  theme(legend.text=element_text(size=13))+
  theme(legend.position=c(0.1, 0.7))+
  theme(legend.background = element_rect(size=0.5, linetype="solid", colour ="darkblue"))




Incredibly, the name Mary remained for 75 years as the most used name.

We also observe that the use of the name Linda explodes since 1947, coinciding with the arrival of the song Linda (Ray Noble & Buddy Clark, Columbia Records) to the Billboard magazine on March 21, 1947 and peaking at number one for several months.

About the name Jennifer, there are different analyzes of why it was the most popular name for about 15 years, you just have to google it.

The most popular name recently is Emma.
 
################ now for males

### filter first data file by male names, get the
### most frequent and make it the first row of a dataframe
pop_names_m <- filter(data.frame(myfiles[1]), V2 == "M")[1,c(1,3)]

###
loop to do the previous step for all data files
 for(i in 2:length(myfiles)){
 
  pop_names_m <- rbind(pop_names_m, filter(data.frame(myfiles[i]),

  V2 == "M")[1,c(1,3)])
 
}


###
set levels for $names variable
 pop_names_m$V1 <- factor(pop_names_m$V1, levels = unique(pop_names_m$V1))

### make year series
Year <- seq(1880, 2016, 1)


### bind $year variable to dataframe
pop_names_m <- cbind(pop_names_m, Year)

### set dataframe names
names(pop_names_m) <- c("Name", "Frequency", "Year")

### most used male names visualization
ggplot(pop_names_m, aes(x = Year, y = Frequency))+
  geom_bar(aes(fill = Name), stat = "identity")+
  labs(title="Most Used Male Names in USA Per Year")+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_continuous(breaks = seq(1880, 2016, by = 5))+
  scale_y_continuous(breaks = seq(0, 100000, by = 5000))+
  scale_fill_brewer(palette="Paired")+
  theme(legend.text=element_text(size=13))+
  theme(legend.position=c(0.1, 0.7))+
  theme(legend.background = element_rect(size=0.5, linetype="solid", colour ="darkblue"))


Not surprisingly, the most popular name for 45 years was John.

The popularity in the use of the name James, coincides with the premiere of the film "The Return of Frank James" on August 6, 1940 and starring Henry Fonda.

The name Micheal remained for about 43 years as the most popular.

The name Jacob begins to be popular from 1999 coinciding with the premiere of the film "Jakob the Liar" carried out by Robin Williams.

Recently the most popular name is Noah.

I hope this exercise has been interesting and useful for you.



Get everything you need to repeat the exercise at:

https://github.com/pakinja/Data-R-Value