Exploring the popularity of Pig and Hive
Pig and Hive are sometimes compared with one another for their ability to do data manipulations on a Hadoop cluster. There are some important differences. Hive is a direct implementation of the SQL language standard, which gives it a leg-up in terms of user familiarity. I wanted to see how the two compared in the number of posts on Stack Overflow a popular question/answer site for software developers.
I first found the most popular tag for the associated technologies at Stackoverflow. Then I used the public data explorer on StackExchange and entered the tags as queries. I then downloaded the csv file and brought it in to R for some visualizations.
library(tidyverse)
## Warning: Installed Rcpp (0.12.12) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
pig <- read.csv("pig.csv", header = TRUE)
hive <- read.csv("hive.csv", header = TRUE)
pighive <- rbind(pig, hive) #combine the data to one dataframe
pighive$mo <- strptime(x = as.character(pighive$mo), format = "%Y-%m-%d %H:%M:%S")
ggplot(pighive, aes(mo, Total.Votes)) +
geom_line(aes(color = TagName)) +
ggtitle("Popularity of Pig vs Hive on Stack Overflow") +
ylab("Tag Votes") +
xlab("Time")
The number of posts with apache-pig as the tag has plataeued and slightly droped from its peak in 2014. Hive has gained in popularity and has more than 3x the number of posts. Seems like a clear winner for Hive here.