Hadoop Popularity

Hive and Pig are no match for Spark

Hadoop Popularity

Hive and Pig are no match for Spark

Exploring the popularity of Pig and Hive

Pig and Hive are sometimes compared with one another for their ability to do data manipulations on a Hadoop cluster. There are some important differences. Hive is a direct implementation of the SQL language standard, which gives it a leg-up in terms of user familiarity. I wanted to see how the two compared in the number of posts on Stack Overflow a popular question/answer site for software developers.

I first found the most popular tag for the associated technologies at Stackoverflow. Then I used the public data explorer on StackExchange and entered the tags as queries. I then downloaded the csv file and brought it in to R for some visualizations.

## Warning: Installed Rcpp (0.12.12) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
pig <- read.csv("pig.csv", header = TRUE)
hive <- read.csv("hive.csv", header = TRUE)
pighive <- rbind(pig, hive) #combine the data to one dataframe
pighive$mo <- strptime(x = as.character(pighive$mo), format = "%Y-%m-%d %H:%M:%S")

ggplot(pighive, aes(mo, Total.Votes)) +
  geom_line(aes(color = TagName)) + 
  ggtitle("Popularity of Pig vs Hive on Stack Overflow") +
  ylab("Tag Votes") +

The number of posts with apache-pig as the tag has plataeued and slightly droped from its peak in 2014. Hive has gained in popularity and has more than 3x the number of posts. Seems like a clear winner for Hive here.

comments powered by Disqus