Hadoop Popularity

Exploring the popularity of Pig and Hive

Pig and Hive are sometimes compared with one another for their ability to do data manipulations on a Hadoop cluster. There are some important differences. Hive is a direct implementation of the SQL language standard, which gives it a leg-up in terms of user familiarity. I wanted to see how the two compared in the number of posts on Stack Overflow a popular question/answer site for software developers.

I first found the most popular tag for the associated technologies at Stackoverflow. Then I used the public data explorer on StackExchange and entered the tags as queries. I then downloaded the csv file and brought it in to R for some visualizations.

library(tidyverse)

## Warning: Installed Rcpp (0.12.12) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

pig <- read.csv("pig.csv", header = TRUE)
hive <- read.csv("hive.csv", header = TRUE)
pighive <- rbind(pig, hive) #combine the data to one dataframe
pighive$mo <- strptime(x = as.character(pighive$mo), format = "%Y-%m-%d %H:%M:%S")

ggplot(pighive, aes(mo, Total.Votes)) +
  geom_line(aes(color = TagName)) + 
  ggtitle("Popularity of Pig vs Hive on Stack Overflow") +
  ylab("Tag Votes") +
  xlab("Time")

The number of posts with apache-pig as the tag has plataeued and slightly droped from its peak in 2014. Hive has gained in popularity and has more than 3x the number of posts. Seems like a clear winner for Hive here.

A comparison of all Hadoop-related technology popularity

How do the other Hadoop-related technologies compare?

hadoop <- read.csv("hadoop.csv", header = TRUE)
hbase <- read.csv("hbase.csv", header = TRUE)
spark <- read.csv("spark.csv", header = TRUE)
mahout <- read.csv("mahout.csv", header = TRUE)
mapreduce <- read.csv("mapreduce.csv", header = TRUE)

hadoop_all <- rbind(hadoop, pig, hive, hbase, spark, mahout, mapreduce) #combine the data to one dataframe
hadoop_all$mo <- strptime(x = as.character(hadoop_all$mo), format = "%Y-%m-%d %H:%M:%S")

ggplot(hadoop_all, aes(mo, Total.Votes)) +
  geom_line(aes(color = TagName)) + 
  ggtitle("Popularity of Hadoop related technologies on Stack Overflow") +
  ylab("Tag Votes") +
  xlab("Time")

Hive’s rise is completely dwarfed by the acceleration over the past year and half of Spark. Spark is tagged in twice the number of posts as the general Hadoop tag, a technology it was built upon. This has convinced me to put full effort into learning Spark going forward.

Hadoop Popularity

Hive and Pig are no match for Spark

Hadoop Popularity

Hive and Pig are no match for Spark

Exploring the popularity of Pig and Hive