Monte Carlo of Random Correlations

Exploring correlations of random numbers When working with big data you need to be more aware of statistical outliers than you do with more typical data sizes. Basic statistical tests like a Student’s t-test or Pearson correlation are acceptable when you only test a few relationships in a small data set. But when you examine the correlation with thousands of columns of data, you are bound to find several that are strongly correlated. [Read More]

Hadoop Popularity

Hive and Pig are no match for Spark

Exploring the popularity of Pig and Hive Pig and Hive are sometimes compared with one another for their ability to do data manipulations on a Hadoop cluster. There are some important differences. Hive is a direct implementation of the SQL language standard, which gives it a leg-up in terms of user familiarity. I wanted to see how the two compared in the number of posts on Stack Overflow a popular question/answer site for software developers. [Read More]

Working with Pigs

Grunt

This week I’ve been learning about the Pig language for Hadoop distributed computing systems. This is the first of several languages that we are covering this semester that were designed as abstractions on top of MapReduce. It’s really interesting to me how many different layers of programming can sit on top of one another and work together to make working with what is essentially machine language more human like. In the case of Pig, the language is called Pig Latin, and resembles SQL in several ways. [Read More]