Tag Archives: mapreduce

Moving from MapReduce to Apache Spark

MapReduce provides only map and reduce primitives. Rest of the things (groupby, sort, top, join, filter etc) have to be force fit into the map and reduce primitives. But, Spark provides a lot of other primitives as methods on RDD (1, 2, 3). So, it’s easy to write programs in Spark than using MapReduce. Also, Spark programs are a bit faster because Spark keeps the data in the memory and spills it to the disk when required between each iteration. In case of MapReduce, the data has to be written to the disk between iterations which make it slow.

The common denominator to MapReduce and Spark are the map and the reduce primitives. But, there are some differences between the map and the reduce primitives in MapReduce and Spark. For those familiar with the MapReduce model and would like to move to Spark, here is an nice blog entry from Sean Owen (Cloudera). As usual reading the article is something to start with, but the actual practice is one which will make things more clear.

Underlying Hive Engines

One thing to note is Hadoop/MapReduce programming is a bit low level in nature and it takes lots of boiler template and logic code to complete a simple task like WordCount. Here is the program to aggregate the occurrence of different words in a give input, which will give an idea on how complex MapReduce programs can become.

Because of the above mentioned reasons Facebook started Apache Hive (note that Facebook is moving towards Presto from Hive) and Yahoo started Apache Pig which are higher level abstracts to convert HiveQL for Hive and PigLatin for Pig into low level MapReduce execution. Hive and Pig provide better developer productivity compared to MapReduce, so most of the companies are opting for higher level abstractions like Hive and Pig for data processing.Hive-Pig-MRAs mentioned in the previous blog, MapReduce is batch oriented in nature. So, the frameworks like Hive and Pig are also batch oriented in nature. So, there is a  work in progress to support multiple Hive progressing engine like Apache Incubator Tez and Apache Spark beside the old MapReduce. By simple configuration changes, the Hive queries can be converted to Tez or Spark or MR programs to be executed on the appropriate platform.Hive-Engines Continue reading Underlying Hive Engines