Moving from MapReduce to Apache Spark

MapReduce provides only map and reduce primitives. Rest of the things (groupby, sort, top, join, filter etc) have to be force fit into the map and reduce primitives. But, Spark provides a lot of other primitives as methods on RDD (1, 2, 3). So, it’s easy to write programs in Spark than using MapReduce. Also, Spark programs are a bit faster because Spark keeps the data in the memory and spills it to the disk when required between each iteration. In case of MapReduce, the data has to be written to the disk between iterations which make it slow.

The common denominator to MapReduce and Spark are the map and the reduce primitives. But, there are some differences between the map and the reduce primitives in MapReduce and Spark. For those familiar with the MapReduce model and would like to move to Spark, here is an nice blog entry from Sean Owen (Cloudera). As usual reading the article is something to start with, but the actual practice is one which will make things more clear.

One thought on “Moving from MapReduce to Apache Spark

  1. Hello Praveen garu,

    Your blog is very helpful.

    I have a doubt in oozie, and thought you would be the best to suggest. I have written the java code using FileUtil.CopyMerge(….). Below is my oozie java action :

    Merging ${nameNode}/user/abhime01/haadoop/Merge/merge_output
    file:///home/abhi01/yoooize.txt

    In the above if I try the second arg as a path in hdfs I am able to merge the data.
    But if I give it as path in local file system I am getting the below error :

    “Mkdirs failed to create file:/home/abhime01 (exists=false, cwd=file:/CDH/sdu1/yarn/nm/usercache/abhime01/appcache/application_1440579785423_1755/container_e27_1440579785423_1755_01_000001)”

    Can you please suggest me how to merge output and store it into local FS using ooize.

    PS: Java code work fine if it is run directly, prblm exist only if it is run through oozie.

Leave a Reply

Your email address will not be published. Required fields are marked *