Hadoop/MR vs Spark/RDD WordCount program

Spark provides an efficient way for solving iterative algorithms by keeping the intermediate data in the memory. This avoids the overhead of R/W of the intermediate data from the disk as in the case of MR.

Also, when running the same operation again and again, data can be cached/fetched from the memory without performing the same operation again. MR is stateless, lets say a program/application in MR has been executed 10 times, then the whole data set has to be scanned 10 times.Spark-vs-MRAlso, namesake MR supports only Map and Reduce operations and everything (join, groupby etc) has to be fit into the Map and Reduce model, which might not be the efficient way. Spark supports a couple of other transformations and actions besides just Map and Reduce as mentioned here and here.

Spark code is also compact when compared to the MR code. Below is the program for performing the WordCount using Python in Spark.

and the same in MR model using Hadoop is a bit verbose as shown below.

Here we looked into WordCount which is similar to HelloWorld program in terms of simplicity, for one to get started with a new concept/technology/language. In the future blogs, we will look into a little more advanced features in Spark.

10 thoughts on “Hadoop/MR vs Spark/RDD WordCount program

  1. Well, its not easy and fault tolerant to keep large data outputs (intermediate key values) in memory for long duration, I have worked on some very robust in-memory Databases and trust me there’s lot of complexities involved to keep the data consistent over a period of time.

    By the way anything that’s abstract leaves little scope for customization, arguably having few lines of code to achieve the implementation may be little tempting but there’s whole bunch of other Software life cycle entities that will open caveats in such approach. Take the case of Hive, it fails in many ways when we work on some real time implementations.

    1. May be you could throw some light into the below. It’s not clear from the comments.

      a) What are the problems you faced with Hive?
      b) Problem with storing the intermediate kv in memory?

      And I agree with you, more the abstraction less the flexibility and more the developer productivity.

      1. We had a requirement to scan through a large log file to spit out some essential key parameters, the initial thought was given to load the whole dataset into hive and run through some queries, but there were challenges involved in processing each input line, for an instance a bunch of lines contained the relevant message info has to be parsed in one iteration, in certain cases the query resulted in null outputs(input dataset was not structured), mutiple log files contained the required data.

        Eventually the model adopted was: Run through the MapReduce (MR) job to filter the initial log file (BTW the MultipleInputs class was used as there were multiple log files generated by different nodes), the generated log files containing specific messages of a given category and they are loaded into Hive External Tables for further analytics.

        Perhaps there could be other ways to overcome these challenges but MR gave us some flexibility for customization.

        There’s certainly a dearth of talent in Java, so not everyone likes to have there implementations in MR but you do have its own benefits when we write a code to design the Software.

        In-memory data storage has the overhead of checkpointing for failover recoveries, also imagine working on large datasets producing huge key-value pairs, this would result in lot of store keeping activities. Well its really interesting to have an in-memory storage and I am sure things will evolve in coming days.

        Ah! and finally I see someone really cares about developer :).

        1. Picking the right tool is really important. This is the pattern I see. Use Pig for preprocessing of the data and then use Hive for doing some analytics or generating reports. Pig lies between Hive and MR in terms of flexibility and is much more easier than MR.

          Checkpoint is a necessity especially when using memory, so as to start from the last checkpoint state in case of failures. Checkpoint can be tweaked by using different approaches like by using a proper SerDe, compression and others. Not, sure how it is an overhead.

          Not sure what you mean by store keeping activities.

  2. I have a question running in my mind since long, and no answer was convincing:

    What if am trying to work on the datasets whoe’s size is more than my memory.
    suppose am working on a clustering algorithm and having data of size which 3x greater than my Memory.

    Here I cant pull the complete data into my RAM(in-memory) nor I can break the data into Pieces which can fit-in my memory and perform clustering.

    I would be happy if you cab brief about this.
    How does it work in SparkContext and normal MR mode?

    1. As Praveen has rightly mentioned,
      In normal hadoop mode the map reduce program works on HDFS, so main memory is not a problem whereas in Spark it would be a problem.
      But spark handles it by spilling the data on to filesystem when the main memory is all utilized.

  3. Can you be more precise. I would request you to elaborate a little more on this.
    You mean to say, some of your data resides on the disk and the required operations are performed.
    The performance is gone.

  4. I’m a bit confuse when is comes to creating RDD. RDD is a collection of partitions and a partition a logical division of the data. when an RDD is created, I want to know if the data is physically loaded in the memory or RDD only have the address of the physical data located in HDFS for example?
    Thanks for your help

Leave a Reply

Your email address will not be published. Required fields are marked *