Tag Archives: python

Spark Accumulators vs Hadoop Counters

In the previous blog, we looked at how to find the maximum temperature for a particular year using Spark/Python. Now, we will extend the program to find the number of valid and invalid records using accumulators in the weather data set. Spark accumulators are similar to Hadoop counters and can be used to count the number of times an event happens during the job execution.

During a sunny day scenario, the RDD records are processed only once. But, during a failure scenario the same RDD record might be processed multiple times and so the accumulator is updated multiple times for the same record. This might lead to incorrect value of the accumulator. The same is the case with Hadoop also, a block of data might be processed multiple times either due to some failure or due to speculative execution and the counter might be incremented for the same record multiple times. Not exactly sure if this failure scenario is handled in Spark and Hadoop.

Finding the top 3 movies using Spark/Python

Given the movie and a movie rating data in the below format and the scaled down sample data sets for the movierating and movie. The Spark program aggregates the movie ratings and gives the top 3 movies by name. For this to happen aggregation had to happen on the movierating data and a join has to be done with the movie data.

Note that  MapReduce model doesn’t support the join primitive, so join has to be forcefully fitted in the map and reduce primitives. But, Spark supports join and other high level primitives which make it easier to write Spark programs.

movie
id (pk) | name | year

movierating
userid | movieid (fk) | rating

The Spark program is a bit complex when compared to the previous programs mentioned in this blog, but can be easily understood by going through the comments and the Spark Python API documentation for the same.

Finally, here is the output from the above Spark program in HDFS. movie-db-output

Maximum temperature for year using Spark SQL

In the previous blog, we looked at how find out the maximum temperature of each year from the weather dataset. Below is the code for the same using Spark SQL which is a layer on top of Spark. SQL on Spark was supported using Shark which is being replaced by Spark SQL. Here is a nice blog from DataBricks on the future of SQL on Spark.

Not sure why but a lot of tasks are getting spawned and it is taking much more time than the previous program without using SQL. Also, max_temperature_per_year.count() is returning a proper value, but the output in HDFS has a lot of empty files along with the results in the some of the files. The total number of files equals to the number of tasks spawned in the executor.

Looks like a bug in Spark. If something has been missed in the above code, please let know in the comments and the code will be updated accordingly.

Maximum temperature for year using Spark/Python

Hadoop – The Definitive Guide revolves around the example of finding the maximum temperature for a particular year from the weather data set. The code for the same is here and the data here. Below is the Spark code implemented in Python for the same.

The map function during the transformation is similar to the map function in the MR model and reduceByKey transformation is similar to the reduce function in the MR model.

Not sure why, but the above program spawns two tasks within an executor and each task will generate a separate file in HDFS in the Spark stand alone mode. The size of the input data is small which could have been easily processed by  a single task. Posted a query in StackOverflow and waiting for the response.

In the future blogs, we will see how to perform complex processing using the other transformations and actions provided by Spark.

Hadoop/MR vs Spark/RDD WordCount program

Spark provides an efficient way for solving iterative algorithms by keeping the intermediate data in the memory. This avoids the overhead of R/W of the intermediate data from the disk as in the case of MR.

Also, when running the same operation again and again, data can be cached/fetched from the memory without performing the same operation again. MR is stateless, lets say a program/application in MR has been executed 10 times, then the whole data set has to be scanned 10 times.Spark-vs-MRAlso, namesake MR supports only Map and Reduce operations and everything (join, groupby etc) has to be fit into the Map and Reduce model, which might not be the efficient way. Spark supports a couple of other transformations and actions besides just Map and Reduce as mentioned here and here.

Spark code is also compact when compared to the MR code. Below is the program for performing the WordCount using Python in Spark.

Continue reading Hadoop/MR vs Spark/RDD WordCount program