Maximum temperature for year using Spark/Python

Hadoop – The Definitive Guide revolves around the example of finding the maximum temperature for a particular year from the weather data set. The code for the same is here and the data here. Below is the Spark code implemented in Python for the same.

The map function during the transformation is similar to the map function in the MR model and reduceByKey transformation is similar to the reduce function in the MR model.

Not sure why, but the above program spawns two tasks within an executor and each task will generate a separate file in HDFS in the Spark stand alone mode. The size of the input data is small which could have been easily processed by  a single task. Posted a query in StackOverflow and waiting for the response.

In the future blogs, we will see how to perform complex processing using the other transformations and actions provided by Spark.

Leave a Reply

Your email address will not be published. Required fields are marked *