Hadoop – The Definitive Guide revolves around the example of finding the maximum temperature for a particular year from the weather data set. The code for the same is here and the data here. Below is the Spark code implemented in Python for the same.
from pyspark import SparkContext
#function to extract the data from the line
#based on position and filter out the invalid records
val = line.strip()
(year, temp, q) = (val[15:19], val[87:92], val[92:93])
if (temp != "+9999" and re.match("", q)):
return [(year, temp)]
logFile = "hdfs://localhost:9000/user/bigdatavm/input"
#Create Spark Context with the master details and the application name
sc = SparkContext("spark://bigdata-vm:7077", "max_temperature")
#Create an RDD from the input data in HDFS
weatherData = sc.textFile(logFile)
#Transform the data to extract/filter and then find the max temperature
max_temperature_per_year = weatherData.flatMap(extractData).reduceByKey(lambda a,b : a if int(a) > int(b) else b)
#Save the RDD back into HDFS
Not sure why, but the above program spawns two tasks within an executor and each task will generate a separate file in HDFS in the Spark stand alone mode. The size of the input data is small which could have been easily processed by a single task. Posted a query in StackOverflow and waiting for the response.