Spark Accumulators vs Hadoop Counters

In the previous blog, we looked at how to find the maximum temperature for a particular year using Spark/Python. Now, we will extend the program to find the number of valid and invalid records using accumulators in the weather data set. Spark accumulators are similar to Hadoop counters and can be used to count the number of times an event happens during the job execution.

During a sunny day scenario, the RDD records are processed only once. But, during a failure scenario the same RDD record might be processed multiple times and so the accumulator is updated multiple times for the same record. This might lead to incorrect value of the accumulator. The same is the case with Hadoop also, a block of data might be processed multiple times either due to some failure or due to speculative execution and the counter might be incremented for the same record multiple times. Not exactly sure if this failure scenario is handled in Spark and Hadoop.

Leave a Reply

Your email address will not be published. Required fields are marked *