Tag Archives: sql

Maximum temperature for year using Spark SQL

In the previous blog, we looked at how find out the maximum temperature of each year from the weather dataset. Below is the code for the same using Spark SQL which is a layer on top of Spark. SQL on Spark was supported using Shark which is being replaced by Spark SQL. Here is a nice blog from DataBricks on the future of SQL on Spark.

Not sure why but a lot of tasks are getting spawned and it is taking much more time than the previous program without using SQL. Also, max_temperature_per_year.count() is returning a proper value, but the output in HDFS has a lot of empty files along with the results in the some of the files. The total number of files equals to the number of tasks spawned in the executor.

Looks like a bug in Spark. If something has been missed in the above code, please let know in the comments and the code will be updated accordingly.