Finding the top 3 movies using Spark/Python

Given the movie and a movie rating data in the below format and the scaled down sample data sets for the movierating and movie. The Spark program aggregates the movie ratings and gives the top 3 movies by name. For this to happen aggregation had to happen on the movierating data and a join has to be done with the movie data.

Note that  MapReduce model doesn’t support the join primitive, so join has to be forcefully fitted in the map and reduce primitives. But, Spark supports join and other high level primitives which make it easier to write Spark programs.

movie
id (pk) | name | year

movierating
userid | movieid (fk) | rating

The Spark program is a bit complex when compared to the previous programs mentioned in this blog, but can be easily understood by going through the comments and the Spark Python API documentation for the same.

Finally, here is the output from the above Spark program in HDFS. movie-db-output

6 thoughts on “Finding the top 3 movies using Spark/Python

    1. I haven’t tried it out and not familiar with the image/pdf/document processing API. But, guess a sequence file with K as the file name and V as image/pdf/document has to be created and the same has to be fed to the Spark program for processing. It shouldn’t be that difficult. Take a shot at the code and get back for any clarifications.

Leave a Reply

Your email address will not be published. Required fields are marked *