Apache Spark MOOC

Databricks has announced MOOC around Apache Spark. More details here. The first one Introduction to Big Data with Apache Spark starts on 23rd Feb, 2015 and the second one Scalable Machine Learning starts on 14th April, 2015 which is a long time for now. For those who are interested in Apache Spark, would strongly recommend to enroll to the courses and get notified before the course starts.

Moving from MapReduce to Apache Spark

MapReduce provides only map and reduce primitives. Rest of the things (groupby, sort, top, join, filter etc) have to be force fit into the map and reduce primitives. But, Spark provides a lot of other primitives as methods on RDD (1, 2, 3). So, it’s easy to write programs in Spark than using MapReduce. Also, Spark programs are a bit faster because Spark keeps the data in the memory and spills it to the disk when required between each iteration. In case of MapReduce, the data has to be written to the disk between iterations which make it slow.

The common denominator to MapReduce and Spark are the map and the reduce primitives. But, there are some differences between the map and the reduce primitives in MapReduce and Spark. For those familiar with the MapReduce model and would like to move to Spark, here is an nice blog entry from Sean Owen (Cloudera). As usual reading the article is something to start with, but the actual practice is one which will make things more clear.

Apache Spark cluster on a single machine

In the previous blogs we looked at how to get started with Spark on a single machine and to execute a simple Spark program from iPython. In the current blog. we will look into how to create a small cluster (group of machines) on a single machine as shown below and execute a simple word count program with the data in HDFS. For those who are curious here is a screencast on starting the cluster and running a Spark program on the same.
spark-cluster-wordcount-python 1) The first step is to setup Hadoop in multi node mode as mentioned here and here using virtualization (1, 2).

2) Download and extract the Spark on both the host and slaves.

3) Change the Spark conf/slaves file on the host machine to reflect the host names of the slaves.

4) Start HDFS/Spark and put some data in HDFS.

5) Execute the Spark Python program for WordCount as shown in the screencast.

The good thing about the entire setup is that it can work while offline, there is no need for an internet connection. Oracle VirtualBox host only networking can be used for the communication between the host and the guest OS.

The entire setup process takes quite some time and for those who are doing it for the first time, it’s be a bit of challenge. But, the main intension is to show that it’s possible to create a small Hadoop/Spark cluster in a single machine.

WordCount with Spark and IPython Notebook

In the previous blog, we looked at how to install Apache Spark. In this blog, we will look at how to run Spark Python programs in an interactive way using IPython. For those who are curious, here is a screencast on the same.

IPython is a web based interactive environment for executing code snippets, plotting graphs, collaborating with others and a lot of nice cool things.  IPython can be installed as a stand alone or as a part of Anaconda which includes IPython and libraries like Pandas, Numpy which we will try to explore later. So, here are the steps.

1) Download and install Anaconda. Anaconda is not part of Ubuntu repository, so it has to be installed manually. Also, Anaconda has to be updated manually.

2) Edit the .bashrc file to add Anaconda to the path and to specify the IPython options related to Spark.

3) Go to the Spark installation folder and start pyspark

4) A browser will be launched where in a notebook can be created and a Spark wordcount program can be executed in an interactive way as shown in this screencast.

Using Hadoop Streaming programs with Spark Pipes

Hadoop provides bindings(API) for only Java. To implement MapReduce programs in other languages Hadoop Streaming can be used. Any language which can R/W to standard streams can be programmed against Hadoop. It’s good that we can write MapReduce programs in non-Java languages based on our requirement, but the catch is that there is a bit of an overhead of launching an additional process for each block of data and also due to the inter process communication.

In any case, the non-Java program reads data from the Standard Input, does processing of the data and finally writes to the Standard Output. Spark also has a similar concept of integrating external programs which R/W to STDIO. So, any streaming program which has been developed for Hadoop/MapReduce can be easily used with minimal or no changes with Spark by using the pyspark.rdd.RDD.pipe() API in case of Python.hadoop spark streamingThe max_temperature_map.rb program from the Hadoop – The Definitive Guide book can be integrated with the Spark program to find the maximum temperature for an year as shown in the below code snippet. The max_temperature_map.rb takes a line from the RDD as an input, extract the different fields of interest and then only return the valid records to create a new RDD.


Big Data frameworks provide bindings mostly for Java, since most of them are developed in Java. Integration of Big Data frameworks with multiple languages is really important.  Writing extensions in other languages is not possible or is a bit of pain in most of the cases. All Hadoop Streaming programs can be integrated with Spark with minimal or no changes.