Tag Archives: installation

Apache Spark cluster on a single machine

In the previous blogs we looked at how to get started with Spark on a single machine and to execute a simple Spark program from iPython. In the current blog. we will look into how to create a small cluster (group of machines) on a single machine as shown below and execute a simple word count program with the data in HDFS. For those who are curious here is a screencast on starting the cluster and running a Spark program on the same.
spark-cluster-wordcount-python 1) The first step is to setup Hadoop in multi node mode as mentioned here and here using virtualization (1, 2).

2) Download and extract the Spark on both the host and slaves.

3) Change the Spark conf/slaves file on the host machine to reflect the host names of the slaves.

4) Start HDFS/Spark and put some data in HDFS.

5) Execute the Spark Python program for WordCount as shown in the screencast.

The good thing about the entire setup is that it can work while offline, there is no need for an internet connection. Oracle VirtualBox host only networking can be used for the communication between the host and the guest OS.

The entire setup process takes quite some time and for those who are doing it for the first time, it’s be a bit of challenge. But, the main intension is to show that it’s possible to create a small Hadoop/Spark cluster in a single machine.

WordCount with Spark and IPython Notebook

In the previous blog, we looked at how to install Apache Spark. In this blog, we will look at how to run Spark Python programs in an interactive way using IPython. For those who are curious, here is a screencast on the same.

IPython is a web based interactive environment for executing code snippets, plotting graphs, collaborating with others and a lot of nice cool things.  IPython can be installed as a stand alone or as a part of Anaconda which includes IPython and libraries like Pandas, Numpy which we will try to explore later. So, here are the steps.

1) Download and install Anaconda. Anaconda is not part of Ubuntu repository, so it has to be installed manually. Also, Anaconda has to be updated manually.

2) Edit the .bashrc file to add Anaconda to the path and to specify the IPython options related to Spark.

3) Go to the Spark installation folder and start pyspark

4) A browser will be launched where in a notebook can be created and a Spark wordcount program can be executed in an interactive way as shown in this screencast.