In the previous blogs we looked at how to get started with Spark on a single machine and to execute a simple Spark program from iPython. In the current blog. we will look into how to create a small cluster (group of machines) on a single machine as shown below and execute a simple word count program with the data in HDFS. For those who are curious here is a screencast on starting the cluster and running a Spark program on the same.
1) The first step is to setup Hadoop in multi node mode as mentioned here and here using virtualization (1, 2).
2) Download and extract the Spark on both the host and slaves.
3) Change the Spark conf/slaves file on the host machine to reflect the host names of the slaves.
4) Start HDFS/Spark and put some data in HDFS.
5) Execute the Spark Python program for WordCount as shown in the screencast.
The good thing about the entire setup is that it can work while offline, there is no need for an internet connection. Oracle VirtualBox host only networking can be used for the communication between the host and the guest OS.
The entire setup process takes quite some time and for those who are doing it for the first time, it’s be a bit of challenge. But, the main intension is to show that it’s possible to create a small Hadoop/Spark cluster in a single machine.