WordCount with Spark and IPython Notebook

In the previous blog, we looked at how to install Apache Spark. In this blog, we will look at how to run Spark Python programs in an interactive way using IPython. For those who are curious, here is a screencast on the same.

IPython is a web based interactive environment for executing code snippets, plotting graphs, collaborating with others and a lot of nice cool things.  IPython can be installed as a stand alone or as a part of Anaconda which includes IPython and libraries like Pandas, Numpy which we will try to explore later. So, here are the steps.

1) Download and install Anaconda. Anaconda is not part of Ubuntu repository, so it has to be installed manually. Also, Anaconda has to be updated manually.

2) Edit the .bashrc file to add Anaconda to the path and to specify the IPython options related to Spark.

3) Go to the Spark installation folder and start pyspark

4) A browser will be launched where in a notebook can be created and a Spark wordcount program can be executed in an interactive way as shown in this screencast.

2 thoughts on “WordCount with Spark and IPython Notebook

  1. Hi. I have a doubt in the context of Spark ML lib. This may not be very relevant to the above post, apologies for that. I started using Spark ML lib instead of Scikit because of the earlier’s scalable capabilities. I am facing the problem with the vectorizer. In Spark only Hashing TF is available (If I am not completely wrong). Because of this, the text column I am considering for modeling is converted to hashes and I dont think, reverse engineering can be done to the tokens. Could you please suggest if you find any alternatives

Leave a Reply

Your email address will not be published. Required fields are marked *