Category Archives: Big Data

Integrating Apache HBase (Big Data) with Arduino Uno (IOT)

For some time I had been blogging about the intersection of Big Data and IOT. The combination of Cloud, Big Data and IOT provide a lot of opportunities. In this blog we will look on how to integrate Apache HBase with Arduino Uno. HBase is a NoSQL database of columnar type. HBase has got some nice documentation here.

Adding Ethernet Shield to Arduino Uno

arduino-uno-ethernet-shieldArduino Uno by itself doesn’t come with a facility to connect to the internet. So, an Ethernet Shield has to be added on top of the Arduino Uno board. Here we see the Arduino Uno connected to the USB cable on the right and the Ethernet Shield connected to RJ45 network jack on the left. Now the Ethernet Shield has to be mounted on the top of the Arduino Uno for gaining the network capabilities.arduino-uno-ethernet-shield

Shields are expansion boards to provide additional functionality to the Arduino Board. More than one shield can be stacked on each other. Many shield are available in the market. A custom shield can also be built using the Arduino Protoshield.

In the coming blogs, we will look at the Arduino Motor Shield and build some cool things using it.

Starting the HBase and the REST service

HBase provides different interfaces (Java – REST – Phoenix – Protobuf) for the DDL/DML operations. In this tutorial the REST interface will be used. After starting the HBase, the REST server which is part of the HBase installation has to be started. More about the HBase REST interface here and 1, 2, 3.

The REST interface runs on port 8080 and uses the HTTP verbs like POST, PUT, GET and can be  invoked from the browser as shown below. The schema for the table mytable is displayed in the browser here. Because of the language agnostic behavior of REST, applications can be built using any language on top of HBase.

browser-rest-call
Creating a table in HBase and querying it in Arduino Uno

Now a HBase table has to be created using the command create ‘mytable’, ‘metrics’ from the HBase shell. Once the table has been created, the same can be accessed from Arduino Uno. It should be made sure that the Arduino Uno and the HBase are within the same LAN.

hbase-shell

Now that the table has been created,  the Arduino sketch has to be executed to make a REST call. Here (GET, POST) are the complete Arduino Sketch. The HBase-Arduino-GET.ino Sketch is getting the schema for a table in HBase and the HBase-Arduino-POST.ino Sketch is for inserting a row in HBase. Note that the code is not optimized, but is good enough to know the basic concepts and also to get started.

The bottom screen is the Arduino IDE for developing/compiling/deploying the  Arduino Sketches. Bottom right in the same screen is the Arduino Serial Monitor which shows the output of the HBase-Arduino-GET.ino Sketch. Note that the Arduino Sketch does a simple HTTP GET operation to get the schema for a particular table.

arduino-hbase-output

Similarly, the HTTP POST/PUT can be used to put some data in HBase. HBase stores the data as Base64, so the input data has to be converted into Base64 before insert it. The arduindo-base64 library can be used to encode/decode to Base64.

Conclusion

In this blog we have seen how to integrate the Arduino Uno directly with the HBase REST Server, which is a bit restrictive approach because Arduino Uno can’t process much and also there aren’t too many libraries for the Arduino Sketch when compared to languages like Python. Arduino Uno can also be interfaced with an intermediate (through pySerial or Processing) and the intermediate communicates with the HBase REST Server. We will look into the pySerial and the Processing approach in a future blog.arduino-hbase-interfaceIntegrating Big Data with IOT directly  was fun, but had not been straight forward. Big Data and the IOT exist in their own space, may be down the line the different m2m platform vendors will provide some sort of connectors to the Big Data space to make the connection between the Big Data and the IOT worlds a bit more smoother.

In the coming blogs we will look at how to get the ambient temperature from a temperature sensor and put it into HBase at regular intervals.

Prerequisites to get started with Big Data

Very often I do get the query ‘I am familiar with X, Y and Z, is it good enough to get started with Big Data ?’. Old and new construnuctions under Central Station Amsterdam from Flickr by urban portrait in photo pictures under CCThis post is to address the same. Here we will look at what is required to get started with Big Data and the rational behind it.

There is a lot of information on the internet to get started with each one of them and it is easy to get lost. So, the references to get started with those technologies are also included.

Linux : Big Data (the Hadoop revolution) started on the Linux OS. Even now most of the Big Data softwares are initially developed on Linux and porting to Windows has been an after UBUNTU! Tux floats to work! from Flick by danoz2k9 under CCthought. Microsoft partnered with Hortonworks to speed up the porting of the Big Data softwares to Windows.

To get started with the latest softwares around Big Data, knowledge of Linux is a must. The good thing about Linux is that it is free and it opens up a lot of opportunities. Would recommend to go through all the tutorials here, except the seventh one.

There are more than 100 different flavors of Linux and Ubuntu is one of the popular distribution to get started for those who are new to Linux.

Java : Most of the Big Data softwares are developed in Java. I say most, exceptions are Spark has been developed in Scala, Impala has been developed in C/C++ and so on.

To extend the Big Data softwares knowledge of Java is a must. Also, sometimes the documentation might not be up to mark and so it might be required to go through the The Evolution of Computer Programming Languages #C #Fortran #Java #Ruby from Flick by dullhunk under CCunderlying code for the Big Data software to see how some thing works or is not working as the way it is expected to.

For the above mentioned reasons knowledge of Java is must. Basics of core Java is enough, knowledge of enterprise Java is not required. Go through the Java Basics section and the Java Object Oriented section here.

Java programs can be developed with as simple as notepad. But, developing in an IDE like Eclipse makes it a piece of cake. Here is a nice tutorial on Eclipse.

Databases 2 by Tim Morgan from Flickr under CCSQL : Not everyone is comfortable with programming in Java and other languages. That’s the reason why SQL abstractions have been introduced on top of the different Big Data frameworks. Those who are from a database background can get very easily started with Big Data because of the SQL abstraction. Hive, Impala, Phoenix are few of such softwares.

Expertise in SQL is not required. The basics of the DDL and the DML operations is more than enough. Here are some nice tutorials to get started with SQL.

Others : The above mentioned skills are good enough to get started with Big Data. As one gets into more and more at Big Data, would also recommend to look at R, Python and Scala. Each of these languages have got their strength and weakness and depending upon the requirement the appropriate option can be picked to write Big Data programs.

To become good at Big Data it’s required for an aspirant to have a good overview of the different technologies and the above guide mentions what is required and where to start reading about them.

Best of luck !!!

Apache Spark MOOC

Databricks has announced MOOC around Apache Spark. More details here. The first one Introduction to Big Data with Apache Spark starts on 23rd Feb, 2015 and the second one Scalable Machine Learning starts on 14th April, 2015 which is a long time for now. For those who are interested in Apache Spark, would strongly recommend to enroll to the courses and get notified before the course starts.

Moving from MapReduce to Apache Spark

MapReduce provides only map and reduce primitives. Rest of the things (groupby, sort, top, join, filter etc) have to be force fit into the map and reduce primitives. But, Spark provides a lot of other primitives as methods on RDD (1, 2, 3). So, it’s easy to write programs in Spark than using MapReduce. Also, Spark programs are a bit faster because Spark keeps the data in the memory and spills it to the disk when required between each iteration. In case of MapReduce, the data has to be written to the disk between iterations which make it slow.

The common denominator to MapReduce and Spark are the map and the reduce primitives. But, there are some differences between the map and the reduce primitives in MapReduce and Spark. For those familiar with the MapReduce model and would like to move to Spark, here is an nice blog entry from Sean Owen (Cloudera). As usual reading the article is something to start with, but the actual practice is one which will make things more clear.

Apache Spark cluster on a single machine

In the previous blogs we looked at how to get started with Spark on a single machine and to execute a simple Spark program from iPython. In the current blog. we will look into how to create a small cluster (group of machines) on a single machine as shown below and execute a simple word count program with the data in HDFS. For those who are curious here is a screencast on starting the cluster and running a Spark program on the same.
spark-cluster-wordcount-python 1) The first step is to setup Hadoop in multi node mode as mentioned here and here using virtualization (1, 2).

2) Download and extract the Spark on both the host and slaves.

3) Change the Spark conf/slaves file on the host machine to reflect the host names of the slaves.

4) Start HDFS/Spark and put some data in HDFS.

5) Execute the Spark Python program for WordCount as shown in the screencast.

The good thing about the entire setup is that it can work while offline, there is no need for an internet connection. Oracle VirtualBox host only networking can be used for the communication between the host and the guest OS.

The entire setup process takes quite some time and for those who are doing it for the first time, it’s be a bit of challenge. But, the main intension is to show that it’s possible to create a small Hadoop/Spark cluster in a single machine.