Using Hadoop Streaming programs with Spark Pipes

Hadoop provides bindings(API) for only Java. To implement MapReduce programs in other languages Hadoop Streaming can be used. Any language which can R/W to standard streams can be programmed against Hadoop. It’s good that we can write MapReduce programs in non-Java languages based on our requirement, but the catch is that there is a bit of an overhead of launching an additional process for each block of data and also due to the inter process communication.

In any case, the non-Java program reads data from the Standard Input, does processing of the data and finally writes to the Standard Output. Spark also has a similar concept of integrating external programs which R/W to STDIO. So, any streaming program which has been developed for Hadoop/MapReduce can be easily used with minimal or no changes with Spark by using the pyspark.rdd.RDD.pipe() API in case of Python.hadoop spark streamingThe max_temperature_map.rb program from the Hadoop – The Definitive Guide book can be integrated with the Spark program to find the maximum temperature for an year as shown in the below code snippet. The max_temperature_map.rb takes a line from the RDD as an input, extract the different fields of interest and then only return the valid records to create a new RDD.

Conclusion

Big Data frameworks provide bindings mostly for Java, since most of them are developed in Java. Integration of Big Data frameworks with multiple languages is really important.  Writing extensions in other languages is not possible or is a bit of pain in most of the cases. All Hadoop Streaming programs can be integrated with Spark with minimal or no changes.

Leave a Reply

Your email address will not be published. Required fields are marked *