Hadoop provides bindings(API) for only Java. To implement MapReduce programs in other languages Hadoop Streaming can be used. Any language which can R/W to standard streams can be programmed against Hadoop. It’s good that we can write MapReduce programs in non-Java languages based on our requirement, but the catch is that there is a bit of an overhead of launching an additional process for each block of data and also due to the inter process communication.
In any case, the non-Java program reads data from the Standard Input, does processing of the data and finally writes to the Standard Output. Spark also has a similar concept of integrating external programs which R/W to STDIO. So, any streaming program which has been developed for Hadoop/MapReduce can be easily used with minimal or no changes with Spark by using the pyspark.rdd.RDD.pipe() API in case of Python.The max_temperature_map.rb program from the Hadoop – The Definitive Guide book can be integrated with the Spark program to find the maximum temperature for an year as shown in the below code snippet. The max_temperature_map.rb takes a line from the RDD as an input, extract the different fields of interest and then only return the valid records to create a new RDD.
from pyspark import SparkContext
#Create Spark Context with the master details and the application name
sc = SparkContext("spark://bigdata-vm:7077", "max_temperature")
#Add a file to be downloaded with this Spark job on every node.
#Create an RDD from the input data in HDFS
weatherData = sc.textFile("hdfs://localhost:9000/user/bigdatavm/input")
#Transform the data to extract/filter and then find the max temperature
max_temperature_per_year = weatherData.pipe("max_temperature_map.rb").map(lambda x: (x.split("\t"), x.split("\t"))).reduceByKey(lambda a,b : a if int(a) > int(b) else b).coalesce(1)
#Save the RDD back into HDFS
Big Data frameworks provide bindings mostly for Java, since most of them are developed in Java. Integration of Big Data frameworks with multiple languages is really important. Writing extensions in other languages is not possible or is a bit of pain in most of the cases. All Hadoop Streaming programs can be integrated with Spark with minimal or no changes.