Tag Archives: optimization

Persisting RDD in Spark

Spark revolves around the idea of of RDD (Resilient Distributed Datasets). These datasets are distributed in nature spread across multiple nodes and are also able to recreate (Resilient) themselves during any loss in the RDD.

The first RDD (RDD1 below) can be created  from any input source like a file or from a collection and then different transformations (flatMap, map, reduceByKey, join, sort) can be applied in a sequence fashion on it to achieve the desired transformation. Each of the transformation takes an input line from the RDD and then applies a function on it to create a new RDD. So, the input and the output of a transformation are RDD.
spark-without-cachingOnce the transformations (t1, t2, t3, t4) are done, an action (a1, a2) is called on the RDD to return some data to the client/driver or to persist RDD in a file. Spark engine allows us create such a DAG as above and execute it on a cluster of nodes. In the above diagram there are two jobs (Job 1 and Job 2) in the context of a single application. Continue reading Persisting RDD in Spark