Persisting RDD in Spark

Spark revolves around the idea of of RDD (Resilient Distributed Datasets). These datasets are distributed in nature spread across multiple nodes and are also able to recreate (Resilient) themselves during any loss in the RDD.

The first RDD (RDD1 below) can be created  from any input source like a file or from a collection and then different transformations (flatMap, map, reduceByKey, join, sort) can be applied in a sequence fashion on it to achieve the desired transformation. Each of the transformation takes an input line from the RDD and then applies a function on it to create a new RDD. So, the input and the output of a transformation are RDD.
spark-without-cachingOnce the transformations (t1, t2, t3, t4) are done, an action (a1, a2) is called on the RDD to return some data to the client/driver or to persist RDD in a file. Spark engine allows us create such a DAG as above and execute it on a cluster of nodes. In the above diagram there are two jobs (Job 1 and Job 2) in the context of a single application.

Spark works in a lazy fashion and won’t process an RDD unless an action is called. So, in the above diagram action a1 will invoke Job 1 and a2 will invoke Job 2. Also, Spark by default computes all the RDDs in a job flow. So, in the above application RDD1/RDD2 are calculated once for each Job1 and Job2. This should be OK for small data sets, but is a hit for processing huge data sets.

To overcome the above problem, Spark offers RDD persistence to store RDDs either in memory or hard disk or a combination of both with different levels of replication. Let’s assume RDD2 is persisted by Job1, then RDD2 need not be calculated again in the Job2 flow. Once the RDDs are persisted, they can also be¬†unpersisted manually or is automatically evicted when the memory is full in an LRU fashion.

Summary

While creating Spark application the intermediate RDDs can be persisted, so as not to create them again. Also, the RDD which has been persisted is only available in the context of an application. To share RDD across multiple applications, it has to be saved to an external source (like HDFS) and has to be accessed in the other application.

Leave a Reply

Your email address will not be published. Required fields are marked *