Caching and Persistence – Apache Spark

By Sai Kumar on March 4, 2018

Caching and Persistence-
By default, RDDs are recomputed each time you run an action on them.
This can be expensive (in time) if you need to use a dataset more than once.

Spark allows you to control what is cached in memory.

[code lang=”scala”]val logs: RDD[String] = sc.textFile("/log.txt")
val logsWithErrors = logs.filter(_.contains("ERROR”)).persist()
val firstnrecords = logsWithErrors.take(10)[/code]

Here, we cache logswithErrors in memory.
After firstnrecords is computed, Spark will store the contents of firstnrecords for faster access in future operations if we would like to reuse it.

[code lang=”scala”]val numErrors = logsWithErrors.count() //faster result[/code]

Now, computing the count on logsWithErrors is much faster.

There are many ways to configure how your data is persisted.
Possible to persist data set:-
-in memory as regular Java objects
-on disk as regular Java objects
-in memory as serialized Java objects (more compact)
-on disk as serialized Java objects (more compact)
-both in memory and on disk (spill over to disk to avoid re-computation)

cache-
Shorthand for using the default storage level, which is in memory only as regular Java objects.

persist-
Persistence can be customized with this method. Pass the storage level you’d like as a parameter to persist.

Level	Space Used	CPU Time	In memory	On disk
MEMORY_ONLY	High	Low	Y	N
MEMORY_ONLY_SER	Low	High	Y	N
MEMORY_AND_DISK	High	Medium	Some	Some
MEMORY_AND_DISK_SER	Low	High	Some	Some
DISK_ONLY	Low	High	N	Y

Sai Kumar

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.24tutorials.com/sai

Share This Post

Related Articles

Hadoop/MapReduce Vs Spark

Handy Methods in SparkContext Object while writing Spark Applications

Memory Management in Spark and its tuning

Login

Lost Password

Register