Spark

Caching and Persistence – Apache Spark

caching-persistence-in-spark-24tutorials.jpg

Caching and Persistence-
By default, RDDs are recomputed each time you run an action on them.
This can be expensive (in time) if you need to use a dataset more than once.

Spark allows you to control what is cached in memory.

[code lang=”scala”]val logs: RDD[String] = sc.textFile("/log.txt")
val logsWithErrors = logs.filter(_.contains("ERROR”)).persist()
val firstnrecords = logsWithErrors.take(10)[/code]

Here, we cache logswithErrors in memory.
After firstnrecords is computed, Spark will store the contents of firstnrecords for faster access in future operations if we would like to reuse it.

[code lang=”scala”]val numErrors = logsWithErrors.count() //faster result[/code]

Now, computing the count on logsWithErrors is much faster.

There are many ways to configure how your data is persisted.
Possible to persist data set:-
-in memory as regular Java objects
-on disk as regular Java objects
-in memory as serialized Java objects (more compact)
-on disk as serialized Java objects (more compact)
-both in memory and on disk (spill over to disk to avoid re-computation)

cache-
Shorthand for using the default storage level, which is in memory only as regular Java objects.

persist-
Persistence can be customized with this method. Pass the storage level you’d like as a parameter to persist.

LevelSpace UsedCPU TimeIn memoryOn disk
MEMORY_ONLYHighLowYN
MEMORY_ONLY_SERLowHighYN
MEMORY_AND_DISKHighMediumSomeSome
MEMORY_AND_DISK_SERLowHighSomeSome
DISK_ONLYLowHighNY

Share This Post

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.24tutorials.com/sai

Lost Password

Register

24 Tutorials