Caching and Persistence-
By default, RDDs are recomputed each time you run an action on them.
This can be expensive (in time) if you need to use a dataset more than once.
Spark allows you to control what is cached in memory.
[code lang=”scala”]val logs: RDD[String] = sc.textFile("/log.txt")
val logsWithErrors = logs.filter(_.contains("ERROR”)).persist()
val firstnrecords = logsWithErrors.take(10)[/code]
Here, we cache logswithErrors in memory.
After firstnrecords is computed, Spark will store the contents of firstnrecords for faster access in future operations if we would like to reuse it.
[code lang=”scala”]val numErrors = logsWithErrors.count() //faster result[/code]
Now, computing the count on logsWithErrors is much faster.
There are many ways to configure how your data is persisted.
Possible to persist data set:-
-in memory as regular Java objects
-on disk as regular Java objects
-in memory as serialized Java objects (more compact)
-on disk as serialized Java objects (more compact)
-both in memory and on disk (spill over to disk to avoid re-computation)
cache-
Shorthand for using the default storage level, which is in memory only as regular Java objects.
persist-
Persistence can be customized with this method. Pass the storage level you’d like as a parameter to persist.
Level | Space Used | CPU Time | In memory | On disk |
---|---|---|---|---|
MEMORY_ONLY | High | Low | Y | N |
MEMORY_ONLY_SER | Low | High | Y | N |
MEMORY_AND_DISK | High | Medium | Some | Some |
MEMORY_AND_DISK_SER | Low | High | Some | Some |
DISK_ONLY | Low | High | N | Y |