Memory Management in Spark and its tuning

By Sai Kumar on June 23, 2019

Spark has two kinds of memory-
1.Execution Memory
which is used to store temporary data of shuffles, joins, sorts, and aggregations
2. Storage Memory
which is used to cache RDDs and data frames

Executor has some amount of total memory, which is divided into two parts, the execution block and the storage block.This is governed by two configuration options.
1. spark.executor.memory > It is the total amount of memory which is available to executors. It is 1 gigabyte by default
2. spark.memory.fraction > Fraction of the total memory available for execution and storage.

In early version of Spark, these two kinds of memory were fixed. And if your job was to fill all the execution space, Spark had to spill data to disk, reducing performance of the application.

On the other hand, if your application relied heavily on caching and your job filled all the storage space, Spark had to push out cache data. And this is done using least recently used(LRU) strategy. This strategy frees blocks with the earliest access time. If your app does not using caching a lot, an executor has some free memory available, but as the execution part is fixed, you reduce performance of the job anyway.

And starting with version 1.6, Spark introduced unified memory managing. That means that execution and storage are not fixed, allowing to use as much memory as available to an executor.

Two premises of the unified memory management are as follows, remove storage but not execution. This is obvious, because if you remove the execution data, you will definitely read it back soon. And if you remove the cached data, you’ll probably won’t use it anyway.
The second premise is that unified memory management allows the user to specify the minimum unremovable amount of data for applications which rely heavily on caching.
The minimum unremovable amount of data is defined using spark.memory.storageFraction configuration option, which is one-half of the total memory, by default.

Summary-
If execution memory is full – spill to disk
If storage memory is full – remove LRU blocks
It is better to remove storage, not execution
If your app relies on caching, tune – spark.memory.storageFraction

Sai Kumar

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.24tutorials.com/sai

Share This Post

Related Articles

Joins in Spark SQL- Shuffle Hash, Sort Merge, BroadCast

How to create Spark Dataframe on HBase table[Code Snippets]

PySpark – Components

Login

Lost Password

Register