Common issues with Apache Spark

By Sai Kumar on March 10, 2018

Tricky Deployment:

Once you’re done writing your app, you have to deploy it right? That’s where things get a little out of hand. Although there are many options for deploying your Spark app, the simplest and straightforward approach is standalone deployment. Spark supports Mesos and Yarn, so if you’re not familiar with one of those it can become quite difficult to understand what’s going on. You might face some initial hiccups when bundling dependencies as well. If you don’t do it correctly, the Spark app will work in standalone mode but you’ll encounter Class path exceptions when running in cluster mode.

Memory Issues:

As Apache Spark is built to process huge chunks of data, monitoring and measuring memory usage is critical. While Spark works just fine for normal usage, it has got tons of configuration and should be tuned as per the use case. You’d often hit these limits if configuration is not based on your usage; running Apache Spark with default settings might not be the best choice. It is strongly recommended to check the documentation section that deals with tuning Spark’s memory configuration.

API Changes Due to Frequent Releases:

Apache Spark follows a three-month release cycle for 1.x.x release and a three- to four-month cycle for 2.x.x releases. Although frequent releases mean developers can push out more features relatively fast, this also means lots of under the hood changes, which in some cases necessitate changes in the API. This can be problematic if you’re not anticipating changes with a new release, and can entail additional overhead to ensure that your Spark application is not affected by API change.

Half-baked Python Support:

It’s great that Apache Spark supports Scala, Java, and Python. Having support for your favorite language is always preferable. However, Python API is not always at a par with Java and Scala when it comes to the latest features. It takes some time for the Python library to catch up with the latest API and features. If you’re planning to use the latest version of Spark, you should probably go with Scala or Java implementation, or at least check whether the feature/API has a Python implementation available.

Poor Documentation

Documentation and tutorials or code walkthroughs are extremely important for bringing new users up to the speed. However, in the case of Apache Spark, although samples and examples are provided along with documentation, the quality and depth leave a lot to be desired. The examples covered in the documentation are too basic and might not give you that initial push to fully realize the potential of Apache Spark.

Sai Kumar

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.24tutorials.com/sai

Tricky Deployment:

Memory Issues:

API Changes Due to Frequent Releases:

Half-baked Python Support:

Poor Documentation

Share This Post

Related Articles

Caching and Persistence – Apache Spark

Deep dive into Partitioning in Spark – Hash Partitioning and Range Partitioning

How to Add Serial Number to Spark Dataframe

Login

Lost Password

Register