Spark

How to filter DataFrame based on keys in Scala List using Spark UDF [Code Snippets]

There are some situations where you are required to Filter the Spark DataFrame based on the keys which are already available in Scala collection. Let’s see how we can achieve this in Spark. You need to use spark UDF for this – Step -1: Create a DataFrame using parallelize method by taking sample data. scala> val df = sc.parallelize(Seq((2,"a"),(3,"b"),(5,"c"))).toDF("id","name") df: org.apache.spark.sql.DataFrame = [id: int, name: string] Step -2: Create a UDF which concatenates columns inside dataframe. Below UDF accepts a collection of columns and returns concatenated column separated by the given delimiter. scala> val concatKey = udf( (xs: Seq[Any], sep:String) => xs.filter(_ != null).mkString(sep)) concatKey: org.apache.spark.sql.UserDefinedFunction = UserDefinedFu...

Caching and Persistence – Apache Spark

Caching and Persistence- By default, RDDs are recomputed each time you run an action on them. This can be expensive (in time) if you need to use a dataset more than once. Spark allows you to control what is cached in memory. [code lang=”scala”]val logs: RDD[String] = sc.textFile("/log.txt") val logsWithErrors = logs.filter(_.contains("ERROR”)).persist() val firstnrecords = logsWithErrors.take(10)[/code] Here, we cache logswithErrors in memory. After firstnrecords is computed, Spark will store the contents of firstnrecords for faster access in future operations if we would like to reuse it. [code lang=”scala”]val numErrors = logsWithErrors.count() //faster result[/code] Now, computing the count on logsWithErrors is much faster. There are many ways to c...

Transformation and Actions in Spark

Transformations and Actions – Spark defines transformations and actions on RDDs. Transformations – Return new RDDs as results. They are lazy, Their result RDD is not immediately computed. Actions – Compute a result based on an RDD and either returned or saved to an external storage system (e.g., HDFS). They are eager, their result is immediately computed. Laziness/eagerness is how we can limit network communication using the programming model. Example – Consider the following simple example: val input: List[String] = List(“hi”,”this”,”is”,”example”) val words = sc.parallelize(input) val lengths = words.map(_.length) Nothing happened on the cluster at this point, execution of map(a transformation) is deferred. We need t...

How to Create an Spark RDD?

RDDs can be created in two ways: 1)Transforming an existing RDD. 2)From a SparkContext or SparkSession object. – Transforming an existing RDD: When map called on List, it returns a new List. Similarly, many higher-order functions defined on RDD returns a new RDD. – From a SparkContext (or SparkSession) object: The SparkContext object (renamed SparkSession) can be thought of as your handle to the Spark cluster. It represents the connection between the Spark cluster and your running application. It defines a handful of methods which can be used to create and populate a new RDD: a)parallelize: convert a local Scala collection to an RDD. ex:- val rdd= sc.parallelize(Seq("1","2","3")) b)textFile: read a text file from HDFS or a local file system and return an RDD of String. ex:-val ...

Resilient Distributed Datasets(RDDs) – Spark

Spark implements a distributed data parallel model called Resilient Distributed Datasets(RDDs). Given some large dataset that can’t fit into memory on a single node. ->Chunk up the data(Diagrams needs to be added) ->Distribute it over the cluster of machines. ->From there, think of your distributed data like a single collection. RDDs are Spark’s Distributed collections. It seems a lot like immutable sequential or parallel Scala collections. [code]abstract class RDD[T]{ def map[U](f: T => U): RDD[U] = … def flatMap[U](f: T => TraversableOnce[U]): RDD[U] = .. def filter(f; T => Boolean): RDD[T] = … def reduce(f: (T, T) => T): T = … }[/code] Most operations on RDDs, like Scala’s immutable List. and Scala’s parallel collections, ar...

Hadoop/MapReduce Vs Spark

Hadoop/MapReduce- Hadoop is a widely-used large-scale batch data processing framework. It’s an open source implementation of Google’s MapReduce. MapReduce was ground-breaking because it provided: -> simple API (simple map and reduce steps) -> fault tolerance Fault tolerance is what made it possible for Hadoop/MapReduce to scale to 100s or 1000s of nodes at all. Hadoop/MapReduce + Fault Tolerance Why is this important? For 100s or 1000s of old commodity machines. likelihood of at least one node failing is very high midway through a job. Thus, Hadoop/MapReduce’s ability to recover from node failure enabled: -> computations on unthinkably large data sets to succeed to completion. Fault tolerance + simple API At Google, MapReduce made it possible for an average Googl...

Data Parallelism – Shared Memory Vs Distributed

The primary concept behind big data analysis is parallelism, defined in computing as the simultaneous execution of processes. The reason for this parallelism is mainly to make analysis faster, but it is also because some data sets may be too dynamic, too large or simply too unwieldy to be placed efficiently in a single relational database. Parallelism is very important concept when it comes to data processing. Scala achieves Data parallelism in single compute node which is considered as Shared Memory and Spark achieves the data parallelism in the distributed fashion which spread across multiple nodes due to which the processing is very faster. Shared Memory Data Parallelism(Scala) – ->Split the data ->Workers/threads independently operate on the data in parallel. ->Combine when done....

Why Scala? Why Spark?

Why Scala? In general, Data Science and analytics is done in the small using R, Python, Matlab etc… If your dataset gets too large to fit into memory, these languages/frameworks won’t allow scaling. You have to reimplement everything in some other language or system. Now, the industry is shifting towards data-oriented decision making and many applications are Data science in the large. By using a language like Scala. it’s easier to scale your small problem to the large with Spark, whose API is almost 1-to-1 with Scala’s collections. That is, by working in Scala, in a functional style, you can quickly scale your problem from one node to tens, hundreds, or even thousands by leveraging Spark, successful and performant large-scale data processing framework which looks a...

Ways to create DataFrame in Apache Spark [Examples with Code]

Ways to create DataFrame in Apache Spark – DATAFRAME is the representation of a matrix but we can have columns of different datatypes or similar table with different rows and having different types of columns (values of each column will be same data type). When working with Spark most of the times you are required to create Dataframe and play around with it. DATAFRAME is nothing but a data structure which is stored in memory and can be created by following ways – 1)Using Case Class 2)Using createDataFrame method 3)Using SQL method 4)Using read..load methods i) From flat files(JSON, CSV) ii) From RDBMS Databases 1)Using Case Class val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ case class Employee(name: String, sal: Int) Below is the sample...

Steps for creating DataFrames, SchemaRDD and performing operations using SparkSQL

Spark SQL: SparkSQL is a Spark module for Structured data processing. One use of SparkSQL is to execute SQL queries using a basic SQL syntax. There are several ways to interact with Spark SQL including SQL, the dataframes API,dataset API. The backbone for all these operation is Dataframes and SchemaRDD. DataFrames A dataFrame is a distributed collection of data organised into named columns. It is conceptually equivalent to a table in a relational database. SchemaRDD SchemaRDDs are made of row objects along with the metadata information. Spark SQL needs SQLcontext object,which is created from existing SparkContext. Steps for creating Dataframes,SchemaRDD and performing some operations using the sql methods provided by sqlContext. Step 1: start the spark shell by using the following command....

Word count program in Spark

WordCount in Spark WordCount program is like basic hello world program when it comes to Big data world. Below is program to achieve wordCount in Spark with very few lines of code. [code lang=”scala”]val inputlines = sc.textfile("/users/guest/read.txt") val words = inputlines.flatMap(line=>line.split(" ")) val wMap = words.map(word => (word,1)) val wOutput = wMap.reduceByKey(_ + _) wOutput.saveAsTextFile("/users/guest/")[/code]

Lost Password

Register

24 Tutorials