Sai Kumar, Author at 24 Tutorials

Difference between DataFrame and Dataset in Apache Spark

Sai Kumar March 10, 2018 No Comments

DataFrame Dataset Spark Release Spark 1.3 Spark 1.6 Data Representation A DataFrame is a distributed collection of data organized into named columns. It is conceptually equal to a table in a relational database. It is an extension of DataFrame API that provides the functionality of – type-safe, object-oriented programming interface of the RDD API and performance benefits of the Catalyst query opti...

How to get latest record in Spark Dataframe

Sai Kumar March 7, 2018 No Comments

scala> val inputDF = sc.parallelize(Seq((1,"oclay",400,"2015-01-01 00:00:00"),(1,"oclay",800,"2018-01-01 00:00:00"))).toDF("pid","pname","price","last_mod") scala> inputDF.show +---+-----+-----+-------------------+ |pid|pname|price| last_mod| +---+-----+-----+-------------------+ | 1|oclay| 400|2015-01-01 00:00:00| | 1|oclay| 800|2018-01-01 00:00:00| +---+-----+-----+-------------------+ sca...

How to filter DataFrame based on keys in Scala List using Spark UDF

How to filter DataFrame based on keys in Scala List using Spark UDF [Code Snippets]

Sai Kumar March 7, 2018 No Comments

There are some situations where you are required to Filter the Spark DataFrame based on the keys which are already available in Scala collection. Let’s see how we can achieve this in Spark. You need to use spark UDF for this – Step -1: Create a DataFrame using parallelize method by taking sample data. scala> val df = sc.parallelize(Seq((2,"a"),(3,"b"),(5,"c"))).toDF("id","name") df:...

Caching and Persistence – Apache Spark

Sai Kumar March 4, 2018 No Comments

Caching and Persistence- By default, RDDs are recomputed each time you run an action on them. This can be expensive (in time) if you need to use a dataset more than once. Spark allows you to control what is cached in memory. [code lang=”scala”]val logs: RDD[String] = sc.textFile("/log.txt") val logsWithErrors = logs.filter(_.contains("ERROR”)).persist() val firstnrecords...

Transformation and Actions in Spark

Sai Kumar March 4, 2018 No Comments

Transformations and Actions – Spark defines transformations and actions on RDDs. Transformations – Return new RDDs as results. They are lazy, Their result RDD is not immediately computed. Actions – Compute a result based on an RDD and either returned or saved to an external storage system (e.g., HDFS). They are eager, their result is immediately computed. Laziness/eagerness is ho...

How to Create an Spark RDD?

Sai Kumar February 18, 2018 No Comments

RDDs can be created in two ways: 1)Transforming an existing RDD. 2)From a SparkContext or SparkSession object. – Transforming an existing RDD: When map called on List, it returns a new List. Similarly, many higher-order functions defined on RDD returns a new RDD. – From a SparkContext (or SparkSession) object: The SparkContext object (renamed SparkSession) can be thought of as your han...

Resilient Distributed Datasets(RDDs) – Spark

Sai Kumar February 18, 2018 No Comments

Spark implements a distributed data parallel model called Resilient Distributed Datasets(RDDs). Given some large dataset that can’t fit into memory on a single node. ->Chunk up the data(Diagrams needs to be added) ->Distribute it over the cluster of machines. ->From there, think of your distributed data like a single collection. RDDs are Spark’s Distributed collections. It see...

Hadoop/MapReduce Vs Spark

Sai Kumar February 18, 2018 No Comments

Hadoop/MapReduce- Hadoop is a widely-used large-scale batch data processing framework. It’s an open source implementation of Google’s MapReduce. MapReduce was ground-breaking because it provided: -> simple API (simple map and reduce steps) -> fault tolerance Fault tolerance is what made it possible for Hadoop/MapReduce to scale to 100s or 1000s of nodes at all. Hadoop/MapReduce +...

Data Parallelism – Shared Memory Vs Distributed

Sai Kumar February 18, 2018 No Comments

The primary concept behind big data analysis is parallelism, defined in computing as the simultaneous execution of processes. The reason for this parallelism is mainly to make analysis faster, but it is also because some data sets may be too dynamic, too large or simply too unwieldy to be placed efficiently in a single relational database. Parallelism is very important concept when it comes to dat...

Why Scala? Why Spark?

Sai Kumar February 18, 2018 No Comments

Why Scala? In general, Data Science and analytics is done in the small using R, Python, Matlab etc… If your dataset gets too large to fit into memory, these languages/frameworks won’t allow scaling. You have to reimplement everything in some other language or system. Now, the industry is shifting towards data-oriented decision making and many applications are Data science in the large....

Algorithm to sort elements in an Array using Scala

Sai Kumar February 17, 2018 No Comments

Algorithm/Program for sorting elements in an Array using Scala. The algorithm used is Bubble Sort. Bubble Sort is the simplest algorithm that works by repeatedly swapping the adjacent elements. [code lang=”scala”] object SortArray{ def main(args: Array[String]) { val inputarray = Array(1,2,3,2,4,1,4) println("Input") println(inputarray.mkString(",")) for(i <- 0 u...

Program to print only duplicate elements in an Integer Array using Scala

Sai Kumar February 17, 2018 No Comments

Write a Program to print only duplicate elements in an Integer Array? Logic: Loop through each element of Array and Compare it with other elements. [code lang=”scala”]object PrintDuplicates{ def main(args: Array[String]) { for (i <- 0 until inputarray.length){ for(j <- i+1 until inputarray.length){ if(inputarray(i)==inputarray(j)){ println(inputarray(i)) } } } } }[/code]

Login

Lost Password

Register