February 2018 - 24 Tutorials

Spark

How to Create an Spark RDD?

Sai Kumar February 18, 2018 No Comments

RDDs can be created in two ways: 1)Transforming an existing RDD. 2)From a SparkContext or SparkSession object. – Transforming an existing RDD: When map called on List, it returns a new List. Similarly, many higher-order functions defined on RDD returns a new RDD. – From a SparkContext (or SparkSession) object: The SparkContext object (renamed SparkSession) can be thought of as your handle to the Spark cluster. It represents the connection between the Spark cluster and your running application. It defines a handful of methods which can be used to create and populate a new RDD: a)parallelize: convert a local Scala collection to an RDD. ex:- val rdd= sc.parallelize(Seq("1","2","3")) b)textFile: read a text file from HDFS or a local file system and return an RDD of String. ex:-val ...

Spark

Resilient Distributed Datasets(RDDs) – Spark

Sai Kumar February 18, 2018 No Comments

Spark implements a distributed data parallel model called Resilient Distributed Datasets(RDDs). Given some large dataset that can’t fit into memory on a single node. ->Chunk up the data(Diagrams needs to be added) ->Distribute it over the cluster of machines. ->From there, think of your distributed data like a single collection. RDDs are Spark’s Distributed collections. It seems a lot like immutable sequential or parallel Scala collections. [code]abstract class RDD[T]{ def map[U](f: T => U): RDD[U] = … def flatMap[U](f: T => TraversableOnce[U]): RDD[U] = .. def filter(f; T => Boolean): RDD[T] = … def reduce(f: (T, T) => T): T = … }[/code] Most operations on RDDs, like Scala’s immutable List. and Scala’s parallel collections, ar...

Spark

Hadoop/MapReduce Vs Spark

Sai Kumar February 18, 2018 No Comments

Hadoop/MapReduce- Hadoop is a widely-used large-scale batch data processing framework. It’s an open source implementation of Google’s MapReduce. MapReduce was ground-breaking because it provided: -> simple API (simple map and reduce steps) -> fault tolerance Fault tolerance is what made it possible for Hadoop/MapReduce to scale to 100s or 1000s of nodes at all. Hadoop/MapReduce + Fault Tolerance Why is this important? For 100s or 1000s of old commodity machines. likelihood of at least one node failing is very high midway through a job. Thus, Hadoop/MapReduce’s ability to recover from node failure enabled: -> computations on unthinkably large data sets to succeed to completion. Fault tolerance + simple API At Google, MapReduce made it possible for an average Googl...

Spark

Data Parallelism – Shared Memory Vs Distributed

Sai Kumar February 18, 2018 No Comments

The primary concept behind big data analysis is parallelism, defined in computing as the simultaneous execution of processes. The reason for this parallelism is mainly to make analysis faster, but it is also because some data sets may be too dynamic, too large or simply too unwieldy to be placed efficiently in a single relational database. Parallelism is very important concept when it comes to data processing. Scala achieves Data parallelism in single compute node which is considered as Shared Memory and Spark achieves the data parallelism in the distributed fashion which spread across multiple nodes due to which the processing is very faster. Shared Memory Data Parallelism(Scala) – ->Split the data ->Workers/threads independently operate on the data in parallel. ->Combine when done....

Spark

Why Scala? Why Spark?

Sai Kumar February 18, 2018 No Comments

Why Scala? In general, Data Science and analytics is done in the small using R, Python, Matlab etc… If your dataset gets too large to fit into memory, these languages/frameworks won’t allow scaling. You have to reimplement everything in some other language or system. Now, the industry is shifting towards data-oriented decision making and many applications are Data science in the large. By using a language like Scala. it’s easier to scale your small problem to the large with Spark, whose API is almost 1-to-1 with Scala’s collections. That is, by working in Scala, in a functional style, you can quickly scale your problem from one node to tens, hundreds, or even thousands by leveraging Spark, successful and performant large-scale data processing framework which looks a...

Scala

Algorithm to sort elements in an Array using Scala

Sai Kumar February 17, 2018 No Comments

Algorithm/Program for sorting elements in an Array using Scala. The algorithm used is Bubble Sort. Bubble Sort is the simplest algorithm that works by repeatedly swapping the adjacent elements. [code lang=”scala”] object SortArray{ def main(args: Array[String]) { val inputarray = Array(1,2,3,2,4,1,4) println("Input") println(inputarray.mkString(",")) for(i <- 0 until inputarray.length-1){ for(j<-0 until inputarray.length-i-1){ if(inputarray(j)>inputarray(j+1)){ var temp = inputarray(j) inputarray(j)=inputarray(j+1) inputarray(j+1)=temp } } } println("Sorted elements in Array") println(inputarray.mkString(",")) } } [/code] Output: Input 1,2,3,2,4,1,4 Sorted elements in Array 1,1,2,2,3,4,4

Scala

Program to print only duplicate elements in an Integer Array using Scala

Sai Kumar February 17, 2018 No Comments

Write a Program to print only duplicate elements in an Integer Array? Logic: Loop through each element of Array and Compare it with other elements. [code lang=”scala”]object PrintDuplicates{ def main(args: Array[String]) { for (i <- 0 until inputarray.length){ for(j <- i+1 until inputarray.length){ if(inputarray(i)==inputarray(j)){ println(inputarray(i)) } } } } }[/code]

Scala

Program to print triangle pattern using Scala

Sai Kumar February 17, 2018 No Comments

Write a program to Print below triangle pattern using Scala? # ## ### #### ##### Using Scala functional style of programming it’s very easy to use print patterns than Java. Below is the code for printing the same using Scala for loops. Approach 1 – [code lang=”scala”]object PrintTriangle { def main(args: Array[String]) { for(i < – 1 to 5){ for(j <- 0 to i){ print("#") } println("") } } } [/code] Approach 2 – [code lang=”scala”]object PrintTriangle{ def main(args: Array[String]) { for(x <- 1 until 6) { println("#" * x) } } } [/code] Output: # ## ### #### #####

Scala

How to Remove Header and Trailer of File using Scala

Sai Kumar February 11, 2018 No Comments

Removing header and trailer of the File using Scala might not be real-time use case since you will be using Spark when dealing with large datasets. This post helpful mainly for interview purpose, An Interviewer might ask to write code for this using scala instead Unix/Spark. Here is the code snippet to achieve the same using Scala – [code lang=”scala”] import scala.io.Source object RemoveHeaderTrailer{ def main(args: Array[String]){ println("start") val input = Source.fromFile("C:/Users/Sai/input.txt") //input.getLines().drop(1).foreach(println)//This is for removing Header alone val lines = input.getLines().toList val required_data = lines.slice(1,lines.size-1).mkString("\n") import java.io._ val pw = new PrintWriter(new File("C:/Users/...

Interview Questions

Month: February 2018

How to Create an Spark RDD?

Resilient Distributed Datasets(RDDs) – Spark

Hadoop/MapReduce Vs Spark

Data Parallelism – Shared Memory Vs Distributed

Why Scala? Why Spark?

Algorithm to sort elements in an Array using Scala

Program to print only duplicate elements in an Integer Array using Scala

Program to print triangle pattern using Scala

How to Remove Header and Trailer of File using Scala

Top Apache Spark Interview Questions and Answers For 2018

Login

Lost Password

Register