RDDs can be created in two ways: 1)Transforming an existing RDD. 2)From a SparkContext or SparkSession object. – Transforming an existing RDD: When map called on List, it returns a new List. Similarly, many higher-order functions defined on RDD returns a new RDD. – From a SparkContext (or SparkSession) object: The SparkContext object (renamed SparkSession) can be thought of as your handle to the Spark cluster. It represents the connection between the Spark cluster and your running application. It defines a handful of methods which can be used to create and populate a new RDD: a)parallelize: convert a local Scala collection to an RDD. ex:- val rdd= sc.parallelize(Seq("1","2","3")) b)textFile: read a text file from HDFS or a local file system and return an RDD of String. ex:-val ...
Spark implements a distributed data parallel model called Resilient Distributed Datasets(RDDs). Given some large dataset that can’t fit into memory on a single node. ->Chunk up the data(Diagrams needs to be added) ->Distribute it over the cluster of machines. ->From there, think of your distributed data like a single collection. RDDs are Spark’s Distributed collections. It seems a lot like immutable sequential or parallel Scala collections. [code]abstract class RDD[T]{ def map[U](f: T => U): RDD[U] = … def flatMap[U](f: T => TraversableOnce[U]): RDD[U] = .. def filter(f; T => Boolean): RDD[T] = … def reduce(f: (T, T) => T): T = … }[/code] Most operations on RDDs, like Scala’s immutable List. and Scala’s parallel collections, ar...
Hadoop/MapReduce- Hadoop is a widely-used large-scale batch data processing framework. It’s an open source implementation of Google’s MapReduce. MapReduce was ground-breaking because it provided: -> simple API (simple map and reduce steps) -> fault tolerance Fault tolerance is what made it possible for Hadoop/MapReduce to scale to 100s or 1000s of nodes at all. Hadoop/MapReduce + Fault Tolerance Why is this important? For 100s or 1000s of old commodity machines. likelihood of at least one node failing is very high midway through a job. Thus, Hadoop/MapReduce’s ability to recover from node failure enabled: -> computations on unthinkably large data sets to succeed to completion. Fault tolerance + simple API At Google, MapReduce made it possible for an average Googl...
The primary concept behind big data analysis is parallelism, defined in computing as the simultaneous execution of processes. The reason for this parallelism is mainly to make analysis faster, but it is also because some data sets may be too dynamic, too large or simply too unwieldy to be placed efficiently in a single relational database. Parallelism is very important concept when it comes to data processing. Scala achieves Data parallelism in single compute node which is considered as Shared Memory and Spark achieves the data parallelism in the distributed fashion which spread across multiple nodes due to which the processing is very faster. Shared Memory Data Parallelism(Scala) – ->Split the data ->Workers/threads independently operate on the data in parallel. ->Combine when done....
Why Scala? In general, Data Science and analytics is done in the small using R, Python, Matlab etc… If your dataset gets too large to fit into memory, these languages/frameworks won’t allow scaling. You have to reimplement everything in some other language or system. Now, the industry is shifting towards data-oriented decision making and many applications are Data science in the large. By using a language like Scala. it’s easier to scale your small problem to the large with Spark, whose API is almost 1-to-1 with Scala’s collections. That is, by working in Scala, in a functional style, you can quickly scale your problem from one node to tens, hundreds, or even thousands by leveraging Spark, successful and performant large-scale data processing framework which looks a...
Algorithm/Program for sorting elements in an Array using Scala. The algorithm used is Bubble Sort. Bubble Sort is the simplest algorithm that works by repeatedly swapping the adjacent elements. [code lang=”scala”] object SortArray{ def main(args: Array[String]) { val inputarray = Array(1,2,3,2,4,1,4) println("Input") println(inputarray.mkString(",")) for(i <- 0 until inputarray.length-1){ for(j<-0 until inputarray.length-i-1){ if(inputarray(j)>inputarray(j+1)){ var temp = inputarray(j) inputarray(j)=inputarray(j+1) inputarray(j+1)=temp } } } println("Sorted elements in Array") println(inputarray.mkString(",")) } } [/code] Output: Input 1,2,3,2,4,1,4 Sorted elements in Array 1,1,2,2,3,4,4
Write a Program to print only duplicate elements in an Integer Array? Logic: Loop through each element of Array and Compare it with other elements. [code lang=”scala”]object PrintDuplicates{ def main(args: Array[String]) { for (i <- 0 until inputarray.length){ for(j <- i+1 until inputarray.length){ if(inputarray(i)==inputarray(j)){ println(inputarray(i)) } } } } }[/code]
Write a program to Print below triangle pattern using Scala? # ## ### #### ##### Using Scala functional style of programming it’s very easy to use print patterns than Java. Below is the code for printing the same using Scala for loops. Approach 1 – [code lang=”scala”]object PrintTriangle { def main(args: Array[String]) { for(i < – 1 to 5){ for(j <- 0 to i){ print("#") } println("") } } } [/code] Approach 2 – [code lang=”scala”]object PrintTriangle{ def main(args: Array[String]) { for(x <- 1 until 6) { println("#" * x) } } } [/code] Output: # ## ### #### #####
Removing header and trailer of the File using Scala might not be real-time use case since you will be using Spark when dealing with large datasets. This post helpful mainly for interview purpose, An Interviewer might ask to write code for this using scala instead Unix/Spark. Here is the code snippet to achieve the same using Scala – [code lang=”scala”] import scala.io.Source object RemoveHeaderTrailer{ def main(args: Array[String]){ println("start") val input = Source.fromFile("C:/Users/Sai/input.txt") //input.getLines().drop(1).foreach(println)//This is for removing Header alone val lines = input.getLines().toList val required_data = lines.slice(1,lines.size-1).mkString("\n") import java.io._ val pw = new PrintWriter(new File("C:/Users/...
According to StackOverFlow Survey, Apache Spark is Hot, Trending and Highly paid Skill in IT Industry. Apache Spark is extremely popular in the Big Data Analytics world. Here are the frequently asked Apache Spark interview questions to crack Spark job in 2018. What is Apache Spark? Apache Spark is a lighting fast, in-memory(RAM) computation tool to processing big data files stored in Hadoop’s HDFS, NoSQL, or on local systems. What are the Spark Ecosystem components? Spark Core/SQL, Spark Streaming, Spark MLLib, Spark GraphX Spark Vs MapReduce a. Speed: Spark is ten to hundred times faster than MapReduce b. Analytics: Spark supports streaming, machine learning, complex analytics. c. Spark is suitable for Real-time processing and Map Reduce is suitable for Batch processing d. Spark is ...