What is a companion object? Companion object is nothing but a singleton object which will have same name as the class name and it will hold all the static methods and members. Since, Scala doesn’t support definition of static methods in the class we have to create the companion object. Both the class and companion object should need to have same name and should be present in the same source file. Consider the below example – class ListToString private(list: List[Any]) { def size(): Int = { list.length } def makeString(sep: String = ","): String = { list.mkString(sep) } } object ListToString { def makeString(list: List[Any], sep: String = ","): String = { list.mkString(sep) } def apply(list: List[Any]): ListToString = new ListToString(list) } Class ListToString defined with two ...
Binary Search is a fast & efficient algorithm to search an element in sorted list of elements. It works on the technique of divide and conquer. Data structure: Array Time Complexity: Worst case: O(log n) Average case: O(log n) Best case: O(1) Space complexity: O(1) Let’s see how it can implemented in Scala with different approaches – Iterative Approach – Recursion Approach – Tail Recursion – Driver Program- val arr = Array(1, 2, 4, 5, 6, 7) val target = 7 println(binarySearch_iterative(arr, target) match { case -1 => s"$target doesn't exist in ${arr.mkString("[", ",", "]")}" case index => s"$target exists at index $index" }) println(binarySearch_Recursive(arr, target)() match { case -1 => s"$target doesnt match" case index => s"$target exists a...
1.What is the version of spark you are using? Check the spark version you are using before going to Interview. As per 2020, the latest version of spark is 2.4.x 2.Difference between RDD, Dataframe, Dataset? RDD – RDD is Resilient Distributed Dataset. It is the fundamental data structure of Spark and is immutable collection of records partitioned across nodes of cluster. It allows us to perform in-memory computations on large clusters in a fault-tolerant manner. Compared with DF and DS, RDD will not hold the schema. It holds only the data. If user want to implement schema over the RDD, User have to create a case class and have to implement the schema over the data. We will use RDD for the below cases: -When our data is unstructured, A streams of text or media streams. -When we donR...
SparkContext is Main entry point for Spark functionality. Its basically a class in Spark framework, when initialized, gets access to Spark Libraries. A SparkContext is responsible for connecting to Spark cluster, and can be used to create RDD(Resilient Distributed Dataset), to broadcast variables on that cluster and has much more useful methods. To create or initialize Spark Context, SparkConf need to be created before hand. SparkConf is basically the class used to set some configurations for Spark Applications like setting Master, App Name etc. Creating SparkContext- from pyspark import SparkConf, SparkContext conf = SparkConf().set("master", "yarn") sc = SparkContext(conf=conf) In latest versions of Spark, sparkContext is available in SparkSession (Class in Spark SQL component/Main entry...
PySpark Core Components includes – Spark Core – All functionalities built on top of Spark Core. Contains classes like SparkContext, RDD Spark SQL – Gives API for structured data processing. Contains important classes like SparkSession, DataFrame, DataSet. Spark Streaming – Gives functionality for Streaming data processing using micro-batching technique. Contains classes like Streaming Context, DStream Spark ML – Provides API to implement Machine learning algorithms.