1.What is the version of spark you are using? Check the spark version you are using before going to Interview. As per 2020, the latest version of spark is 2.4.x 2.Difference between RDD, Dataframe, Dataset? RDD – RDD is Resilient Distributed Dataset. It is the fundamental data structure of Spark and is immutable collection of records partitioned across nodes of cluster. It allows us to perform in-memory computations on large clusters in a fault-tolerant manner. Compared with DF and DS, RDD will not hold the schema. It holds only the data. If user want to implement schema over the RDD, User have to create a case class and have to implement the schema over the data. We will use RDD for the below cases: -When our data is unstructured, A streams of text or media streams. -When we donR...
According to StackOverFlow Survey, Apache Spark is Hot, Trending and Highly paid Skill in IT Industry. Apache Spark is extremely popular in the Big Data Analytics world. Here are the frequently asked Apache Spark interview questions to crack Spark job in 2018. What is Apache Spark? Apache Spark is a lighting fast, in-memory(RAM) computation tool to processing big data files stored in Hadoop’s HDFS, NoSQL, or on local systems. What are the Spark Ecosystem components? Spark Core/SQL, Spark Streaming, Spark MLLib, Spark GraphX Spark Vs MapReduce a. Speed: Spark is ten to hundred times faster than MapReduce b. Analytics: Spark supports streaming, machine learning, complex analytics. c. Spark is suitable for Real-time processing and Map Reduce is suitable for Batch processing d. Spark is ...
What is input spilt size for 64mb block ,min IS is 32mb and max IS is 128mb. 64kb 67mb 127mb How to Change replication factor in hdfs How to get size of each file in hdfs path user/hdfs What is default partitioner and combiner Output for inner join and left ,right,full Customer 1 A 2 B 2 B 4 C 5 D Transaction 8 200 2 100 2 100 9 200 6 200 How to compare two files How to get IDs of all jobs How to run the process in background and how to bring it foreground and kill the job How get 50th line in the text Copy the line from file Lin by Lin which are greater than 10 lines and write it to sales.txt @$ How to exclude two tables and import all tables, How to skip special characters /n /r which are in rbms table and import into hdfs Can sparkstreaming stopped without stopping spark context? Zlib i...
Hadoop MapReduce Interview Questions and Answers Explain the usage of Context Object. Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output. What are the core methods of a Reducer? The 3 core methods of a reducer are – 1)setup () – This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc. Function Definition- public void setup (context) 2)reduce () it is heart of the reducer which is called once per key with the associated reduce task. Function Definition -...
What is a block and block scanner in HDFS? Block – The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB. Block Scanner – Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode. Explain the difference between NameNode, Backup Node and Checkpoint NameNode. NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace- f...