Below are major components in Hive Architecture – UI – The user interface for users to submit queries and other operations to the system. As of 2011 the system had a command line interface and a web based GUI was being developed. Driver – Hive queries are sent to drivers for compilation, optimization and execution Compiler – The component that parses the query, does semantic analysis o...
Hive Command Line Interface (CLI) – Interaction with Hive is commonly done with CLI Hive CLI is started with the $HIVE_HOME/bin/hive command which is a bash script Prompt for hive is hive > Using CLI , you can create tables, inspect schema and query tables CLI is a thick client for Hive – it needs local copy of all Hive and Hadoop client components along with their configurations It can ...
History- At Facebook the data grew from GBs (2006) to 1 TB/day (2007) and today it is 500+ TBs per day Rapidly grown data made traditional warehousing expensive Scaling up vertically is very expensive Hadoop is an alternative to store and process large data But MapReduce is very low-level and requires custom code Facebook developed Hive as solution Sept 2008 – Hive becomes a Hadoop subproject What...
Spark SQL: SparkSQL is a Spark module for Structured data processing. One use of SparkSQL is to execute SQL queries using a basic SQL syntax. There are several ways to interact with Spark SQL including SQL, the dataframes API,dataset API. The backbone for all these operation is Dataframes and SchemaRDD. DataFrames A dataFrame is a distributed collection of data organised into named columns. It is ...
WordCount in Spark WordCount program is like basic hello world program when it comes to Big data world. Below is program to achieve wordCount in Spark with very few lines of code. [code lang=”scala”]val inputlines = sc.textfile("/users/guest/read.txt") val words = inputlines.flatMap(line=>line.split(" ")) val wMap = words.map(word => (word,1)) val wOutput = wM...
Reversal of String in Scala using recursive function – object reverseString extends App { val s = “24Tutorials” print(revs(s)) def revs(s: String): String = { // if (s.isEmpty) “” if (s.length == 1) s else revs(s.tail) + s.head //else revs(s.substring(1)) + s.charAt(0) } } } Output: slairotuT42
Q1) CASE Classes: A case class is a class that may be used with the match/case statement. Case classes can be pattern matched Case classes automatically define hashcode and equals Case classes automatically define getter methods for the constructor arguments. Case classes can be seen as plain and immutable data-holding objects that should exclusively depend on their constructor arguments. Case cla...
Hadoop MapReduce Interview Questions and Answers Explain the usage of Context Object. Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the o...
What is a block and block scanner in HDFS? Block – The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB. Block Scanner – Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk...
PIG QUICK NOTES: Pig latin – is the language used to analyze data in Hadoop using Apache Pig. A RELATION is outermost structure of Pig Latin data model. and it is bag where- -A bag is collection of Tuples -A tuple is an ordered set of fields -A field is a piece of data Pig Latin –Statements While processing data using Pig Latin, statements are the basic constructs. 1. These statements work w...
Dynamically brand synergistic schemas via cross functional networks. Quickly visualize web-enabled strategic theme areas for cross functional e-business. Enthusiastically productize client-centered web-readiness without cost effective outsourcing. Uniquely target integrated content whereas backend deliverables. Appropriately simplify viral bandwidth via premier users. Continually formulate virtual...