Sai Kumar, Author at 24 Tutorials

Program to print triangle pattern using Scala

Sai Kumar February 17, 2018 No Comments

Write a program to Print below triangle pattern using Scala? # ## ### #### ##### Using Scala functional style of programming it’s very easy to use print patterns than Java. Below is the code for printing the same using Scala for loops. Approach 1 – [code lang=”scala”]object PrintTriangle { def main(args: Array[String]) { for(i < – 1 to 5){ for(j <- 0 to i){ prin...

How to Remove Header and Trailer of File using Scala

Sai Kumar February 11, 2018 No Comments

Removing header and trailer of the File using Scala might not be real-time use case since you will be using Spark when dealing with large datasets. This post helpful mainly for interview purpose, An Interviewer might ask to write code for this using scala instead Unix/Spark. Here is the code snippet to achieve the same using Scala – [code lang=”scala”] import scala.io.Source obje...

Ways to create DataFrame in Apache Spark [Examples with Code]

Sai Kumar January 7, 2018 No Comments

Ways to create DataFrame in Apache Spark – DATAFRAME is the representation of a matrix but we can have columns of different datatypes or similar table with different rows and having different types of columns (values of each column will be same data type). When working with Spark most of the times you are required to create Dataframe and play around with it. DATAFRAME is nothing but a data s...

Hadoop questions Part2

Sai Kumar September 30, 2017 No Comments

What is input spilt size for 64mb block ,min IS is 32mb and max IS is 128mb. 64kb 67mb 127mb How to Change replication factor in hdfs How to get size of each file in hdfs path user/hdfs What is default partitioner and combiner Output for inner join and left ,right,full Customer 1 A 2 B 2 B 4 C 5 D Transaction 8 200 2 100 2 100 9 200 6 200 How to compare two files How to get IDs of all jobs How to ...

Bucketing in Hive

Sai Kumar August 25, 2017 No Comments

• Bucketing decomposes data sets into more manageable parts • Users can specify the number of buckets for their data set • Specifying bucketing does not guarantee that table is properly populated • The number of bucket does not vary with data • Bucketing is best suited for sampling • Map-side joins can be done well with bucketing In the below sample code , a hash function will be done on the ‘emp...

Partitioning in Hive

Sai Kumar August 25, 2017 No Comments

Partition improves query performance The way Hive structures data storage changes with Partitioning Partitions are stored as sub-directories in the table directory Over Partitioning to be avoided – Each partition creates an HDFS directory with many files in it – It increases large number of small sized files in HDFS – It eventually consume the capacity of namenode as the metadata is kept in main m...

Hive Tables

Sai Kumar August 25, 2017 No Comments

Hive supports 2 types of tables : 1. Managed / Internal tables 2. External Tables Managed Tables – Life cycle of data in the table is controlled by Hive – data is stored under the sub directory defined by ‘ hive.metastore.warehouse.dir ‘ – When table is dropped , data & metadata is deleted – Not a good choice for sharing data with other tools External Tables – Use the keyword EXTERNAL with CRE...

Hive Databases

Sai Kumar August 25, 2017 No Comments

Hive Databases are like namespaces/catalogs If no database name is specified, ‘default’ database is used We can also use the keyword SCHEMA instead of DATABASE in all the database related commands below. Hive creates a directory for each of the databases it creates The default directory created for the database under a top-level directory specified by the property hive.metastore.warehouse.dir You ...

File Formats in Hive

Sai Kumar August 20, 2017 No Comments

File Format specifies how records are encoded in files Record Format implies how a stream of bytes for a given record are encoded The default file format is TEXTFILE – each record is a line in the file Hive uses different control characters as delimeters in textfiles ᶺA ( octal 001) , ᶺB(octal 002), ᶺC(octal 003), \n The term field is used when overriding the default delimiter FIELDS TERMINATED BY...

Data Model and Datatypes in Hive

Sai Kumar August 20, 2017 No Comments

Data in Hive is organised into – Databases – Namespace to separate table and other data Tables – Homogeneous collection of data having same schema Partitions – Divisions in table data based on key value Buckets – Divisions in partitions based on hash value of a particular column Hive Data Types: Hive supports primitive data types and three collection types. Primitiv...

Hive Metastore Configurations

Sai Kumar August 20, 2017 No Comments

In order to store meta data Hive can use any of the below three strategies – Embedded – Local – Remote Hive – Metastore – Embedded Mainly used for unit tests Only one process is allowed to connect to the metastore at a time Hive metadata is stored in an embedded Apache Derby database Hive – Metastore – Local Metadata is stored in some other database like MySQL Hive Client will open the connection ...

Author: Sai Kumar

Program to print triangle pattern using Scala

How to Remove Header and Trailer of File using Scala

Top Apache Spark Interview Questions and Answers For 2018

Ways to create DataFrame in Apache Spark [Examples with Code]

Hadoop questions Part2

Bucketing in Hive

Partitioning in Hive

Hive Tables

Hive Databases

File Formats in Hive

Data Model and Datatypes in Hive

Hive Metastore Configurations

Login

Lost Password

Register