spark dataframe example Archives

Spark

How to flatten JSON in Spark Dataframe

Suriya February 13, 2019 No Comments

How to flatten whole JSON containing ArrayType and StructType in it? In order to flatten a JSON completely we don’t have any predefined function in Spark. We can write our own function that will flatten out JSON completely. We will write a function that will accept DataFrame. For each field in the DataFrame we will get the DataType. If the field is of ArrayType we will create new column with exploding the ArrayColumn using Spark explode_outer function. If the field is of StructType we will create new column with parentfield_childfield for each field in the StructType Field. This is a recursive function. Once the function doesn’t find any ArrayType or StructType. It will return the flattened DataFrame. Otherwise, It will it iterate through the schema to completely flatten out the JSON...

How to filter DataFrame based on keys in Scala List using Spark UDF

Spark

How to filter DataFrame based on keys in Scala List using Spark UDF [Code Snippets]

Sai Kumar March 7, 2018 No Comments

There are some situations where you are required to Filter the Spark DataFrame based on the keys which are already available in Scala collection. Let’s see how we can achieve this in Spark. You need to use spark UDF for this – Step -1: Create a DataFrame using parallelize method by taking sample data. scala> val df = sc.parallelize(Seq((2,"a"),(3,"b"),(5,"c"))).toDF("id","name") df: org.apache.spark.sql.DataFrame = [id: int, name: string] Step -2: Create a UDF which concatenates columns inside dataframe. Below UDF accepts a collection of columns and returns concatenated column separated by the given delimiter. scala> val concatKey = udf( (xs: Seq[Any], sep:String) => xs.filter(_ != null).mkString(sep)) concatKey: org.apache.spark.sql.UserDefinedFunction = UserDefinedFu...

Spark

Ways to create DataFrame in Apache Spark [Examples with Code]

Sai Kumar January 7, 2018 No Comments

Ways to create DataFrame in Apache Spark – DATAFRAME is the representation of a matrix but we can have columns of different datatypes or similar table with different rows and having different types of columns (values of each column will be same data type). When working with Spark most of the times you are required to create Dataframe and play around with it. DATAFRAME is nothing but a data structure which is stored in memory and can be created by following ways – 1)Using Case Class 2)Using createDataFrame method 3)Using SQL method 4)Using read..load methods i) From flat files(JSON, CSV) ii) From RDBMS Databases 1)Using Case Class val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ case class Employee(name: String, sal: Int) Below is the sample...

How to flatten JSON in Spark Dataframe

How to filter DataFrame based on keys in Scala List using Spark UDF [Code Snippets]

Ways to create DataFrame in Apache Spark [Examples with Code]

Login

Lost Password

Register