How to Add Serial Number to Spark Dataframe

By Sai Kumar on January 20, 2019

You may required to add Serial number to Spark Dataframe sometimes.
It can be done with the spark function called monotonically_increasing_id(). It generates a new column with unique 64-bit monotonic index for each row. But it isn’t significant, as the sequence changes based on the partition. In short, random numbers will be assigned which are out of sequence.

If the goal is add serial number to the dataframe, you can use zipWithIndex method available on RDD.
below is how you can achieve the same on dataframe.

[code lang=”python”]

from pyspark.sql.types import LongType, StructField, StructType

def dfZipWithIndex (df, offset=1, colName="rowId"):
”’
Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
and preserves a schema

:param df: source dataframe
:param offset: adjustment to zipWithIndex()’s index
:param colName: name of the index column
”’

new_schema = StructType(
[StructField(colName,LongType(),True)] # new added field in front
+ df.schema.fields # previous schema
)

zipped_rdd = df.rdd.zipWithIndex()

new_rdd = zipped_rdd.map(lambda (row,rowId): ([rowId +offset] + list(row)))

return spark.createDataFrame(new_rdd, new_schema)

[/code]

Credits: stackoverflow

Sai Kumar

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.24tutorials.com/sai

Share This Post

Related Articles

Spark runtime Architecture – How Spark Jobs are executed

Apache Spark Interview Questions For 2020

How to connect to Snowflake from AWS EMR using PySpark

Login

Lost Password

Register