You may required to add Serial number to Spark Dataframe sometimes.
It can be done with the spark function called monotonically_increasing_id(). It generates a new column with unique 64-bit monotonic index for each row. But it isn’t significant, as the sequence changes based on the partition. In short, random numbers will be assigned which are out of sequence.
If the goal is add serial number to the dataframe, you can use zipWithIndex method available on RDD.
below is how you can achieve the same on dataframe.
[code lang=”python”]
from pyspark.sql.types import LongType, StructField, StructType
def dfZipWithIndex (df, offset=1, colName="rowId"):
”’
Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
and preserves a schema
:param df: source dataframe
:param offset: adjustment to zipWithIndex()’s index
:param colName: name of the index column
”’
new_schema = StructType(
[StructField(colName,LongType(),True)] # new added field in front
+ df.schema.fields # previous schema
)
zipped_rdd = df.rdd.zipWithIndex()
new_rdd = zipped_rdd.map(lambda (row,rowId): ([rowId +offset] + list(row)))
return spark.createDataFrame(new_rdd, new_schema)
[/code]
Credits: stackoverflow