Resilient Distributed Datasets(RDDs) – Spark

By Sai Kumar on February 18, 2018

Spark implements a distributed data parallel model called Resilient Distributed Datasets(RDDs).

Given some large dataset that can’t fit into memory on a single node.
->Chunk up the data(Diagrams needs to be added)
->Distribute it over the cluster of machines.
->From there, think of your distributed data like a single collection.

RDDs are Spark’s Distributed collections. It seems a lot like immutable sequential or parallel Scala collections.

[code]abstract class RDD[T]{
def map[U](f: T => U): RDD[U] = …
def flatMap[U](f: T => TraversableOnce[U]): RDD[U] = ..
def filter(f; T => Boolean): RDD[T] = …
def reduce(f: (T, T) => T): T = …
}[/code]

Most operations on RDDs, like Scala’s immutable List. and Scala’s parallel collections, are higher-order functions.
That is, methods that work on RDDs, taking a function as an argument and which typically return RDDs.
While their signatures differ a bit, their semantics are the same:

Scala List	Spark RDD
map[B](f: A => B): List[B]	map[B](f: A => B): RDD[B]
flatMap[B](f: A => TraversableOnce[B]): List[B]	flatMap[B](f: A => TraversableOnce[B]): RDD[B]
filter(pred: A => Boolean): List[A]	filter(pred: A => Boolean): RDD[A]
reduce(op: (A. A) => A): A	reduce(op: (A. A) => A): A
fold(z: A)(op: (A, A) => A): A	fold(z: A)(op: (A. A) => A): A
aggregate[B](z: => B)(seqop: (B. A) => B, combop: (B, B) => B): B	aggregate[B](z: B)(seqop: (B, A) => B, combop; (B, B) => B): B

Using RDDs in Spark feels a lot like normal Scala sequential/Parallel collections, with the added knowledge that your data in distributed across machines.

Sai Kumar

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.24tutorials.com/sai

Share This Post

Related Articles

How to create Spark DataFrame from different sources

Common issues with Apache Spark

Deep dive into Partitioning in Spark – Hash Partitioning and Range Partitioning

Login

Lost Password

Register