Spark implements a distributed data parallel model called Resilient Distributed Datasets(RDDs).
Given some large dataset that can’t fit into memory on a single node.
->Chunk up the data(Diagrams needs to be added)
->Distribute it over the cluster of machines.
->From there, think of your distributed data like a single collection.
RDDs are Spark’s Distributed collections. It seems a lot like immutable sequential or parallel Scala collections.
[code]abstract class RDD[T]{
def map[U](f: T => U): RDD[U] = …
def flatMap[U](f: T => TraversableOnce[U]): RDD[U] = ..
def filter(f; T => Boolean): RDD[T] = …
def reduce(f: (T, T) => T): T = …
}[/code]
Most operations on RDDs, like Scala’s immutable List. and Scala’s parallel collections, are higher-order functions.
That is, methods that work on RDDs, taking a function as an argument and which typically return RDDs.
While their signatures differ a bit, their semantics are the same:
Scala List | Spark RDD |
map[B](f: A => B): List[B] | map[B](f: A => B): RDD[B] |
flatMap[B](f: A => TraversableOnce[B]): List[B] | flatMap[B](f: A => TraversableOnce[B]): RDD[B] |
filter(pred: A => Boolean): List[A] | filter(pred: A => Boolean): RDD[A] |
reduce(op: (A. A) => A): A | reduce(op: (A. A) => A): A |
fold(z: A)(op: (A, A) => A): A | fold(z: A)(op: (A. A) => A): A |
aggregate[B](z: => B)(seqop: (B. A) => B, combop: (B, B) => B): B | aggregate[B](z: B)(seqop: (B, A) => B, combop; (B, B) => B): B |
Using RDDs in Spark feels a lot like normal Scala sequential/Parallel collections, with the added knowledge that your data in distributed across machines.