Hadoop/MapReduce-
Hadoop is a widely-used large-scale batch data processing framework. It’s an open source implementation of Google’s MapReduce.
MapReduce was ground-breaking because it provided:
-> simple API (simple map and reduce steps)
-> fault tolerance
Fault tolerance is what made it possible for Hadoop/MapReduce to scale to 100s or 1000s of nodes at all.
Hadoop/MapReduce + Fault Tolerance
Why is this important?
For 100s or 1000s of old commodity machines. likelihood of at least one node failing is very high midway through a job.
Thus, Hadoop/MapReduce’s ability to recover from node failure enabled:
-> computations on unthinkably large data sets to succeed to completion.
Fault tolerance + simple API
At Google, MapReduce made it possible for an average Google software engineer to craft a complex pipeline of map/ reduce stages on extremely large data sets.
Why Spark?
Fault-tolerance in Hadoop/MapReduce comes at a cost.
Between each map and reduce step, in order to recover from potential failures. Hadoop/MapReduce shuffles its data and write intermediate data to disk:
Remember:
Reading/writing to disk: 1000x slower than in-memory
Network communication: 1,000.000x slower than in-memory
Spark…
->Retains fault-tolerance
->Different strategy for handling latency (latency significantly reduced!)
Achieves this using ideas from functional programming!
Idea: Keep all data immutable and in-memory. All operations on data are just functional transformations, like regular Scala collections. Fault tolerance is achieved by replaying functional transformations over original
dataset.
Result: Spark has been shown to be 100x more performant than Hadoop.
while adding even more expressive APIs.