Spark Release 2.1.0:
Apache Spark 2.1.0 release makes significant strides in the production readiness of Structured Streaming, with added support for event time watermarks and Kafka 0.10 support. In addition, this release focuses more on usability, stability, and polish, resolving over 1200 tickets. The below is the list of high level changes
Core and Spark SQL:
- This version supports from json and to json for parsing jsonfor string columns.
- This version allows for the use of DDL commands to manipulate partitions for tables stored with Spark’s native formats.
- It Speeds up group-by aggregate performance by adding a fast aggregation cache that is backed by a row-based hashmap.
Structured Streaming:
- This version gives Kafka 0.10 support in Structured Streaming.
- This version Support all file formats in structured streaming
- This version gives Separate instantaneous state from progress performance statistics.
Stability
- This version handles Long running structured streaming requirements
Spark Release 2.2.0:
Apache Spark 2.2.0 release removes the experimental tag from Structured Streaming. In addition, this release focuses more on usability, stability, and polish, resolving over 1100 tickets.
Core and Spark SQL:
- This version supports creating hive table with DataFrameWriter and Catalog
- This adds support for LATERAL VIEW OUTER explode() for accessing nested data.
- It supports session for getting local time zone.
- It Supports AES-based authentication mechanism for Spark.
- It supports partial aggregation support of HiveUDAFFunction.
- Introduced limit max number of records written per file.
- It supports parsing multi-line JSON and CSV files.
- In this version removed support for Hadoop 2.5 and earlier versions.
- In this version removed Java 7 support.
Structured Streaming:
- Kafka improvements are done in this version that support for reading and writing data in streaming or batch to/from apache Kafka.
- It also cached producer for lower latency Kafka to Kafka streams.
- It supports complex stateful processing and timeouts using flatMapGroupsWith state functionality.
Spark Release 2.0.0:
Apache Spark 2.0.0 major updates are API usability, SQL 2003 support, performance improvements, structured streaming, R UDF support, as well as operational improvements. In addition, this release includes over 2500 patches from over 300 contributors.
Core and Spark SQL:
Programming APIs:
One of the largest changes in this version is new updated APIs
- Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. Data Frame is just a type alias for Dataset of Row.
- SparkSession: new entry point that replaces the old SQLContext and HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility. It supports session for getting local time zone.
- This version also provides a new, streamlined configuration API for SparkSession
- This version also supports Simpler, more performant accumulator API
SQL:
- Spark 2.0 substantially improved SQL functionalities with SQL2003 support and Spark SQL can now run all 99 TPC-DS queries.
- It supports a native SQL parser that supports both ANSI-SQL as well as Hive QL.
- It supports native DDL command implementations.
- It includes sub query support for correlated, uncorrelated scalar subqueries, NOT IN, IN and NOT EXISTS predicate subqueries in where having clauses.
- When building without Hive support, Spark SQL should have almost all the functionality as when building with Hive support, with the exception of Hive connectivity, Hive UDFs, and script transforms.
New Features
- Native CSV data source, based on Databricks’ spark-csv module.
- Off-heap memory management for both caching and runtime execution
- Hive style bucketing support
- Approximate summary statistics using sketches, including approximate quantile, Bloom filter, and count-min sketch.
Spark Release 1.6.0:
Apache Spark 1.6.0 is the seventh release on the 1.x line. This release includes contributions from 248+ contributors.
Spark Core/SQL:
API Updates:
- Dataset API – A new Spark API, similar to RDDs, that allows users to work with custom objects and lambda functions while still gaining the benefits of the Spark SQL execution engine.
- Session Management – Different users can share a cluster while having different configuration and temporary tables.
- SQL Queries on Files – Concise syntax for running SQL queries over files of any supported format without registering a table.
- Reading non-standard JSON files – Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
- Per-operator Metrics for SQL Execution – Display statistics on a per-operator basis for memory usage and spilled data size.
- Advanced Layout of Cached Data – storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
- table supports specifying database name. For example, sqlContext.read.table(“dbName.tableName”) can be used to create a DataFrame from a table called “tableName” in the database “dbName”.
- With schema inference from JSON into a Dataframe, users can set primitivesAsString to true (in data source options) to infer all primitive value types as Strings. The default value of primitivesAsString is false.
Performance
- Unified Memory Management – Shared memory for execution and caching instead of exclusive division of the regions.
- Parquet Performance – Improve Parquet scan performance when using flat schemas.
- Improved query planner for queries having distinct aggregations – Query plans of distinct aggregations are more robust when distinct columns have high cardinality.
- Adaptive query execution – Initial support for automatically selecting the number of reducers for joins and aggregations.
- Avoiding double filters in Data Source API – When implementing a data source with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
- Fast null-safe joins – Joins using null-safe equality (<=>) will now execute using SortMergeJoin instead of computing a cartisian product.
- In-memory Columnar Cache Performance – Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
- SQL Execution Using Off-Heap Memory – Support for configuring query execution to occur using off-heap memory to avoid GC overhead
Spark Streaming
API Updates:
- New improved state management – mapWithState – a DStream transformation for stateful stream processing, supercedesupdateStateByKey in functionality and performance.
- Kinesis record de-aggregation – Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent de-aggregation of KPL-aggregated records.
- Kinesis message handler function – Allows arbitrary function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
- Python Streaming Listener API – Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
UI Improvements:
- Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
- Made output operations visible in the streaming tab as progress bars.