Apache Spark Acceleration Using GridGain

GridGain Provides SQL with Indexing for 1,000x Faster Spark Queries

The GridGain in-memory computing platform is built on Apache Ignite which includes an in-memory data grid, compute grid, streaming, and acceleration solutions for Hadoop and Apache Spark. Apache Spark is an open source fast and general engine for large-scale data processing. GridGain and Spark are in-memory computing solutions but they target different use cases and are complementary. They can be used together to achieve superior performance and functionality.

Apache Spark Plus GridGain for Faster SQL Queries

Apache Spark supports a fairly rich SQL syntax. However, it doesn’t support indexing the data so Spark must do full scans all the time. Spark queries may take minutes, even on moderately small data sets. GridGain supports SQL indexes for faster queries, so Spark SQL can be accelerated over 1,000x when using Spark plus GridGain. The result set returned by GridGain Shared RDDs also supports Spark Dataframe API, so it can be further analyzed using standard Spark data frames as well. Both Apache Spark and GridGain natively integrate with Apache YARN and Apache Mesos so they can easily be used together.

GridGain and Apache Spark Integration

The Shared RDD API implemented in GridGain is one of the main integrations for GridGain and Apache Spark. Spark shared RDDs are essentially wrappers around GridGain caches which can be deployed directly inside of Spark processes that are executing Spark jobs. Spark shared RDDs can also be used with the cache-aside pattern, where GridGain clusters are deployed separately from Spark, but still in-memory. The data is still accessed using Spark RDD APIs.

GridGain RDDs are used through IgniteContext which is the main entry point into GridGain RDDs. It allows users to specify different GridGain configurations. GridGain can be accessed in client mode or server mode. Users can create new shared RDDs, which essentially means that new GridGain caches are created with different configurations and different indexing strategies. GridGain supports fully replicated or partitioned caches to support a variety of partitioning and replication strategies.

Everything that can be done in GridGain can be done with IgniteContext by passing a proper GridGain configuration. The RDD syntax is native so it can be accessed using the native Spark RDD syntax. The main difference is that GridGain RDDs are mutable while Spark RDDs are immutable. Mutable GridGain RDDs enable them to be updated at the end of or during every job or task execution and ensures that other applications and jobs can be notified and can read the state.

Shared In-Memory File System with Apache Spark Plus GridGain

When working with files instead of RDDs, it is still possible to share state between Spark jobs and applications using the GridGain In-Memory File System (GGFS). GGFS implements the Hadoop FileSystem API and can be deployed as a native Hadoop file system, just like HDFS. GridGain plugs in natively to any Hadoop environment and any Spark environment. An in-memory file system can be used with zero code changes in plug-n-play fashion.