GridGain Integration with Apache Spark

1,000x Faster Spark Queries and Shared RDDs

The GridGain In-Memory Data Fabric is built on Apache Ignite which is an open source in-memory data fabric which provides a wide variety of computing solutions including an in-memory data grid, compute grid, streaming, as well as acceleration solutions for Hadoop and Spark. Apache Spark is an open source fast and general engine for large-scale data processing. Both GridGain and Spark are in-memory computing solutions but they target different use cases and are complementary. In many cases, they can be used together to achieve superior performance and functionality.

Apache Spark and GridGain utilize the power of in-memory computing but they address somewhat different use cases. They rarely “compete” for the same task. Some differences:

Solution Apache Spark GridGain In-Memory Data Fabric
Data Retention Apache Spark loads data for processing from other storages, usually disk-based, and then discards the data when the processing is finished. It doesn’t store data. GridGain provides a distributed in-memory key-value store (distributed cache or data grid) with ACID transactions and SQL querying capabilities which retains data in memory and can write through to an underlying database
OLAP/OLTP Apache Spark is for non-transactional, read-only data (RDDs don’t support in-place mutation) so is used for OLAP GridGain supports non-transactional (OLAP) payloads as well as fully ACID compliant transactions (OLTP)
Data Types Apache Spark is based on RDDs and works only on data-driven payloads GridGain fully supports pure computational payloads (HPC/MPP) that can be “dataless”

Apache Spark is built for in-memory processing of event-driven data. Spark doesn’t provide shared storage, so ETL-ed data must be loaded from HDFS or another disk storage into Spark for processing. State is only passed from Spark job to job by saving the processed data back into external storage. GridGain can share Spark state directly in memory, without storing the state to disk.

The Shared RDD API implemented in GridGain is one of the main integrations for GridGain and Apache Spark. Spark shared RDDs are essentially wrappers around GridGain caches which can be deployed directly inside of Spark processes that are executing Spark jobs. Spark shared RDDs can also be used with the cache-aside pattern, where GridGain clusters are deployed separately from Spark, but still in-memory. The data is still accessed using Spark RDD APIs.

GridGain RDDs are used through IgniteContext which is the main entry point into GridGain RDDs. It allows users to specify different GridGain configurations. GridGain can be accessed in client mode or server mode. Users can create new shared RDDs, which essentially means that new GridGain caches are created with different configurations and different indexing strategies. GridGain supports fully replicated or partitioned caches to support a variety of partitioning and replication strategies.

Everything that can be done in GridGain can be done with IgniteContext by passing a proper GridGain configuration. The RDD syntax is native so it can be accessed using the native Spark RDD syntax. The main difference is that GridGain RDDs are mutable while Spark RDDs are immutable. Mutable GridGain RDDs enable them to be updated at the end of or during every job or task execution and ensures that other applications and jobs can be notified and can read the state.

Apache Spark Plus GridGain for Faster SQL Queries

Apache Spark supports a fairly rich SQL syntax. However, it doesn’t support indexing the data so Spark must do full scans all the time. Spark queries may take minutes, even on moderately small data sets. GridGain supports SQL indexes for faster queries, so Spark SQL can be accelerated over 1,000x when using Spark plus GridGain. The result set returned by GridGain Shared RDDs also supports Spark Dataframe API, so it can be further analyzed using standard Spark data frames as well. Both Apache Spark and GridGain natively integrate with Apache YARN and Apache Mesos so they can easily be used together.

Shared In-Memory File System with Apache Spark Plus GridGain

When working with files instead of RDDs, it is still possible to share state between Spark jobs and applications using the GridGain In-Memory File System (IGFS). IGFS implements the Hadoop FileSystem API and can be deployed as a native Hadoop file system, just like HDFS. GridGain plugs in natively to any Hadoop environment and any Spark environment. An in-memory file system can be used with zero code changes in plug-n-play fashion.

The Benefits of Apache Spark Plus GridGain

GridGain and Spark are both in-memory computing solutions but they target different use cases and are complementary to each other. They can be used together in memory use cases to achieve superior results:

  • GridGain can provide shared storage so state can be passed from one Spark application or job to another
  • GridGain can provide SQL with indexing so Spark SQL can run over 1,000x faster
  • The GridGain In-Memory File System (GGFS) can share state between Spark jobs and applications when working with files instead of RDDs