Persistent Apache Spark RDDs with the GridGain In-Memory Data Fabric

Share State Across Spark Applications or Jobs

The GridGain in-memory computing platform includes an in-memory data grid, in-memory database, streaming analytics and acceleration solutions for Hadoop and Spark. Apache Spark is an open source fast and general engine for large-scale data processing. GridGain and Spark are in-memory computing solutions but target different use cases. In many cases, they can achieve superior performance and functionality when used together.

Apache Spark Shared RDDs Powered by GridGain

Apache Spark is built for in-memory processing of event-driven data. Spark doesn’t provide shared storage, so ETL-ed data must be loaded from HDFS or another disk storage into Spark for processing. State is only passed from Spark job to job by saving the processed data back into external storage. GridGain can share Spark state directly in memory, without storing the state to disk.

The Shared RDD API implemented in GridGain is one of the main integrations for GridGain and Apache Spark. Spark shared RDDs are essentially wrappers around GridGain caches which can be deployed directly inside of Spark processes that are executing Spark jobs. Spark shared RDDs can also be used with the cache-aside pattern, where GridGain clusters are deployed separately, but still in-memory. The data is still accessible using Spark RDD APIs.

IgniteContext is the main entry point into GridGain RDDs. It allows users to specify different GridGain configurations. GridGain is accessible in client or server mode. Users can create new shared RDDs, which means new GridGain caches are created with different configurations and indexing strategies. GridGain supports a variety of partitioning and replication strategies with fully replicated or partitioned caches.

Everything that can be done in GridGain can be done with IgniteContext by passing a proper GridGain configuration. The RDD syntax is native so it can be accessed using the native Spark RDD syntax. The main difference is GridGain RDDs are mutable while Spark RDDs are immutable. Mutable GridGain RDDs enable them to be updated at the end of or during every job or task execution and ensures that other applications and jobs can be notified and can read the state.

Shared In-Memory File System with Apache Spark Plus GridGain

When working with files instead of RDDs, it is still possible to share state between Spark jobs and applications using the GridGain In-Memory File System (GGFS). GGFS implements the Hadoop FileSystem API and can be deployed as a native Hadoop file system, just like HDFS. GridGain plugs in natively to any Hadoop environment and any Spark environment. An in-memory file system can be used with zero code changes in plug-n-play fashion.