Share State Across Persistent Apache Spark Applications or Jobs

The GridGain® in-memory computing platform includes an in-memory data grid, in-memory database, streaming analytics, continuous learning framework for machine and deep learning, and acceleration solutions for Hadoop and Spark. Apache Spark is an open source fast and general engine for large-scale data processing. GridGain and Spark are both in-memory computing solutions but target different use cases. In many cases, such as creating architectures for persistent Apache Spark RDDs and DataFrames, they can achieve superior performance and functionality when used together.


Apache Spark Shared RDDs and DataFrames Powered by GridGain

Apache Spark is built for in-memory processing of event-driven data. Spark doesn’t provide shared storage, so ETL-ed data must be loaded from HDFS or another disk storage into Spark for processing. State is only passed from Spark job to job by saving the processed data back into external storage. GridGain can share Spark state directly in memory, without storing the state to disk, by acting as an in-memory data store for the Spark data.

The Shared RDD and DataFrames APIs implemented in GridGain are the main integrations for GridGain and Apache Spark. Spark shared RDDs are essentially wrappers around GridGain caches which can be deployed directly inside of Spark processes that are executing Spark jobs. Spark shared RDDs can also be used with the cache-aside pattern, where GridGain clusters are deployed separately, but still in-memory. The data is still accessible using Spark RDD APIs.

IgniteContext is the main entry point into GridGain RDDs. It allows users to specify different GridGain configurations. GridGain is accessible in client or server mode. Users can create new shared RDDs, which means new GridGain caches are created with different configurations and indexing strategies. GridGain supports a variety of partitioning and replication strategies with fully replicated or partitioned caches.

Everything that can be done in GridGain can be done with IgniteContext by passing a proper GridGain configuration. The RDD syntax is native so it can be accessed using the native Spark RDD syntax. The main difference is GridGain RDDs are mutable while Spark RDDs are immutable. Mutable GridGain RDDs enable them to be updated at the end of or during every job or task execution and ensures that other applications and jobs can be notified and can read the state.

Shared In-Memory File System with Apache Spark Plus GridGain

When working with files instead of RDDs or DataFrames, it is still possible to share state between Spark jobs and applications using the GridGain In-Memory File System (GGFS). GGFS implements the Hadoop FileSystem API and can be deployed as a native Hadoop file system, just like HDFS. GridGain plugs in natively to any Hadoop environment and any Spark environment. An in-memory file system can be used with zero code changes in plug-n-play fashion.