GridGain Integration with Apache Spark

1,000x Faster Spark Queries and Shared RDDs

The GridGain® in-memory computing platform provides Apache® Spark™ data management for streaming data, machine learning, and big data analytics with real-time responsiveness and unlimited horizontal scalability. The GridGain integration with Apache Spark is the broadest provided by any in-memory computing platform, and makes in-memory data management for Spark simple. GridGain provides native support for Spark DataFrames, a GridGain RDD API for reading and writing data to GridGain as mutable Spark RDDs, and an in-memory implementation of HDFS with the Ignite File System (IGFS). Spark developers can easily read, write and share streaming data and state across Spark jobs, in memory, using existing Spark APIs.

Companies can use GridGain to process streaming data and data-at-rest together for Spark in real-time, whether to perform data preparation, big data analytics or machine learning. GridGain provides a massivelyApache Spark application parallel processing (MPP) architecture that delivers up to 1000x performance gains. It collocates processing with the data to perform computations on massive data sets in place without having to wait for data to move across the network into a separate infrastructure. GridGain uses MPP for built-in distributed SQL, advanced analytics, machine and deep learning.

Apache Spark Plus GridGain for Faster SQL Queries

Apache Spark supports a fairly rich SQL syntax. However, it doesn’t support data indexing so Spark must do full scans all the time. Spark queries may take minutes, even on moderately small data sets. GridGain provides fast SQL capabilities and accelerates SparkSQL by optimizing Spark’s query execution plans to leverage GridGain’s distributed SQL and advanced indexing. Any data across its distributed clusters are accessible using ANSI-99 SQL or as APIs across a host of programming languages. Developers can also write and distribute their own code for MPP and expose them as microservices. Both Apache Spark and GridGain natively integrate with Apache YARN and Apache Mesos so they can easily be used together.

Machine Learning and HTAP

The combination of Spark and GridGain for end-to-end in-memory computing enables companies to rapidly deliver new in-process hybrid transactional/analytical processing (HTAP) applications. With Spark and GridGain, companies can train models against massive data sets and re-run training in “mid-stream” during Spark processing to improve the models based on the latest data. As Spark processes streams, it can access all the data, learning and processing power in GridGain in real-time to gain insights, share state across Spark jobs, automate decisions and save the results. It is only this combination of Spark performing the stream processing in memory with GridGain keeping the data management, machine learning, and big data analytics in memory and in place that makes real-time responsiveness and automation a reality.

Solution Apache Spark Apache Spark + GridGain
Data Management Apache Spark loads data for processing from other storages, usually disk-based, and then discards the data when the processing is finished. It doesn’t store data. GridGain provides the data in-memory via RDDs, DataFrames or HDFS APIs. Data can be directly loaded into GridGain, enriched and provided as RDDs or DataFrames; or processed by Spark as RDDs and DataFrames that interact directly with GridGain to maximize performance and minimize big data movement.
State Management> Apache Spark has no simple built-in mechanism for sharing state while stream processing. (RDDs and DataFrames don’t support in-place mutation) GridGain provides a mutable RDD API, as well as the ability to write DataFrames to the same underlying store that makes it much easier to store and share state across Spark jobs and over time.
SQL Support SparkSQL does not scale very well as data grows due to a lack of advanced indexing. GridGain improves SparkSQL query performance up to 1000x by optimizing Spark query execution plans to use GridGain distributed SQL, which distributes data based on data affinity, collocates processing to minimize network data movement, and leverages advanced indexing.
Analytics Spark requires code to perform any analytics and store the results. Spark can leverage GridGain’s MPP for distributed SQL and advanced analytics directly at petabyte scale, then store and access results.
Machine Learning Spark is not well suited for training models against massive data sets that require any type of processing or algorithm not suited for streaming. It also requires moving all of the data at least once, which can take hours. GridGain includes built-in machine learning optimized for MPP-style distributed processing that runs the machine learning in place, making it possible to train and re-train models in real-time while Spark is processing.

The Benefits of Apache Spark Plus GridGain

GridGain and Spark are both in-memory computing solutions but they target different use cases and are complementary to each other. They can be used together in memory use cases to achieve superior results:

  • GridGain can provide shared storage so state can be passed from one Spark application or job to another
  • GridGain can provide SQL with indexing so Spark SQL can run over 1,000x faster
  • The GridGain In-Memory File System (GGFS) can share state between Spark jobs and applications when working with files instead of RDDs