Apache Spark is a hugely popular open source cluster computing framework that has taken off like wildfire since it was introduced to the Apache community. It's a key tool for working with streaming data, machine learning and graph analytics. Spark's rapid adoption has been breathtaking and it has become the primary engine for processing big data.
The excitement was palpable at Spark Summit 2017 in San Francisco earlier this week. Over 3,000 attendees enjoyed three days of in-depth tutorials, lectures, and an awesomely exciting exhibit floor. This represents a 20% increase in attendance since last year.
Denis Magda, GridGain's product manager and PMC Chair of Apache® Ignite™, gave a well-attended lecture on Wednesday afternoon entitled "Apache Spark and Apache Ignite: Where Fast Data Meets the IoT". Denis discussed how to build a fast data solution to receive and process streaming data using an Apache Ignite cluster and Apache Spark.
An Evolving Ecosystem
What may be even more breathtaking is the rapid growth of a commercial ecosystem around Spark. Numerous companies, some old and some new, have launched major initiatives around improving and extending Apache Spark. There are many products available that integrate Spark into existing computing frameworks. There are both software and hardware products that optimize Spark processing. There are libraries that enhance Spark. There are plenty of companies that have leveraged Spark to build some amazing products that offer real-time insight into streaming data.
I particularly enjoyed Intel's demo of Spotlight, a web-based tool that leverages Spark to help Thorn: Digital Defenders of Children provide a critical service to law enforcement in the United States. Spotlight uses AI to identify missing children from photographs discovered across the web. The demo showed how hard and time-consuming it is for a human to make these identifications, and then showed how fast and easy it is when guided by AI.
The need for real-time big data analysis is also driving the increased popularity of the in-memory computing. There were several exhibitors that showed off some aspect of in-memory enhancements for Spark. The most mature is GridGain, the commercial version of Apache Ignite, because it is a complete in-memory computing platform containing an in-memory Data Grid, Compute Grid, and SQL Grid that is widely used for Streaming Analytics.
GridGain Enhances Spark
Apache Ignite and GridGain are complementary in-memory solutions for Apache Spark. While Apache Spark supports SQL, it lacks the ability to index and must do a full scan to respond to each query. GridGain supports SQL indexes for faster queries, so Spark SQL can be accelerated up to 100x when using Spark plus GridGain. GridGain also allows for shared RDDs so multiple Spark jobs can access the same RDD. It's also possible to share state between Spark jobs and applications using the GridGain In-Memory File System.
Download a free trial and start exploring how GridGain improves Spark today!