Enabling Real-Time Analytics for Hadoop Data Lakes with GridGain

In this IMCS Europe 2019 session, Denis Magda describes how Apache Ignite and GridGain as an in-memory computing platform can modernize existing data lake architectures, enabling real-time analytics that spans operational, historical, and streaming data sets.

Data lakes, such as those powered by Hadoop, are an excellent choice for analytics and reporting at scale. Hadoop scales horizontally and cost-effectively and fulfills long-running operations spanning big data sets. However, the continual growth of real-time analytics requirements — where operations need to be completed in seconds rather than minutes, or milliseconds rather than seconds — has brought new challenges to Hadoop based solutions.

In particular, you'll learn the following:

  • How to choose the right deployment mode and responsibilities when working with GridGain and Hadoop
  • How to determine which operations should be handled by GridGain and which should be sent to Hadoop
  • How to use Spark DataFrames to run federated (aka cross-database) queries that span GridGain and Hadoop
  • How to perform initial data loading from Hadoop to GridGain
  • How to set up bi-directional synchronization between Hadoop and GridGain
Presenters
Denis Magda
VP, Developer Relations in R&D at GridGain; Apache Ignite committer and PMC member