Enabling Real-Time Analytics for Hadoop Data Lakes with GridGain
Data lakes, such as those powered by Hadoop, are an excellent choice for analytics and reporting at scale. Hadoop scales horizontally and cost-effectively and fulfills long-running operations spanning big data sets. However, the continual growth of real-time analytics requirements — where operations need to be completed in seconds rather than minutes, or milliseconds rather than seconds — has brought new challenges to Hadoop based solutions.
In this session, Denis Magda, GridGain VP of Product and Apache Ignite PMC Chair, describes how Apache® Ignite™ and GridGain® as an in-memory computing platform can modernize existing data lake architectures, enabling real-time analytics that spans operational, historical, and streaming data sets.
In particular, you'll learn the following:
- How to choose the right deployment mode and responsibilities when working with GridGain and Hadoop
- How to determine which operations should be handled by GridGain and which should be sent to Hadoop
- How to use Spark DataFrames to run federated (aka cross-database) queries that span GridGain and Hadoop
- How to perform initial data loading from Hadoop to GridGain
- How to set up bi-directional synchronization between Hadoop and GridGain