HydraDragon: Hybrid Transactional/Analytical Storage Engine for Apache Ignite

Apache Ignite's default storage engine is sophisticated enough to enable us to use the database for various use cases, ranging from transactional workloads to real-time analytics. The multi-tiered storage architecture allows you to configure Ignite as a distributed, in-memory cache without persistence or to have Ignite function as a hybrid transactional/analytical database that scales beyond available memory capacity. Built-in support for essential APIs (SQL, key-value), high-performance computing, and real-time streaming makes Ignite a database that is easy to transition to and innovate with.

When best practices are followed and optimization is applied, Ignite can easily give us up to a 100x performance boost. One might ask, “Isn’t that enough?” Certainly, this level of performance satisfies the needs of many users. However, we were constantly searching for ways to boost performance by an additional 10x to 100x. And, we discovered what works—hardware-level and storage-level optimizations. So, we began working on HydraDragon.

HydraDragon Overview

HydraDragon is a hybrid transactional/analytical storage engine for Apache Ignite. The engine expands Ignite’s default storage engine. The default engine is optimized for operational and high-performance computing workloads. HydraDragon introduces hardware- and storage-level optimizations, many of which are specific to data-intensive, analytical use cases.  

HydraDragon consists of three planes (or layers)—data formats, storage, and compute. In Figure 1, the optimizations that exist in the default Ignite storage engine and in HydraDragon are in red, and the optimizations that are unique to HydraDragon are in orange.

HydraDragon architecture

Figure 1. HydraDragon architecture

The data formats plane defines the format that is used to serialize and organize records. Ignite 2 currently supports the row format, which stores records one by one in contiguous memory blocks. When records are read, written to, or updated, B+ trees are used to access them (see Ignite memory architecture for details). In Ignite 3, LSM tree support will be added. HydraDragon supports not only the row format but also a columnar format that organizes data in sequences of columns and places all the data entries that are associated with a column next to each other. The columnar format enables ad hoc analytical queries, such as queries that aggregate large volumes of data for a subset of columns.

The storage plane defines the storage types (physical devices) where your data can reside. For years, the Ignite default storage engine has supported DRAM, HDD, and SSD. To this collection of storage devices HydraDragon adds native support for Intel Optane Persistent Memory (through App Direct Mode) and GPU memory. In the following Sample Use Case sections, you’ll see the benefits that these two types of memory provide.

The compute plane enumerates the processing units that HydraDragon supports. When an Ignite node processes application requests, it uses the CPU’s standard instruction set to make things happen. To be more precise, the JVM and the OS split requests into multiple CPU commands, and Ignite, as Java middleware, takes this behavior for granted. HydraDragon is more involved in the execution of commands. It natively supports the SIMD (single instruction, multiple data) and GPU instruction sets to exploit hardware-level parallelism. If HydraDragon decides that vectorized instructions can boost your running requests, then it requests CPUs and GPUs to use those instructions. If HydraDragon decides otherwise, the JVM or OS uses the CPU’s standard instruction set. Most of such decisions will be automated and, therefore, will work transparently. However, some decisions you will control and configure.

Sample Use Case 1: Real-Time Analytics with Affordable In-Memory Speed

It’s fair to say that Ignite performs real-time analytics exceptionally well—especially if the data resides in DRAM. But, analytical workloads must deal with vast datasets and, if all historical records are kept in DRAM, costs can go through the roof. That’s why, even with Ignite, most users store historical data on SSDs, thus absorbing the inevitable hit on performance.

The beauty of DRAM as a storage device is that it’s byte-addressable. Therefore, the CPU can directly read from or write to any memory location. With SSDs, the CPU must wait patiently as a block of data that contains the requested record is copied (loaded) from disk to memory. Only after the record is copied can the CPU access the record in the byte-addressable way. As the copy (load) procedure is repeated for each of the thousands or millions of records that a typical analytical query traverses, the performance hit continues to mount. 

So, DRAM is fast but limited in capacity and expensive, and SSD has large capacity and is cheap but slow. Thus, we need another option for analytical operations. Fortunately, Intel Optane Persistent Memory operating in App Direct Mode is a perfect fit for real-time analytics, offering in-memory speed at an affordable price. 

How is this discussion related to HydraDragon? Assume that your historical data is stored in the columnar format (instead of in the row format) and that the records are physically located in Intel Optane Persistent Memory operating in App Direct Mode (instead of in SSDs). Also, suppose that HydraDragon, in addition to using the standard instruction set, uses the SIMD instruction set whenever possible. In Figure 2, the optimizations that HydraDragon uses are in orange.

Real-time analytics with HydraDragon

Figure 2. Real-time analytics with HydraDragon

By combining capabilities and optimizations, HydraDragon can deliver a performance increase of up to 100x for real-time analytics. Consider the following:

  • First, the columnar format is superior when you need to scan through an entire dataset, comparing or aggregating the values of a specific column (such as values for date, age, or location).
  • Second, when Intel Optane Persistent Memory in App Direct Mode is used for physical storage, the CPU accesses columnar data directly in the byte-addressable way, eliminating the need to copy records from Optane to DRAM!
  • Third, vectorized execution, which uses the SIMD instruction set, exploits hardware-level parallelism for a selected number of analytical queries, thus driving performance even higher.

Sample Use Case 2: GPU Acceleration for Ultrahigh Performance Data Processing

If you are not familiar with GPUs, I recommend that you stop reading and watch this one-minute video. The video compares GPUs to CPUs and illustrates how much faster GPUs can perform. Looks impressive, doesn’t it? 

Initially, GPUs were used primarily in the gaming and animation industries to accelerate graphical operations. However, now, GPUs are used for high-performance computing, machine learning, and simulations and for many tasks across many industries (such as financial, biotech, manufacturing, and oil and gas).
Ignite is a distributed database for high-performance computing (HPC). Ignite shines in HPC use cases by using all the cores of all the CPUs in parallel. The most performant CPUs have, on average, 64 cores. However, a typical GPU has thousands of processing units. Thus, HydraDragon’s native support of GPUs is a logical step toward optimizing the value of hardware-level parallelism.   

However, GPUs are not suitable for all types of operations. HydraDragon uses GPUs to optimize data-intensive and compute-intensive SQL operations (vector operations, scans, and complex analytical functions). Also, we’re thinking about integrating GPUs into the Ignite compute APIs so that, as necessary, you can execute parts of your custom compute tasks on GPUs. Finally, as Ignite application developers begin to experiment with Ignite’s support for GPUs, more use cases will emerge.

HydraDragon: How Do I Try It?

When this article was written, HydraDragon was in active development. HydraDragon is based on the Apache Ignite 3 architecture and is planned to be released in 2022.

The full set of HydraDragon capabilities will be available to all Apache Ignite applications through GridGain Nebula, a cloud-native, fully managed service for Ignite.Those who want to use HydraDragon in on-premise environments will be able to do so through a downloadable version of GridGain Platform.

Stay tuned and be ready to boost hybrid transactional/analytical workloads with Ignite 3 and HydraDragon!