Apache Ignite Native Persistence Storage Engine: Overview

Ignite Native Persistence

The Apache Ignite native persistence storage engine follows a classic database approach based on the ARIES architecture. However, the Ignite developers needed to make some adjustments to the architecture in order to improve development speed and support memory-only storage.

In this blog, I will provide an overview of the Ignite native persistence storage engine and discuss the tradeoffs that were made during its development. I will also delve into the reasons behind these decisions and share the challenges that were encountered during the implementation of the chosen approach in Java. In addition, I'll highlight how the Ignite community successfully navigated these difficulties.

This blog post will appeal to architects and engineers who are interested in gaining a deeper understanding of the ARIES and Apache Ignite architectures.

 

Introduction to Ignite Native Persistence

My name is Alexey, and I am a committer and PMC member of the Apache Ignite project. In this blog post, I would like to discuss the Ignite Native Persistence feature, which was added a few years ago. This post will serve as a retrospective, covering the high-level architecture of native persistence. I will focus on the persistent storage itself, excluding the protocols. Additionally, I’ll explore the trade-offs and implementation choices we made, as well as the obstacles we faced during the Java implementation.

Before diving into the details, let's discuss why we felt the need to add native persistence to Apache Ignite in the first place. In Apache Ignite 1.x, there was a feature called "File Store PC" that allowed storing durable data on disk in an Apache Ignite cluster. However, this feature had its downsides. Firstly, when restarting the cluster, all the previously loaded data had to be pre-loaded again in order to serve SQL queries. This process became time-consuming, especially with large amounts of data, taking hours to complete. This was not convenient for maintenance purposes. Secondly, although the data could be stored on disk, the keys index still resided in memory. Therefore, the amount of data that could be stored in the cluster was limited by the available memory. Over time, these restrictions became more and more inconvenient for Ignite users.

To address these limitations, we decided to introduce a new feature that would enable faster restarts and efficient handling of SQL queries after a restart. In this architecture, pre-loading all the data would no longer be necessary. Instead, only the data required to answer a specific query would be pre-loaded. Additionally, we aimed to remove the limitation on the amount of data that could be stored by utilizing disk storage. Furthermore, we wanted to maintain a unified architecture for both in-memory and persistence layer storage. This decision was driven by the desire to keep the codebase simple and avoid the complexity of maintaining two different storage engine implementations.

After conducting initial research, we determined that the Areas architecture would be a suitable approach. The Areas architecture is a well-known approach in databases and fulfills all our requirements. It supports arbitrary data structures, allowing us to implement the data storage, scale indexes, and other necessary auxiliary data structures. Additionally, it provides recoverable durability, enabling data storage on disk, and supports transactions, a feature that has been present in Ignite since its early stages.

Now, let's delve into the details of the Areas approach and understand its components and underlying workings.

 

Part 1: Pages

The first thing we need to understand is how the data we want to store on disk is organized. The data is split into blocks of equal size, known as pages. Each page has a unique identifier, although the specific format of this identifier is not specified in the ARIES paper. What is important is that given an identifier, we can locate the corresponding page on disk. There are different ways to implement this, but one example mentioned in the slides is to split the logical identifier into two parts: the file ID and the page index. The file ID can be encoded in the file name, while the page index allows us to calculate the offset of the page within the file.

To ensure durability, we also need the ability to load these pages into memory. We allocate a chunk of memory to hold the data loaded from disk, and this memory is divided into page buffers. Each page buffer represents an image of a page loaded into memory. To protect the integrity of the data, each page is accompanied by a read-write log. Ignite, being a multi-threaded system, requires synchronization primitives, so each page has its own read-write log. This means that reading or writing to a page must be done atomically, with readers acquiring a shared log and writers acquiring an exclusive log.

As our goal is to store more data than the available memory on each node, there is a need for indirection. We need to know which page buffer holds a particular page. This mapping is implemented as a translation table or hash map, where logical page IDs are mapped to the address of the corresponding page buffer.

In some cases, when a query tries to access a page that is not currently loaded into memory, a process called page replacement comes into play. Page replacement selects a page that needs to be evicted from memory to make space for the requested page. The algorithm for selecting the page is beyond the scope of this discussion, but typically it is a page that has been accessed less frequently or was accessed a long time ago. Once the evicted page is replaced, the requested page can be loaded into memory.

It's important to note that when a page is not in memory, it needs to be read from disk, which limits the overall throughput of the system. The number of input/output operations per second (IOPS) that can be performed by the disk becomes a bottleneck. For example, if the disk can perform a certain number of IOPS and each cache get requires 10 page reads, the maximum throughput of the system will be limited by the number of IOPS divided by 10. This is why having more memory available on the nodes is crucial.

This memory management and page replacement strategy is particularly relevant to Ignite, as it was originally designed as an in-memory data grid. The goal was to maintain high performance while also allowing for fast restarts.

 

Part 2: Pages

As part of the Aries architecture, we utilize pages, page buffers, and mechanisms to enable access to any page. These components serve as the foundation for building data structures. To ensure durability, these data structures must also be organized in a page format.

For example, we can create a persistent tree or a durable free list. These data structures are implemented in Ignite and follow a page organization approach. To establish links between pages, we continue to utilize the logical page IDs used in the directional tables.

Pages play a crucial role in the Aries architecture by providing the framework for organizing and linking data structures in a persistent manner.

 

Part 3: Right HeadLock

One of the most important components of the iris architecture is the right headlock. It serves a crucial purpose as it enables crash recovery. The write ahead log, which is an append-only file or set of files, holds information about the changes that have been applied to the system. Data written to the write headlock can be divided into two categories: logical records and physical records.

Logical records represent information in terms of operations performed on the key-value store. For example, if a put operation is executed, the logical records for that operation will contain the key and the corresponding value that was written to the key-value store. On the other hand, physical records are not tied to operational semantics but instead describe the physical changes made to pages. For instance, if bytes are changed from 10 to 50 on a specific page, the physical record will contain this information.

The right headlock serves as a single point of truth. In the event of a process crash, the information stored in the right headlock is read and reapplied to the persistent storage. While theoretically, it would be possible to solely rely on the right headlock, this would necessitate an infinitely large disk, which is clearly impractical. To address this, a mechanism called checkpointing is employed to truncate the right headlock. Checkpointing involves collecting all the changes that have been applied to the memory and writing these changes to disk.

A right headlock plays a critical role in the iris architecture by facilitating crash recovery and storing essential information about system changes. The use of both logical and physical records ensures the accuracy and integrity of the data. Checkpointing is employed to manage the size of the right headlock and maintain efficient storage utilization.

Now, let's examine how the coordination between updates and checkpointing is implemented in areas specific to Ignite. The first thing to note is that there is a read-write log called the checkpointing clock that facilitates this coordination. When an update arrives in the storage, it acquires a shared lock. This allows multiple updates to be concurrently applied to the storage since the lock is shared. Once the lock is acquired, a logical record is written to the right headlock. If the right headlock is synced to the disk at this point, we can be confident that the operation is persistent. In the event of a process crash, this update can already be replayed and applied to the storage.

After the logical record is written to the right ahead log, the update proceeds to update all the necessary data structures. For example, it inserts data into the data pages, updates the primary index tree, updates secondary SQL indexes if any, and updates the layout of data pages. As part of this protocol, when changes are applied to pages, physical records are also written to the right headlock. It is crucial that these physical changes to the right ahead lock are written before the write log to the page is released. Essentially, the protocol is to take a page, acquire the write log, put the physical record to the right headlock, and then release the log.

Once all the data structures are updated, the shared log is released. In this case, there are a number of pages that are kept in memory whose content differs from the pages currently written to disk. These pages are referred to as "dirty." The essence of a checkpoint is to write these dirty pages to disk in order to remove the portion of the write-ahead log responsible for these pages. Let's now delve into the checkpoint protocol.

The first step in the checkpoint protocol is acquiring an exclusive lock. Once the exclusive lock is acquired, it indicates that there are no in-flight updates to the pages and that all the pages are mutually consistent. This is the state we want to transfer to disk. When the log is acquired, a lazy copy of all the pages that need to be written to disk is performed. However, only the logical identifiers of the pages are copied, not the actual data. This is done to minimize the amount of time the exclusive lock is held. Once the page identifiers are copied, the exclusive log is released, and further updates can proceed.

In the case where the next update wants to write to a page that is supposed to be written to disk, a copy and write operation occurs. The original state of this page is sent to a temporary buffer called the checkpoint buffer. After the checkpoint log is released, the pages can be transferred to disk. This essentially summarizes the protocol of the Ignite architecture from a high-level perspective.

Please note that for the sake of simplicity and clarity in this discussion, some points have been omitted. However, these details should be sufficient to understand and discuss the implementation specifics that I have in mind and wanted to share with you.

 

ReadWrite Log

When we started implementing the architecture, the first question we faced was how to handle the read-write log for each page. The straightforward solution was to have a separate instance of a Java dual concurrent transient log for each page. However, since we were targeting hundreds of gigabytes of RAM and each page contained 4 to 16 kilobytes of data, this approach would result in hundreds of millions of pages in memory. Storing each page as a Java object would not only consume a large amount of heap memory but also create issues with garbage collection.

An alternative approach we considered was to create and destroy locks on demand. For example, when accessing page number 10, we would first create a log for that page, acquire the log, perform the necessary operations on the page, release the log, and then destroy it if it was no longer needed. However, implementing this approach would require additional synchronization points to ensure that multiple threads do not end up with different instances of the lock for the same page.

Moreover, since multiple pages are involved in each update operation, allocating a large number of Java objects would be inefficient. Additionally, keeping the logs separate from the pages would negatively impact processor cache locality.

To address these issues, we decided to inline the lock itself in the page buffer. This means that the lock is integrated directly into the page, improving cache locality and avoiding the need for separate log instances. However, it's important to note that this is not a complete solution.

 

Lock

The re-entrant log still provides all the necessary infrastructure to implement the RS architecture. The lock itself consists of four fields, three of which are generic and common. These fields hold the number of currently waiting readers, the number of currently awaiting writers, and the log state itself (shared acquired state or exclusive acquired state). The final field, called tag, is used by Ignite to prevent and detect ABA problems. ABA problem occurs when you expect a page with a certain ID, but due to concurrent processes, the ID is not of the same type as expected. This can happen, for example, when running a SQL query with a cursor on a large index. The tag is embedded into the log to help avoid this problem.

From an implementation standpoint, the lock is not fair, and there may be situations where reader and writer threads could starve. Writer threads are given preference, but some randomization is employed to balance between readers and writers. Additionally, the log is not re-entrant. However, during the implementation of Ignite native persistence, there was never a need or regret for the log to be re-entered. The lock itself is lightweight, only 8 bytes, but it provides all the necessary infrastructure for page access and synchronization.

 

Checkpoint

The second implementation choice that is most visible to users is the concept of a sharp checkpoint. In the checkpointing protocol, we acquire an exclusive lock to capture the state that needs to be written to disk. Since this state is fully consistent, the checkpoint is called sharp. An alternative approach, called physical point, is more challenging from an implementation standpoint and also limits the number of data structures that can be implemented. 

When switching to a sharp checkpoint, the data structure itself needs to be written in a recoverable way. However, with the use of a sharp checkpoint, the data structure can have arbitrary complexity and still be easily checkpointed. The downside of this approach is that the lock is exclusive, and even if we try to hold the exclusive lock for a very short time, it still affects the latency, particularly in the high percentiles. The checkpoint buffer and checkpoint-related configuration properties are the parameters that are often tuned when working with persistence.

 

File

Sometimes, users face the challenge of deciding how to handle file storage in relation to partitions. The two options are to either keep the data of each partition in separate files or write all the data related to a specific node in a single file or set of files that don't directly map to the partition numbers. The choice to use a file per partition approach was made for several reasons. Firstly, it allows for a transparent mapping from logical pages to physical pages and from files to sets of pages. This simplifies various internal processes in Ignite, such as rebalancing and taking snapshots. When taking a snapshot, having a single copy of each partition is beneficial, and it would be difficult to achieve this if data was mixed within the partition file.

However, there is a downside to this approach when dealing with secondary indexes. It becomes challenging to drop a specific partition because if two partitions share an index and one of them is dropped, it would result in a broken index. Therefore, before actually dropping the partition file, Ignite needs to clear the data records one by one and then release the file.

 

Indexed Scans

Another important aspect related to the iris implementation specifics is that the data scans are always index organized. This means that when you scan a large amount of data, you may end up reading the same pages over and over again. Avoiding duplicate page accesses is an important optimization to consider in order to improve performance. Originally, the iris was optimized and targeted for OTP use cases, where these scans are not as common. However, the index access works very well for small result sets.

 

Relocation Table

Finally, an interesting part of the implementation is the relocation table. This table holds and translates the page IDs to page buffers, as we have seen before.

 

Partition Eviction

One of the important operations for this table is partition eviction. In other words, if we were to evict a partition from a simple hash table, we would need to scan the entire partition, which can be time-consuming. To address this, in Ignite, we introduced the concept of generation for partitions. We store the generation along with the page identifier in this table. 

For example, when we evict a certain partition, we simply increment the generation number. Once the generation number has been incremented, the corresponding entries in the hash table are considered invalid and can be freed up. This approach allows us to efficiently perform a clear operation for any partition. So, let's recap what we have discussed so far.

 

Further Considerations


From an architecture point of view and implementation details, there are a few key considerations to keep in mind. Firstly, if you have more data than available RAM, you will eventually hit the disk when editing data. Therefore, the more data you can fit in RAM, the better the performance will be. In cases where all the data cannot fit in RAM, it is important to have a disk that can sustain a large number of IOPS (Input/Output Operations Per Second).

In most cases, Ignite is designed to provide the best performance for point queries, which require random access. If you have arbitrary workloads, it is recommended to transform and optimize them to resemble OTP (Online Transaction Processing) workloads as much as possible. This involves avoiding full scans and instead employing per-partition scans. Ignite provides specific APIs that efficiently preload partitions into memory and allow for data collocation and reading data by a particular key. It is also beneficial to work with small data sets and small result sets.

 

Partition Eviction


One of the important operations for this table is partition eviction. In other words, if we were to evict a partition from a simple hash table, we would need to scan the entire partition, which can be time-consuming. To address this, in Ignite, we introduced the concept of generation for partitions. We store the generation along with the page identifier in this table. 

For example, when we evict a certain partition, we simply increment the generation number. Once the generation number has been incremented, the corresponding entries in the hash table are considered invalid and can be freed up. This approach allows us to efficiently perform a clear operation for any partition. So, let's recap what we have discussed so far.


Further Considerations


From an architecture point of view and implementation details, there are a few key considerations to keep in mind. Firstly, if you have more data than available RAM, you will eventually hit the disk when editing data. Therefore, the more data you can fit in RAM, the better the performance will be. In cases where all the data cannot fit in RAM, it is important to have a disk that can sustain a large number of IOPS (Input/Output Operations Per Second).

In most cases, Ignite is designed to provide the best performance for point queries, which require random access. If you have arbitrary workloads, it is recommended to transform and optimize them to resemble OTP (Online Transaction Processing) workloads as much as possible. This involves avoiding full scans and instead employing per-partition scans. Ignite provides specific APIs that efficiently preload partitions into memory and allow for data collocation and reading data by a particular key. It is also beneficial to work with small data sets and small result sets.