Techniques to Follow If Your Apache Ignite Cluster Is Running Out of Memory Space

Apache Ignite can scale horizontally to accommodate the data that your applications and services generate. If your in-memory cluster is about to run out of memory space, you can take advantage of horizontal scaling, which is one of Ignite’s foundational architectural capabilities. The “throw more resources into the cluster” approach is an often-heard piece of advice. However, in practice, most of us cannot scale out a cluster in a moment. Usually, an Ignite cluster of a specified memory capacity is provisioned for an application, and to get more resources can be a challenging, ongoing task.

We’ll discuss several architectural techniques that enable you to keep your cluster operational and your applications running even if memory becomes a scarce resource.

Configure Ignite Eviction Policies to Avoid Out-of-Memory Issues

Data eviction is a classic mechanism to protect from memory overconsumption. Data eviction avoids out-of-memory issues by monitoring the memory space that is being used and by evicting excess data when the used memory exceeds a maximum-allowed threshold.

Ignite supports several eviction policies that, ultimately, begin to purge the least recently used pages from memory when the maximum data-region size is reached. The following code snippet shows how to enable DataPageEvictionMode.RANDOM_2_LRU policy for a custom data region:

  
DataStorageConfiguration storageCfg = new DataStorageConfiguration();

DataRegionConfiguration regionCfg = new DataRegionConfiguration();

regionCfg.setName("20GB_Region");

// 500 MB initial region size (RAM).
regionCfg.setInitialSize(500L * 1024 * 1024);

// 20 GB maximum region size (RAM).
regionCfg.setMaxSize(20L * 1024 * 1024 * 1024);

// Enabling RANDOM_2_LRU eviction for this region.
regionCfg.setPageEvictionMode(DataPageEvictionMode.RANDOM_2_LRU);
  

Ignite eviction policies can be configured for pure in-memory clusters as well as for clusters that persist records in external databases, such as in Oracle or MySQL. Clusters with Ignite native persistence disregard the eviction policy’s settings but, instead, use a page-replacement algorithm to control memory-space usage.  

If you use Ignite eviction policies as a preventative measure for out-of-memory incidents, then Ignite can bring evicted records back automatically only if a) an external database, connected through the CacheStore interface, has an on-disk copy of the records and b) an application attempts to read the records with Ignite key-value APIs. In all other cases, it’s your responsibility to reload the evicted data to Ignite when it is needed.   

Consider Swapping to Avoid Memory Overflow

Swapping is a memory-overflow protection technique that is supported by operating systems. If an operating system detects that memory consumption is growing beyond a certain threshold, it can swap data in and out to avoid application and system outages.

Ignite allows the swapping mechanism to be enabled for data regions. The following code snippet shows how to turn swapping on via the DataRegionConfiguration.setSwapPath setting:

  
DataStorageConfiguration storageCfg = new DataStorageConfiguration();
DataRegionConfiguration regionCfg = new DataRegionConfiguration();

regionCfg.setName("500MB_Region");
regionCfg.setInitialSize(100L * 1024 * 1024);
regionCfg.setMaxSize(5L * 1024 * 1024 * 1024);

// Enabling swapping.
regionCfg.setSwapPath("/path/to/some/directory");

// Applying the data region configuration.
storageCfg.setDataRegionConfigurations(regionCfg);
  

With such a setting, the Ignite storage engine stores data in memory-mapped files. The operating system takes responsibility for moving the content of the files between memory and disk, depending on the current memory usage.

However, before you rely on swapping, you should consider a couple of drawbacks. First, swapping offers no data-durability guarantee. If a node goes down, any record that was swapped evaporates when the process is terminated. Second, even if a node process has enough memory capacity to retain application records, you can't expect the performance of the in-memory cluster with swapping (with no swapped data yet on disk) to match the performance of a similar cluster without swapping. If you decide to use swapping as a memory-overflow protection technique that minimally impacts the performance of the cluster, you need to review and tune various operating-system-level settings related to swapping (vm.swappinness, vm.extra_free_kbytes, etc.).

Use Ignite Native Persistence to Read Disk-Only Records and to Not Lose a Bit of Data on Restarts

Ignite multi-tier storage can be configured to use Ignite native persistence as a disk tier. Native persistence stores all data records and indexes on disk, letting you decide how much of the data needs to be cached in memory. Even if an application’s dataset doesn’t fit in memory, the cluster contains disk-only records that the application can access. Unlike with the swapping approach, with native persistence you don’t lose a bit of data on restarts. Once the cluster nodes are interconnected, they can serve all the records straight from disk.

Configuration of native persistence is effortless. When you toggle the DataRegionConfiguration.persistenceEnabled property to true, Ignite starts holding on disk all of the records that are associated with your data region:

  
DataStorageConfiguration storageCfg = new DataStorageConfiguration();

DataRegionConfiguration regionCfg = new DataRegionConfiguration();

regionCfg.setName("20GB_Region");

// 500 MB initial region size (RAM).
regionCfg.setInitialSize(500L * 1024 * 1024);

// 20 GB maximum region size (RAM).
regionCfg.setMaxSize(20L * 1024 * 1024 * 1024);

// Enable Ignite Native Persistence for all the data from this data region.
regionCfg.setPersistenceEnabled(true);
  

If persistence is configured per data region, you can set it for only the caches and tables that are likely to overflow. As a general guideline, turn persistence on for all data regions and then decide whether to turn it off for one or more data subsets.

With native persistence, Ignite doesn’t impose API restrictions on the clusters. Quite the opposite, all Ignite APIs, including SQL and ScanQueries, can query records from disk, if the records are not found in memory. This ability eliminates the burden of needing applications to reload evicted records that are purged from memory by Ignite eviction policies. With native persistence, Ignite uses page replacement and rotation to avoid memory space overconsumption. Although the algorithm automatically evicts records from memory, it never touches the disk copy, which can be returned to the memory tier whenever applications need it.

Use Bigger Heap and Pauseless Garbage Collector to Eliminate Issues With Java Heap

Ignite stores all application records and indexes in its off-heap memory, which is frequently called “page memory” (because of the way it is organized and managed). Ignite splits the space into pages of fixed size, like modern operating systems do, and keeps the application records in those pages. As you can guess, eviction policies and native persistence are the techniques to use if there is a chance that you will run out of the off-heap memory.

At the same time, as with any Java middleware, Ignite uses Java heap as interim storage of the objects and records that are requested by applications. For example, when you retrieve data via key-value or SQL calls, a copy of the requested off-heap records is maintained in Java heap and is garbage collected after the on-heap-based result set is transmitted to the application.

If you start running out of Java-heap space, it’s highly likely that the JVM will not generate the out-of-memory exceptions that bring down the cluster nodes. Instead, you will observe long, stop-the-world garbage collection pauses on the cluster nodes. The pauses impact the performance of the cluster and can cause outages on the nodes that are not responding.

A generic solution for this issue is to allocate a Java heap that is big enough to handle the application requests under the production load. Heap size is a use-case specific. It can be as small as 3GB or as big as 30GB per cluster node.

High-throughput and low-latency applications usually have larger Java heaps. For such applications, you can consider pauseless Java garbage collectors such as C4 of Azul Zing JVM, which shows reliable and consistent performance results, regardless of heap size.

Use Memory Quotas for SQL to Make Java Heap Usage Predictable

SQL queries are among the most Java-heap-demanding operations of Apache Ignite. A single query can scan through thousands of table records, group and sort hundreds of records in memory, and join data that resides across multiple tables. The following diagram shows all the steps of the query-execution process. The “computing” phase uses the Java heap the most:

Handling Out of Memory Apache Ignite Cluster

 

It’s not surprising that, in many Apache Ignite deployments, the final Java heap size is influenced by the needs of the SQL-based operations. The suggestion from the previous section for a bigger Java heap works for SQL. However, you can also configure memory quotas to better manage the usage of the Java heap space.

Memory quotas are designed specifically for SQL and are available in GridGain Community Edition and other editions of GridGain. The following configuration example shows how to set quotas per cluster node:

    
// Creating Ignite configuration.
IgniteConfiguration cfg = new IgniteConfiguration();

// Defining SQL configuration.
SqlConfiguration sqlCfg = new SqlConfiguration();

// Setting the global quota per cluster node.
// All the running SQL queries combined cannot use more memory as set here.
sqlCfg.setSqlGlobalMemoryQuota("500M");

// Setting per query quota per cluster node.
// A single running SQL query cannot use more memory as set below.
sqlCfg.setSqlQueryMemoryQuota("40MB");

// If any of the quotas is exceeded, Ignite will start offloading result sets to disk.
sqlCfg.setSqlOffloadingEnabled(true);

cfg.setSqlConfiguration(sqlCfg);
    

With this configuration, a single Ignite cluster node can use no more than 500MB of the allocated Java heap space for the needs of the SQL queries that are currently running. Also, SqlConfiguration.setSqlQueryMemoryQuota instructs Ignite to allow no individual SQL query to consume more than 40MB of the heap. Finally, if the per-query or global quotas are exceeded, Ignite begins to offload the result sets of the queries to the disk tier (as any relational database does if the SqlConfiguration.setSqlOffloadingEnabled parameter is disabled). If the offloading-to-disk feature is disabled, then the queries that exceed the quotas are terminated and an exception is reported.

So, the suggestion is to enable the quotas with the offloading-to-disk feature, especially if an application is going to perform sorting (SORT BY) or grouping (DISTINCT, GROUP BY) of the data or is going to run complex queries with sub-selects and joins.

Conclusion

All the preventive techniques that this article discusses can handle the memory-overconsumption issues. However, do not forget about implications. The techniques, once triggered, can affect application performance. The cluster can start working with disk more actively (page replacement of native persistence and the offload of SQL result sets to disk) or can make records unavailable by deleting them from the cluster (eviction policies). Thus, experiment with scenarios in which the cluster runs out of off-heap space and scenarios in which the cluster runs out of Java-heap space. Your results should indicate the degree to which native persistence is or isn’t well-tuned or the degree to which the application knows or doesn’t know how to reload the records that the Ignite eviction policies removed.