GridGain Developers Hub

Monitoring Apache Ignite Memory and Disk Usage

Head of Developer Relations at GridGain
Apache Ignite Committer and PMC Member

The Apache Ignite storage engine not only spreads application records across a distributed cluster but also stores the records on various storage tiers of an individual cluster node. Monitoring of Apache Ignite storage usage assumes a configuration of metrics that report on off-heap and Java-heap utilization as well as on disk space consumption by Ignite on-disk records and write-ahead-logs (WALs).

In this part of the tutorial, you create a GridGain Control Center dashboard that monitors Ignite’s primary storage metrics. In the end, your dashboard should look like the following:

Apache Ignite Memory and Disk Monitoring Dashboard

Create Storage Usage Dashboard

Start by creating a dashboard with various memory and disk-usage metrics:

  1. Open the Dashboard screen of Control Center.

  2. Create a new dashboard named Storage Usage.

    Storage Usage Dashboard

Monitor Off-Heap Memory

The Ignite keeps all application records and indexes in its memory regions, invisible to Java garbage collectors and fully managed by Ignite. The memory regions are usually known as “off-heap memory.”

Now, you add a widget that displays the off-heap memory usage to the dashboard:

  1. Add a widget of the Metrics (table) type.

  2. In the widget creation dialog, search for the PhysicalMemorySize metric under the iodataregiondefault package. You can also add this metric in the Advanced View tab. Its full name is `io.dataregion.default.PhysicalMemorySize'.

    Ignite Memory Monitoring
  3. After the metric widget is created, click {dots_icon} in the top-right corner of the widget, and select Rename. Rename the widget to Off-Heap Memory.

Monitor Java Heap Usage

Off-heap memory is the primary storage tier for application records. However, Ignite nodes still use Java heap, mostly for temporary objects that are generated during processing of operations that are executed by the application.

Next, you add a widget that monitors Java heap memory usage to the dashboard:

  1. Add a widget of the Metrics (chart) type.

  2. In the widget creation dialog, search for Used and select the Heap Used metric under the heap category. Full metric name is sys.memory.heap.Used.

    Ignite Java Heap Monitoring
  3. After the metric widget is created, rename it to Java Heap Memory.

Track Disk-Storage Size

If Ignite native persistence is configured for your cluster, then a copy of every application record will be persisted on disk.

Perform the following steps to add to the dashboard a widget that reports the size of the disk space that is consumed by the on-disk copies of your application records:

  1. Add a widget of the Metrics (table) type.

  2. In the widget creation dialog, search for and select the StorageSize metric under the datastorage category. Full metric name is io.datastorage.StorageSize.

    Ignite Storage Size Monitoring
  3. After the metric widget is created, rename it to Disk Storage Size.

Monitor WAL Size

Ignite does not persist changes into storage files immediately, because doing so would require Ignite to perform updates in a random file location, an operation that is quite time-consuming. Instead, all changes are first appended to the WAL and then, later, are checkpointed to the storage files.

Now, you add a widget of the Metrics (table) type that enables you to observe WAL size:

  1. Add a widget of the Metrics (table) type.

  2. In the widget creation dialog, search for and select WALTotalSize metric under the datastorage category. Full metric name is `io.datastorage.WALTotalSize'.

    Ignite WAL Size Monitoring
  3. After the metric widget is created, rename it to WAL Size.

Watch Checkpointing Duration

Checkpointing is the process of copying dirty pages from memory to storage files on disk. After the dirty pages are written to disk, Ignite can safely trim the part of the WAL that contained the changes that were checkpointed during the previous checkpointing cycle. It’s prudent to monitor the duration of checkpointing cycles and, if Ignite spends too much time in checkpointing, to perform various disk optimizations.

Now, you create a widget to monitor the duration of the checkpointing process:

  1. Add a widget of the Metrics (chart) type.

  2. In the widget creation dialog, search for and select the LastCheckpointDuration metric under the datastorage category. Full metric name is io.datastorage.LastCheckpointDuration.

    Ignite Checkpointing Monitoring
  3. After the widget is created, rename it to Checkpointing Duration.

Watch WAL Synchronization Duration

Ignite appends changes to the WAL for every data update. However, the operating system (OS) must still fsync the changes to disk to ensure that no data is lost, not even on unexpected node faults. If Ignite or the OS spends a significant amount of time in the fsync stage, then you might need to undertake various WAL-related optimizations.

Finally, you add a widget to the dashboard to monitor the WAL’s fsync stage duration:

  1. Add a widget of the Metrics (chart) type.

  2. In the widget creation dialog, search for and select the WALFsyncTimeDuration under the datastorage category. Full metric name is io.datastorage.WALFsyncTimeDuration.

    Ignite WAL Fsync Monitoring
  3. After the metric widget is created, rename it to WAL Fsync Duration.

Rearrange the Dashboard Widgets

Rearrange the widgets of the dashboard by moving them around as you want. In the end, your Storage Usage dashboard should be similar to the following dashboard.

Apache Ignite Memory and Disk Monitoring Dashboard

What’s Next

The dashboard that you configured in this part of the tutorial can be displayed on the screens of the room that your team uses to monitor Ignite production clusters. But, what if nobody were in the room when the dashboard began reporting unexpected cluster activity? In the next part of the tutorial, you set up alerting, so you are notified about every event that might affect the stability or performance of the cluster: