GridGain Developers Hub
GitHub logo GridGain iso GridGain.com
GridGain Software Documentation

Introduction: Monitoring and Metrics

This chapter covers monitoring and metrics for GridGain. We’ll start with an overview of the methods available for montitoring, and then we’ll delve into the GridGain specifics, including a list of JMX metrics and MBeans.

Overview

The basic task of monitoring in GridGain involves metrics. You have several approaches for accessing metrics:

What to Monitor

Start with monitoring:

  • Each node in isolation.

  • Connection between nodes.

  • The system as a whole.

Understand that a node consists of several layers: hardware, the operating system, the Virtual Machine (JVM, etc.), and the application. You need to check all of these levels, and the network surrounding it.

  • Hardware (Hypervisor): CPU/Memory/Disk ⇒ System Logs/Cloud Provider’s Logs

  • Operating System

  • JVM: GC Logs, JMX, Java Flight Recorder, Thread Dumps, Heap dumps, etc.

  • Application: Logs, JMX, Throughput/Latency, Test queries

    • For log based monitoring, the key is that you can act proactively, watch the logs for trends/etc., don’t just wait until something breaks to go and check the logs.

  • Network: ping monitoring, network hardware monitoring, TCP dumps

This should give you a good place to start for setting up monitoring of your hardware, operating system, and network. To monitor the application layer (the nodes that make up your in-memory computing solution), you’ll need to perform GridGain-specific monitoring via metrics you access with JMX/Beans or Web Console, or programmatically.

Global vs. Node-specific Metrics

The information exposed through different metrics has different scope (applicability), and may be different depending on the node where you get the metrics. The following list explains different metric scopes.

Global metrics

Information about the cluster in general, for example: the number nodes, state of the cluster. This information is available on any node of the cluster.

Node-specific metrics

Information that is specific to the node on which you obtain the metrics. For example, memory consumption, data region metrics, WAL size, queue size, etc.

Cache-related metrics can be global as well as node-specific. For example, the total number of entries in a cache is a global metric, and you can obtain it on any node. You can also get the number of entries of the cache that are stored on a specific node.