GridGain Developers Hub
GitHub logo GridGain iso GridGain.com
GridGain Software Documentation

Troubleshooting and Debugging

This article covers some common tips and tricks for debugging and troubleshooting GridGain and Ignite deployments.

Debugging Tools: Consistency Check Command

The ./control.sh|bat utility includes a set of consistency check commands that help with verifying internal data consistency invariants.

Persistence Files Disappear on Restart

On some systems, the default location for Ignite persistence files might be under a temp folder. This can lead to situations when persistence files are removed by an operating system whenever a node process is restarted. To avoid this:

  • Ensure that WARN logging level is enabled for GridGain. You will see a warning if the persistence files are written to the temporary directory.

  • Change the location of all persistence files using the DataStorageConfiguration APIs, such as setStoragePath(…​), setWalPath(…​), and setWalArchivePath(…​)

Cluster Doesn’t Start After Field Type Changes

When developing your application, you may need to change the type of a custom object’s field. For instance, let’s say you have object A with field A.range of int type and then you decide to change the type of A.range to long right in the source code. When you do this, the cluster or the application will fail to restart because GridGain doesn’t support field/column type changes.

When this happens and you are still in development, you need to go into the file system and remove the following directories: marshaller/, db/, and wal/ located in the GridGain working directory (db and wal might be located in other places if you have redefined their location).

However, if you are in production then instead of changing field types, add a new field with a different name to your object model and remove the old one. This operation is fully supported. At the same time, the ALTER TABLE command can be used to add new columns or remove existing ones at run time.

Debugging GC Issues

The section contains information that may be helpful when you need to debug and troubleshoot issues related to Java heap usage or GC pauses.

Heap Dumps

If JVM generates OutOfMemoryException exceptions then dump the heap automatically the next time the exception occurs. This helps if the root cause of this exception is not clear and a deeper look at the heap state at the moment of failure is required:

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/path/to/heapdump
-XX:OnOutOfMemoryError=“kill -9 %p”
-XX:+ExitOnOutOfMemoryError

Detailed GC Logs

In order to capture detailed information about GC related activities, make sure you have the settings below configured in the JVM settings of your cluster nodes:

-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=100M
-Xloggc:/path/to/gc/logs/log.txt

Replace /path/to/gc/logs/ with an actual path on your file system.

In addition, for G1 collector set the property below. It provides many additional details that are purposefully not included in the -XX:+PrintGCDetails setting:

-XX:+PrintAdaptiveSizePolicy

Performance Analysis With Flight Recorder

In cases when you need to debug performance or memory issues you can use Java Flight Recorder to continuously collect low level runtime statistics, enabling after-the-fact incident analysis. To enable Java Flight Recorder use the following settings:

-XX:+UnlockCommercialFeatures
-XX:+FlightRecorder
-XX:+UnlockDiagnosticVMOptions
-XX:+DebugNonSafepoints

To start recording the state on a particular GridGain node use the following command:

jcmd <PID> JFR.start name=<recordcing_name> duration=60s filename=/var/recording/recording.jfr settings=profile

For Flight Recorder related details refer to Oracle’s official documentation.

JVM Pauses

Occasionally you may see an warning message about the JVM being paused for too long. It can happen during bulk loading, for example.

Adjusting the IGNITE_JVM_PAUSE_DETECTOR_THRESHOLD timeout setting may give the process time to finish without generating the warning. You can set the threshold via an environment variable, or pass it as a JVM argument (-DIGNITE_JVM_PAUSE_DETECTOR_THRESHOLD=5000) or as a parameter to ignite.sh (-J-DIGNITE_JVM_PAUSE_DETECTOR_THRESHOLD=5000).

The value is in milliseconds.