GridGain Developers Hub
GitHub logo GridGain iso GridGain.com
GridGain Software Documentation

Exception Handling

This articles outlines basic exceptions that can be generated by Ignite and GridGain, and explains how to set up and use the critical failures handler.

Handling Ignite/GridGain Exceptions

Exceptions supported by the Ignite API and actions you can take related to these exceptions are described below. Please see the Javadoc throws clause for checked exceptions.

Exception Description Action Runtime exception

IgniteException

Indicates an error condition in the cluster.

Operation failed. Exit from the method.

Yes

IgniteClientDisconnectedException

Thrown by the Ignite API when a client node gets disconnected from cluster. Thrown from Cache operations, compute API, and data structures.

Wait and use retry logic.

Yes

IgniteAuthenticationException

Thrown when there is either a node authentication failure or security authentication failure.

Operation failed. Exit from the method.

No

IgniteClientException

Can be thrown from Cache operations.

Check exception message for the action to be taken.

Yes

IgniteDeploymentException

Thrown when the Ignite API fails to deploy a job or task on a node. Thrown from the Compute grid API.

Operation failed. Exit from the method.

Yes

IgniteInterruptedException

Used to wrap the standard InterruptedException into IgniteException.

Retry after clearing the interrupted flag.

Yes

IgniteSpiException

Thrown by various SPI (CollisionSpi, LoadBalancingSpi, TcpDiscoveryIpFinder, FailoverSpi, UriDeploymentSpi, etc.)

Operation failed. Exit from the method.

Yes

IgniteSQLException

Thrown when there is a SQL query processing error. This exception also provides query specific error codes.

Operation failed. Exit from the method.

Yes

IgniteAccessControlException

Thrown when there is a authentication / authorization failure.

Operation failed. Exit from the method.

No

IgniteCacheRestartingException

Thrown from Ignite cache API if a cache is restarting.

Wait and use retry logic.

Yes

IgniteFutureTimeoutException

Thrown when a future computation is timed out.

Either increase timeout limit or exit from the method.

Yes

IgniteFutureCancelledException

Thrown when a future computation cannot be retrieved because it was cancelled.

Use retry logic.

Yes

IgniteIllegalStateException

Indicates that the Ignite instance is in an invalid state for the requested operation.

Operation failed. Exit from the method.

Yes

IgniteNeedReconnectException

Indicates that a node should try to reconnect to the cluster.

Use retry logic.

No

IgniteDataIntegrityViolationException

Thrown if a data integrity violation is found.

Operation failed. Exit from the method.

Yes

IgniteOutOfMemoryException

Thrown when the system does not have enough memory to process Ignite operations. Thrown from Cache operations.

Operation failed. Exit from the method.

Yes

IgniteTxOptimisticCheckedException

Thrown when a transaction fails optimistically.

Use retry logic.

No

IgniteTxRollbackCheckedException

Thrown when a transaction has been automatically rolled back.

Use retry logic.

No

IgniteTxTimeoutCheckedException

Thrown when a transaction times out.

Use retry logic.

No

ClusterTopologyException

Indicates an error with the cluster topology (e.g. crashed node, etc.). Thrown from Compute and Events API

Wait on future and use retry logic.

Yes

Critical Failures Handling

GridGain is a robust and fault tolerant system. But in the real world, some unpredictable issues and problems arise that can affect the state of both an individual node as well as the whole cluster. Such issues can be detected at runtime and handled accordingly using a preconfigured critical failure handler.

Critical Failures

The following failures are treated as critical:

  • System critical errors (e.g. OutOfMemoryError).

  • Unintentional system worker termination (e.g. due to an unhandled exception).

  • System workers hanging.

  • Cluster nodes segmentation.

A system critical error is an error which leads to the system’s inoperability. For example:

  • File I/O errors - usually IOException is thrown by file read/write operations. It’s possible when Ignite native persistence is enabled (e.g., in cases when no space is left or on a device error), and also for in-memory mode because GridGain uses disk storage for keeping some metadata (e.g., in cases when the file descriptors limit is exceeded or file access is prohibited).

  • Out of memory error - when GridGain memory management system fails to allocate more space (IgniteOutOfMemoryException).

  • Out of memory error - when a cluster node runs out of Java heap (OutOfMemoryError).

Failures Handling

When GridGain detects a critical failure, it handles the failure according to a preconfigured failure handler. The failure handler can be configured as follows:

<bean class="org.apache.ignite.configuration.IgniteConfiguration">
    <property name="failureHandler">
        <bean class="org.apache.ignite.failure.StopNodeFailureHandler"/>
    </property>
</bean>

GridGain support following failure handlers:

Class Description

NoOpFailureHandler

Ignores any failure. It’s useful for tests and debugging.

RestartProcessFailureHandler

Specific implementation that could be used only with ignite.sh/bat. Process must be terminated using Ignition.restart(true) call.

StopNodeFailureHandler

Stops a GridGain node in case of critical error using the Ignition.stop(true) or Ignition.stop(nodeName, true) call.

StopNodeOrHaltFailureHandler

This is the default handler; it tries to stop a node. If the node can’t be stopped, then the handler will terminate the JVM process.

Critical Workers Health Check

GridGain has a number of internal workers that are essential for the cluster to function correctly. If one of them is terminated, a GridGain node can become inoperative.

The following system workers are considered mission critical:

  • Discovery worker - discovery events handling.

  • TCP communication worker - peer-to-peer communication between nodes.

  • Exchange worker - partition map exchange.

  • Workers of the system’s striped pool.

  • Data Streamer striped pool workers.

  • Timeout worker - timeouts handling.

  • Checkpoint thread - check-pointing in Ignite persistence.

  • WAL workers - write-ahead logging, segments archiving, and compression.

  • Expiration worker - TTL based expirations.

  • NIO workers - base networking.

GridGain has an internal mechanism for verifying that critical workers are operational. Each worker is regularly checked to confirm that it is alive and updating its heartbeat timestamp. If a worker is not alive and updating, the worker will be regarded as blocked and GridGain will print a message to the log file. The period of inactivity is specified by the IgniteConfiguration.systemWorkerBlockedTimeout property.

Even though GridGain considers an unresponsive system worker to be a critical error, it doesn’t handle this situation automatically, other than printing out a message to the log file. If you’d like to enable a particular failure handler for unresponsive system workers of all the types, clear the ignoredFailureTypes property of the handler as shown below:

<bean class="org.apache.ignite.configuration.IgniteConfiguration">

    <property name="systemWorkerBlockedTimeout" value="#{60 * 60 * 1000}"/>

    <property name="failureHandler">
        <bean class="org.apache.ignite.failure.StopNodeFailureHandler">

          <!-- Enable this handler to react to unresponsive critical workers occasions. -->
          <property name="ignoredFailureTypes">
            <list>
            </list>
          </property>

      </bean>

    </property>
</bean>