GridGain Developers Hub

Managing and Monitoring Replication

This chapter describes how to manage and monitor your GridGain Data Center Replication.

Managing Replication Process

The replication process can be managed per cache.

If sender nodes are available, the replication process starts automatically when the master cluster is started. If sender nodes are not available, replication will pause. You will need to start it manually.

Full State Transfer

Full state transfer is the process of sending the entire cache content to the replica cluster.

Before starting the replication process, you must synchronize the caches in the master and replica clusters. If you start with empty caches, you do not need to do anything. However, if the caches in the master cluster already have data, you need to perform a full state transfer.

There are several ways you can perform a full state transfer:

Starting Replication

To start the replication process for a specific cache, use one of the JMX beans and perform the startReplication(cache) operation. After performing the start operation, GridGain will start processing data updates that occur in the cache.

Stopping Replication

The replication process for a particular cache can be stopped either manually or automatically. If GridGain detects that the data cannot be replicated for any reason, the replication process will be stopped automatically. Possible reasons include:

  • The sender nodes configured to replicate data are not available.

  • The sender storage is full or corrupted.

A manual pause can be performed through the JMX Bean.

Pausing/Resuming Replication

You can pause the replication process between the master cluster and a specific replica cluster by suspending the process on the sender nodes. The sender nodes will stop sending data to the replica and store the updates into the storage. All data updates that happen during the pause will be accumulated in the sender storage. When you resume the replication process, the sender nodes will send the accumulated data to the replica cluster.

The pause and resume operations must be performed on all sender nodes. The Using JMX Beans to Monitor and Manage Replication section explains how to pause/resume replication using JMX beans.

Using JMX Beans to Monitor and Manage Replication

The following JMX Bean provides information about the replication process of a specific cache. The bean can be obtained on any node that hosts the cache.

MBean’s ObjectName:
group=<Cache_Name>,name="Cache data replication"
Attribute Type Description Scope

DrSenderGroup

String

The sender group name.

Global

DrStatus

String

The status of the replication process for this cache.

Global

DrQueuedKeysCount

int

The number of entries waiting to be sent to the sender node.

Node

DrSenderHubsCount

int

The number of sender nodes available for the cache.

Global

Operation Description

start

Start the replication process for this cache.

stop

Stop the replication process for this cache.

transferTo (id)

Perform a full state transfer of the content of this cache to the given replica cluster ID.

The following MBean can be obtained on a sender node and allows you to pause/resume the replication process on that node.

MBean’s ObjectName:
group="Data Center Replication",name="Sender Hub"
Operation Description

pause (id)

Pause replication for the given replica cluster ID. This operation will stop sending updates to the replica cluster. The updates will be stored in the sender storage until the replication is resumed.

pause

Pause replication for all replica clusters.

resume (id)

Resume replication for the given replica cluster ID.

resume

Resume replication for all replica clusters.

Troubleshooting

When using data replication, you can monitor various statistics provided through JMX beans and events to make sure that the replication process is running smoothly.

Most common problems that you may want to monitor include:

Problem: Size of the sender storage is growing

If the sender storage size on a specific sender node is growing, it means that the sender is not keeping up with the load or there is a problem with the connection to the replica cluster.

How to monitor:

Monitor the metric that shows the size of the storage

Actions:

Check network capacity or add more sender nodes and/or receivers.

Problem: Sender storage is full or corrupted

When the sender storage gets full or becomes corrupted (i.e., due to an error), the replication process will stop for all caches.

How to monitor:

Listen to the EVT_DR_STORE_OVERFLOW or EVT_DR_STORE_CORRUPTED events. Refer to the Data Replication Events section for more information.

Actions:

After addressing the issue that caused the sender storage to get full, you have to do a full state transfer.

Problem: There are failed batches

Updated entries are accumulated into batches on the primary nodes; then, each batch is sent to one of the sender nodes. If no sender is available, the batch will be marked as failed. Failed batches will never be resent.

How to monitor:

Monitor the GridDr.senderCacheMetrics("myCache").batchesFailed() metric for all caches that are configured to replicate their data.

Actions:

Make sure that at least one sender is available to all server nodes in the master cluster. You will need to do a full state transfer to synchronize the cache’s content with the replica cluster.

Data Replication Events

In addition to metrics, you can listen to replication-specific events. DR events must be enabled first and can be handled as regular Ignite events. To learn how to listen to specific events, refer to the Working with Events section.

IgniteConfiguration cfg = new IgniteConfiguration();

// Enable the event
cfg.setIncludeEventTypes(org.gridgain.grid.events.EventType.EVT_DR_CACHE_REPLICATION_STOPPED);

//enable data replication
cfg.setPluginConfigurations(new GridGainConfiguration().setDataCenterId((byte) 1));

Ignite ignite = Ignition.start(cfg);

IgniteEvents events = ignite.events();

// Local listener that listens to local events.
IgnitePredicate<DrCacheReplicationEvent> localListener = evt -> {

    CacheDrPauseReason reason = evt.reason();

    System.out.println("Replication stopped. Reason: " + reason);
    return true; // Continue listening.
};

// Listen to the "replication stopped" events
events.localListen(localListener, org.gridgain.grid.events.EventType.EVT_DR_CACHE_REPLICATION_STOPPED);

The replication event types are defined in the org.gridgain.grid.events.EventType class and are listed in the following table.

Event Type Event Description Where Event Occurred

EVT_DR_REMOTE_DC_NODE_CONNECTED

A sender node connects to the node in the replica cluster.

The sender node.

EVT_DR_REMOTE_DC_NODE_DISCONNECTED

A sender node loses connection to the node in the replica cluster. The sender will try to connect to another receiver node if more than one is configured. If no receiver is available, the sender will try to reconnect after the period defined in the DrSenderConfiguration.reconnectOnFailureTimeout property.

The sender node.

EVT_DR_CACHE_REPLICATION_STOPPED

Replication of a specific cache is stopped due to any reason. A full state transfer is required before resuming the replication process.

All nodes that host the primary partitions of the cache.

EVT_DR_CACHE_REPLICATION_STARTED

Replication of a specific cache is started.

All nodes that host the primary partitions of the cache.

EVT_DR_CACHE_FST_STARTED

A full state transfer for a specific cache is started.

All nodes that host the primary partitions of the cache.

EVT_DR_CACHE_FST_FAILED

A full state transfer for a specific cache fails.

All nodes that host the primary partitions of the cache.

EVT_DR_STORE_OVERFLOW

A sender node cannot store entries to the sender storage because the storage is full. Manual restart and a full state transfer may be required after the issue is fixed.

The sender node.

EVT_DR_DC_REPLICATION_RESUMED

Replication to a specific cluster is resumed on a sender node.

The sender node.

Data Replication Incremental State Transfer

The Incremental Data Center Replication mode enables a safer data replication process featuring higher stability, fewer memory requirements, and simplified configuration.

There are the following main advantages of Data Replication Incremental State transfer:

  • It is not necessary to configure DrSenderStore on the sender node anymore. The sender uses a non-persistent state transfer buffer instead to smooth network load spikes; it is only responsible for routing data to remote Data Centers. Data replication does not rely on a sender store being entirely tolerant of sender-node failure.

  • You do not have to worry about backup queue size for data replication purposes; that is very hard to configure. The backup queues are now removed to prevent potential data loss in case of primary-node failure.

  • DR Incremental State transfer ensures better stability. DrSenderStore (state transfer buffer) overflow and a sender-node failure stop data replication. Therefore, you don’t need to start full state transfer after data replication stops due to any reason because all grid nodes track DR progress automatically.

  • It features a fully asynchronous background process and introduces no additional latency to cache operations; its cost does not exceed an additional SQL index.

  • DR Incremental State transfer has the "Incremental state transfer over snapshot" feature that enables transferring a "delta" only to remote DC, which may help to save time and network resources in some cases.

Limitations:

  • IMPORTANT: Rolling upgrade is NOT seamless. The new mode is not compatible with the previous one. DR should be stopped for all caches before the rolling restart procedure.

  • IMPORTANT: Only single state transfer per-cache at a time allowed.

  • IMPORTANT: Remove operation is not replicated (will be fixed in the next versions).

  • Replication to multiple remote data centers is possible. However, DR may get stuck when one of the DCs becomes unavailable until the failed DC connection will be recovered.

  • Full state transfer and background replication process use the same buffer on the sender, and recent updates have no priority over full state transfer updates.

Setting the Data Replication State transfer to Incremental mode

Set system property GG_INCREMENTAL_STATE_TRANSFER=true to start with incremental data replication mode. No additional configuration is needed. Note: All unsupported configuration options are ignored. Use DrSenderConfiguration.setFullStateTransferBufferSize(long bytes) to limit DR buffer on sender-node.

Using the "Incremental state transfer over snapshot" feature

To set the remote DC in-sync using the snapshot, perform the following steps:

  1. Take a snapshot in the source DC.

  2. Copy it to target DC.

  3. Restore the snapshot on the target DC.

  4. Start Incremental StateTransfer providing the snapshot Id.

Note: For Incremental State Transfer, a user must run FST (either regular or a snapshot-based) to start the IST process. Until then, the old protocol should be maintained in the cluster. The new protocol activation must be tied with switching off the Rolling Upgrade: a user must manually acknowledge the rolling upgrade mode’s termination. After that, the cluster prohibits nodes of older versions from joining and stops the old replication protocol.