GridGain Developers Hub
GitHub logo GridGain iso GridGain.com
GridGain Software Documentation

Point-in-Time Recovery

Overview

Continuous Archiving for Point-in-Time Recovery (PITR) makes it easy to recover a cluster to any previous point in time. Basically, using PITR, you can roll back the data in the cluster to any state you’d like.

When PITR is enabled, the cluster continually records all operations that modify the data to the write-ahead log (WAL). PITR consists of two stages: first, it restores a full snapshot and then applies all the operations from the WAL from the time the full snapshot was taken up to the required moment. This brings the cluster to the state it was in as of the specified moment.

pitr

In the figure above, three snapshots were created during cluster operation, and we want to restore the cluster to a specific moment between point 2 and point 3. In this case, GridGain takes an earlier full snapshot of data (snapshot 2) and then applies the operations from the WAL Archive 2, recreating the required state of the cluster for the given moment.

Because PITR replays the operations starting from the latest available snapshot, the longer the period between the snapshot and the point you want to restore the cluster to, the more operations need to be reapplied and the longer it will take to restore the cluster. Because of this, you should create snapshots on a regular basis. These snapshots will split the lifetime of the cluster into smaller periods, each snapshot serving as a starting point for a recovery process for any time in the subsequent period.

Write-ahead Log and Continuous Archiving

The WAL keeps track of all operations that were performed on the data. For efficiency reasons, log files contain operations for a fixed period of time. However, if PITR is enabled, GridGain keeps all WAL files permanently, archiving them in a directory specified in DataStorageConfiguration. This process is known as continuous archiving. For more information about WAL files and performance, see Keep WALs Separate.

Data Consistency

To ensure data consistency, transactions that have not finished by the time of the recovery will be disregarded. Similarly, if a series of dependent transactions was in progress at the recovery point, all transactions from the series will be ignored and the recovery point will be shifted to the moment before the series begun. This means that with point-in-time recovery the cluster is restored to the latest consistent state prior to the given point.

Requirements

In order to use PITR, you need to make sure your server and cluster configuration meets the following requirements.

Time Synchronization

All machines running the cluster nodes must be configured to synchronize time via the NTP protocol.

Storage Size

When PITR is enabled, the WAL segments will not be automatically deleted. It is, therefore, crucial to make sure that each node has enough disk space.

Consider the following points as general guidelines for managing disk space when PITR is enabled.

Schedule Periodic Snapshot Creation

Snapshots should be created periodically to reduce the time it takes to perform a recovery operation and the amount of changes between snapshots. You can use the Snapshots Management Tool (or any other scheduler) to schedule snapshot creation.

The following command sets up a schedule that creates a full snapshot every day at 00:00.

snapshot-utility.sh schedule -command=create -name="snapshot creation schedule"  -full_frequency=daily

Move or Delete Old Snapshots Regularly

Because snapshots and WAL files will take up significant amount of space on your hard drive, make sure you regularly remove the snapshots you no longer need. Snapshot can be moved or deleted using the Snapshots Management Tool.

To remove a specific snapshot, execute the following command:

snapshot-utility.sh delete -id=snapshot_id

To create a snapshot deletion schedule, use the following command:

snapshot-utility.sh schedule -command=delete -name="snapshot deletion schedule" -ttl=5d -frequency=hourly

This schedule will execute a snapshot deletion command every hour; each command will delete any snapshots that are older than 5 days at the time the command is executed.

Functional Limitations

Please consider the following limitations before using PITR in a production environment.

  • PITR is not supported with caches that have disk page compression enabled. Look for an exception like: "Failed to start cache because disk page compression is enabled."

  • When PITR is enabled, you cannot create snapshots with a subset of caches. You can only create snapshots with all the caches stored in the cluster.

  • Dynamic caches created within one group of caches will be lost if they are not saved in a full snapshot. In other words, a dynamically created cache can be restored only at a point in time after it has been saved in a full snapshot.

  • If you manually remove a snapshot, PITR may fail. Use the provided tools to manage snapshots.

  • You will not be able to move or delete the final snapshot using the Snapshots Management Tool.

  • Because PITR always requires a snapshot to be available, a full snapshot is automatically created during the cluster activation. This first snapshot must be preserved at all times.

  • If you delete a snapshot using Snapshots Management Tool and want to restore the cluster to any time after that snapshot, an earlier snapshot will be used.

Enabling Point-in-Time Recovery

To enable continuous archiving for point-in-time recovery, you have to enable snapshots and set the pointInTimeRecoveryEnabled property to true, as follows:

<bean class="org.apache.ignite.configuration.IgniteConfiguration">

  <!-- Enabling the Ignite Native Persistence. -->
  <property name="dataStorageConfiguration">
    <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
	    <property name="defaultDataRegionConfiguration">
        <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
          <property name="persistenceEnabled" value="true"/>
        </bean>
      </property>
    </bean>
  </property>

  <!-- Enabling the snapshots. -->
  <property name="pluginConfigurations">
    <bean class="org.gridgain.grid.configuration.GridGainConfiguration">
      <property name="snapshotConfiguration">
        <bean class="org.gridgain.grid.configuration.SnapshotConfiguration">
          <property name="pointInTimeRecoveryEnabled" value="true"/>
        </bean>
      </property>
    </bean>
  </property>
</bean>
IgniteConfiguration cfg = new IgniteConfiguration();

//Enabling the Persistent Store.
cfg.setDataStorageConfiguration(new DataStorageConfiguration());

GridGainConfiguration ggCfg = new GridGainConfiguration();

SnapshotConfiguration ggDbCfg = new SnapshotConfiguration();

//Enabling point-in-time recovery
ggDbCfg.setPointInTimeRecoveryEnabled(true);

//Enabling the snapshots.
ggCfg.setSnapshotConfiguration(ggDbCfg);

cfg.setPluginConfigurations(ggCfg);

Recovering to Point in Time

To restore the cluster to a specific point in time, use the restore command in the Snapshots Management Tool, and specify the -to parameter. The time must be specified in yyyy-MM-dd-HH:mm:ss.SSS format.