GridGain Developers Hub
GitHub logo GridGain iso GridGain.com
GridGain Software Documentation

Data Partitioning

Data partitioning is a method of subdividing large sets of data into smaller chunks and distributing them between all server nodes in a balanced manner.

Partitioning is controlled by the affinity function. The affinity function determines the mapping between keys and partitions. Each partition is identified by a number from a limited set (0 to 1024 by default). The set of partitions is distributed between the server nodes available at the moment. Thus, each key is mapped to a specific node and will be stored on that node. When the number of nodes in the cluster changes, the partitions are re-distributed — through a process called rebalancing — between the new set of nodes.

Data Partitioning

The affinity function takes the affinity key as an argument. The affinity key can be any field of the objects stored in the cache (any column in the SQL table). If the affinity key is not specified, the default key is used (in case of SQL tables, it is the PRIMARY KEY column).

Partitioning boosts performance by distributing both read and write operations. Moreover, you can design your data model in such a way that the data entries that are used together are stored together (i.e. in one partition). When you request that data, only a small number of partitions will be scanned. This technique is called Affinity Collocation (Partition Pruning).

Partitioning helps achieve linear scalability at virtually any scale. You can add more nodes to the cluster as your data set grows, and GridGain will make sure that the data is distributed "equally" among all the nodes.

Affinity Function

The affinity function controls how data entries are mapped onto partitions and partitions onto nodes. The default affinity function implements the rendezvous hashing algorithm. It allows a bit of discrepancy in partition-to-node mapping (i.e. some nodes may be responsible for a slightly larger number of partitions than others); however, it guarantees that, when topology changes, partitions are migrated only to the node that joined or from the node that left. No data exchange will happen between the remaining nodes.

Partitioned/Replicated Mode

When creating a cache or SQL table, you can choose between partitioned and replicated mode of cache operation. The two modes are designed for different use case scenarios and provide different performance and availability benefits.

PARTITIONED

In this mode, all partitions are split equally between all server nodes. This mode is the most scalable distributed cache mode and allows you to store as much data as will fit in the total memory (RAM and disk) available across all nodes. Essentially, the more nodes you have, the more data you can store.

Unlike the REPLICATED mode, where updates are expensive because every node in the cluster needs to be updated, with PARTITIONED mode, updates become cheap because only one primary node (and optionally 1 or more backup nodes) need to be updated for every key. However, reads are somewhat more expensive because only certain nodes have the data cached.

The picture below illustrates the distribution of a partitioned cache. Essentially we have key A assigned to a node running in JVM1, key B assigned to a node running in JVM3, etc.

partitioned cache

REPLICATED

In the REPLICATED mode, all the data (every partition) is replicated to every node in the cluster. This cache mode provides the utmost availability of data as it is available on every node. However, every data update must be propagated to all other nodes, which can impact performance and scalability.

In the diagram below, the node running in JVM1 is a primary node for key A, but it also stores backup copies for all other keys as well (B, C, D).

replicated cache

Because the same data is stored on all cluster nodes, the size of a replicated cache is limited by the amount of memory (RAM and disk) available on the node. This mode is ideal for scenarios where cache reads are a lot more frequent than cache writes, and data sets are small. If your system does cache lookups over 80% of the time, then you should consider using the REPLICATED cache mode.

Backup Partitions

By default, GridGain keeps a single copy of each partition (a single copy of the entire data set). In this case, if one or multiple nodes become unavailable, you lose access to partitions stored on these nodes. To avoid this, you can configure GridGain to maintain back-up copies of each partition.

Backup copies are configured per cache (table). If you configure 2 backup copies, the cluster will maintain 3 copies of each partition. One of the partitions is called the primary partition, and the other two backup partitions. By extension, the node that has the primary partition is called the primary node for the keys stored in the partition. The node with backup partitions is called the backup node.

Backup partitions increase availability of your data and in some cases the speed of read operations, since GridGain will read data from backed-up partitions if they are available on the local node. However, they also increase memory consumption or the size of the persistent storage if it is enabled.

Partition Map Exchange

Partition map exchange (PME) is a process of sharing information about partition distribution (partition map) across the cluster so that every node knows where to look for specific keys. PME is required whenever the partition distribution for any cache changes, for example, when new nodes are added to the topology or old nodes leave the topology (whether on user request or due to a failure).

Examples of events that trigger PME include (but are not limited to):

  • A new node joins/leaves the topology.

  • A new cache starts/stops.

  • An index is created.

When one of the PME-triggering events occurs, the cluster waits for all ongoing transactions to complete and then starts PME. Also, during PME new transactions are postponed until the process finishes.

The PME process is essentially the following. The coordinator node requests from all nodes the information about the partitions they own. Each node sends this information to the coordinator. Once the coordinator node receives the messages from all nodes, it merges the information into a full partition map and sends it to all nodes. When the coordinator has received confirmation messages from all nodes, PME is considered completed.

Rebalancing

When a new node joins the cluster, some of the partitions will be relocated to the new node so that the data remains distributed equally in the cluster. This process is called data rebalancing.

If an existing node permanently leaves the cluster and backups are not configured, you will lose the partitions stored on this node. When backups are configured, one of the backup copies of the lost partitions will become a primary partition and the rebalancing process will be initiated.

GridGain supports both synchronous and asynchronous rebalancing. In the synchronous mode, any call to cache data is blocked until rebalancing is finished. In the asynchronous mode, the rebalancing process is done asynchronously. You can also disable rebalancing for a cache. See the Configuring Caches section for configuration details.

When Native Persistence is enabled, data rebalancing does not happen automatically, but is triggered by the changes in the baseline topology. See the Baseline Topology and Cluster Activation section for more details.

Partition Loss Policy

It may happen that throughout the cluster’s lifecycle some of the data partitions are lost due to a failure of all primary and backup nodes that held a copy of the partitions. This situation leads to a partial data loss and needs to be addressed according to your use case. For instance, some applications treat this as an urgent issue blocking all write operations that go to the lost partitions while others might ignore this event completely because the lost data can be repopulated over time.

Ignite supports the following partition loss policies:

  • READ_ONLY_SAFE - all writes to a cache/table will fail with an exception. Reads will only be allowed for entries belonging to survived/alive partitions. Reads from lost partitions will fail with an exception.

  • READ_ONLY_ALL - reads are allowed from any partitions including the lost ones. Any attempt to write to any partition results in an exception. The result of reading from a lost partition is undefined and may be different on different nodes in the cluster.

  • READ_WRITE_SAFE - all reads and writes are allowed for entries in survived/alive partitions. All reads and writes of entries belonging to the lost partitions will fail with an exception.

  • READ_WRITE_ALL - all reads and writes will proceed as if all partitions are in a consistent state (as if no partition loss happened). The result of reading from a lost partition is undefined and may be different on different nodes in the cluster.

  • IGNORE - this mode never marks a lost partition as lost, pretending that no partition loss has happened and clearing the partition loss state right away. Technically, the partition will not be added to the collection of lostPartitions which is the main difference from READ_WRITE_ALL mode. The IGNORE mode is used by default.

Cache Groups

For each cache deployed in the cluster, there is always overhead: the cache is split into partitions whose state must be tracked on every cluster node.

If Native Persistence is enabled, then for every partition there will be an open file on the disk that GridGain actively writes to and reads from. Thus, the more caches and partitions you have:

  • The more Java heap will be occupied by partition maps. Every cache has its own partition map.

  • The longer it might take for a new node to join the cluster.

  • The longer it might take to initiate rebalancing if a node leaves the cluster.

  • The more partition files will be kept open and the worse the performance of the checkpointing might be.

Usually, you will not spot any of these problems for deployments with dozens or several hundreds of caches. However, when it comes to thousands the impact can be noticeable.

To avoid this impact, consider using cache groups. Caches within a single cache group share various internal structures such as partitions maps, thus boosting topology events processing and decreasing overall memory usage. Note that from the API standpoint, there is no difference whether a cache is a part of a group or not.

Cache groups can be created by setting the groupName property of CacheConfiguration. Here is an example of how to assign caches to a specific group:

<bean class="org.apache.ignite.configuration.IgniteConfiguration">
  <property name="cacheConfiguration">
    <list>
      <!-- Partitioned cache for Persons data. -->
      <bean class="org.apache.ignite.configuration.CacheConfiguration">
        <property name="name" value="Person"/>
        <property name="backups" value="1"/>
        <!-- Group the cache belongs to. -->
        <property name="groupName" value="group1"/>
      </bean>
      <!-- Partitioned cache for Organizations data. -->
      <bean class="org.apache.ignite.configuration.CacheConfiguration">
        <property name="name" value="Organization"/>
        <property name="backups" value="1"/>
        <!-- Group the cache belongs to. -->
        <property name="groupName" value="group1"/>
      </bean>
    </list>
  </property>
</bean>
// Defining cluster configuration.
IgniteConfiguration cfg = new IgniteConfiguration();

// Defining Person cache configuration.
CacheConfiguration personCfg = new CacheConfiguration("Person");

personCfg.setBackups(1);

// Group the cache belongs to.
personCfg.setGroupName("group1");

// Defining Organization cache configuration.
CacheConfiguration orgCfg = new CacheConfiguration("Organization");

orgCfg.setBackups(1);

// Group the cache belongs to.
orgCfg.setGroupName("group1");

cfg.setCacheConfiguration(personCfg, orgCfg);

//Starting the node.
Ignition.start(cfg);
var cfg = new IgniteConfiguration
{
    CacheConfiguration = new[] {
        new CacheConfiguration {
            Name = "Person",
            Backups = 1,
            GroupName = "group1"
        },
        new CacheConfiguration {
            Name = "Organization",
            Backups = 1,
            GroupName = "group1"
        }
    }
};
Ignition.Start(cfg);

In the above example, the Person and Organization caches belong to group1.

The reason for grouping caches is simple — if you decide to group 1000 caches, then you will have 1000x fewer structures that store partitions' data, partition maps, and open partition files.