Data partitioning is a method of subdividing large sets of data into smaller chunks and distributing them between all server nodes in a balanced manner.
Partitioning is controlled by the affinity function. The affinity function determines the mapping between keys and partitions. Each partition is identified by a number from a limited set (0 to 1024 by default). The set of partitions is distributed between the server nodes available at the moment. Thus, each key is mapped to a specific node and will be stored on that node. When the number of nodes in the cluster changes, the partitions are re-distributed — through a process called rebalancing — between the new set of nodes.
The affinity function takes the affinity key as an argument. The affinity key can be any field of the objects stored in the cache (any column in the SQL table). If the affinity key is not specified, the default key is used (in case of SQL tables, it is the PRIMARY KEY column).
Partitioning boosts performance by distributing both read and write operations. Moreover, you can design your data model in such a way that the data entries that are used together are stored together (i.e. in one partition). When you request that data, only a small number of partitions will be scanned. This technique is called Affinity Collocation (Partition Pruning).
Partitioning helps achieve linear scalability at virtually any scale. You can add more nodes to the cluster as your data set grows, and GridGain will make sure that the data is distributed "equally" among all the nodes.
The affinity function controls how data entries are mapped onto partitions and partitions onto nodes. The default affinity function implements the rendezvous hashing algorithm. It allows a bit of discrepancy in partition-to-node mapping (i.e. some nodes may be responsible for a slightly larger number of partitions than others); however, it guarantees that, when topology changes, partitions are migrated only to the node that joined or from the node that left. No data exchange will happen between the remaining nodes.
When creating a cache or SQL table, you can choose between partitioned and replicated mode of cache operation. The two modes are designed for different use case scenarios and provide different performance and availability benefits.
In this mode, all partitions are split equally between all server nodes. This mode is the most scalable distributed cache mode and allows you to store as much data as will fit in the total memory (RAM and disk) available across all nodes. Essentially, the more nodes you have, the more data you can store.
REPLICATED mode, where updates are expensive because every node in the cluster needs to be updated, with
PARTITIONED mode, updates become cheap because only one primary node (and optionally 1 or more backup nodes) need to be updated for every key. However, reads are somewhat more expensive because only certain nodes have the data cached.
The picture below illustrates the distribution of a partitioned cache. Essentially we have key A assigned to a node running in JVM1, key B assigned to a node running in JVM3, etc.
REPLICATED mode, all the data (every partition) is replicated to every node in the cluster. This cache mode provides the utmost availability of data as it is available on every node. However, every data update must be propagated to all other nodes, which can impact performance and scalability.
In the diagram below, the node running in JVM1 is a primary node for key A, but it also stores backup copies for all other keys as well (B, C, D).
Because the same data is stored on all cluster nodes, the size of a replicated cache is limited by the amount of memory (RAM and disk) available on the node. This mode is ideal for scenarios where cache reads are a lot more frequent than cache writes, and data sets are small. If your system does cache lookups over 80% of the time, then you should consider using the
REPLICATED cache mode.
By default, GridGain keeps a single copy of each partition (a single copy of the entire data set). In this case, if one or multiple nodes become unavailable, you lose access to partitions stored on these nodes. To avoid this, you can configure GridGain to maintain back-up copies of each partition.
Backup copies are configured per cache (table). If you configure 2 backup copies, the cluster will maintain 3 copies of each partition. One of the partitions is called the primary partition, and the other two backup partitions. By extension, the node that has the primary partition is called the primary node for the keys stored in the partition. The node with backup partitions is called the backup node.
Backup partitions increase availability of your data and in some cases the speed of read operations, since GridGain will read data from backed-up partitions if they are available on the local node. However, they also increase memory consumption or the size of the persistent storage if it is enabled.
Partition Map Exchange
Partition map exchange (PME) is a process of sharing information about partition distribution (partition map) across the cluster so that every node knows where to look for specific keys. PME is required whenever the partition distribution for any cache changes, for example, when new nodes are added to the topology or old nodes leave the topology (whether on user request or due to a failure).
Examples of events that trigger PME include (but are not limited to):
A new node joins/leaves the topology.
A new cache starts/stops.
An index is created.
When one of the PME-triggering events occurs, the cluster waits for all ongoing transactions to complete and then starts PME. Also, during PME new transactions are postponed until the process finishes.
The PME process is essentially the following. The coordinator node requests from all nodes the information about the partitions they own. Each node sends this information to the coordinator. Once the coordinator node receives the messages from all nodes, it merges the information into a full partition map and sends it to all nodes. When the coordinator has received confirmation messages from all nodes, PME is considered completed.
When a new node joins the cluster, some of the partitions will be relocated to the new node so that the data remains distributed equally in the cluster. This process is called data rebalancing.
If an existing node permanently leaves the cluster and backups are not configured, you will lose the partitions stored on this node. When backups are configured, one of the backup copies of the lost partitions will become a primary partition and the rebalancing process will be initiated.
GridGain supports both synchronous and asynchronous rebalancing. In the synchronous mode, any call to cache data is blocked until rebalancing is finished. In the asynchronous mode, the rebalancing process is done asynchronously. You can also disable rebalancing for a cache. See the Configuring Caches section for configuration details.
Partition Loss Policy
It may happen that throughout the cluster’s lifecycle some of the data partitions are lost due to a failure of all primary and backup nodes that held a copy of the partitions. This situation leads to a partial data loss and needs to be addressed according to your use case. For instance, some applications treat this as an urgent issue blocking all write operations that go to the lost partitions while others might ignore this event completely because the lost data can be repopulated over time.
Ignite supports the following partition loss policies:
READ_ONLY_SAFE- all writes to a cache/table will fail with an exception. Reads will only be allowed for entries belonging to survived/alive partitions. Reads from lost partitions will fail with an exception.
READ_ONLY_ALL- reads are allowed from any partitions including the lost ones. Any attempt to write to any partition results in an exception. The result of reading from a lost partition is undefined and may be different on different nodes in the cluster.
READ_WRITE_SAFE- all reads and writes are allowed for entries in survived/alive partitions. All reads and writes of entries belonging to the lost partitions will fail with an exception.
READ_WRITE_ALL- all reads and writes will proceed as if all partitions are in a consistent state (as if no partition loss happened). The result of reading from a lost partition is undefined and may be different on different nodes in the cluster.
IGNORE- this mode never marks a lost partition as lost, pretending that no partition loss has happened and clearing the partition loss state right away. Technically, the partition will not be added to the collection of
lostPartitionswhich is the main difference from
IGNOREmode is used by default.