Micro Learning Unit: Understanding Distributed Data in Apache Ignite

Distributed Data

This six-minute Micro Learning Unit explores how distributed data is implemented in Apache Ignite and identifies solutions to three key data challenges: hardware capacity, hardware reliability, and performance issues.

Capacity: It’s more efficient to scale data capacity horizontally than vertically. For example, extremely fast CPUs typically cost several times as much as two slightly slower CPUs.

Reliability: All hardware devices, including data caches and databases, eventually fail. Redundancy increases reliability.

Performance: Server hardware typically supports hundreds of concurrent queries or tasks, but the need may be to support thousands – or even hundreds of thousands – of concurrent queries or tasks. Transactional overhead can also increase exponentially with workloads. Scaling processing capacity works better if the working data for user requests is distributed with processor capacity.

Two relevant data distribution techniques are replication and partitioning. Replicating data means adding nodes with copies of the same data. Replicating data increases reliability, but it does not increase data capacity. With partitioning, sometimes called sharding, data might be located in two or more places, or shards. Partitioning increases capacity, but it does not improve reliability because a failure that causes a missing shard results in unavailable data.

Apache Ignite’s tables are partitioned with a configurable number of replicas or backups. In this way, it scales elastically, supports workloads that benefit from a high degree of redundancy, and supports workloads where horizontal scalability is more important. To find sharded data, we need either a defined scheme so we can use an algorithm to tell us which shard to look in, or we need a coordinator to route any request to a particular shard. 

Using a coordinator introduces additional overhead. It can also introduce a bottleneck for data access and might become a single point of failure. A coordinator that maintains explicit mappings needs its own additional database, which doesn’t scale well to billions of records. 

A sharding scheme provides much faster access to data but also has some challenges. A poor sharding strategy can produce skewed data unevenly distributed between shards. Adding capacity in this situation will not always have the desired benefit. For example, imagine sharding data based on date. All data from a busy day would end up in the same shard; data from a light day in another shard.

Apache Ignite improves on traditional sharding by using one key to statically map data to partitions and another key to dynamically map partitions or shards to nodes. This means Ignite always knows where to find data. It implements these mappings using the rendezvous, or highest random weight, hashing algorithm as a sharding scheme, which means that data and nodes tend towards being evenly distributed as the cluster size and quantity of data increase. This also means, together with dynamic mapping of partitions to nodes, that if nodes are added or removed from the cluster, rebalancing of data can occur with minimum disruption and communication between notes.

Apache Ignite manages data distribution by partitioning and replication, providing elastic capacity and configurable redundancy for fault tolerance and reliability. This makes it highly scalable, very reliable, and highly performant.

Watch the six-minute Micro Learning Unit now to quickly better understand distributed data in Apache Ignite, and sign up for GridGain University here