Micro Learning Unit: Colocation and Data Affinity for Apache Ignite


This eight-minute GridGain University Micro Learning Unit explores the importance of colocation and affinity to the performance of Apache Ignite data query and computation. 

The two largest sources of latency in any distributed system are network latency and disk access. In traditional client server applications, data is constantly moved over the network, and it's usually accessed from disk. To reduce latency, Apache Ignite allows distributed in-memory data and computation as well as a transactional SQL environment – including the ability to control the location of data. 

By default, Ignite partitions data, spreading the partitions across the nodes in the cluster. However, for optimal performance and efficiency, it’s important to control which partitions and nodes the data is on to ensure queries – especially joins and computations – end up in the same place to keep network traffic to a minimum.

There are two types, or stages, of colocation to consider: data location and compute location. Data location, considered first, refers to the practice of keeping associated data together on one physical node. For example, imagine a large banking system that has millions of customers stored in a customer table and every bank withdrawal each customer makes is stored in a withdrawal table. By colocating all withdrawals for a particular customer together with said customer’s record, calculating interest or balance totals will be more efficient. 

Compute colocation means executing computations on the same node as the data. Imagine calculating a percentage bank fee on every bank customer’s withdrawal for the last calendar year. This computation would need to fetch a lot of data and could lead to significant network latency. Colocating compute with the data eliminates this source of latency.

Colocating data with other data, or colocating a compute task with data, requires a strategy for defining where that data should be and what other data should be colocated with it. This is done with an affinity function. The simplest form of this function is usually just a field key for the data object or table. Specifying this key ensures that data objects or rows in tables will be distributed to partitions with other data with matching keys. Note: Ignite 3.0 uses the term “colocate,” directly, while previous versions used the term “affinity key.”

When designing a cluster topology, it’s important to consider colocation functions or keys to reduce the possibility of skewed data or data that is poorly distributed over a cluster. For example, if the bank were to colocate based on date, all the data from one day would be colocated; however, banking transactions tend to decrease on weekends, so the data would not end up well distributed.

Keeping disk access and network traffic to a minimum is the key to Apache Ignite’s exceptional performance.

Watch this eight-minute Micro Learning Unit now to better understand colocation and data affinity and sign up for GridGain University here