How to Improve Apache Ignite Query Performance with Summary Tables

Summary Tables

To improve application and query performance, developers frequently use summary tables. The summary table pattern is where we feed data into an Apache Ignite or GridGain cluster into two tables: the original data; and a summary, or rollup, of that data.

With GridGain’s SQL engine, it’s not always necessary to have a summary table! For many, perhaps most, use cases, the fact that the data is held in memory and distributed efficiently means that even a complex “group by” expression can be performed very quickly. The additional memory used for the summary table would be hard to justify in this case.

However, when the lowest latency is required, dynamically performing large aggregations will not cut it. Here we typically “pre-aggregate” the data, creating a summary table that can be queried directly. With this approach, clients have seen ten to a hundred times improvement in query execution time, though the exact benefit varies dramatically. (GridGain's consulting team can help you figure out the best approach.)

The most common solution I see is where GridGain is not involved in generating the summary. Instead, some upstream application builds the summary and writes the latest values to the GridGain cluster at the same time as the underlying data.

This option works and is perhaps the best you can do with other, more basic caching solutions. We can do better with Apache Ignite and GridGain!

We’re going to explore two possible other solutions. You can find a project on GitHub that shows them both.

Service Grid

First, let’s have a quick sidebar on a feature that we will use for both options: the Service Grid. The Service Grid is a way of executing your own code on GridGain nodes. You can have as many services as you like and configure where they run. A “cluster singleton” is where one copy runs across the whole cluster; a “node singleton” is where a copy of the service runs on every node in the cluster. Of course, it’s entirely customisable. It’s sometimes described as the ability to run “stored procedures” on a GridGain cluster. In the sense that you can provide an API for clients to call, this is true. For our “summary” use case, we will use another feature: the ability to have long-running processes. 

Before we start, let’s define our prerequisites. We will co-locate the base table and summary tables. Co-location ensures that the summary and source records will be on the same node. Check out the documentation if you’re not familiar with affinity and co-location.

Refreshing the Summary Table on a Schedule

The first option we will look at involves periodically firing a SQL query and writing the results into the cluster.

There’s some code to set up the service, but the gist is that we create a timer and start it in the “execute” method of our service. Whenever the timer is triggered, we make a summary using a SQL query and write that into the summary table.

var query = new SqlFieldsQuery("select b.foreignKey, sum(b.value) from RollupValue b group by b.foreignKey")
        .setLocal(true);
try (var cursor = sourceCache.query(query)) {
    for (var r : cursor) {
        Long id = (Long) r.get(0);
        BigDecimal value = (BigDecimal) r.get(1);
        destCache.put(new RollupKey(id, rollupNumber), new RollupValue(value.longValue()));
    }
}

Note that the SQL query is executed only against the current, local node. Making each node work only on its local data means that you can increase throughput by scaling out the cluster.

Some optimisations can be made here, for example, checking that the value needs to be updated before the put statement, but this code works well.

The nice thing about this solution is that it’s relatively straightforward. Its performance depends not on the number of updates per second pumped through the system but on the number of rows stored on a node. The flipside is that the summary table is always slightly out of date. Also, if you delete the last base record with the above implementation, the summary record is not removed. I left that out to keep the code simple.

This option works well when relatively few summary records get updated frequently. Since Ignite scales horizontally, what we mean by “relatively few” will depend on the size of your cluster but think of hundreds or thousands rather than millions of records.

Update the Summary Table with a Continuous Query

There is another option if you need the summary table to always be up to date. This variation trades throughput for improved latency, but that may be a price worth paying.

We use another feature of GridGain called a Continuous Query. A Continuous Query listens for updates in a table and executes a code block when it happens. As with the previous implementation, we’ll run one copy of our service on every node, meaning the solution can scale.

We need that block of code to be as fast as possible, so we don’t want to execute a SQL query every time. Instead, the event callback gives us both the old and the new record so we can calculate the delta ourselves.

public void onUpdated(Iterable<CacheEntryEvent<? extends RollupKey, ? extends RollupValue>> cacheEntryEvents) throws CacheEntryListenerException {
    for (var r : cacheEntryEvents) {
        var key = new RollupKey(r.getKey().getForeignKey(), rollupNumber);
        var l = destCache.invoke(key, (CacheEntryProcessor<RollupKey, RollupValue, Long>) (entry, arguments) -> {
            Long val = null;
            switch (r.getEventType()) {
                case CREATED:
                    val = r.getValue().getValue();
                    break;
                case UPDATED:
                    val = r.getValue().getValue() - r.getOldValue().getValue();
                    break;
                case REMOVED:
                    val = -r.getOldValue().getValue();
                    break;
            }
            Long oldVal = 0L;
            if (entry.exists()) {
                oldVal = entry.getValue().getValue();
            }
            if (val != null) {
                entry.setValue(new RollupValue(oldVal + val));
            }
            return val;
        });
    }
}

Since this is an event-driven solution, the main disadvantage is that the summary will be incorrect if you miss an event. The way a continuous query works, you won’t miss an event while it’s running, but if data is loaded into the cluster before the service is deployed, you would not see the update. You can fix this by adding an initial query and some extra logic to refresh the summary.

Continuous queries work well when you have a lot of summary records, many of which will not be updated regularly. The cost of updating a single row will be a little higher, but won’t be necessary for most of them.

Conclusion

“Pre-aggregating” a table is a valuable method of improving query performance in cases where large GROUP BY expressions would ordinarily be used. As we have seen, there are some features in GridGain and Apache Ignite that you can use to streamline this process.