January 28, 2020

Memory access is so much faster than disk I/O that many of us expect to gain striking performance advantages by merely deploying a distributed in-memory cluster and start reading data from it. However, sometimes we overlook the fact that a network interconnects cluster nodes with our applications, and it can quickly diminish the positive effects of having an in-memory cluster if a lot of data gets transferred continuously over the wire.

That said, using proper data access patterns provided by distributed in-memory technologies can negate the effect of the network latency. In this article, we're using the APIs of Apache Ignite in-memory computing platform to see how the performance of our application changes if we put less pressure on the communication channels. The ultimate goal is to be able to deploy horizontally scalable in-memory clusters that can tap into the pool of RAM and CPUs spread across all machines with minimal impact of the network.

Running a simple experiment

For the sake of simplicity, let's assume that our Apache Ignite in-memory cluster stores thousands of records, and the application needs to calculate the highest temperature and the longest distance across the data set. Three APIs are to be compared to show how performance changes if network utilization is minimized - individual key-value calls, bulk key-value reads, and co-located computations.

A local laptop is suitable for this experiment. Thus, I deployed a two-node Ignite cluster on my machine and ran this example over 200,000 records preloaded to the nodes (MacBook Pro of Early 2015 with 2.7 GHz Dual-Core Intel Core i5 and 8 GB 1867 MHz DDR3). Those two cluster nodes and the app communicated via the loopback network interface and competed for RAM and CPU resources. If we run the same test in a truly distributed environment, then the difference between the compared APIs will be more noteworthy. I encourage you to experiment with other deployment configurations by following the instructions provided in the example.

Stressing out the network with thousands of key-value calls

We're starting with key-value APIs that pull each record one by one from the two nodes. Ignite provides a standard cache.get(key) API method for that (check calcByPullingDataFromCluster for full implementation):


for (int i = 0; i < RECORDS_CNT; i++) {
    SampleRecord record = recordsCache.get(i);

    if (record.getDistance() > longestDistance)
        longestDistance = record.getDistance();

    if (record.getTemperature() > highestTemp)
        highestTemp = record.getTemperature();

    // Running other custom logic...
  }

What we do here can be seen as a brute-force approach as long as the application reads all 200,000 records and does the same or more network roundtrips. Unsurprisingly, it took ~35 seconds for the app to finish the calculation. If this method of data access is selected for similar calculations, then we might not win at all by keeping data in RAM vs. disk since a bunch of records are transferred over the network.

Speeding up by dropping down the number of network roundtrips

The first obvious optimization for our experiment is to reduce the number of network roundtrips between two Ignite server nodes and the application. Ignite has the cache.getAll(keys) version of key-value API that queries data in bulk. The following code snippet shows how to use the API for our task (full implementation can be found in the calcByPullingDataInBulkFromCluster method):


Collection<SampleRecord> records = recordsCache.getAll(keys).values();

Iterator<SampleRecord> recordsIterator = records.iterator();

while (recordsIterator.hasNext()) { 
    // Calculating highest temperature and longest distance
}

With this approach, our calculation finishes in ~5 secs, which is 5x faster in comparison to individual key-value calls used previously. The application still reads all 200,000 records from the server nodes, but it does it in a few network roundtrips. Ignite cache.getAll(keys) does this optimization for us - when we pass the primary keys, Ignite first maps the keys to the server nodes that store the records, and then connects to the server nodes reading data in bulk.

Removing network impact with co-located computations

Finally, let's see what happens if we stop pulling 200,000 records from the server nodes to the application. With Ignite compute APIs, we can wrap our calculation into an Ignite compute task that will be executed on the server nodes over their local data set. The application only receives results from the servers and no longer pulls the actual data; thus, there is almost no network utilization (full implementation can be found in the calcByComputeTask method):


// Step #1: Application sends the compute task

Collection<Object[]> resultsFromServers = compute.broadcast(new IgniteCallable<Object[]>() {
           
    @Override public Object[] call() throws Exception {
        // Step #2: Calculating the highest temperature and longest distance on each server node
    }
}

// Step #3: Application selects the highest temperature and longest distance by checking results from servers

This co-located compute-based approach completes in ~1 second, which is 5x faster than the cache.getAll(keys) solution and 35x more performant than issuing individual key-value requests. Moreover, if we load X times more data into the cluster, the compute-based approach will keep scaling linearly while cache.getAll(keys) will be slowing down.

Should we stick to co-located compute?

The goal of this experiment was to show that with distributed in-memory systems, the performance of our applications can vary greatly depending on the way we access distributed data sets since the network glues the cluster and applications together. This experiment does not discourage you from using key-value APIs or sticking to co-located computations only. If an application needs to read an individual record and return it to the end-user or join two data sets, then go ahead with key-value calls or SQL queries. If the logic is more complicated or data-intensive, then consider compute APIs to eliminate or reduce the network traffic. In short, know the APIs an in-memory technology provides and choose those that will be most performant for a particular task.

Why my in-memory cluster underperforms? Negating network impact.

Running a simple experiment

Stressing out the network with thousands of key-value calls

Speeding up by dropping down the number of network roundtrips

Removing network impact with co-located computations

Should we stick to co-located compute?

Share this

Why my in-memory cluster underperforms? Negating network impact.

Running a simple experiment

Stressing out the network with thousands of key-value calls

Speeding up by dropping down the number of network roundtrips

Removing network impact with co-located computations

Should we stick to co-located compute?

Share this

Subscribe to our technical newsletter