Implementing ‘anti-caching’ and NoSQL large OLAP workloads

Implementing ‘anti-caching’ and NewSQL large OLAP workloadsThe forces of big data and fast data are so overwhelming that a "data-driven" business like ours needs to react to these changes, and build capabilities that match the needs of the evolving data economy.

GridGain, as well as other in-memory data grid providers, has for many years used the argument that the price of RAM drops over time.

This has meant that entire datasets could be held and processed entirely in-memory at an effective cost per GB compared to disk. When datasets have not been large, perhaps a few tens of GB’s at most, this approach has held true; that is until now.

The advent of the big data and fast data movement however necessitates a change in approach. The quantities of data, and the nature of the processing that needs to be done in OLAP (recommendations, personalized content, machine learning, analytics etc.), means that the effective data set that needs to be held in-memory is much larger.

The velocity of change for this data is also much higher: think of trading platforms, social media, and ad exchanges; this data must be ingested, processed, analyzed, and sometimes persisted very quickly. So the economics of in-memory have been skewed to some extent.  It’s not as clear cut that a dataset can fit entirely in memory, even if that data is distributed.  Terabytes of data is not uncommon, but terabytes of memory is still the preserve of very large and well-financed organizations.

It has therefore become imperative to find a new approach, and one of these is to implement the "Anti-Caching" pattern[1]. The main idea of this pattern is to use main memory as the primary store for hot data, retaining the benefits of pure play in-memory, and move cold data to disk as a secondary store - the opposite of caching. Some of the reasoning being according to Jim Grays "five minute rule" [2] which speaks about trading disk accesses for memory:

“...with today’s technologies, if a 1KB record is accessed at least once every 30 hours, it is not only faster to store it in memory than on disk, but also cheaper (to enable this access rate only 2 percent of the disk space can be utilized).”

So as disk densities go up, RAM actually gets cheaper for random access reads. Data economics leads us to the use of disk (usually meaning flash or SSD) as a secondary store, but retaining data that is accessed as infrequently as every 30 hours in main memory. This capability is expressed in the new "Durable Memory" architecture introduced as of Ignite 2.1 [3] and donated by GridGain to Apache Ignite. This capability also extends querying across memory and disk in a read-through pattern to underpin the "Anti-Caching" pattern.

Companies utilizing big data and fast data have been forced to choose either between scalability or the consistency they enjoy with an RDBMS, and increasingly more of them were having to choose scalability because of increased data volumes. This has lead them to choose NoSQL, Hadoop, etc for their scaling characteristics. However using these inconsistent databases and data stores in mission-critical business applications takes a lot of expertise and hand-crafted defensive coding.

NoSQL stores generally rely on the principle of "eventual consistency" that an update to one replica will eventually make its way to the others. This type of consistency certainly doesn't allow for transactional consistency where guarantees have to be in effect that an update was definitely copied to all replicas.

Apache Ignite, with its new "Durable Memory" and ACID semantics, offers both consistency and scalability. It is possible for users to take advantage of the anti-caching pattern via Ignite, either through Durable memory or continue to use NoSQL/Hadoop as a secondary store, and at the same time go back to a traditional, easier-to-use RDBMS like interaction. Apache Ignite achieves this via its CacheStore Interface, which allows it the ability to act as the primary (in-memory) store to MongoDB, Casandra, Hadoop, etc., and even traditional databases like MySQL at the same time.

Ultimately, Apache Ignite’s low latency will enable a database with whatever level of consistency, from a fully ACID compliant RDBMS to a NoSQL offering with fewer consistency guarantees to scale out much further horizontally than would otherwise be possible, And Apache Ignite does this in a much more operationally effective way than many NoSQL databases can due to its master-less architecture and shared-nothing clustering.

Apache Ignite is fully ACID transaction compliant.  As such it brings ACID guarantees back within the reach of database users whose scalability needs had grown past the point where they could use an RDBMS; again the familiarity of this transactional model would be familiar to Ops teams, further enhancing operational efficiency by building on current expertise.

So I think Apache Ignite’s ability to front databases/datastores (NoSQL, Hadoop based and traditional), to retain consistency guarantees and bring scalability is an extraordinary capability. Apache Ignite also offers the power of a full SQL interface to NoSQL stores that do not have minimal or zero SQL query capabilities like Cassandra (CQL) and MongoDB native API.

Apache Ignite has its own "durable memory" and this is a stand-out feature, but read/write through to a system of record, via the CacheStore interface to any DB, is easily just as powerful if we consider that we bring low-latency, consistency, and scalability to traditional databases.

As we begin to think more about the move to the Anti-Caching pattern, the term in-memory becomes slightly moot. Instead we are able to talk about low-latency, scalable, consistent, durable, transactional without any conflict.

Additionally we are also then able to talk about the utility of the SQL Grid as a capability that allows customers to retain business logic and application expertise, and then have Service Grid, and Streaming Grid as powerful modern application architecture functionality. These intrinsic capabilities makes Apache Ignite a singularly powerful engine for building the most demanding low-latency, transactional, consistent, scalable and operationally effective solutions.

[1]. De-brabant, Pavlo, Tu et al, ‘Anti-Caching: A New Approach to Database Management System architecture”. http://www.vldb.org/pvldb/vol6/p1942-debrabant.pdf

[2]. The Five-Minute Rule 20 Years Later - ACM Queue, http://queue.acm.org/detail.cfm?id=1413264

[3]. Apache Ignite: Durable Memory. https://apacheignite.readme.io/v2.1/docs/durable-memory