In-Memory Data Grid: Explained...

In-Memory Data Grid


In-Memory Compute and Data Grids serve as the fundamental components of an in-memory architecture. The objective of In-Memory Data Grids (IMDG) is to ensure exceptionally high data availability by storing it in memory in a highly distributed and parallelized manner. By loading terabytes of data into memory, IMDGs can effectively handle most of the requirements for processing Big Data today.

What is an In-Memory Data Grid?

An in-memory data grid (IMDG) is a distributed object store that functions similarly to a concurrent hash map. In an IMDG, you can store objects using keys, and unlike traditional systems, you are not limited to using byte arrays or strings as keys and values. Instead, you can use any domain object, providing great flexibility.

This means that you can directly store the same objects your business logic is working with, without the need for additional steps like marshaling and de-marshaling required by alternative technologies. Using an IMDG simplifies the usage of a data grid, as you can interact with it as if it were a simple hash map in most cases. One key distinction between an IMDG and an In-Memory Database (IMDB) is the ability to work directly with domain objects in an IMDG, eliminating the need for Object-To-Relational Mapping, which often introduces significant performance overhead in IMDBs.

IMDGs have distinct features that set them apart from other products like NoSQL databases, IMDBs, or NewSQL databases. One key difference is their ability to achieve scalable data partitioning across a cluster. In essence, IMDGs can be seen as distributed hash maps, with each key cached on a specific cluster node. The larger the cluster, the more data can be cached. The key to this architecture is ensuring that your processing is colocated with the cluster nodes where the data is cached. This ensures that all cache operations are local, minimizing or eliminating the need for data movement within the cluster. Well-designed IMDGs should have no data movement in stable topologies, except when new nodes join or existing nodes leave, which may require some data repartitioning within the cluster.

The diagram below depicts a typical IMDG with a key set of {k1, k2, k3}, where each key is assigned to a different node. The inclusion of an external database component is optional. If present, the IMDGs will typically handle the automatic retrieval or storage of data from/to the database.

In-memory Datagrid graphic

IMDGs also offer Transactional ACID support, which ensures data consistency within the cluster. Typically, a 2-phase-commit (2PC) protocol is employed for this purpose. While different IMDGs may use varying underlying locking mechanisms, more advanced implementations often incorporate concurrent locking mechanisms such as MVCC (multi-version concurrency control). These advanced techniques minimize network communication and ensure transactional ACID consistency with exceptional performance.

In-Memory Data Grid vs. NoSQL Databases

Data is a key distinguishing factor between IMDGs and NoSQL databases. NoSQL databases typically operate on the principle of Eventual Consistency (EC), where data may be temporarily inconsistent until it eventually becomes consistent. In EC-based systems, writes are generally fast, but reads are as fast as the writes. However, the latest IMDGs with an optimized 2PC (Two-Phase Commit) protocol can match or even outperform EC-based systems in terms of write performance, while also being substantially faster in reads. It is noteworthy that the industry has come full circle, transitioning from the previously slower 2PC approach to EC, and now moving from EC to an optimized 2PC that often offers significant speed improvements.

Various products offer different optimizations for Two-Phase Commit (2PC), but their ultimate goal is typically to enhance concurrency, minimize network overhead, and decrease the number of locks needed to complete a transaction. For instance, Google's Spanner, a distributed global database, relies on a transactional 2PC approach because it offers a faster and more streamlined method to ensure data consistency and achieve high throughput compared to alternatives like MapReduce or EC.

Although IMDGs generally share some fundamental functionality, there are significant differences in features and implementation details among vendors. When evaluating an IMDG product, it is crucial to consider factors such as eviction policies, (pre)loading techniques, concurrent repartitioning, and memory overhead. Additionally, the ability to query data at runtime should be taken into account. Some IMDGs, like GridGain, offer the capability to query in-memory data using standard SQL, including support for distributed joins, which is quite rare.

The typical use case for IMDGs involves partitioning data across a cluster and executing computations on the nodes where the data resides. Since these computations are typically part of Compute Grids and require proper deployment, load balancing, failover, and scheduling, the integration between Compute Grids and IMDGs plays a crucial role. It is particularly advantageous if both In-Memory Compute and Data Grids are integrated within the same product and utilize the same APIs. This eliminates the need for separate integration efforts and often results in highly performant and reliable systems.

 

 

 

In-Memory Computing graphic

 


In-Memory Data Grids (IMDGs), along with Compute Grids, are utilized across various industries in a wide range of applications. These include Risk Analytics, Trading Systems, Bioinformatics, eCommerce, and Online Gaming. Essentially, any project that faces challenges related to scalability and performance can greatly benefit from the use of In-Memory Processing and IMDG architecture.