In-Memory Compute Grid
In-Memory Compute Grid is one of two core technologies that define GridGain foundation. GridGain Systems pioneered the Java-based In-Memory Compute Grid back in 2007 and today it represents the most feature rich and advanced platform for parallel processing and distributed computations for JVM.
Computational grid technology, often referred to as MapReduce, provides means for distribution of processing logic, i.e. parallelization of computations on more than on computer.

More specifically, computational grids or MapReduce type processing defines the method of splitting original compute task into multiple sub-tasks, executing these sub-tasks in parallel on any managed infrastructure and aggregating (a.k.a. reducing) results back to one final result.
Key Features
GridGain provides comprehensive In-Memory Compute Grid and MapReduce capabilities:
- In-Memory MapReduce
- Pluggable failover, topology and collision resolution
- Distributed task session
- Distributed continuations & recursive split
- Support for Streaming MapReduce
- Support for Complex Event Processing (CEP)
- Node-local cache
- AOP-based, OOP/FP-based, synch/asynch execution modes
- Support for direct closure distribution in Java, Scala and Groovy
- Cron-based scheduling
- Direct redundant mapping support
- Zero deployment with P2P class loading
- Partial asynchronous reduction
- Direct support for weighted and adaptive mapping
- State checkpoints for long running tasks
- Early and late load balancing
- Affinity routing with data grid
Zero Deployment
GridGain is the first platform which In-Memory Data Grid features zero-deployment capability enabling users to simply bring up default GridGain nodes online and they immediately become part of the data grid topology and can store any user objects without any need for explicit deployment of user’s classes.
In-Memory MapReduce
MapReduce is a distributed computing paradigm which allows to map your task into smaller jobs based on some key, execute these jobs on Grid nodes, and reduce multiple job results into one task result. This is essentially what GridGain’s MapReduce does. However, the difference of GridGain MapReduce from other MapReduce frameworks, like Hadoop for example, is that GridGain MapReduce is geared towards streaming low-latency in-memory processing.
If Hadoop MapReduce task takes input from disk, produces intermediate results on disk and outputs result onto disk, GridGain does everything Hadoop does in memory – it takes input from memory via direct API calls, produces intermediate results in memory and then creates result in-memory as well. Full in-memory processing allows GridGain provide results in sub-seconds whereas other MapReduce frameworks would take minutes.
Collocation of Computations & Data
One of the major scalability problems in utilizing data grids is unnecessary noise traffic which may consume significant amount of bandwidth and often can bring a server to its knees. Imagine a scenario when you are using a partitioned cache and have to constantly retrieve various key-value pairs from cache and perform some computation on them. However, in partitioned mode, every key-value pair may or may not be cached on the local node, so it needs to be fetched from remote nodes. Once the data is fetched and brought to a local node, you perform the computation on it and, once you are done, the data you just requested is most likely discarded. It may be cached in Near cache on the local node (which is default GridGain behavior), but Near caches are generally much smaller than partitioned caches (size limitation) and have more aggressive eviction policies than partitioned caches. So to summarize, most of the data access from caches is either immediately discarded or will be discarded shortly after – thus creating unnecessary noise traffic.
It is much more effective to bring the computation exactly to the node where data resides as opposed to bring the data to computation. It is so because in absolute majority of cases computations are much smaller in size to transport over the network and they are changing much less frequently, if at all.
Collocation between computations and data is often called Affinity Routing highlighting that computations and data have affinity between them and computation jobs will be routed based on this affinity. In GridGain such collocation is easily achieved via compute and data grid integration.
Distributed Continuations & Recursive Split
Continuations are useful for cases when jobs need to be suspended and their resources need to be released. For example, if you spawn a new task from within a job, it would be wrong to wait for that task completion synchronously because the job thread will remain occupied while waiting, and therefore your grid may run out of threads. The proper approach is to suspend the job so it can be continued later, for example, whenever the newly spawned task completes.
This is where GridGain continuations become really helpful. GridGain allows users to suspend and restart the jobs at any point. So in our example, where a remote job needs to spawn another task and wait for result, our job would spawn the task execution and then suspend itself. Then, whenever the new task completes, our job would wake up and resume its execution. Such approach allows for easy task nesting and recursive task execution. It also allows to have a lot more cross-dependent jobs and tasks in the system than there are available threads.
Support for Streaming MapReduce
Streaming MapReduce usually refers to streaming data sets and ability to process them immediately, as the data comes. Unlike other MapReduce frameworks which spawn different external executable processes which work with data from disk files and produce output onto disk files (even when working in streaming mode), GridGain MapReduce seamlessly works on streaming data directly in-memory. As the data comes in into the system, user can keep spawning MapReduce tasks and distribute them to any set of remote nodes on which the data is processed in parallel and result is returned back to the caller. The main advantage is that all MapReduce tasks execute directly in-memory and can take input and store results utilizing GridGain in-memory caching, thus providing very low latencies.
Cron-Based Scheduling
In addition to running direct MapReduce tasks on the whole grid or any user-defined portion of the grid (virtual subgrid), you can schedule your tasks to run repetitively as often as you need. GridGain supports Cron-based scheduling syntax for the tasks, so you can schedule your tasks to run using the familiar standard Cron syntax that we are all used to.
For example, this Cron syntax would schedule a task to run every 2 seconds up to 5 times: "{2,5} * * * * *".
Direct Support for Redundant Mapping
In some cases a guarantee of a timely successful result is a lot more important than executing redundant jobs. In such cases GridGain allows you to spawn off multiple copies of the same job within your MapReduce task to execute in parallel on remote nodes. Whenever the first job completes successfully, the other identical jobs are cancelled and ignored. Such approach gives a much higher guarantee of successful timely job completion at the expense of redundant executions. Use it whenever your grid is not overloaded and consuming CPU for redundancy is not costly.
Partial Asynchronous Reduction
Sometimes when executing MapReduce tasks you don’t need to wait for all the remote jobs to complete in order for your task to complete. A good example would be a simple search. Let’s assume, for example, that you are searching for some pattern from data cached in GridGain data grid on many remote nodes. Once the first job returns with found pattern you don’t need to wait for other jobs to complete as you already found what you were looking for. For cases like this GridGain allows you to reduce (i.e. complete) your task before all the results from remote jobs are received – hence the name “partial asynchronous reduction”. The remaining jobs belonging to your task will be cancelled across the grid in this case.
AOP-based Execution
Aspect Oriented Programming (AOP) provides cross-cutting ability across the code base and intercepting it by attaching different behaviors to certain logic. One of the coolest advantages to AOP is that it happens transparently to main code execution. For example, AOP-based grid-enabling in GridGain happens by attaching @Gridify annotation to any Java method. Once GridGain finds such annotation, the method will be automatically executed on the grid instead of running locally.
GridGain support all main AOP frameworks out-of-the-box, including AspectJ, JBoss AOP, and Spring AOP.
Support for Complex Event Processing (CEP)
Concept of complex event processing usually refers to ability to receive events from various data sources and infer certain patterns and behaviors from them. CEP most often is associated with real time low-latency processing which allows to make smart decisions where seconds matter, for example, providing targeted add on some website. GridGain supports CEP in several ways:
- By providing Streaming MapReduce capability which allows spawning low-latency in-memory MapReduce tasks to provide responses in real time.
- By utilizing GridGain messaging (including Actor-based functional APIs) which allow streamlined message processing and ability to maintaing conversational state between two endpoints throughout multiple message exchanges.
Also note that with such concept as continuous mapping, users are able to continue adding jobs to the same MapReduce task even after mapping step has already happened which provides ability to continuously add data to an already running task.
Adaptive and Weighted Loan Balancing
For cases when some grid nodes are more powerful or have more resources than others you can run into scenarios where nodes are not fully utilizes or over-utilized. Under-utilization and over-utilization are both equally bad for a grid – ideally all grid nodes in the grid should be equally utilized. GridGain provides several ways to achieve equal utilization across the grid.
Weighted Load Balancing
If you know in advance that some nodes are, say, 2 times more powerful than others, you can attach proportional weights to the nodes. For examples, part of your grid nodes would get weight of 1 and the other part would get weight of 2. In this case job distribution will be proportional to node weights and nodes with heavier weight will proportionally get more jobs assigned to them than nodes with lower weights. So nodes with weight 2 will get 2 times more jobs than nodes with weight 1.
Adaptive Load Balancing
For cases when nodes are not equal and you don’t know exactly how different they are, GridGain will automatically adapt to differences in load and processing power and will send more jobs to more powerful nodes and less jobs to weaker nodes. GridGain achieves that by listening to various metrics on various nodes and constantly adapting its load balancing policy to the differences in load.
Distributed Task Session & Connected Tasks
Distributed task session is created for every task execution and allows for sharing state between different jobs within the task. Jobs can add, get, and wait for various attributes to be set, which allows grid jobs and tasks remain connected in order to synchronize their execution with each other and opens a solution for a whole new range of problems.
Imagine for example that you need to compress a very large file (let’s say terabytes in size). To do that in a grid environment you would split such file into multiple sections and assign every section to a remote job for execution. Every job would have to scan its section to look for repetition patterns. Once this scan is done by all jobs in parallel, jobs would need to synchronize their results with their siblings so compression would happen consistently across the whole file. This can be achieved by setting repetition patterns discovered by every job into the session.
Local Node Memory Space
When working in distributed environment often you need to have a consistent local state per grid node that is reused between various job executions. For example, what if multiple jobs require database connection pool for their execution – how do they get this connection pool to be initialized once and then reused by all jobs running on the same grid node? Essentially you can think about it as a per-grid-node singleton service, but the idea is not limited to services only, it can be just a regular Java bean that holds some state to be shared by all jobs running on the same grid node.
Starting with GridGain 3.0 GridNodeLocal per-grid-node local cache was introduced. The name was borrowed from ThreadLocal class in Java, because just like ThreadLocal provides unique space per-thread in Java, GridNodeLocal provides unique space per-grid-node in GridGain. GridNodeLocal implements java.util.concurrent.ConcurrentMap interface and is absolutely lock-free. In fact, it simply extends java.util.concurrent.ConcurrentHashMap implementation and, therefore, inherits all the methods available there.
Load Balancing & Job Stealing
In MapReduce pattern the mapping is a process of splitting the initial task into sub-tasks and assigning them to the grid nodes. Mapping generally involves the splitting logic itself, mapping sub-tasks to the nodes including load balancing, and potential failover and collision resolution. GridGain provides several load balancers out-of-the-box to make sure that the best available node is picked during the mapping step. You can choose form Round-Robin, Random load-balancers to Adaptive load-balancer which listens to the different load characteristics on each node and picks the least loaded node for execution. This step is called early load balancing.
However, even though early load balancing works well in absolute majority of the cases, by the time your job gets to remote node the load characteristics on that node may change. In short, it may no longer be the least loaded node in topology. For cases like this GridGain has a concept of job stealing, where less loaded nodes steal jobs directly from the more loaded nodes. This step is called late load balancing.
Combination of both, early and late load balancing, allow GridGain to keep all nodes in the grid equally loaded which is very important in achieve low latencies and high throughput.
Fail-Over & Fault-Tolerance
As in many other distributed systems, in GridGain in case of a node crash jobs are automatically failed over to another node. However, in GridGain you can also treat any result that comes back from remote job execution as a failure. Remote node can still be alive, but it may be running low on CPU, I/O, disk space, etc… – there are many conditions that may result in a failure within your application. GridGain allows you to optionally failover a job based on any job result. Moreover, you have the ability to chose to which node a job should be failed over to as it could be different for different applications or different computations within the same application.
Checkpoints
Checkpointing a job is ability to periodically save its state. This comes especially useful in combination with fail-over functionality. Imagine a job that may take 5 minute to execute, but after 4th minute the node on which it was running crashed. The job will be failed over to another node, but it would usually have to be restarted from scratch and would take another 5 minutes. However, if job was checkpointed every minute, then the most amount of work that could be lost is the last minute of execution and upon failover the job would restart form last saved checkpoint. GridGain allows you to easily checkpoint jobs to better control overall execution time of your jobs and tasks.
Affinity Routing With Data Grid
Affinity routing is one of the key concepts behind Compute and Data Grid technologies (whether they are in-memory or disk based). In general, affinity routing allows to co-locate a job and the data set this job needs to process.
The idea is pretty simple: if job and data are not co-located, then job will arrive on some remote node and will have to fetch necessary data from yet another node where data is stored. Once processed this data will most likely will have to be discarded (since it’s already stored and backed up elsewhere). This process induces expensive network trip plus all associated marshaling and demarshaling. At scale – this behavior can bring almost any system to a halt.
Affinity co-location solves this problem by co-locating job with its necessary data set. We say that there is an affinity between a processing (i.e. job) and the data that this processing requires – and therefore we can route the job based on this affinity to a node where data is stored to avoid unnecessary network trips and extra marshaling and demarshaling.
GridGain provides advanced capabilities for affinity co-location: from a simple single-method call to sophisticated APIs supporting complex affinity keys and non-trivial topologies.
Example
The following examples demonstrate simple Pi calculations (in Java, Scala and Groovy++) on the grid and how simple the implementation can be with GridGain. Note that this is a full source code - copy it and compile. It works on one node – and just as good on thousands of nodes in the grid or cloud with no code change.
What is even more interesting is that this application automatically includes:
- Auto topology discovery
- Auto load balancing
- Distributed failover
- Collision resolution
- Zero code deployment & provisioning
- Pluggable marshalling & communication
In Scala:
import org.gridgain.scalar._
import scalar._
import org.gridgain.grid.GridClosureCallMode._
import org.gridgain.grid.Grid
object ScalarPiCalculationExample {
private val N = 10000
def main(args: Array[String]) = scalar { g: Grid =>
println("Pi estimate: " +
g.@<[Double, Double](
SPREAD,
for (i <- 0 until g.size()) yield () => calcPi(i * N),
_.sum
)
}
def calcPi(start: Int): Double =
(start until (start + N)) map
(i => 4.0 * (1 - (i % 2) * 2) / (2 * i + 1)) sum
}
In Groovy++:
import org.gridgain.grid.*
import static org.gridgain.grid.GridClosureCallMode.*
import static org.gridgain.grover.Grover.*
import org.gridgain.grover.categories.*
@Typed
@Use(GroverProjectionCategory)
class GroverPiCalculationExample {
private static int N = 10000
static void main(String[] args) {
grover { Grid g ->
println("Pi estimate: " +
g.reduce$(
SPREAD,
(0.. calcPi(it * N) }
},
{ it.sum() }
)
)
}
}
private static double calcPi(int start) {
(start..
sum + (4.0 * (1 - (i % 2) * 2) / (2 * i + 1))
}
}
}
In Java:
import org.gridgain.grid.*;
import org.gridgain.grid.typedef.*;
import static org.gridgain.grid.GridClosureCallMode.*;
public final class GridPiCalculationExample {
private static final int N = 1000;
private static double calcPi(int start) {
double acc = 0.0;
for (int i = start; i < start + N; i++)
acc += 4.0 * (1 - (i % 2) * 2) / (2 * i + 1);
return acc;
}
public static void main(String[] args) throws GridException {
G.start();
try {
Grid g = G.grid();
System.out.println("Pi estimate: " +
g.reduce(
SPREAD,
F.yield(F.range(0, g.size()),
new C1() {
@Override public Double apply(Integer i) {
return calcPi(i * N);
}
}
),
F.sumDoubleReducer()
)
);
}
finally {
G.stop(true);
}
}
}
