Do you remember where you were on October 20th, 2025? It's not exactly up there with "Where were you for the moon landing?", but if you deployed your production Ignite cluster in AWS US-East-1, there's a good chance that the scars are still fresh.
Apache Ignite has many native facilities to minimise your downtime. In common with other distributed systems, some level of redundancy is baked right in. If a node storing the primary replica of some data goes down, Ignite will seamlessly switch to a backup partition. Your client applications don't need to know about this topology change–Ignite is sophisticated enough to figure it out on its own. Similarly, compute tasks, long-running services and data structures are all distributed and will transparently fail over to one of the remaining nodes.
But October's AWS outage is different. Ignite can't fail over to another node if the whole data centre has gone down–there is no node to fail over to! It also doesn't help if the virtual machines are up, but the network is down and no one can reach the running cluster.
Ignite can only be as reliable as the underlying infrastructure.
A common architectural pattern to reduce downtime in these scenarios is to replicate your cluster's data to another, completely different data centre. A second cluster, constantly kept up to date with the most recent changes, allows you to fail over to another data centre should the worst happen with your primary.
We often see companies architect their own solution here–replicating data with Kafka is common–but it typically takes longer to implement than expected and there are often operational complexities that are not apparent when a proof of concept is tested. In many cases, the Data Center Replication that ships with GridGain can be a much more cost-effective solution.
Downtime is not only associated with catastrophic failures. Sometimes an upstream system goes haywire and sends corrupted or incorrect data. How can you fix the data as quickly as possible? GridGain includes the option to enable Point-in-Time Recovery that allows you to rollback the whole database to a specific time, rather than just to the last backup. It also offers Full and Incremental Snapshots that can later be used for cluster recovery purposes. This is over and above Ignite’s Snapshots, which are local/partition-level copies of data stored on the node, but not offering the same enterprise-level guarantees.
Ironically, good hygiene can also cause downtime, albeit usually scheduled. It's good practice to update to the latest version of your software to get the latest features, bug fixes and patches for security vulnerabilities. However, Ignite requires downtime to upgrade the cluster version. To design for 24/7 operation you’ll want to minimize (or even eliminate) planned downtime. GridGain has the ability to perform Rolling Upgrades, where a small number of server nodes are replaced over time with a newer version. No downtime required.
In short, no matter how stable Ignite is, minimising your overall system's downtime requires more than just a distributed database. It can require some additional enterprise features that make your underlying architecture even more reliable.