Overcoming Performance Issues in a Giant Banking Project
In helping to implement a nationwide banking repository for India, there were a lot of issues to solve around scale and latency requirements. The project is absolutely massive in terms of volume and scope and will touch the lives of nearly every person in the country. It was initially rolled out at 5-10 banks, then once proven, used as the central repository of data from all banks. At first, the performance in our initial design was not meeting requirements, but we overcame those issues to create a new, more performant design.
In this talk, you’ll learn about:
- Challenges faced around designing for this demanding project
- Performance achieved as compared to the baseline numbers
- Implementation strategies employed to overcome issues
- The current architecture design
- Plans for the future including how to solve remaining challenges
Solutions Architect at Tata Consultancy Services
Hello everyone, my name is Methun. I work as a Solution Architect for TCS. Today I’ll walk you through our journey of leveraging GridGain to overcome multiple performance bottlenecks and achieve our SLA targets. To set the context, the use case we’ll discuss is a critical part of a large-scale banking application that serves over 2,000 clients across India. The system required ultra-fast processing speeds, but we encountered several challenges along the way. In this presentation, I’ll cover the use case, the challenges we faced, and how we overcame each one.
I’ll begin with the architecture and use case overview, then the technical challenges and how we used GridGain to address them and meet our SLAs. I’ll also touch on the optimizations we implemented and, most importantly, the results we achieved. I’ll wrap up with the key takeaways and learnings from our experience.
To summarize our performance journey up front, our baseline throughput (TPS) was 900 records per second, and we ultimately reached about 21,000 records per second. The path from 900 to 21,000 TPS required many tuning stages: SQL and index optimizations (to ~2,500 TPS), configuration changes, memory sizing, key-value processing, persistence enablement, and workarounds for third-party constraints. Reaching 21,000 TPS from 900 was a significant effort involving many late nights and iterative tuning.
For architecture, I’ll focus on the parts relevant to this use case. The use case had a strict SLA and was our highest priority. In production we used a six-node GridGain cluster, each node with about 512 GB of RAM. For this discussion, I’ll reference testing performed in UAT, which used a two-node cluster with the same per-node memory. The workflow began with clients uploading a ZIP file containing 10–15 smaller files, with total records ranging from 1 million to 50 million. A containerized platform performed initial validations and parsing, then pushed records to Kafka. From Kafka, records entered a rule engine for field-level validations (mandatory checks, regex checks, dependency checks). Validated records were pushed to another Kafka topic. From that point, our GridGain processing began. (Note: we also performed significant tuning in the microservices, file-upload layer, and big-data layer, but those figures are out of scope here; we’re focusing on the GridGain aspects.)
Within GridGain, we used three sets of caches. First were the “initial caches,” one per input file in the ZIP (e.g., 15 files → 15 initial caches), used to ingest and stage the Kafka records. Second were the “primary caches,” which we persisted; these caches represent a subset of the big-data database and were used later for database validations. Third were the “final caches,” again mirroring the number of input files, holding records that had passed all validations and cross-joins; these were the source for moving clean data into the big-data database. In short, files are uploaded, validated in the containerized platform, cross-joined and DB-validated in GridGain, and then written to the big-data store.
We built four services around these caches. First, we performed a cross-file duplicate check across all staged files to remove duplicate records. Second, we ran cross-file validations (joins) and populated the primary caches. Third, we streamed data from Kafka to GridGain. Fourth, we handled tokenization for two specific files by calling a third-party tool; we sent records for tokenization and stored the tokenized results back in GridGain. Initially, we deployed the Kafka-streaming and tokenization services inside the container platform and the purely GridGain processing services on GridGain servers outside the platform. This split later caused connectivity issues and inconsistencies.
Our overall SLA in production was to process about 500 million records within four hours, equating to roughly 30,000 records per second end-to-end. Because UAT was smaller, our end-to-end target there was about 10,000 records per second. Since processing stages run sequentially, GridGain’s internal TPS target needed to be ~70,000 in production to support the overall 30,000 end-to-end target; this was scaled to ~24,000 for UAT. We initially designed the solution with SQL joins because file sizes were expected to be ~50 MB (40,000–50,000 records). A scope change increased file size dramatically, up to 50 million records per file (tens of gigabytes), which broke our original assumptions. Additionally, the third-party tokenization tool was limited to ~200 records per second, creating a severe bottleneck. Early on, files that should have completed in ~30 minutes sometimes took 10 hours; we’d see uploads at night still processing the next morning. It was clear we needed significant redesign and tuning.
Our tuning journey proceeded in stages based on what we observed. First, we optimized SQL and indexes. We began with three primary caches set to REPLICATED to avoid cross-network joins for SQL. Using explain plans, we rewrote queries and adjusted or added indexes (including index column ordering). This raised GridGain TPS from ~900 to ~2,500, with spikes near 20,000 only for very small files. Next, we made configuration changes. We converted caches from REPLICATED to PARTITIONED and carefully set affinity keys to collocate related data, minimizing cross-node joins. We also discovered that a 10-second metadata lookup (used to check whether GridGain had received all records for a given file) caused threads to wait and sometimes triggered critical thread-block errors under concurrent writes. Increasing the lookup interval to two minutes reduced contention and raised TPS to ~4,000, with occasional spikes still dependent on file size.
We then moved to memory sizing and key-value processing. We reduced backups from two to one, as only the primary caches were persisted and the initial/final caches were purged after processing. We uploaded large files to measure RAM/disk usage, tuned on-heap/off-heap allocations, and enabled persistence temporarily for sizing analysis. To prevent RAM exhaustion, we shifted processing control to GridGain: after upload, microservices sent file metadata to GridGain; GridGain then assessed current load and RAM, prioritized files via a queue, and only started processing when resources were available. This raised TPS to ~6,000 and stabilized performance across varying file sizes.
The major breakthrough came when we abandoned SQL joins for large joins and moved to key-value processing in binary format. We rewrote duplicate checking and join logic to use fast key-value lookups and leveraged Ignite compute tasks for distributed processing. This eliminated query overhead and unlocked in-memory performance, pushing TPS to ~34,000—consistently 30–34k even under heavier loads—making the system far more scalable and resilient to data-size variation.
Next, we stabilized the environment. Because the Kafka connector and tokenizer services were inside the container platform and Ignite servers were outside, we saw frequent disconnections that hurt continuity. After trying alternatives (including moving Ignite into the container), we ultimately moved all services out of the container platform and co-located them with the Ignite server instances. We also resolved thread-pool starvation caused by overlapping TCP/communication port ranges across multiple services by assigning distinct port ranges. These changes reduced network and thread-pool issues and delivered a consistent ~40,000 TPS in UAT.
We then enabled persistence in the primary caches as required for durability. Enabling persistence initially caused a throughput dip due to slow SSD I/O. We upgraded to faster SSDs and tuned write-ahead log settings and checkpoint frequency. After this, TPS stabilized around ~25,000. Although lower than the peak 40,000 without persistence, it met our durability goals and prevented data loss—an acceptable trade-off.
Finally, we addressed the third-party tokenization bottleneck (~200 records/sec). Only two files required tokenization, so we introduced a caching strategy in GridGain: we stored a hash of the raw data mapped to its tokenized value. This meant repeat values didn’t require another third-party call. With this change, end-to-end performance constraints eased, and our GridGain TPS settled around ~21,000, which aligned with the overall end-to-end objective (>10,000 records/sec in UAT).
In summary, our tuning journey moved from 900 TPS to 21,000 TPS through iterative changes: SQL/index tuning, cache mode and affinity optimization, memory sizing, switching to key-value processing with compute tasks, deployment/network stabilization, persistence enablement with SSD upgrades and WAL/checkpoint tuning, and a tokenization cache to reduce third-party dependency. Key takeaways from our experience: SQL tuning alone wasn’t enough; moving to key-value lookups and compute tasks was essential to scale. Partitioning and affinity key design were critical to minimize cross-node communication. Memory sizing and on-heap/off-heap tuning improved cache utilization and reduced excessive GC. Deployment strategy mattered—co-locating clients/services with Ignite servers reduced network bottlenecks. Persistence and SSD quality required careful tuning but provided necessary durability. Overall success came from a holistic approach across data layout, processing model, infrastructure, and durability settings.
At present, we’re operating at ~21,000 records per second in GridGain and about ~13,000 end-to-end in UAT’s two-node cluster. We’ll begin production testing soon. Looking ahead, we plan to make cross-joins more dynamic. Tuning isn’t just about faster queries; it’s about transforming how we handle data at large scale. We turned a slow, unreliable system into a high-speed, scalable platform—and this is only the beginning. I appreciate the opportunity to present at this summit and welcome your questions, feedback, and suggestions.