Chitij Chauhan is a subject matter expert on distributed databases at Incedo, an IT service management company. Specifically, he’s an expert in database architecture ranging from core RDBMS platforms to massively parallel distributed databases to in-memory based databases. His talk at the In-Memory Computing Summit Europe (June 25-26 in London) is titled, “How to Identify the Most Appropriate In-Memory Database Solution for Computing and Application Requirements.”
Tom: Chitij, as we both know, businesses around the world are grappling to process exponentially more data from their customers each year – data coming in fast from swipes, clicks, micropayments and other touch points. This torrent of streaming data is overwhelming IT systems.
Gartner has reported that traditional data solutions are struggling to update, organize and index their conventional disk-based databasesso it can be queried in real-time to see the true state of their business.
In your talk next month in London at the In-Memory Computing Summit Europe, you’ll address how in-memory distributed databases provide an alternative to disk-resident databases. What are the main advantages that in-memory distributed databases provide organizations like those mentioned above – companies that are well-entrenched to conventional disk-resident databases?
Chitij: Traditional disk-resident databases usually operate by allocating a memory area known as buffer cache which is used to store a small subset of database data. Anytime a client requests data that does not currently exist in the buffer cache, it must fetch the data from the physical disk and load it into the buffer cache before it is read. While this strategy may work for some queries whose data is currently being served from the cache -- it will lead to bad performance for queries whose data is not found in the cache and it instead needs an expensive disk I/O operation.
In contrast to this, a distributed in-memory database will serve the data from RAM thereby improving query performance and eliminating costly disk I/O operations. Another often ignored fact is that in-memory databases are essentially lock-free data structures. This is a huge benefit in contrast to traditional disk resident databases where during concurrent operations transaction locking behavior often chokes system performance and causes locks and blocks user sessions.
Another vast area of improvisation with distributed in-memory databases is the ability to shard out table data evenly across multiple data nodes in the cluster. In comparison to this for disk resident databases you can partition your tables -- but it is a well-known observation that once your dataset grows beyond 5-10 TB range, users experience performance degradation and hence vertical table partitioning will no longer be a feasible workaround for a disk resident databases.
It is also easier to scale-out distributed in-memory databases by adding more data nodes to the cluster. Furthermore, there is no single point of failure in a distributed in-memory database as they usually deploy a shared-nothing architecture where a given node’s data is replicated to a corresponding node or more nodes in the cluster -- so that in the event of the loss of a given data node other available nodes can take over and continue serving the database cluster and hence cluster availability is unaffected.
Disk-resident databases usually deploy a shared everything architecture like Oracle RAC where the disk is shared among the multiple nodes and disk still remains the single point of failure.
Tom: Finding the right in-memory distributed database is a challenge for organizations because there are so many choices on market. With so many offerings available, how can a CTO identify the right fit for his or her company’s applications?
Chitij: The key thing to identify the right in-memory database is based on the use-case of a given application, corresponding data model and what the underlying in-memory database product is designed for.
As an example, one of the customer was running a MySQL InnoDB production database environment and they were looking for a suitable in-memory database alternative. MySQL NDB Cluster was a perfect fit in this situation because of its compatibility with InnoDB tables and without changing the data model table data could be easily be moved to NDb tables.
Most of the available in-memory distributed databases are designed for specific use cases and hence it is important to choose accordingly. For instance, SAP HANA and EXASOL are appliance based in-memory columnstore products designed purely for analytics. VoltDB is perfectly suitable for supreme OLTP performance -- and MemSQL, which is gearing towards the data warehousing space. VoltDB and MemSQL run on commodity hardware as compared to appliance-based products such as SAP HANA and EXASOL.
If an organization is looking for an open-source in-memory database product then Altibase and Apache® Ignite™ are suitable alternatives.
Tom: What are some examples of companies that have already made this transition? And what have they learned during this process? Lessons that our readers who are perhaps just starting this journey can learn from.
Chitij: Samsung, Tapjoy, HP Korea, Comcast, Citibank and Verizon Business are some of the various companies that have already made this transitional switch to an in-memory database. One of the key things that these organizations have now begun to understand is the evolution and the corresponding architecture of distributed in-memory databases and where these products fit in their tech stack.
One of the most observed trends has been that most of the organizations which have started to make a switch to in-memory databases often compared the features of traditional disk-resident databases to distributed in-memory databases. For instance, foreign keys and incremental backups are not supported in a majority of distributed in-memory database systems and it is important to understand why they are not supported instead of demanding such features.
Over a period of time, organizations have now started to understand that disk-resident databases and in-memory distributed databases are different in terms of their design architecture and hence it is important to make the right comparisons. Metrics such as performance and data-ingestion timings are valid metrics . Better understanding of the underlying architecture has led to better clarity -- and hence, a majority of these organizations essentially have heterogeneous tech stacks for data flow between different data products.
Tom: Can you share any “before” and “after” stories – organizations that have embraced a particular in-memory database solution?
Chitij: Tapjoy is a very good example in this context. Tapjoy implemented MemSQL in their heterogeneous tech stack which is comprised of messaging and data storage platforms such as SQS, SNS, Kafka, Google, BigQuery and MySQL. Tapjoy’s systems had a constant stream of incoming data from their processing pipelines and a part of data was financial in nature. Hence, they needed an ACID-compliant system.
Secondly, they needed a system that was capable of handling some orders of magnitude of their current MySQL traffic which had reached its capacity. They also wanted a system that is horizontally scalable. MemSQL being wire compliant with MySQL served their needs as it ticked all the boxes that Tapjoy was looking for and the same time delivering faster performance.
While evaluating performance metrics the query response time in MySQL was approximately 68 seconds whereas in MemSQL the response time was around 1.23 seconds resulting in a dramatic increase in performance.
Chitij’s session at the In-Memory Computing Summit Europe 2018 conference is scheduled for June 25 at 1:45 p.m. Details here.
Meantime, get the latest new and updates by following the conference on Twitter @IMCSummit. And attend the conference for just £20! A limited number of £20 tickets are being raffled off each week. Enter to win a full conference pass for only £20 (you'll save hundreds if you win!). Enter here. Good luck!