5 Limitations of MySQL with Big Data

Monday, January 8 2018 | Tom Diederich
Share This

MySQL is a widely used open-source relational database management system (RDBMS) and is an excellent solution for many applications, including web-scale applications. However, its architecture has limitations when it comes to big data analytics.

A closer look at the strengths and weaknesses of MySQL reveals several use cases where the RDBMS, powerful though it is, can benefit from some assistance. The following are the five limitations of MySQL in this area:

1. Delivering Hot Data

In large applications, the data cache stored in RAM can grow very large and be subjected to thousands or even millions of requests per second. MySQL does not have a strong memory-focused search engine. Because it is not designed for very high concurrency, users can be exposed to bottlenecks and periodic performance issues. MySQL is saddled with relatively high overhead and cannot deliver optimal speed.

A Solution: It is common to use either Memcached or Redis as external solutions for caches and other hot data needs, to eliminate the overhead of SQL parsing and transactions. But caching is difficult because you risk reading data that is not current. It is also possible to use internal scalability improvements in MySQL, such as the thread pool, available either as a MySQL enterprise feature or as an open source implementation. But such features typically help many queries to run concurrently, without improving the performance of any given query.

A Better Solution: The Apache® Ignite™ in-memory computing platform includes an in-memory data grid and an in-memory SQL grid. Redis and Memcached are primarily in-memory key-value stores. All three solutions partition and cache data in memory and they can be scaled out across distributed clusters. However, there are many differences in the way caching, transactions, persistence, and data querying are supported. The Apache Ignite in-memory computing platform includes many additional features not in Redis or Memcached that are often highly valuable for companies moving to in-memory computing. For example, ANSI-99 SQL, distributed ACID-compliant transactions, advanced computing and streaming capabilities, and more.<

2. Dealing with Highly Volatile Data

In thousands of updates per second are applied to a single database row (for example, flash online sales for high-demand concert tickets), it is crucial to maintain exact values at every second. MySQL is designed around full transactional semantics with support for long transactions and works with disk-based log durability. It is therefore not well suited for use with this kind of highly volatile data. 

A Solution: To some extent, good data design within MySQL can address this issue. Splitting the counter over several rows can help, and optimal configuration of the MySQL installation can yield up to ten times better performance than a stock MySQL configuration. Parallel replication, another historical problem, has been addressed in MySQL 5.7, or by solutions such as Percona XtraDB Cluster. Some users even move frequently-updated data to Memcached or Redis, and then periodically synchronize to the RDBMS. 

A Better Solution: Apache Ignite automatically distributes data across all nodes in a cluster. Replication between nodes and clusters is configurable and takes place automatically. Apache Ignite can be configured to provide the needed level of consistency. Once the data is in Apache Ignite, you can leverage the ANSI-99 distributed SQL engine and ACID compliant transactions in order to work with your data in the same way you work with it in MySQL. The Apache Ignite in-memory computing platform also includes many additional features not included in Percona XtraDB Cluster, Redis or Memcached that are often highly valuable for companies that are moving to in-memory computing.

3. Handling Large Data Volumes

MySQL was originally designed as a single-node system and not with the modern data center concept in mind. Today’s largest MySQL installations cannot scale by using MySQL as a single system and must rely on sharding, or splitting a data set over multiple nodes or instances. However, most sharding solutions in MySQL are manual and make application code more complex. Any performance gain is lost when queries must access data across multiple shards.

A Solution: Tools such as Vitess, a framework YouTube released for MySQL sharding, or ProxySQL are often used to help implement sharding. Memcached and Redis also offer front-end solutions, as do MongoDB and Cassandra.

A Better Solution: Apache Ignite was built from the ground up as a high performance and highly scalable distributed in-memory computing platform. There are no limitations to the amount of CPU and memory that can be used by any node. Furthermore, nodes are automatically load balanced. Data is automatically distributed across all nodes in a cluster so manual sharding is not necessary. Apache Ignite is a complete in-memory computing platform that includes many additional features beyond those offered by point solutions such as Vitess, Redis, Memcached, MongoDB, and Cassandra.

4. Providing Analytics

MySQL was not designed for running complicated queries against massive data volumes which requires crunching through a lot of data on a huge scale. MySQL optimizer is quite limited, executing a single query at a time using a single thread. A given MySQL query can neither scale among multiple CPU cores in a single system nor execute distributed queries across multiple nodes.

A Solution: Although MySQL offers no comprehensive solution for robust, large-scale data processing, many external solutions, such as Hadoop® and Apache® Spark™, are often used. Vertica and ClickHouse have also emerged as worthy analytics solutions.

A Better Solution: Apache Ignite easily integrates with Hadoop and Spark, using in-memory technology to complement these technologies and achieve significantly better performance and scale. The Apache Ignite in-memory SQL grid is highly optimized and easily tuned to execute high performance SQL queries. The in-memory SQL grid provides JDBC/ODBC connections to any analytics solution.

5. Powering Full Text Searches

MySQL can handle basic full text searches. However, because of its inability to manage parallel processing, searches do not scale well as data volumes increase.

A Solution: For small-scale search applications, InnoDB, first available with MySQL 5.6, can help. When the search spans beyond a few gigabytes, however, an advanced search solution such as Elasticsearch, Apache Solr, Crate.io or Sphinx Search makes sense.

A Better Solution: Apache Ignite supports text-based queries based on Lucene indexing. Text searches are distributed across the cluster for parallel processing against data stored in memory. This results in high performance and highly scalable full text searches.

When two trends converge

Although MySQL is an excellent choice for many applications, two major trends have converged to change the landscape of large-scale data processing. Together they are forcing a wholesale change in the database world:

  1. In-memory computing and the concept of caching has been around for 40 years, but it has come to the forefront as real-time data processing has become standard operating procedure (or at least the goal) of enterprises of all sizes. In-memory computing is the key to that speed, and at some point, in the not-too-distant future, advances in memory technology may lead us to a world of in-memory storage as well, with disks serving only as historical backups.

  1. Memory prices have dropped to the point where, for example, a clustered terabyte of RAM costs about $20,000, making it affordable for large enterprises that demand real-time data processing and analytics of data sets of that scale.

The good news is that as these trends converge, in-memory platforms are ready to deliver as the most sophisticated distributed systems in existence because they are efficient and well-architected. The largest implementations of Apache Ignite rival the largest installations of Cassandra and Hadoop with several thousand nodes and clusters – while providing greater performance and scalability.

Tom Diederich
Director of Community Engagement