SQL Performance Tuning
This article outlines basic and advanced optimization techniques for GridGain SQL queries. Some of the sections are also useful for debugging and troubleshooting.
Basic Considerations: GridGain vs RDBMS
GridGain and Ignite are frequently compared to relational databases for their SQL capabilities with an expectation that existing SQL queries, created for an RDBMS, will work out of the box and perform faster in GridGain without any changes. Usually, such a faulty assumption is based on the fact that GridGain stores and processes data in-memory. However, it’s not enough just to put data in RAM and expect an order of magnitude performance increase. GridGain as a distributed platform requires extra changes for the sake of performance and below you can see a standard checklist of best practices to consider before you benchmark GridGain against an RDBMS or do any performance testing:
Ignite/GridGain is optimized for multi-nodes deployments with RAM as a primary storage. Don’t try to compare a single-node GridGain cluster to a relational database that was optimized for such single-node configurations. You should deploy a multi-node GridGain cluster with the whole copy of data in RAM.
Be ready to adjust your data model and existing SQL queries if any. Use the affinity colocation concept during the data modelling phase for proper data distribution. Remember, it’s not enough just to put data in RAM. If your data is properly collocated you can run SQL queries with JOINs at massive scale and expect significant performance benefits.
Define secondary indexes and use other standard, and GridGain-specific, tuning techniques described below.
Keep in mind that relational databases leverage local caching techniques and, depending on the total data size, an RDBMS can complete some queries even faster than GridGain even in a multi-node configuration. If your data set is around 10-100GB and an RDBMS has enough RAM for caching data locally than it, for instance, can outperform a multi-node GridGain cluster because the latter will be utilizing the network. Store much more data in GridGain to see the difference.
Using the EXPLAIN Statement
EXPLAIN syntax for reading execution plans and query performance investigation purposes.
Use this command whenever you need to optimize your queries. Note that the plan will contain multiple rows: the
last one will contain a query for the reducing side (usually your application), others are for map nodes (usually server
EXPLAIN SELECT name FROM Person WHERE age = 26;
The execution plan itself is generated by H2 as described here.
OR Operator and Selectivity
If a query contains an
OR operator, then indexes may not be used as expected depending on the complexity of the query.
For example, for the query
select name from Person where gender='M' and (age = 20 or age = 30), an index on the
gender field will be used instead of an index on the
age field, although the latter is a more selective index. As a workaround
for this issue, you can rewrite the query with
UNION ALL (notice that
ALL will return
rows, which will change the query semantics and will further penalize your query’s performance):
SELECT name FROM Person WHERE gender='M' and age = 20 UNION ALL SELECT name FROM Person WHERE gender='M' and age = 30
Avoid Having Too Many Columns
Avoid having too many columns in the result set of a
SELECT query. Due to limitations of the H2 query parser,
queries with 100+ columns may perform worse than expected.
Result Set Lazy Loading
By default, GridGain attempts to load the whole result set to memory and send it back to the query initiator (which is usually your application). This approach provides optimal performance for queries of small or medium result sets.
However, if the result set is too big to fit in the available memory, then it can lead to prolonged GC pauses and even
To minimize memory consumption, at the cost of a moderate performance hit, you can load and process the result sets lazily by passing the lazy parameter to the JDBC and ODBC connection strings or use a similar method available for Java, .NET, and C++ APIs:
SqlFieldsQuery query = new SqlFieldsQuery("SELECT * FROM Person WHERE id > 10"); // Result set will be loaded lazily. query.setLazy(true);
Querying Collocated Data
Whenever GridGain executes a distributed query, it sends sub-queries to individual cluster nodes and groups the
results on the reducer node (usually your application). If you know in advance that the data you are querying is
collocated by the
GROUP BY condition, you can use
SqlFieldsQuery.collocated = true to reduce network traffic
between the nodes and query execution time. When this flag is set to
true, the query is executed on individual nodes
first and the results are sent to the reducer node for final calculation.
Consider the following examples, in which we assume that data is collocated by
SELECT SUM(salary) FROM Employee GROUP BY department_id
Because of the nature of the SUM operation, GridGain will sum the salaries across the elements stored on individual
nodes, and then send these sums to the reducer node where the final result will be calculated. Enabling the
flag will only slightly improve performance.
Let’s take a slightly different example.
SELECT AVG(salary) FROM Employee GROUP BY depatment_id
In this example, GridGain has to fetch all (
department_id) pairs to the reducer node and calculate the results
there. However, if employees are collocated by the
department_id field, i.e. employee data for the same department
is stored on the same node, setting
SqlFieldsQuery.collocated = true will reduce query execution time because
GridGain will calculate the averages for each department on the individual nodes and send the results to the reducer
node for final calculation.
Increasing Index Inline Size
If a value type allows, GridGain includes indexed values in the index itself to optimize querying and data updates. Fixed-size data types (bool, byte, short, int, etc.) are included in full. For variable-size data (string, byte), only a fixed-length portion is included. The length of the included portion is called the inline size and by default equals the first 10 bytes of the value.
When values are not fully included in the index, comparing these values may require reading the corresponding data pages, which may have a negative impact on performance. When indexing variable length data, it is recommended that you estimate the length of your fields and set the inline size to include most or all values.
The inline size can be changed by adjusting the
INLINE_SIZE property of the CREATE INDEX command and
other APIs available for your programming language.
By default, a SQL query is executed in a single thread on each participating GridGain node. This approach is optimal for queries returning small result sets involving index search. For example:
SELECT * FROM Person WHERE p.id = ?;
Certain queries might benefit from being executed in multiple threads. This relates to queries with table scans and aggregations, which is often the case for HTAP and OLAP workloads. For example:
SELECT SUM(salary) FROM Person;
You can control query parallelism through the
CacheConfiguration.queryParallelism property which defines the number
of threads that will be used to execute a query on a single node. Use a preconfigured
that adjusts a value of the parameter whenever
CREATE TABLE generates the SQL schema and underlying caches.
If a query contains
JOINs, then all the participating caches must have the same degree of parallelism.
Index hints are useful in scenarios when it’s known that one index is more suitable for certain queries than another.
They are also needed to instruct the query optimizer to choose a more efficient execution plan. To do this
optimization in GridGain, you can use
USE INDEX(indexA,…,indexN) statement that tells Ignite to apply only one of the
named indexes provided for query execution.
Below is an example that leverages this capability:
SELECT * FROM Person USE INDEX(index_age) WHERE salary > 150000 AND age < 35;
Partition pruning is a technique that optimizes queries that use affinity keys in the
WHERE condition. When
executing such a query, GridGain will scan only those partitions where the requested data is stored. This will reduce
query time because the query will be sent only to the nodes that store the requested partitions.
In the following example, the employee objects are collocated by the
id field (if an affinity key is not set
explicitly then the primary key is used as the affinity key):
CREATE TABLE employee (id BIGINT PRIMARY KEY, department_id INT, name VARCHAR) /* This query is sent to the node where the requested key is stored */ SELECT * FROM employee WHERE id=10; /* This query is sent to all nodes */ SELECT * FROM employee WHERE department_id=10;
In the next example, the affinity key is set explicitly and, therefore, will be used to colocate data and direct queries to the nodes that keep primary copies of the data:
CREATE TABLE employee (id BIGINT PRIMARY KEY, department_id INT, name VARCHAR) WITH "AFFINITY_KEY=department_id" /* This query is sent to all nodes */ SELECT * FROM employee WHERE id=10; /* This query is sent to the node where the requested key is stored */ SELECT * FROM employee WHERE department_id=10;
Skip Reducer on Update
When GridGain executes a DML operation, first, it fetches all the affected intermediate rows for analysis to the reducer node (usually your application), and only then prepares batches of updated values that will be sent to remote nodes.
This approach might affect performance and saturate the network if a DML operation has to move many entries over it.
Use this flag as a hint for the SQL engine to do all intermediate rows analysis and updates "in-place" on corresponding remote server nodes. The hint is supported for JDBC, ODBC, and other programming APIs:
SQL On-heap Row Cache
GridGain stores data and indexes in its own memory space outside of Java heap. This means that with every data access, a part of the data will be copied from the off-heap space to Java heap, potentially deserialized, and kept in the heap as long as your application or server node references it.
The SQL on-heap row cache is intended to store hot rows (key-value objects) in Java heap, minimizing resources spent for data copying and deserialization. Each cached row refers to an entry in the off-heap region and can be invalidated when one of the following happens:
The master entry stored in the off-heap region is updated or removed.
The data page that stores the master entry is evicted from RAM.
The on-heap row cache can be enabled for a specific cache/table (if CREATE TABLE is used for SQL tables and caches creation, then the parameter can be passed via a cache template:
<bean class="org.apache.ignite.configuration.CacheConfiguration"> <property name="name" value="person"/> ... <property name="sqlOnheapCacheEnabled" value="true"/> </bean>
If the row cache is enabled, you might be able to trade RAM for performance. You might get up to a 2x performance increase for some SQL queries and use cases by allocating more RAM for rows caching purposes.
Using TIMESTAMP instead of DATE
TIMESTAMP type instead of
DATE whenever possible. Presently, the
DATE type is serialized/deserialized very inefficiently resulting in performance degradation.