GridGain Developers Hub

Working with SQL

GridGain comes with ANSI-99 compliant, horizontally scalable and fault-tolerant distributed SQL database. The distribution is provided either by partitioning the data across cluster nodes or by full replication, depending on the use case.

As a SQL database, GridGain supports all DML commands including SELECT, UPDATE, INSERT, and DELETE queries and also implements a subset of DDL commands relevant for distributed systems.

You can interact with GridGain as you would with any other SQL enabled storage by connecting with JDBC or ODBC drivers from both external tools and applications. Java, .NET and C++ developers can leverage native SQL APIs.

Internally, SQL tables have the same data structure as key-value caches. It means that you can change partition distribution of your data and leverage affinity colocation techniques for better performance.

GridGain’s SQL engine uses H2 Database to parse and optimize queries and generate execution plans.

Distributed Queries

Queries against partitioned tables are executed in a distributed manner:

  • The query is parsed and split into multiple “map” queries and a single “reduce” query.

  • All the map queries are executed on all the nodes where required data resides.

  • All the nodes provide result sets of local execution to the query initiator that, in turn, will merge provided result sets into the final results.

You can force a query to be processed locally, i.e. on the subset of data that is stored on the node where the query is executed.

By default, the queries that require multiple records are executed on a specific node, and data required for these queries must be collocated with the other data.

When the node executes a simple query that only required one record, there is no need to collocate data, as queries can be distributed across the cluster and executed on nodes independently.

However, advanced queries may require access to multiple records. For example, JOIN operations may need data from several caches. In this case, GridGain would need all data available locally to perform the query. This data can be collocated and stored locally, or loaded over network as required. GridGain handles data loading differently for different operations.

Distributed JOIN

In GridGain, JOIN operations are executed with the assumption that the data is collocated. If your operation attemts a JOIN on non-colocated data, you may see incorrect results.

In case you need to perform a join on non-collocated data, you can perform the non-collocated join instead. This comes at a cost of performance for the whole query, as data will be transferred to the node before query execution.

Due to the high data load caused by JOIN operations, using non-collocated joins should be the last resort for when you cannot collocate data. This may have a significant impact on your performance and load.

Collocated Aggregation

GridGain assumes that data you query by using the aggregations (for example, GROUP BY or ORDER BY) is not collocated. If you know in advance that the data you are querying is already collocated, and your query does not need non-collocated data, you can use SqlFieldsQuery.collocated = true to tell the SQL engine that all required data for the query is located on the same node. This will reduce network traffic between the nodes and query execution time.

When this flag is set to true, the query is executed on individual nodes first and the results are sent to the reducer node for final calculation.

Consider the following example, in which we assume that the data is collocated by department_id (in other words, the department_id field is configured as the affinity key).

SELECT SUM(salary) FROM Employee GROUP BY department_id

Because of the nature of the SUM operation, GridGain will sum the salaries across the elements stored on individual nodes, and then send these sums to the reducer node where the final result will be calculated. This operation is already distributed, and enabling the collocated flag will only slightly improve performance.

Let’s take a slightly different example:

SELECT AVG(salary) FROM Employee GROUP BY department_id

In this example, GridGain has to fetch all (salary, department_id) pairs to the reducer node and calculate the results there. However, if employees are collocated by the department_id field, i.e. employee data for the same department is stored on the same node, setting SqlFieldsQuery.collocated = true will reduce query execution time because GridGain will calculate the averages for each department on the individual nodes and send the results to the reducer node for final calculation.

WHERE Clause

GridGain always assumes that the data used in the WHERE clause is collocated. For example, the following query must be executed with both Salaries and Employees collocated.

SELECT * FROM Salaries WHERE salary = (SELECT MAX(salary) FROM Employees WHERE Age=42)

Local Queries

If a query is executed over a replicated table, it will be run against the local data.

Queries over partitioned tables are executed in a distributed manner. However, you can force local execution of a query over a partitioned table. See Local Execution for details.

Working in Multiple Timezones

Each GridGain cluster exists in one timezone. All DATE, TIME or TIMESTAMP operations are performed relative to this specific timezone. However, because clients can operate in different timezones, GridGain converts time for operations performed from thin clients to represent a local user’s timezone.

For operations performed directly on caches, cluster’s timezone is used. If you perform direct cache operations from multiple time zones, make sure you keep track of the timezone users are in.