GridGain supports both
JOIN clauses. To ensure that the joins are both functionally correct
and performant, it is important to understand the collocation model.
If an SQL statement contains two or more tables, then these tables need to be collocated. "Collocated" means that the related data of the two tables are stored on the same node.
The join between tables A and B is collocated if any of the following is true:
Either A or B (or both) is
The join is done on the partitioning column of both tables (affinity key)
A distributed join is an SQL statement with a join clause that combines two or more partitioned tables. If the tables are joined on the partitioning column (affinity key), the join is called a collocated join. Otherwise, it is called a non-collocated join.
Collocated joins are more efficient because they can be effectively distributed between the cluster nodes.
By default, GridGain treats each join query as if it is a collocated join and executes it accordingly.
The following image illustrates the procedure of executing a collocated join. A collocated join (
Q) is sent to all the nodes that store the data matching the query condition. Then the query is executed over the local data set on each node (
E(Q)). The results (
R) are aggregated on the node that initiated the query (the client node).
Collocation joins have the following known limitations:
OUTER JOIN and REPLICATED Tables
There is currently a limitation in GridGain’s support of
OUTER JOIN. Given a
P, the following queries may not work correctly out-of-the-box and require special handling:
SELECT * FROM R LEFT JOIN P ON R.X = P.X
SELECT * FROM P RIGHT JOIN R ON P.X = R.X
To work around the limitation, the following setup is required:
Rneed to have equal affinity functions (specifically, the same number of partitions)
The join columns
P.Xmust be the affinity keys of both tables; note that unlike most cases this operation requires the
REPLICATEDtable to have a specific affinity key
Non-collocated joins must be turned off (
If all of the above is true, then the JOIN is be performed correctly.
If you execute a query in a non-collocated mode, the SQL Engine executes the query locally on all the nodes that store the data matching the query condition. But because the data is not collocated, each node requests missing data (that is not present locally) from other nodes by sending either broadcast or unicast requests. This process is depicted on the image below.
If the join is done on the primary or affinity key, the nodes send unicast requests because in this case the nodes know the location of the missing data. Otherwise, nodes send broadcast requests. For performance reasons, both broadcast and unicast requests are aggregated into batches.
Enable the non-collocated mode of query execution by setting a JDBC/ODBC parameter or, if you use SQL API, by calling