GridGain Developers Hub
GitHub logo GridGain iso GridGain.com
GridGain Software Documentation

Data Loading

Most production systems have large amounts of data that needs to be loaded into the cluster as soon as the cluster is up and running. This chapter describes data loading techniques such as streaming, the COPY command, and batch loading for clusters running on GridGain Cloud.

Streaming

Streaming data in bulk can be done with the JDBC driver. When streaming is enabled, the JDBC driver will pack your commands in batches and send them to the cluster. On the cluster side, the batch is converted into a stream of cache update commands which are distributed asynchronously between server nodes. This technique increases peak throughput because at any given time all cluster nodes are busy with data loading.

Usage

To stream data into your GridGain Cloud deployment, you can prepare a file with the SET STREAMING ON command with INSERT statements for the data that needs to be loaded. For example, the code snippet below creates a table and streams data into it:

CREATE TABLE City (
  ID INT(11),
  Name CHAR(35),
  CountryCode CHAR(3),
  District CHAR(20),
  Population INT(11),
  PRIMARY KEY (ID, CountryCode)
) WITH "template=partitioned, backups=1, affinityKey=CountryCode, CACHE_NAME=City, KEY_TYPE=demo.model.CityKey, VALUE_TYPE=demo.model.City";

SET STREAMING ON;

INSERT INTO City(ID, Name, CountryCode, District, Population) VALUES (1,'Kabul','AFG','Kabol',1780000);
INSERT INTO City(ID, Name, CountryCode, District, Population) VALUES (2,'Qandahar','AFG','Qandahar',237500);
INSERT INTO City(ID, Name, CountryCode, District, Population) VALUES (3,'Herat','AFG','Herat',186800);
INSERT INTO City(ID, Name, CountryCode, District, Population) VALUES (4,'Mazar-e-Sharif','AFG','Balkh',127800);
INSERT INTO City(ID, Name, CountryCode, District, Population) VALUES (5,'Amsterdam','NLD','Noord-Holland',731200);
-- More INSERT commands --

SET STREAMING OFF;

Example

First, we need to connect to the cluster using the JDBC driver. For this we will use SQLLine and provide the JDBC connection string obtained from the Cluster Info window. Open a command shell and execute the following command:

./sqlline.sh --verbose=true -u "jdbc:ignite:thin://[address]:[port];user=ignite;password=[cluster_password];
sslMode=require;sslClientCertificateKeyStoreUrl=keyStore.jks;sslTrustCertificateKeyStoreUrl=keyStore.jks"
sqlline.bat --verbose=true -u jdbc:ignite:thin://[address]:[port];user=ignite;password=[cluster_password];
sslMode=require;sslClientCertificateKeyStoreUrl=keyStore.jks;sslTrustCertificateKeyStoreUrl=keyStore.jks

Substitute the values for [host], [port], [username], [password] and SSL related parameters with the values copied from the Cluster Info window.

Once the SQLLine tool is connected to the cluster, we can execute the !run command supported by SQLLine to load the data into cluster running on GridGain Cloud. For this example we will use the sample World Database that is shipped with the latest GridGain Client Cloud Pack distribution or any other GridGain Edition. You can find it in your {gridgain-version}/examples/sql/ directory.

!run /path/to/sample/ignite_world.sql
!run /path/to/sample/ignite_world.sql

After executing the above command and closing the JDBC connection, all the data will be loaded into the cluster, ready to be queried.

GridGain Cloud SQL

COPY Command

The COPY command allows you to copy the content of a file in the local file system to the server and apply its data to a SQL table. Internally, it reads the file content in a binary form into data packets, and sends those packets to the server. Then, the file content is parsed and executed in a streaming mode. Use this mode if you have data dumped to a file. Currently, the COPY command is only supported via the JDBC driver and can only work with data in CSV format.

Usage

It’s fairly simple to use the COPY command to load the data from a CSV file into a GridGain Cloud deployment.

COPY FROM "/path/to/local/file.csv" INTO tablename (col1, col2, col3, col4, ...) FORMAT CSV

In the above command, substitute /path/to/local/file.csv with the actual path to your CSV file, as well as the table name and col1, col2, col3, col4, …​ with an actual table name and column names.

Example

First, we need to connect to the cluster using the JDBC driver. For this we will use SQLLine and provide the JDBC connection string obtained from Cluster Info. Open a command shell and execute the following command:

./sqlline.sh --verbose=true -u "jdbc:ignite:thin://[address]:[port];user=ignite;password=[cluster_password];
sslMode=require;sslClientCertificateKeyStoreUrl=keyStore.jks;sslTrustCertificateKeyStoreUrl=keyStore.jks"
sqlline.bat --color=true --verbose=true -u jdbc:ignite:thin://[address]:[port];user=ignite;password=[cluster_password];
sslMode=require;sslClientCertificateKeyStoreUrl=keyStore.jks;sslTrustCertificateKeyStoreUrl=keyStore.jks

Substitute the values for [host], [port], [username], [password] and SSL parameters with the values copied from the Cluster Info window.

Once the SQLLine tool is connected to the cluster, we can execute the COPY command:

COPY FROM "/path/to/local/file.csv" INTO city (ID, Name, CountryCode, District, Population) FORMAT CSV

Before loading the data, we created a City table. In the above command, substitute /path/to/local/file.csv with the actual path to your CSV file, and provide the table name and fields just like you would to an INSERT command. For our example, we’ll use a CSV file - city.csv - which is shipped with the latest GridGain distributions. You can find it in your {gridgain-version}/examples/src/main/resources/sql/ directory.

Batch Loading

The simplest way to load data into the cluster is to use JDBC or ODBC driver, and execute the INSERT command for every row one by one. This approach requires synchronous request-response between the client and the cluster and is not very efficient for large data sets.

The JDBC and ODBC drivers also support batch loading. You may add several statement to a local batch and then execute all of them in a single network call. This approach typically increases data loading speed by several times. Optimal batch size may differ depending on the network parameters and the length of INSERT commands and their arguments, but ideally a pack of at least 50 rows in a batch would suffice. For more details, please refer to the following documentation for: