GridGain Developers Hub

Batch Data Export from GridGain to Apache Iceberg on AWS

Vadim Kolodin
Senior Software Engineer

Overview

GridGain offers batch data export/import with the COPY FROM <source> INTO <target> command, which supports a variety of data formats and sources/destinations.

This tutorial explains how to:

  • Export data from GridGain as an Apache Iceberg table to Amazon services - AWS Glue - and store it in Amazon S3.

    • Apache Iceberg is an open table format for huge analytic datasets.

    • AWS Glue is a fully managed ETL (extract, transform, and load) service.

  • Query the exported data using Amazon Athena - an interactive query service that enables analyzing data directly in AWS Glue and Amazon S3 with the standard SQL queries.

  • Import data from AWS to GridGain.

Prerequisites

Prepare AWS environment:

  1. Create AWS access keys.

  2. Create an S3 bucket named s3://iceberg-warehouse-1/. Accept the default values for all settings.

  3. Create an AWS Glue database named gridgain_glue_iceberg.

  4. Create an AWS Athena workgroup named iceberg-workgroup-1, with s3://iceberg-warehouse-1/athena/ as the query result location.

Create Table

Create a simple table in GridGain:

CREATE TABLE cities(id int primary key, city varchar, country varchar, population_est int);

INSERT INTO cities(id, city, country, population_est)
VALUES
(1, 'Tokyo', 'Japan', 37468000),
(2, 'Delhi', 'India', 28514000),
(3, 'Shanghai', 'China', 25582000);

Batch-export the Table to AWS Glue

The AWS Glue data catalog is a persistent technical metadata store. The data is stored in AWS S3.

Command

Export data from the GridGain table to the AWS cloud using the COPY command:

COPY FROM cities(id, city, country, population_est)
INTO 's3://iceberg-warehouse-1/'
FORMAT ICEBERG
PROPERTIES(
    'table-identifier'='gridgain_glue_iceberg.cities',
    'io-impl'='org.apache.iceberg.aws.s3.S3FileIO',
    'catalog-impl'='org.apache.iceberg.aws.glue.GlueCatalog',
    's3.client-region'='eu-east-1',
    's3.access-key-id'='YOUR_ACCESS_KEY',
    's3.secret-access-key'='YOUR_SECRET_KEY'
);

AWS-specific properties:

  • The AWS Client region requires s3.client-region to be explicitly passed as an export property.

  • The AWS Client credentials can be defined as:

    • Bulk export properties s3.access-key-id, s3.secret-access-key, and s3.session-token.

    • Environment variables AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN.

    • Java system properties aws.accessKeyId, aws.secretKey, and aws.sessionToken.

Examine the Export Results

When the export is complete:

  1. Examine the "destination" S3 bucket:

    1 s3 data
  2. Note that a new AWS Glue table has been created, and it refers to S3:

    2 glue tables
  3. Query the exported data with Amazon Athena (using the previously created workgroup) read more. The query should pick up the new table created in AWS Glue:

    3 athena query

Batch-import Apache Iceberg Table to GridGain

To import an Apache Iceberg table from AWS Glue (Amazon S3) to GridGain.

  1. Create a "destination" GridGain table (for example, using DDL). For this tutorial purposes, you can reuse an existing GridGain table - just clean it up first:

    DELETE FROM CITIES;
  2. Swap FROM and INTO keywords in the command you have used to upload the data to the cloud:

    COPY INTO cities(id, city, country, population_est)
    FROM 's3://iceberg-warehouse-1/'
    FORMAT ICEBERG
    PROPERTIES(
        'table-identifier'='gridgain_glue_iceberg.cities',
        'io-impl'='org.apache.iceberg.aws.s3.S3FileIO',
        'catalog-impl'='org.apache.iceberg.aws.glue.GlueCatalog',
        's3.client-region'='eu-central-1',
        's3.access-key-id'='YOUR_ACCESS_KEY',
        's3.secret-access-key'='YOUR_SECRET_KEY'
    );
  3. After the import is complete, verify that new entries have appeared in the destination table:

    SELECT * FROM cities;

    The response should look as follows:

    +------+-------------+-------------+--------------+
    |  ID  |     CITY    |   COUNTRY   |POPULATION_EST|
    +------+-------------+-------------+--------------+
    | 1    | Tokyo       | Japan       | 37468000     |
    | 2    | Delhi       | India       | 28514000     |
    | 3    | Shanghai    | China       | 25582000     |
    +------+-------------+-------------+--------------+

Conclusion

You have successfully completed the tutorial. You have learned how to bulk-upload Apache Iceberg tables to an AWS Glue catalog (with data residing in Amazon S3) and bulk-download data from AWS Glue / Amazon S3 to GridGain.