GridGain Developers Hub

FAQ

The section covers commonly asked questions related to the recommended data lakes acceleration architecture.

Can GridGain be used as a data lake, a data warehouse, or an analytical database?

GridGain is not an analytical database or data lake solution. It’s designed for speed and scale. In particular:

  • GridGain’s storage format is similar to a row-based format and is optimized for fast in-memory access and real-time workloads. GridGain will consume more memory and disk space providing ultra-low latencies in return (microseconds, milliseconds, second). With data lakes like Hadoop or analytical databases, it’s the opposite — the storage format is more compact and latencies are high (dozens of seconds, minutes or hours).

  • GridGain is not suited for ad-hoc queries. With GridGain’s storage format, secondary indexes are used for performance optimizations. This contradicts to the primary purpose of ad-hoc operations — they can be easily added by end users without extra optimization efforts. These limitations will no longer apply once GridGain supports columnar store format.

Can I use GridGain for BI reports?

Yes and no.

If you follow the suggested Data Lake Acceleration architecture by pre-selecting a list of reports that require a real-time response time and do the required tuning for them (secondary indexes, data collocation, etc.) then, yes, GridGan can be used as a storage for BI for those specific reports.

Otherwise, you might be tempted to use GridGain as an analytical database or data lake. However, that is not a supported GridGain use case. Please see the question above.

Is SQL the only API for real-time analytics?

No, SQL is one of the most wide-spread and standardized APIs adopted for analytics but it’s not the only one. With GridGain you can use a compute grid to do complex calculations with your custom Java, .NET, or C++ code in a map-reduce fashion which boosts performance by reducing data movement over the network. Next, GridGain Machine and Deep Learning APIs are good for custom calculations with generic ML/DL models.

How do I decide what is stored in GridGain vs. Hadoop?

A simple, generic approach is best:

  • If low-latency (microseconds, milliseconds, seconds) and high throughput (thousands and millions of operations per second) are required for a set of business operations then store a data set needed for such operations in GridGain and use GridGain APIs for the fastest data processing.

  • If high-latency (dozens of seconds, minutes, hours) and batch processing are reasonable/required for another set of operations then continue using your data lake (Hadoop).

Consider Spark integrations for federated-queries (aka. cross-database queries).