Introducing Apache Ignite White Paper
This white paper describes the Apache® Ignite™ in-memory computing platform. Apache Ignite includes an in-memory data grid, in-memory database, streaming analytics, and a continuous learning framework. The Introducing Apache Ignite white paper describes how Ignite delivers in-memory speed and massive scalability to modern data processing by supporting high-performance transactions, real-time streaming, and fast analytics in a single, comprehensive data access and processing layer. Apache Ignite easily powers both existing and new applications in a distributed, massively parallel architecture on affordable, industry-standard hardware. Apache Ignite can run on-premise, in a hybrid environment, or on a cloud platform such as AWS, Microsoft Azure or Google Cloud Platform.
Apache Ignite provides a unified API which supports SQL, C++, .NET, Java/Scala/Groovy, Node.js and more access for the application layer. The unified API connects cloud-scale applications with multiple data stores containing structured, semi-structured and unstructured data (SQL, NoSQL, Hadoop). It offers a high-performance data environment that allows companies to process full ACID transactions and generate valuable insights from real-time, interactive and batch queries.
In-memory computing platforms offer a strategic approach to in-memory computing. They deliver performance, scale and comprehensive capabilities far above and beyond what traditional in-memory databases (IMDBs), in-memory data grids (IMDGs) or other in-memory-based point solutions can offer by themselves.
Unlike in-memory databases, Apache Ignite works on top of existing databases and requires no rip-and-replace or any changes to an existing RDBMS. Users can keep their existing RDBMSs in place and deploy Apache Ignite as a layer above it. Apache Ignite can even automatically integrate with different RDBMS systems, such as Oracle, MySQL, Postgres, DB2, Microsoft SQL and others. This feature automatically generates the application domain model based on the schema definition of the underlying database and then loads the data. Moreover, IMDBs typically only provide a SQL interface while Apache Ignite provides a much wider ecosystem of supported access and processing paradigms in addition to ANSI SQL. Apache Ignite supports key/value stores, SQL access, MapReduce, HPC/MPP processing, streaming/CEP processing and Hadoop acceleration, all in one well-integrated in-memory data fabric.
When comparing Apache Ignite with in-memory data grids, it should be noted that an in-memory data grid is just one of the capabilities that Apache Ignite provides. In addition to the data grid function, Apache Ignite also supports HPC/MPP processing, streaming, clustering, and Hadoop acceleration, allowing for a much broader set of use cases than a typical IMDG.
Introducing Apache Ignite Architecture
Apache Ignite is JVM-based distributed middleware software. It is based on a homogeneous cluster topology implementation that does not require separate server and client nodes. All nodes in an Apache Ignite cluster are equal and can play any logical role per runtime application requirement.
At the core of Apache Ignite is a Service Provider Interface (SPI) design. The SPI-based design makes every internal component of Apache Ignite fully customizable and pluggable by the developer. This enables tremendous configurability of the system, with adaptability to any existing or future server infrastructure.
Another core tenet of Apache Ignite is the direct support for parallelization of distributed computations based on Fork/Join, MapReduce or MPP-style processing, the largest implementation ecosystem of distributed processing algorithms. Distributed parallel computations are used extensively internally by Apache Ignite and are fully exposed at the API level for user-defined functionality.
IN-MEMORY DATA GRID
One of the core Apache Ignite capabilities is an in-memory data grid. The data grid handles distributed in-memory data management including ACID transactions, failover and advanced load balancing, extensive SQL support and many other features. The Apache Ignite data grid is a distributed, object-based, ACID transactional, in-memory key-value store. Apache Ignite stores its data in memory as opposed to traditional Database Management Systems, which utilize disk as their primary storage mechanism. By utilizing system memory rather than disk, Apache Ignite is orders of magnitude faster than traditional DBMS systems.
The primary benefits and capabilities of the In-Memory Data Grid in Apache Ignite include:
- Distributed ANSI SQL-99 queries with distributed joins
- Lightning-fast performance
- Distributed in-memory caching
- Elastic scalability
- Distributed in-memory ACID transactions
- Distributed in-memory queue and other data structure
- Web session clustering
- Hibernate L2 cache integration
- Tiered off-heap storage
- Unique deadlock-free transactions implementation for fastest speed in in-memory transaction processing
Apache Ignite supports free-form SQL queries with virtually no limitations. The SQL syntax is ANSI-99 compliant. Apache Ignite can use any SQL function, aggregation, or grouping. Apache Ignite supports distributed SQL joins and allows for cross-cache joins. Joins between partitioned and replicated caches work without limitations while joins between partitioned data sets require that the keys are collocated. Apache Ignite supports the concept of fields queries as well to help minimize network and serialization overhead.
IN-MEMORY COMPUTE GRID
Apache Ignite includes a compute grid which enables parallel, in-memory processing of CPU-intensive or other resource-intensive tasks, including traditional High Performance Computing (HPC) and Massively Parallel Processing (MPP).
The primary capabilities of the In-Memory Compute Grid in Apache Ignite include:
- Dynamic clustering
- Fork-Join and MapReduce processing
- Distributed closure execution
- Load balancing and fault tolerance
- Distributed messaging and events
- Linear scalability
- Standard Java ExecutorService support
IN-MEMORY SERVICE GRID
The Apache Ignite Service Grid provides users with complete control over services being deployed on the cluster. It allows users to control how many instances of their service should be deployed on each cluster node, ensuring proper deployment and fault tolerance. The Service Grid guarantees continuous availability of all deployed services in case of node failures.
The primary capabilities of the In-Memory Service Grid in Apache Ignite include:
- Automatic deployment of multiple instances of a service
- Automatic deployment of a service as singleton
- Automatic deployment of services on node start-up
- Fault tolerant deployment
- Removal of deployed services
- Retrieval of service deployment topology information
- Remote access to deployed services via service proxy
In-memory streaming processing addresses a large family of applications for which traditional processing methods and disk-based storages, such as disk-based databases or file systems, are inadequate. Such applications are extending the limits of traditional data processing infrastructures.
Streaming support enables querying so-called rolling windows of incoming data to enable users to answer questions such as “What are the 10 most popular products over the last 2 hours?” or “What is the average product price in a certain category for the past day?”.
Another common use case for stream processing is controlling and properly pipelining a distributed events workflow. As events are coming into the system at high rates, the processing of events is split into multiple stages and each stage has to be properly routed within a cluster for processing.
The primary capabilities of In-Memory Streaming in Apache Ignite include:
- Programmatic window-based querying
- Customizable event workflow / Complex Event Processing (CEP)
- At-least-once guarantee
- Built-in, user-defined sliding windows
- Streaming data indexing
- Distributed streaming queries
- Co-location with in-memory data grid
IN-MEMORY HADOOP ACCELERATION
The Apache Ignite accelerator for Hadoop enhances existing Hadoop environments by enabling fast data processing using the tools and technology your organization is already using today.
In-Memory Hadoop Acceleration in Apache Ignite is based on the industry’s first dual-mode, high-performance in-memory file system that is 100% compatible with Hadoop HDFS and an in-memory optimized MapReduce implementation. In-memory HDFS and in-memory MapReduce provide easy to use extensions to disk-based HDFS and traditional MapReduce, delivering up to 100x faster performance.
This plug-and-play feature requires minimal to no integration. It works with open source Hadoop or any commercial version of Hadoop, including Cloudera, HortonWorks, MapR, Apache, Intel, AWS, as well as any other Hadoop 1.x or Hadoop 2.x distribution.
The main capabilities of In-Memory Hadoop Acceleration in Apache Ignite include:
- Up to 100x faster performance for MapReduce and HIVE jobs
- In-memory MapReduce
- Highly optimized in-memory processing
- Dual mode – standalone Ignite File System (IGFS) file system & primary caching layer for HDFS
- Highly tunable read-through and write-through behavior
DISTRIBUTED IN-MEMORY FILE SYSTEM
One of the unique capabilities of Apache Ignite is a file system interface to its in-memory data called the Ignite File System (IGFS). IGFS delivers similar functionality to Hadoop HDFS, including the ability to create a fully functional file system in memory. IGFS is at the core of the Apache Ignite In-Memory Accelerator for Hadoop.
The data from each file is split on separate data blocks and stored in cache. Developers can access the data in each file with a standard Java streaming API. For each part of the file a developer can calculate an affinity and process the file’s content on corresponding nodes to avoid unnecessary networking.
The primary capabilities of the Distributed In-Memory File System in Apache Ignite include:
- Standard file system “view” on in-memory data
- Listing of directories or information for a single path
- Create/move/delete of files or directories
- Write/read of data streams into/from files
The Apache Ignite In-Memory Data Fabric provides one of the most sophisticated clustering technologies on Java Virtual Machines (JVM). With Apache Ignite, nodes can automatically discover each other. This helps scale the cluster when needed, without having to restart the entire cluster. Developers can also take advantage of the hybrid cloud support in Apache Ignite which allows users to establish connections between private clouds and public clouds such as Amazon Web Services or Microsoft Azure.
The main capabilities of Advanced Clustering in Apache Ignite include:
- Dynamic topology management
- Automatic discovery on LAN, WAN, and AWS
- Automatic “split-brain” (i.e., network segmentation) resolution
- Unicast, broadcast, and group-based message exchange
- On-demand and direct deployment
- Support for virtual clusters and node groupings
Apache Ignite provides high performance, cluster-wide messaging functionality to exchange data via publish- subscribe and direct point-to-point communication models.
The primary capabilities of Distributed Messaging in Apache Ignite include:
- Support for topic-based publish-subscribe model
- Support for direct point-to-point communication
- Pluggable communication transport layer
- Support for message ordering
- Cluster-aware message listener auto-deployment
The distributed events functionality in Apache Ignite allows applications to receive notifications about cache events occurring in a distributed grid environment. Developers can use this functionality to be notified about the execution of remote tasks or any cache data changes within the cluster.
In Apache Ignite, event notifications can be grouped together and sent in batches and/or timely intervals. Batching notifications help attain high cache performance and low latency.
The main capabilities of Distributed Events in Apache Ignite include:
- Subscribing of local and remote listeners
- Ability to enable and disable any event
- Local and remote filters for fine-grained control over notifications
- Automatic batching of notifications for enhanced performance
DISTRIBUTED DATA STRUCTURES
Apache Ignite allows for most of the data structures from the java.util.concurrent framework to be used in a distributed fashion. For example, you can take java.util.concurrent.BlockingDeque and add to it on one node and poll it from another node. Or you could have a distributed Primary Key generator, which would guarantee uniqueness on all nodes.
Distributed Data Structures in Apache Ignite includes support for these standard Java APIs:
- Concurrent map
- Distributed queues and sets
The Apache Ignite unified API supports a wide variety of common protocols for the application layer to access data. Supported protocols include:
Apache Ignite supports several protocols for client connectivity to Ignite clusters including Ignite Native Clients, REST/HTTP, SSL/TLS, and Memcached.
Apache Spark is an open source fast and general engine for large-scale data processing. Apache Ignite and Spark are complementary in-memory computing solutions that target different use cases. They can be used together in many instances to achieve superior performance and functionality.
Apache Spark and Apache Ignite address somewhat different use cases. They rarely “compete” for the same task. Some differences:
Apache Spark loads data for processing from other storages, usually disk-based, and discards the data when the processing is finished. It doesn’t store data.
Apache Ignite provides a distributed in-memory key-value store (distributed cache or data grid) with ACID transactions and SQL querying capabilities which retains data in memory and can write through to an underlying database
Apache Spark is for non-transactional, read-only data (RDDs don’t support in-place mutation) so is used for OLAP
Apache Ignite supports non-transactional (OLAP) payloads as well as fully ACID compliant transactions (OLTP)
Apache Spark is based on RDDs and works only on data-driven payloads
Apache Ignite fully supports pure computational payloads (HPC/MPP) that can be “data-less”
Apache Spark is for in-memory processing of event-driven data. Spark doesn’t provide shared storage, so ETL-ed data from HDFS or another disk storage must be loaded into Spark for processing. State can only be passed from Spark job to job by saving the processed data back into external storage. Apache Ignite can share Spark state directly in memory, without storing the state to disk.
The Apache Ignite Shared RDD API is one of the main integrations for Apache Ignite and Apache Spark. Apache Ignite RDDs are essentially wrappers around Apache Ignite caches which can be deployed directly inside of Spark processes that are executing Spark jobs. Apache Ignite RDDs can also be used with the cache-aside pattern, where Apache Ignite clusters are deployed separately from Spark, but still in-memory. The data is still accessed using Spark RDD APIs.
Apache Ignite RDDs are used through IgniteContext which is the main entry point into Apache Ignite RDDs. It allows users to specify different Apache Ignite configurations. Apache Ignite can be accessed in client mode or server mode. Users can create new shared RDDs, which means new Apache Ignite caches are created with different configurations and different indexing strategies. Apache Ignite supports support a variety of partitioning and replication strategies with fully replicated or partitioned caches.
Everything that can be done in Apache Ignite can be done with IgniteContext by passing a proper Apache Ignite configuration. The RDD syntax is native so it can be accessed using the native Spark RDD syntax. The main difference is Apache Ignite RDDs are mutable while Spark RDDs are immutable. Mutable Apache Ignite RDDs can be updated at the end of or during every job or task execution and ensure that other applications and jobs can be notified and can read the state.
Apache Spark Plus Apache Ignite for Faster SQL Queries
Apache Spark supports a fairly rich SQL syntax but it doesn’t support data indexing so Spark must do full scans all the time. Spark queries may take minutes, even on moderately small data sets. Apache Ignite supports SQL indexes for faster queries, so Spark SQL can be accelerated over 1,000x when using Spark plus Apache Ignite. The result set returned by Apache Ignite Shared RDDs also supports Spark Dataframe API, so it can be further analyzed using standard Spark data frames as well. Both Apache Spark and Apache Ignite natively integrate with Apache YARN and Apache Mesos so they can easily be used together.
Shared In-Memory File System with Apache Spark Plus Apache Ignite
When working with files instead of RDDs, it is still possible to share state between Spark jobs and applications using the Apache Ignite In-Memory File System (IGFS). IGFS implements the Hadoop FileSystem API and can be deployed as a native Hadoop file system, just like HDFS. Apache Ignite plugs in natively to any Hadoop or any Spark environment. The in-memory file system can be used with zero code changes in plug-n-play fashion.
Apache Cassandra can be a high performance solution for structured queries. However, Cassandra requires that the data must be modeled such that each pre-defined query results in one row retrieval. This pre-planning requires knowledge of the required queries before modeling the data.
While very powerful in the right use cases, Cassandra has limitations such as the lack of an in-memory option. Cassandra can be a good match for OLAP applications but it lacks support for transactions, ACID or otherwise, so is not used for OLTP. Cassandra can be efficient for pre-defined queries but it lacks SQL support and does not support joins, aggregations, groupings, or usable indexes. Cassandra is useful for pre-defined queries but does not support ad hoc queries.
Apache Ignite offers native support for Apache Cassandra. When used as complements, Cassandra plus Apache Ignite provides Cassandra users with very powerful new capabilities such as the ability to leverage in-memory computing to reduce query times by 1,000x. With Apache Ignite, Cassandra users can leverage ANSI-99 compliant SQL support to run ad hoc and structured queries and to perform joins, aggregations, groupings and usable indexes.
Apache Ignite is the leading open source in-memory computing platform. It is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time. It performs orders of magnitude faster than is possible with traditional disk-based or flash technologies. As an in-memory data management software layer, it sits between applications and various data sources, and does not require the rip-and-replacement of existing databases.
Apache Ignite comprises, in one well-integrated framework, a set of key in-memory capabilities, including:
- An in-memory data grid
- An in-memory compute grid
- An in-memory service grid
- In-memory streaming processing
- In-memory acceleration for Hadoop
Despite the breadth of its feature set, Apache Ignite is very easy to use and deploy. There are no custom installers. The code base comes as one .zip file with only one mandatory dependency: ignite-core.jar. All other dependencies, such as integration with Spring for configuration, can be added to the process a la carte. The project is fully Mavenized and is composed of over a dozen Maven artifacts that can be imported and used in any combination. Apache Ignite is based on standard Java APIs. For distributed caches and data grid functionality, Apache Ignite implements the JCache (JSR107) standard.
The Apache Ignite large scale, distributed in-memory framework offers transactional and analytical applications performance gains of 100 to 1,000 times faster throughput and/or lower latencies. Apache Ignite is an important open source foundation that holds the key to the world of Fast Data across high-volume transactions, real-time analytics and the emerging class of hybrid transaction/analytical workloads (HTAP).