Tame The Avalanche
In-Memory Streaming key innovations:
Streaming processing fits a large family of applications for which traditional processing methods and disk-based storages, like databases or file systems, fall short. Such applications are pushing the limits of traditional data processing infrastructures.
Processing of market feeds, electronic trading by many financial companies on Wall Street, security and fraud detection, military data analysis – all these applications produce large amounts of data at very fast rates and require appropriate infrastructure capable of processing data in real-time without bottlenecking.
One of the most common use cases for stream processing is the ability to control and properly pipeline distributed events workflow. As events are coming into system with high rates, the processing of events is split into multiple stages and each stage has to be properly routed within a cluster for processing.
Affinity collocation of event processing with nodes on which data resides is essential for achieving high-throughput and scalability in real-time streaming.
Another important use case for streaming is the support for Complex Event Processing (CEP). One of the key features of many CEP systems is the ability to control the scope of operations on streamed data. As streaming data never ends, an application must be able to provide a size limit or a time boundary on how far back each request or each query should go.
In the past few years several technologies have emerged specifically to address the challenges of processing high-volume, real-time streaming data. Generally such technologies evolve around single specific use case such event workflow management or streaming data querying.
Customers looking for a real-time streaming solution usually require both, rich event workflow combined with CEP data querying, and as a result are left with the difficult task of integrating different streaming technologies together. Such integration is rarely effective or simple.
GridGain In-Memory Streaming combines both event workflow and CEP capabilities fully integrated in one product.
Here is a list of some of the key features provided by GridGain’s In-Memory Streaming:
In streaming CEP applications some querying mechanism must be used to find events of interest as well as to aggregate, analyze, and process them. There are two main approaches to querying stream data, using standard SQL or using programmatic coding directly, both having their advantages and disadvantages.
The ability to query event data using semi-standard SQL seems desirable at first as it provides a familiar view on streaming data using a high level query language. However, standard SQL does not provide sliding window capabilities and, hence, new SQL dialect needs to be created to support stream processing. These semi-standard SQL dialects are not extensible and users are limited to the set of features supported by a certain CEP vendor.
GridGain In-Memory Streaming uses programmatic coding utilizing Java or Scala with rich data indexing support to provide CEP querying capabilities over streaming data.
Using programmatic coding such as Java or Scala to query stream data may require additional work, but is generally more flexible as users are not limited by any proprietary syntax or pre-defined routines. Moreover, using Java or Scala directly usually renders dramatically better performance as there is no overhead associated with SQL parsing or processing, and users have much more fine-grained control over data indexing and data aggregation.
Even though majority of streaming events are often processed in one step there are many use cases when processing must be split into different stages and routed to different nodes. GridGain provides comprehensive support for customizable event workflow. As events come into the system they can go through different execution chains, supporting branching and joining of execution paths, with every stage possibly producing new types of events.
By allowing every stage to declare the next one or multiple stages for execution, GridGain In-Memory Streaming can support multiple execution paths for the same events executing in parallel on one or more nodes, supporting loops and recursive branching. The execution workflow ends when all branches finish.
The at-least-once execution semantic provides a guarantee that as long as there is one node standing, the event workflow chain will run its course. GridGain fails over the execution chain as a whole. This means that if at any point a node responsible for some execution branch fails, the whole execution will be cancelled and restarted from the root.
Failing over and restarting workflow from the root is important as different execution stages may store partial results on individual nodes, and the aggregated result as a whole becomes invalid whenever some partial results are lost. Restarting from scratch guarantees that the final result will always be complete and consistent.
As streaming data is fairly constant, it is important to define the scope of streaming data operations by limiting the size of data being queried. Sliding window constructs supports exactly that. GridGain rich windowing functionality includes sliding windows that can be limited by size or time, windows that slide with either every individual event or in batches, windows that are unique or allow duplicates, and windows that can be sorted or snapshotted.
Processing of sliding windows usually involves reacting to new events or a batch of events entering the window, as well as reacting to events that leave. This essentially allows users to control any sort of time- or size-based metrics such as sliding averages, counts, sums, as well as instantaneous selection of any group of events by sorting them in any custom order. Different windows may coexist together and can optionally store and control different types or groupings of events.
By being able to query sliding windows programmatically, users are free to implement any custom algorithms on a streaming window of data.
Quite often iterating through windows is not performant, especially when window sizes are large, and indexing into window data is necessary to achieve efficient query processing. GridGain provides comprehensive indexing APIs which allows you to create any type of index, as well as maintain any type of running aggregate metrics within indexes.
Every window can have as many indexes as needed and every index can be based on any arbitrary field or group of fields. Such flexibility allows users get immediate responses for a variety of use cases and queries, from maintaining running averages to figuring out best selling products, or top co-selling product groups over a certain sliding period of time.
When processing streaming data it is important to keep any network communication or disk level access to the bare minimum. Otherwise there is always a possibility for some workflow stage to start bottlenecking and essentially bring the whole system to a full stop. To avoid such a scenario, GridGain took special care to make sure that no data or event distribution happens unless it is absolutely required.
All windows on every node accepting streaming events are local and are not copied over network for redundancy (if redundancy is required for certain events, then such events should be stored in in-memory database). Whenever a collective cluster-wide result needs to be computed, a streamer query is issued across all participating nodes which will gather results from all participating nodes, aggregate them, and return to user.
GridGain supports various types of streamer queries, allows for local and remote result aggregation, as well as support for visiting queries, which perform computations remotely on the nodes where the event data resides without sending anything back.
To preserve data integrity for mission-critical information and to provide a fault-tolerant highly available system, certain streaming events may need to be stored in an in-memory database. This way applications will avoid any disruptions in real-time processing, survive node crashes, and ensure that all event-related data will always remain intact and consistent.
However, to utilize data stored in an in-memory database while minimizing any data migration during stage execution, stages must be routed exactly to the nodes where the data is cached. To achieve this GridGain provides a special router which automatically co-locates stage execution with required in-memory data, based on affinity information provided from incoming events.
Moving stage execution logic directly to the data is a lot cheaper than moving data to the execution. Transferring data over network is one of the most expensive operations a distributed system can perform and should be avoided whenever possible. In fact, whenever a GridGain cluster topology is stable, there will be zero data movement during event processing. Instead, stages will be automatically routed to the appropriate nodes for execution.
GridGain In-Memory Streaming, as with any other GridGain platform product, comes with a comprehensive and unified GUI-based management and monitoring tool called GridGain Visor. It provides deep operations, management and monitoring capabilities.
A starting point of the Visor management console is the Dashboard tab which provides an overview on grid topology, many relevant graphs and metrics, as well as event panel displaying all relevant grid events:
To manage and monitor configured streamers there is a Streaming tab which displays various metrics and routing information for streamer events and stages:
Click here for more information about Visor.