Anatomy of an In-Memory Data Fabric: JCache and Beyond

Are you ready for a deep dive into Apache Ignite™, the high-performance, distributed in-memory data fabric that massively boosts application performance and scale? The on-demand webinar "Anatomy of an In-Memory Data Fabric: JCache and Beyond", is an in-depth look at the architecture of this powerful open source technology that is helping leading edge companies turn Big Data into Fast Data.

Dmitriy Setrakyan, Chair of the Apache Ignite Project Management Committee, and also founder and EVP of Engineering at GridGain Systems, reveals the technical details of distributed clusters and compute grids as well as distributed data grids using various demos and examples.

This is an excellent opportunity for big data developers, engineers, architects and consultants to learn from one of the masters in the field of distributed, high-performance, big data systems.

Dmitriy Setrakyan
Founder & CPO, GridGain Systems

Dane Christensen:
Hello, everyone. I’m Dane Christensen, the digital market manager at GridGain Systems, and I wanna thank you for attending today’s webinar, Anatomy of an In-Memory Data Fabric: Jcache and Beyond. Your presenter today is Dmitriy Setrakyan, Co-Founder and Vice President of Engineering at GridGain Systems as well as the Chairman of the Project Management Committee for the Apache Ignite project. He’ll be providing a deep dive into how Apache Ignite works today.

Dmitriy Setrakyan:
Thank you, Dane. Again, my name is Dmitriy Setrakyan. I’m a founder and EVP of Engineering for GridGain Systems. I’m also Chair of Apache Ignite PMC, and today, given that the title of the webinar is Anatomy of an In-Memory Data Fabric, I thought that we should get anatomically correct and actually go through most of the components an in-memory data fabric has or should have. So we’ll be talking about clustering data grid. Data grid we’ll focus in some detail. It’s one of the biggest components we provide. There’s also compute grid, streaming service grid, and we’ll also talk about Apache Ignite Spark and Hadoop features such as our in-memory file system and memory MapReduce in-memory RDDs for Spark. And at the end, we’ll do a cool demo comparing Ignite SQL versus Spark SQL and go over differences.

All right, so let’s take a look at what in-memory data fabric. Essentially, the main purpose of in-memory data fabric is to improve performance and scalability of your application, and we do it by providing a collection of independent but yet very well-integrated for each other components. All of them have to do with in-memory computing, and all of them have to do with improving performance and skill. All of these components can be used independently from each other, but when used together, they actually provide a nice integration which essentially does improve the performance – the overall performance and skill. For example, fused data grid together with computer grid, and you need to send a computation to a node where the data – to a node to compute on some data, you want to make sure that this computation goes exactly to the node where the data is at. Otherwise, you would create a lot of extra network traffic and most likely would worsen the scalability rather than improve it. So all these components provide this type of integration. The same goes for streaming signals for service grids, et cetera.

So despite all these features, despite such a broad feature set of Apache Ignite, the project is actually very easy to install and very easy to use. It comes as a ZIP file. In order to install, all you need to do is unzip it and you’re ready to go, and that’s very simple. We tried to use as much – we tried to use as many standard APIs as possible. For example, we have standard queues, sets, maps, all of them in cloud sync distributed fashion. We also implement JSR 107 standard, and we’ll add distribution and other extra cloud sync features to it.

So the project is very simple to use, and it also brings only one mandatory dependency to your application. It’s fully Mavenized and only the dependency you need to add would be Ignite Core JAR. Other dependency will bring on ala carte basis. For example, if you have – if you need to spring integration, you would add spring dependency. If you want to use hibernate assistance, you would need to add hibernate dependency. But ON the whole the project only requires JAR, which is Ignite Core JAR.

And last thing that I’ll add here is that there are no hardware requirements to running Apache Ignite Cluster. And we’ll utilize pretty much all the resources you might have available to the cluster. If you have high-end servers with maybe 500 gig of RAM and 64 course available, we’ll utilize all that memory and all those – all that CPU power. Also, if you have a low-end server, with maybe 10 gig of RAM and maybe four course only, we’ll utilize those resources as well. Of course, you will not be able to cache as much data in a 10 gig of RAM server, but maybe you can stack them up so you can stack them up, scale them and actually collectively cache a lot of data together.

So this jigsaw puzzle actually represents most of the components available within Apache Ignite. Data grid, of course being one of our biggest components has most of the code written. It provides distributed in-memory cache. It partitions data in memory and adds transactional – lot of transactional and quoting capabilities to that data. But it’s not the only component we have. I like to say that Ignite is more than the data grid, more than just data grid. We also have a compute grid. Very feature-rich compute grid. We have service grid for deploying single services onto the class. We have streaming for ingesting large amounts of data. We integrate with – very well with Hadoop and Spark environments. We’ll provide an in-memory file system, in-memory data structures, et cetera, et cetera. So all these components available within Apache Ignite, and today we’ll focus a little bit on – on most of them.

So let’s start with clustering. The main feature of any clustering product is to make sure that nodes can discover each other in every environment, and every cluster environment is very different. For example, a private cloud has certain features that are not available in a public cloud. For example, multi-cast may not be available on the public cloud. Public cloud social also have some shared storages that may not be available on private cloud. So it’s important that your clustering product plugs in automatically into all those environments, be that the private cloud and Amazon Public or maybe Google Compute Engine, maybe a local laptop. So we ensure that Ignite Cluster can actually be automatically brought up on all – in any environment you start it in, and we do that by making our discovery protocol pluggable. So we have different plug-in, different implementations that are plugged in for different environments. Maybe you – if you’re starting from a local laptop or within a private network, we’ll use multi-cache based discovery if you’re starting from Amazon AWS, we’ll use – we’ll utilize some of the S3 storage as a discovery mechanism and so on. And I also want to add that the Ignite works very well with Docker and we do provide Docker containers out of the box as well, so do take advantage of that if you’re running in a container environment.

So let’s talk about data grid. As I mentioned, we implement very nice standards called JSR 107, which is a standard for local caching. That essentially provides a very nice API, which in many ways similar to concurrent – Java concurrent map API and provides a lot of atomic operations on that map, replace, remove, et cetera. It also provides a mechanism for collocated processing via entry processor, and also have some various events, various metrics and allows you to plug in your own persistence. It’s very important. Persistence is always an important part. If you already have your data stored in, let’s say, Oracle database, you don’t have to replace it. Keep storing in Oracle, Ignite will automatically write through cash updates to your Oracle database, and it – whenever a read happens that is not cached, it will automatically read it through from Oracle database, or from any other database, for that matter. Will automatically integrate most any type of RDBMS or any other persistence that you might have, you might be working with. We also support the read behind mode, which essentially allows you to accumulate updates to the database in one batch and then flash that batch to the database. This basically allows you to reduce the load that would be imposed on the database by frequent update and actually improve overall performance.

So those features come with – by implementing JCache standard. But on top of that, Ignite add cloud sync and distribution. As I mentioned JCache is just a standard for local caching, so Ignite actually adds data partitioning capabilities, data queuing capabilities in a distributive fashion, transactions, distributed transactions. You can look at it as a distributed key value store, and where every node actually owns a portion of the data, and collectively across all the nodes, you have your total memory available to the cluster, and you can transact across multiple nodes in ACID fashion, and Ignite is actually fully ACID-compliant. At no time you will get an inconsistent read or inconsistent write with an Apache Ignite cluster. The data is always consistent, and each transaction will either commit as one unit or fail as one unit, but regardless of whether the fail happens on the client side, on the server side, on a primary copy or on a backup copy. On top of that, from a standpoint, we support ANSI 99 SQL where it’s a very unique feature of Apache Ignite where virtually the only product within memory data grid market which supports SQL and is compliant – and is ANSI 99-compliant. So you can pretty much run any type of SQL on top of Apache Ignite. Can do distributed joints, you can do group by clauses, can do sorting, averages, all sorts of aggregation functions, et cetera, et cetera. Actually I’ll have – I’ll show you a couple of examples on some following slides. And all the queries actually execute fast because we’re doing indexing in memory, so all the indexes are kept in memory and all the index look-ups are pretty fast.

So this diagram actually shows how the data is distributed within Apache Ignite cluster. It shows actually the ways how you can distribute data. You can just partition it, you can replicate it. This diagram shows the partition cache, and essentially here we have four server nodes named JVM1, 2, 3, and 4, and a client node, which remotely connects to the data cluster. And we have Keys A, B, C, D, and as you see, keys are evenly distributed within the cluster, so every node becomes an owner of a certain key. If you have more keys, then every node would actually become an owner of a collection of keys. And one of the interesting things here is that there’s no backup server. We don’t back up servers; we back up keys. So if you look at this diagram, all the keys are all the backups – backup copies for the keys are also evenly distributed within cluster. So, for example, if one cluster node has ten keys, then some backups will end up on Node 2, some backups will end up on Node 3, and some backups will end up on Node 4. So backups are also evenly distributed within the cluster. We do it to make sure that, if any of the node crashes, you don’t get another node all of a sudden overloaded. Since backups are evenly distributed, the extra load will also be evenly distributed within the cluster.

And lastly, I want to bring your attention to how a client connection works. We have a concept of near cache. So it’s a cache that you can instantiate on a client side and it provides a local – it’s a smaller local cache available to the client. And the way it works is, like, for example, if a key is accessed and it’s not a near cache, then a request to the node, on this diagram, to JVM1 will be made, and JVM1 has a primary – is a primary holder of Key 1. So it will shift the Key 1 – shift the Key A back to the client and then will cache it in the near cache and return it to you. If you’re trying to access Key B, on this diagram, Key B is already present in your cache, so no request to the server is made.

So replicated caches actually work a bit differently from partition cache. Replicated caches actually hold the whole data set in the memory of each node. If in partition scenario we took the data set and we split it across multiple nodes, so the more nodes you add, the more data you can cache, in a replicated scenario, each node holds the full data set, so by adding more nodes, you are not actually adding more memory to your cluster. And updates in replicated caches are a lot more expensive than in partition caches. In a partition cache scenario, we only have to update primary copy and as many backups as you have configured. If you have ten nodes in the cluster, you’ll only have the primary and backups. Let’s say if you have one backup, you only update primary and one backup cost. You have 100 nodes in the cluster, you can have only primary and backup. Thousand nodes in the cluster, same thing. In a replicated cache, the more nodes you add, the more network trips you would have to make to do an update. If you have 100 nodes in the cluster, then you have to update all 100 nodes. So the updates are very expensive and, therefore, replicated caches should not be used in scenarios where updates are frequent. The predominant use case for replicated caches is to cache lookup data or reference data, data that gets read a lot but is not updated a lot. And typically in a deploy – in your deployments, you will end up with some combination of the two. You probably will have maybe 20 partition caches, maybe five replicated caches. The cool thing here is that you can transact between those caches, so all different caches can participate in the same transaction. Transaction support multiple caches. And you also can query and do joints, distributive joints between multiple caches.

So let’s take a look at how Ignite stores the data. We have a concept called off-heap memory, and for anybody who worked with Java would know that Java does not work very well with large amounts of heap. It’s actually Java works well with maybe between 16 and 20 gig of RAM, and if you go beyond that, then you run into lengthy GC pauses. In our own experiments, if you have a cluster, let’s say, of ten nodes with 50 gigabyte of RAM allocated to each node, you may end up with a GC pause of about five minutes, and that actually is unacceptable for most of the production environment. So to solve this problem, Ignite introduces a concept called off-heap memory where essentially the data is still stored in memory, it’s still a pointer access away, so it’s – the access to the off-heap memory is very fast except that now JVM does not know about this memory and therefore garbage collection is not affected. So the way you would use off-heap memory is you would allocate for your JVM as much memory as you would do without Ignite even being there. So you would allocate only the memory that your own application requires. And then you would specify how much memory Ignite should use for its off-heap storage, and Ignite will actually start delegating most of the cash – all of the cache data off heap. One of the differences – one of the differentiating features of Ignite off-heap implementation is that we also keep indexes of heap. It is very important distinction because indexes generally occupy maybe – might inward anywhere between 30 and 50 percent of your overall data set because, depending on how aggressively you index. So if you already caching maybe 100 gigabyte of data and you have 30 gigabyte of indexes, they should be stored off-heap, not on-heap. If you store them on-heap, then you are back to Square 1 and you’re having the same GC problem. So Ignite actually supports storing indexes off-heap, also in off-heap storage, which makes indexes work very well with GC, with garbage collection.

Another cool feature that we recently added to Ignite is deadlock-free transactions. If anybody who actually worked with transactions and distributive locks would know that it’s important to always worry, always keep track of the order in which you acquire locks because, if at any point, you acquire locks in the wrong order, you would end up with a deadlock situation. So Ignite actually introduced a deadlock-free transaction moved, or fail-fast transaction modes, where we optimistically proceed with the transaction without acquiring any locks, so you don’t have to worry about lock order. You can actually do updates within different transactions in different order, and you will never end up in a deadlock scenario. But at the same time, if – and we resolve most of the contention. So if you’re updating the same keys, most likely your updates will – if you’re updating the same keys for multiple threads, most likely your update will go through. But at the same time, sometimes when it’s impossible to resolve, we optimistically fail the transaction with an optimistic exception and allow a user to retry it. This mode actually makes it possible to work – to transact across multiple components in organizations that have large teams that access those APIs of those components in any order. So it’s almost impossible to actually coordinate lock ordering across multiple teams within a project. So deadlock-free transaction’s definitely preferable way. And on top of that, the transactions are actually faster because we don’t acquire any locks and we always optimistically proceed. The transaction actually performs a lot faster than the pessimistic – than its pessimistic counterpart which actually pessimistic requires lock. So do take advantage of this functionality. It’s a pretty cool feature.

Another recent addition to Apache Ignite is our cross-platform binary protocol. So starting from the Version 1.5, we completely switched by default, we completely switched now to storing data in binary format. So you actually never have to deploy classes on the server side because we’ll always seriaize them on the client side and keep them serialized on the server side. Our indexing and our queries are able to look inside of the binary format so you never even have to de-serialize them to do index look-ups or specific lookups. So very efficient protocol, very compact protocol, and the cool thing here is that it’s cross-platform. And again, and other cool addition in Ignite 1.5 is that we have added C++ and .NET/C# APIs to Apache Ignite. And the difference in our .NET and C++ integration is that it’s not only – it’s not that .NET, for example, can connect to the grid only using client API. It’s a full-fledged in-memory data fabric available for .NET. You can transact from .NET, you can do near caching from .NET, you can even execute .NET closures in collocated fashion on top of your Apache Net cluster. You can utilize .NET for systems. You don’t have to implement Java persistence. You can use .NET persistence. So the goal here is to make sure that if you’re using .NET, you do not have to write a single line of code in Java. We’re not 100 percent there yet, but we’re – with every list, we’re getting closer and closer, and it’s already fairly feature-rich in-memory data fabric for .NET and for C++.

As I mentioned before, one of the unique features of Apache Ignite is our SQL capabilities and ability to query data using standard ANSI 99 SQL. All the queries are always consistent. They return consistent data and they execute in a [unintelligible] fashion. So regardless of a crash within the cluster, the queries will always return a consistent result. That is actually very different from the way some other product support querying capabilities. There’s virtually no limit to the SQL queries we can execute on top of Apache Ignite. You’re working with Java classes essentially the cross name will become table name, the fields in the class will become columns. You tell us which fields you want to index, and after that, you can execute pretty much any type of query. You can do distribute joins, you can do group-by, you can do hanging clauses, you can do sorting, order-by, you can do any type aggregation. Any type of SQL is supported. So SQL is yet another way you can interact with an Apache Ignite cluster. There are several ways. One is a key value API which I mentioned for the data grid. Then another is a MapReduce API which we’ll talk about. Now there’s an SQL API. And Ignite can also be exposed as a JDBC, and that also comes with JDBC driver and ODBC drivers, so it can interact with Ignite using standard JDBC/ODBC connectivity from any kind of standard tool that supports it.

Here is an example of a typical query that you can run on top of Apache Ignite. In this case, we simply – we simply do a join between person and organization and we’re querying – and we’re doing a join based on organization ID field. So using standard SQL syntax, and we’re also showing that you can use any type of SQL functions. Here we’re using concat function to concatenate first name and last name of a person. Once we’ve executed the query, we get a cursor back and we can actually iterate through that cursor in paginated fashion to purge all the data and at once and store it in collection.

Here’s another example of SQL you can execute on top of Ignite. Here we’re actually demonstrating group-by and order-by clauses and how they can be used. Again, we’re using a distributor joins, and we’re querying for organization name and some average minimum and maximum salaries of all people working within that organization. So we’re doing a group-by based on organization and we actually producing several – utilizing average aggregate here. Again one of the fairly simple queries you can actually get on top of Apache Ignite.

So let’s switch from data grid to compute grid. Compute grid is actually probably the second-most utilized component available within Apache Ignite. And what it allows you to do is distribute and execute closures and tasks within a cluster. So you can actually use it to simply distribute simple jobs, simple runnable scripts or even the standard Java within the cluster and how you can execute thousands of jobs per second. Or you can actually have a longer-running job, take a longer-running task, split it into multiple jobs and execute those jobs in parallel within your cluster, therefore making your task execute faster. This actually – paradigm actually is called ForkJoin paradigm, and this is exactly what is depicted on this diagram. We have a task C that is split in multiple jobs, C1, C2 and C3. They execute in parallel within the cluster, and when – then we get results R1, R2 and R3. We aggregate them back and return them to user. Very fast implement, very efficient algorithm for distributive task execution. And we notice that it gets you to utilize a lot more than standard map reading CPIs simply because MapReduce was never built for performance. As a matter of fact, for join is each case of MapReduce API where both mapping and reduction happen on the client side. And that’s something here that mapping reductions that usually would be – would execute fast and there is no point in distributing them.

So from an API standpoint, the tasks are pretty simple, but there’s a lot of complexity that happens underneath. And for example, Ignite supports automatic load balancing, fault tolerance. We support automatic job checkpointing for longer-running jobs, et cetera. And all these components come – are pluggable. All the implementations are pluggable. For example, load balancing comes in several permutations. We have random load balancing, we have round-robin load balancing, adaptive load balancing where take the least loaded node with some metrics that we constantly listen to that from all the nodes in the cluster. Same goes for fault tolerance. We have various fault tolerance for failover pluggable implementations. As a matter of fact, Ignite comes with a guarantee that, regardless of what kind of failure happens in the cluster, no job will ever be lost. So as long as you have one cluster node standing, Ignite job will always complete. And checkpointing here is utilized whenever your job takes a long time and you want to periodically checkpoint its state to make sure that in case of a crash, whenever job gets restarted, it’s restarted from the last safe checkpoint. There are also other additional features within Ignite compute grid, so please do check it out. It’s a pretty cool API, pretty simple there and yet very powerful.

Streaming is another component available within Apache Ignite, and essentially the main goal of streaming is to make sure that you can continually ingest large amounts of data into your cluster. Streaming data never ends, so the – you never are able to hold over data you’re streaming and you only generally work with a sliding window, a sliding window available within the cluster. And as the data comes in, keeps coming, the sliding window keeps shifting together with the data. And since sliding windows are based on top of data grid caches, all the features available within data grid, caches are also available within streaming and sliding windows capabilities, sliding windows APIs. For example, you can transact between sliding windows, you can query and you can join – do joins between sliding windows of data. So all that is available. So on one side, you’re ingesting data in large amounts, in large continuous volumes into Ignite, and on the other side, you can actually execute SQL queries into that data and into the averages or whatever – or other aggregates that you choose to maintain for that data, for your streaming use case. But even if you do not have a continuous data stream and just want to preload your cache, utilizing streaming APIs is probably the best way to do it. so if you have a large amount of data that you simply want to ingest into a cache once, you still probably would take advantage of a streaming – of our data stream API to do so because what it will do is it will automatically batch all the data before it gets shipped to – and actually ship data to nodes in batches to provide the best possible throughput here.

I’d like to mention, in Apache Ignite, before I switch to our Spark and Hadoop integration, is service grid. What service grid allows you to do is, one of the main features that service grid has as ability to deploy cluster singletons onto your clusters. Essentially, if you look at this diagram, we have a cluster singleton and we have a node singleton. In case of a cluster singleton, the service will be deployed only once onto a cluster, and if the node on which it ends up being deployed crashes, then Ignite will guarantee that the service will automatically be redeployed on another node. So this ensure constant availability of your service. In case of a node singleton, we make sure that the service is always available on each node. But singleton is also just an edge case. As a matter of fact, you can control the number of services you – you’re able to deploy into the cluster. For example, you can deploy – you can say that you only want five instances of a service and Ignite will automatically distribute these five instances across the cluster and make sure that no matter what kind of failures happens within your cluster, the number of your services always remains five.

So why would you need the service? Essentially something – a lot of times, you need something that only happens once or only happens a controlled number of times. For example, maybe you have a preloading process that takes place that you need to preload data into the cluster from some remote data source. If this process crashes, you want it to automatically be restarted on remote node, but if it doesn’t, you want to actually run it from start to end, complete it and make sure that it executes only once. Cluster singleton actually is the best way to deploy and provide such capability in your cluster.

Now last thing I want to talk about before we switch to Hadoop and Spark integration is benchmarking. I’ve been saying that Ignite is very fast. Ignite improves performance of the applications. And how do we know it? We know it because we constantly benchmark it ourselves. We run benchmarks against previous versions of Ignite and we show that the new releases are faster, and we always run it against our competitor products to make sure so we always know where we stand in comparison to the competition. As a matter of fact, benchmarks of Ignite versus Hazelcast are available on Apache Ignite web site, so I encourage you to come and take a look at our benchmarks. Dakota is also available. We use ER6 as a distributive benchmarking framework. It produces very nice graphs, so you can take a look, and we’re also working on benchmarking Ignite versus Infinispan, and when we finish that, we’ll benchmark it versus Cassandra as well.

So let’s talk about how Ignite can help with Hadoop and Spark installations. So first of all, I want to mention that Ignite natively integrates with Hadoop Yarn and Mesos clusters. So if you’re running those clusters in your environment, Ignite will automatically integrate and automatically scale on demand as well.

So let’s take a look at how Ignite can integrate with Spark. As you know, Spark is very good about running computations in parallel and computing on data in parallel. What Spark is not very good at is about sharing state between those computations. All Spark jobs, all Spark tasks run in complete isolation from one another, and Ignite, on the other hand, is very good with sharing data. And with why not just implement a native RDD API in Spark and place it on top of a cache and provide space-sharing capabilities to Spark RDDs. So that’s exactly what Ignite share RDD does. It – unlike Spark, RDD, the shared RDD provided by Ignite is mutable so you can have one computation, one job per task update this RDD and another one read the update. So this allows you to share state between various job staff and applications. As a side effect, you also get a faster scale. Because Ignite shared RDD is based on top of a cache or on top of a data grid, you get all the data grid features, and indexed SQL is one of them. So whenever you actually run SQL on top of a shared RDD, it executes a lot faster than in Spark because Spark always had to do full scans and Ignite utilizes [unintelligible]. And a matter of fact, towards the end of this presentation, I will do a quick demo comparing Spark SQL execution to Ignite SQL execution.

Another way you can share state between Spark computations, Spark tasks and Spark jobs is by utilizing Ignite in memory file system. So the diagram, as I say, it looks pretty similar to the shared RDD except that here we can automatically persist data to a secondary file system such as IGFS or any other Hadoop-compliant file system. The benefits you get with a shared file system is – are pretty similar to the benefits you would get with shared RDD. So why use one versus another? When do you use file system in memory file system and when do you use the RDDs? And that is pretty simple. Sometimes you really do have to work with files. Sometimes you work with CSV files, text files, log files, PNG files, and the most convenient way to store it is to use an in-memory file system. On the other hand, whenever you work with objects or key value pairs, the most convenient way to work with them is using RDD format, maybe shared RDD in Ignite. So the selection of what to use, either RDD or in-memory file system was really based on your use case.

Yet another useful integration between Ignite and Hadoop is our in-memory map reduce implementation. Whenever you work with native Hadoop map reduce, you probably know that it’s not fast. It’s actually pretty slow, and the reason that it’s slow is because it was never built for performance. When you execute a Hadoop map reduce task, essentially you start with a client, then you conduct a job tracker, then a main node, then a task tracker. Then task tracker will go to IGFS node. IGFS node will go to disc, and all this process will bubble up and probably main node will be contacted more than once by multiple task trackers back and forth. So all the gray arrows on this diagram is essentially what Hadoop does during map reduce task execution. Very [unintelligible] process. Obviously never built for performance, built for data entry processing. The blue arrow, on the other hand, is what Ignite does. So essentially we not only collocate computations with data, map reduce computations with data, we do it in one step. We immediately send the computations to the node where the data is at. And on top of that, we often are able to collocate in process. So it’s the job, the map reduce task itself will execute in the same Java process where the data is actually cached. So that actually allows us to access data by referencing a pointer which is very fast, and that’s where the most acceleration comes by executing map reduce task in one step and colocating in process.

And before I move to the demo, I just want to mention that GridGain provides enterprise features on top of Apache Ignite. GridGain does not change Apache Ignite in any way. As a matter of fact, it plugs in as a – it plugs into the plug-in architecture provided by Apache Ignite and starts up as an Apache Ignite plug-in. So this diagram shows all the features available in Ignite and GridGain as a scene of what GridGain provides, a lot of available features that have to do with data center application, security, management and monitoring, network segmentation, rolling upgrades, et cetera, et cetera. So do check it out. You can always download the trail version from GridGain.com web site.

And the last thing I’ll be showing is a demo. I’ll be showing – I’ll be comparing Apache Ignite SQL to Spark SQL using a small data set that I uploaded on my laptop. And I’ll be showing it using Apache Zeppelin, which is additional tool for running analytics. And it automatically integrates with Spark and Ignite, among other projects. So we’ll be using Zepplin for running our SQL queries in this case.

So this what a Zeppelin window looks like. I am going to actually have two windows open – one for Spark and one for Apache Ignite. I have preloaded about six gigabyte of data into both. I essentially have one instance of Spark running and one instance of Ignite, both holding six gigabyte of data. Given that today’s data size is the data I’m working with here can be considered miniscule, it’s – we’re not doing terabytes here. We’re doing just about six gigabyte of data, and what I’m going to do is run a couple of joint queries that will do joins between two tables, mainly person and organization, and we’re going to see how long each query will take. So let’s start from Spark. We are going to select all person names, manager names and organization names in – by doing a triple join, SQL with triple join based on the data we’ve loaded, so let’s go ahead and run it. We’re still running. Spark has to do full scan between two tables, and that’s why the query’s taking a little bit. Just want to remind you the data set is fairly small, but the full scan on this data set still takes quite a bit of time, and the query still has not completed yet, and now it’s done. The execution took about 30 seconds.

So let’s go ahead and run the same query in Ignite. I have – the only difference here is I have to tell Zepplin that I’m using Ignite interpreter and not Spark interpreter, and I do that by specifying this parameter, Ignite.IgniteSQL. So let’s go – otherwise the query looks exactly the same. I’m going to go ahead and run it and boom, it’s done. So it takes literally less than a second. Zeppelin actually shows the time of zero seconds. Just to ensure that we have not hard-coded anything here, I’m going to change the value maybe to 223. I’m going to run it again. It’s done. So all person names, manager names and organization names are retrieved.

So let’s execute another query with Spark, and this time we are going to do it group-by, order-by and we are going to select average salaries for all employees for a certain organization, for all managers for a certain organization. So manager ID is – now that means a person is manager. And I’m going to go ahead and run it. I’m currently in Spark window – we are still running. Again, Spark has to do a full scan in order to execute a join between person and organization tables. We’re still running. All right, it’s almost done. Okay, so it took about 33 seconds and we got the salaries ordered in ascending order.

So let’s go ahead and run the same query in Ignite. Again, the only difference is that I specified Ignite interpreter. Run it, it’s done. So Zepplin shows zero seconds. I can actually – just to change the result, I can change ascending to descending here, for example. I can run it and now the results are shown in descending form. So when I’m – might also mention that Zepplin has some pretty cool graph to visualize your data, so do take advantage whenever playing with Ignite together with Zepplin. So on this note, I’m going to end our – my presentation and I’m going to move to the Q&A session.

Dane Christensen:
Hey, thanks for that, Dmitriy. Excellent information, and we do have some questions from the audience here, so I’m going to go ahead and run those by you. And by the way, they have some time for some additional questions, so if you haven’t gotten a question, if you have a – do have a question to ask, go ahead and feel free to enter that now, but – so here’s the first question I’ve got for you, Dmitriy. Do you tag the information coming into the network?

Dmitriy Setrakyan:
Do we tag the information coming into the network? Essentially, I think this question has to do with security, which is something that is provided by GridGain, actually, not by Ignite. But generally speaking, every node, before it receive the packet, has to pass through authentication and authorization mechanisms that are built in, and every node, be that client or a server node, will have to be authenticated before it enters the cluster.

Dane Christensen:
Okay. Excellent. Now here’s another one for you. Actually, let me do this one. How do I filter on lists within my objects with SQL?

Dmitriy Setrakyan:
All right, so the question is, I think, if I have a messed up list inside of my object, how would I filter on it? And my recommendation usually is store all objects and cache individually. You can have multiple caches. As I mentioned, every cache can store different objects, so I would put the objects in the list into cache and the parents into another cache potentially and then the filtering would happen by doing a join with the where clause. It’s actually very simple. You would, of course, have an index on the join key and then you would select all persons that work for the organization, let’s say, that have certain criteria like maybe salary was $50,000.

Dane Christensen:
Okay, great. Thanks for that answer, and here’s another one. You showed SQL queries as text. Is there a builder API?

Dmitriy Setrakyan:
Actually – so the question is if you can have a builder API to build SQL queries. The answer is no, you just – to us, a SQL query is a string. I’m sure it can get a builder adapt for a SQL outside of Ignite and for – as I mentioned, the industry standard ANSI 99 so it shouldn’t be hard to create.

Dane Christensen:
Okay, excellent. Next question. What was the size of the clusters in the demo?

Dmitriy Setrakyan:
The size of the clusters in the demo was one, actually. I had one Ignite node running and one Spark node running, so on both sides, there was one worker for Spark and one for Ignite.

Dane Christensen:
Okay. Now, there’s – that’s the question there, as I read it, but do you think they were asking about RAM or – or –

Dmitriy Setrakyan:
Ah, so both clusters were running on my laptop. I have a Netbook Pro, 13-inch. Total RAM – total data set was about six gig. The whole data set was loaded in memory on both servers, Spark and Ignite. And then we run a scale queries for that.

Dane Christensen:
Okay, great. So that – thanks for that clarification. Okay, here’s another question for you. In your demo, was the Spark RDD cached prior to running the query?

Dmitriy Setrakyan:
Correct. As a matter of fact, I loaded the data in memory prior to running the query. So the RDD was in memory.

Dane Christensen:
Okay, good. And the next question here. Is there a way to store a consistent snapshot for all data of a map?

Dmitriy Setrakyan:
So the question is can we snapshot the cached data, I believe. So right now there’s no built-in mechanism to do a snapshot. There are several – there is a ticket open on Apache Ignite with several propositions on how to support it, so I believe it will be coming soon. In the meantime, you can always iterate through cache and serialize the key value pairs into the file and do the reverse on the other side whenever you need to read it, if you really need the snapshotting functionality.

Dane Christensen:
All right, excellent. All right, here’s another question. Is it possible to work with Ignite from any other language other than Java, like Python?

Dmitriy Setrakyan:
That’s a good question. So currently, as I mentioned, we have pretty rich APIs for Java, .Net and C++. There’s also JDBC and ODBC client APIs. There’s also a rest base connectivity, so there is a pretty powerful REST API which is very well documented on the Ignite web site. So that’s one way to interact from Python. We also have a node GS native client coming out pretty soon, probably within a month or so. So additional languages are coming, but for the languages that we don’t have direct API for, there’s always REST protocol that you could utilize.

Dane Christensen:
All right. Another question. How do you compare Ignite data grid capabilities with Hazelcast?

Dmitriy Setrakyan:
All right. So how does Ignite compare to Hazelcast? We get that question a lot. So I would say that one of the – Ignite has quite a few unique features that are not present in other data grid products. First of all, Ignite is an in-memory data fabric. It’s not just a data grid. It’s more than a data grid, and data grids are just one of the components it would provide, while Hazelcast is a data grid product, so data grid is everything that Hazelcast provide. So even within data grid, I think there are quite a few unique features in Ignite that are not available in Hazelcast. One of the main ones would be SQL capability. As I mentioned, we are ANSI 99 SQL compliant, and you can build distributed joins on SQL. In Hazelcast, for example, you wouldn’t be able to do a join with or without SQL, so you always have to bring data back and forth to connect any type of different types of data. In Ignite, you just do it with one SQL query. You can do joins across two, three, four or five tables together, add an – add a few indexes in cache and the junctions work. So SQL is definitely probably the biggest differentiator. There are quite a few others, for example, our transaction support I believe is a lot better. But like I mentioned, there is a deadlock-free mode for the transactionals, so you never have to worry about acquiring locks in the same order. The performance with benchmarks against also will generally come on top. We publish all the benchmarks and all the code for the benchmarks, so you can always come and take a look. I’m sure there are quite a few more differences. They’re all listed on, by the way, Ignite web site. There is a comparison – there are a few comparison pages where we compare Ignite to Hazelcast and to other data grid, so do visit the Ignite web site for a detailed list of differences.

Dane Christensen:
Okay, excellent. Next question. Can I use stored procedures and functions?

Dmitriy Setrakyan:
Oh, that’s a good question. Can I use stored procedures and functions in SQL? So right now we support functions. We can define constant functions that you can call in your SQL queries. So for example, if you have some non-standard behavior or some discrepancy in behavior with certain functions when converting a database query to Ignite query, you can always create – fix it in Java code. You can create your own logic for a function and then use it in SQL down the road. The PL/SQL itself is not supported, so we don’t really have a need for it because we have a compute grid, and if you need to execute any kind of operation on the data, you just send a computation to the nodes where the data is and perform all the updates directly from Java. So there is no real need to do it from PL/SQL sampling.

Dane Christensen:
All right. Great. And we do have several questions, but if you did have a question that you wanted to ask, feel free to get it in. again, if we can’t get to it on this call, we’ll follow up with an answer to your questions by e-mail, but here’s your next question, Dmitriy. It’s how do I re – I’m sorry, how re-index happens in cache same as database.

Dmitriy Setrakyan:
All right, so let me try to make sense out of this. So essentially indexes are dynamic. They work – they’re always up-to-date. As I mentioned, the Ignite is always consistent and you never get inconsistent read or write, the symbols for indexes. So as you update data in caches, the data updates propagate – the index updates, sorry, are also happens simultaneously. So indexes are always in sync with the cache data. I hope that was the question.

Dane Christensen:
Okay. Yeah, and you know, feel free, everyone on the call, to, you know, e-mail Dmitriy or, you know, sales at GridGain or just go to the Apache Ignite web site if you’ve got questions afterwards, and you can get those answered. But here are some other questions. How good is Ignite for parallel processing? We currently have SQL server and it has a lot of intentions with deadlocks. We have evaluated a few distributed database products, but none of them seem to scale well.

Dmitriy Setrakyan:
Well, I mean, this interesting questions there because Ignite is all about parallel processing. As a matter of fact, Ignite doesn’t have non-parallel processing. So we believe we’re pretty good at parallel processing. All the queries get automatically split if there’s multiple nodes, and the data set also gets split across multiple nodes, so just because each node has a smaller data set, just by virtue of that, you get faster querying execution, and on top of that, we have all the indexes that are also split across multiple nodes, so all the indexes are processed in parallel as well. And as I mentioned, as far as deadlocks go, we implemented the deadlock-free approach, so you should not – if you use our deadlock-free transactions, you will never run into deadlock scenario within Ignite.

Dane Christensen:
All right, fantastic. And I think we’ve got time for one really quick answer. Let me run this one by you. Is it possible to add event listeners on value changes in caches? Not all cache values but only some specific value of interest? Is that one you can answer in one minute?

Dmitriy Setrakyan:
Yes, I can. So the question is can I react to update, can I subscribe listeners to cache updates and can I only subscribe listeners to subgroup of cache updates? The answer is yes. We have a concept of continuous query, part of Apache Ignite, and continuous query has a concept of remote filter. The filter that would execute directly on the server where the update happens, so that’s where the first degree, first level of filtering will take place. And then whatever passes the remote filter will be shipped to the client and on client, we can set up additional filters. So the answer is use continuous queries. They’re very well documented on Apache Ignite website.

Dane Christensen:
All right. Thanks for that answer, Dmitriy. And there’s a few other questions that we just didn’t have time to get to. We’re up against the end of our time today, but as I said, we’ll go ahead and follow up with the answers to those questions by e-mail.

So I want to thank Dmitriy first for all that excellent information about Apache Ignite. Want to thank all of you who asked such excellent questions. I hope you got a lot of value from today’s presentation. Just a reminder that there is – you can download the PDF version of that from the resources widget, and we will also be placing an on-demand version of this webinar to the web site, and you’ll receive an e-mail with the information about where to get that.

So again, thank you all for attending today. We appreciate it, and we hope to hear from you soon, and have a great rest of the day.

Dmitriy Setrakyan:
Thanks, everyone.