How to Share State Across Multiple Spark Jobs using Apache Ignite™

Watch this deep-dive webinar to learn how to easily share state in-memory across multiple Spark jobs, either within the same application or between different Spark applications using an implementation of Spark RDD abstraction provided in Apache Ignite™.


In this one-hour webinar, Dmitriy Setrakyan, Apache Ignite Project Management Committee Chairman and co-founder and EVP of Engineering at GridGain, will explain in detail how IgniteRDD — an implementation of native Spark RDD and DataFrame APIs — shares the state of the RDD across other Spark jobs, applications and workers. Dmitriy will also demonstrate how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames.

Dmitriy Setrakyan
Founder & CPO, GridGain Systems

Dane Christensen:
Hello, everyone. I’m Dane Christensen, the digital marketing manager at GridGain Systems, and I want to thank you for attending today’s webinar, How to Easily Share State across Multiple Spark Jobs Using Apache Ignite. Your presenter today is Dmitriy Setrakyan, co-founder and vice-president of engineering [break in audio] at GridGain Systems as well as the chairman of the Project Management Committee for the Apache Ignite Project, who will be providing a deep dive into how Apache Ignite can improve the performance of Spark.

But before we get into Dmitriy’s presentation, I wanted to cover just a few housekeeping items for you. First, you can expand your slide area by clicking on the Maximize icon on the top right of the slide area or by dragging the bottom right corner of the slide area. At the bottom of your audience console are multiple application widgets you can use. If you have any technical difficulty, please click on the yellow Help widget, which covers common technical issues.

If you have any questions for Dmitriy at any point during the webcast, you can click on the red Q&A widget to submit your question. Dmitriy will answer questions during the Q&A section at the end of the presentation. It is possible we may not have time to get to all of your questions during the slide event, but if he’s unable to get to your question, we’ll certainly follow up with a response by e-mail after the event.

A copy of today’s slide deck is in PDF format. It’s available in the green Resource List widget. And an on-demand version of the webcast will be available approximately one day after the webcast and can be accessed using the same audience link that was sent to you earlier or from the GridGain website. So, with that, we’re ready to turn the floor over to Dmitriy Setrakyan. Dmitriy?

Dmitriy Setrakyan:
Yes. Thank you, Dane. Hello, everyone. Again, my name is Dmitriy Setrakyan. I’m one of the founders of GridGain Systems. I’m also on PMC of Apache Ignite, so I actively participate in several Apache projects, and my main one is probably Apache Ignite.

So, this presentation is about how Apache Ignite can integrate with Spark, what kind of cool stuff Apache Ignite has that will help you use Ignite with Spark, how it makes it better in certain cases. Before we get into that, though, I will give a brief Ignite overview and I will talk about Ignite architecture – not too deeply, but just at a high level. I will talk about data grid architecture and how data grid is the backbone of pretty much all the other – many other features provided by Ignite, like file systems, Spark integration, in-memory caching, et cetera.

And I will also do a quick demo as well towards the end of this presentation where I’ll be using a pretty cool project called Apache Zeppelin and using Apache Zeppelin to query Spark and query Ignite and seeing how SQL is different, and then we can get into why it’s different as well. SQL is actually the same, but performance is different, so Ignite will – you’ll notice that Ignite does it a lot faster and I’ll explain why.

All right. So, let’s get started. So, let’s take a look at what Ignite In-Memory Data Fabric is. Essentially, Ignite is more than just one thing. Ignite is a collection of multiple components. All of them are in-memory. All of them work in-memory. All of them are distributed. So all of them are scalable and all of them can scale across multiple nodes in the cluster. And virtually unlimited scale. Largest installations of Ignite include over 1000 nodes in a cluster. Not to say that every installation has to be that way, but some of the largest ones are that way.

So, Ignite is a collection of in-memory components, all of them independent, all of them distributed, and all of them exist for one purpose, to make the application run faster. So Ignite is all about increasing performance and increasing scalability of your application. And all of these components can be used standalone and independent from each other. You do not have to pick Ignite because you have to use like three components or four components. Usually, our users start off with one, maybe two, and as their use gets progressive and gets a little more involved, then they start using different components as well.

However, the strength of Ignite is not in providing all these independent components, but making them work in integrated fashion. It’s the integration of these components with each other that makes it very powerful. For example, let’s take a look at, for example, at data grid and computer grid. Data grid distributes data within the cluster, but computer grid helps you compute on that data, which means essentially that a computation can go to any node in the cluster or be broadcasted to multiple nodes in the cluster and perform computation.

So the power of integration here comes that computation will always know where to go – so, to compute on a certain data. It will always know where the data resides, and it will always go exactly to the node where the data is at. Otherwise, if a computer computation went to the wrong node, you would end up in a situation where the data would have to be fetched, brought over. And if that happens too much, then you’re not gaining performance, you’re losing performance. And if you misuse it, then you’re actually not adding scalability; you’re probably removing scalability.

So integration is very important. The same goes for data grid and streaming. The streamed data always go to the nodes where the data is. Also, same goes to scale, to MapReduce. So all these components are very well integrated with each other, and the main portion of integration comes from data locality and knowing where the data resides and where to go at all times.

So, just several quick facts about Ignite. Ignite is open source, licensed under Apache 2.0. It’s part of Apache Project. It’s Apache Ignite. We went through incubation process since last September. We’ve been in incubation for over nine months. And about August of last year – actually, it’s not last September. It’s September over, so August of last year we graduated to a top level Apache project, and now we’re enjoying the status of being a top level project.

So, some more quick facts. Another, Ignite is very easy to integrate into a system, to deploy into a system. It comes as one ZIP file. You download it, unzip it, and it’s ready to go. It has only one mandatory dependency, which is ignite-cor.jar. All other dependencies can be brought on a la carte basis, so the project is fully mavenized, so whichever dependency you like, whichever integration you’d like to use, you can add that dependency at that time.

For example, if you would like to use Spring to configure Ignite, using Spring Beans, you would add Spring dependency. If you would like to use Log4J for logging, you would add Log4J dependency. But those dependencies are not mandatory and you only add them as needed, only whenever you use them. Ignite is resilient. It’s fault tolerant. It does the best job at load balancing the stuff you send to the cluster.

And, lastly, I’ll mention that it runs on any type of hardware. There’s absolutely no requirements as to what hardware you need to have to run Ignite or deploy Ignite. We will utilize all of the resources that you make available to Ignite. Essentially, if it’s a smaller box and you want us to use only four gigabytes of RAM, we’ll take that RAM and we’ll utilize it. If it’s a larger box, maybe with 64 cores and 300 gigabytes of RAM, we will gladly utilize all that RAM as well. So it doesn’t matter to us. Of course, the more memory you have, the more data you can cache and the better performance you get, so always think about trying to cache more, not less.

All right. So, let’s take a look at just a list of components available in Ignite. And, again, I mean if you look at this famous jigsaw puzzle, all of these components are available in Ignite. It’s not an exhaustive list, but it captures most of them I think – data grid, of course, being, as I mentioned, one of the most important components to us. It is responsible for distributing data within the cluster. It is responsible for transacting on that data, querying that data, and I’ll be covering that in a little bit of detail today.

There’s also computer grid, which is responsible for distributing computations within the cluster; service grid for deploying singletons within the cluster – guaranteed singletons; streaming for continuous data ingestion; a file system, in-memory file system, messaging commands. All the components are available. Again, you never have to use all of them, pretty much, but most of our users end up using a good portion of them, so you probably will use data grid, computer grid, maybe messaging commands, et cetera, maybe data structures.

All right. So, given that we’re talking about Spark and how Ignite integrates with Spark, I’ll just start off. Before I dive into data grid, I’ll start off with why we need to share state within Spark. So, essentially, Spark is very good at processing data, at processing in-memory. It will process in parallel. It will distribute the load across multiple servers, and it’s very good at it.

What it doesn’t do is it doesn’t cache the data. It doesn’t cache – keep the state in memory. So if you run one task or one job and then – or maybe one of the Spark application – and then you want another application to use the same data or share some state, you cannot. The data has to be persisted through HDFS maybe, be converted to a file and then persisted or maybe into some other medium. But Spark will not provide you means to keep the data between applications, jobs, or tasks.

And we actually get multiple user requests on how to use Ignite within Spark, just to make sure that folks could share state across multiple executions, and that’s where the idea of shared in-memory RDD came in. So, if Spark is very good at parallel processing and at executing task in parallel across a cluster – Ignite is also good at that, by the way – but Ignite is also very good at caching data and distributing data across a cluster.

So we thought, “Why not just provide a Spark RDD API over Ignite caches?” So we wrapped Ignite caches into native RDD API. It actually also supports data frame API from Spark natively. And all the transformations that you could do on native Spark RDDs you can now do on Ignite RDDs as well. So, a pretty cool integration. I will talk about it a little deeper. That was just like a small teaser.

So where does data grid come in? Actually, a data grid is an in-memory key-value store, so it stores data as key-value pairs, which is ideal for storing tuples because tuples are also a pair of objects. So the first object becomes key; second becomes value. And that’s why it became data grid could easily became a foundation of providing shared state for Spark RDDs. It acts – also is a foundation for providing shared in-memory file system, which is Hadoop compliant file system in Ignite.

It is very good at storing large data sets, even though the privatization in Java will utilize off-heap memory, so we’re not affecting GC in any way. Essentially, the data is still a pointer access away. It’s still in memory. The action is still very fast, but garbage GVM doesn’t know about and heap and therefore is not spending GC cycles on it, so we manage that portion of memory ourselves and we do it very fast and very efficiently.

And another cool thing about data grid is that it supports indexes for SQL querying. So, essentially, all the data that is stored in a data grid can be indexed in-memory and then can be utilized in SQL for distributed in-memory queries. And that’s one of the biggest advantages of Ignite, probably – Ignite data grid over other data grids that you may see on the market. And that allows us to run faster scale. And that’s exactly what the demo will be at the end of this webinar, comparing our SQL versus Spark SQL.

Spark actually supports a pretty good, a pretty rich SQL syntax. However, it doesn’t do indexes. And not doing indexes forces Spark to do full scans all the time, and the queries may take minutes in Spark, even on a moderately small data set. All right?
Moving on.

All right. So, briefly, let’s take a look – go deeper at what a data grid is. As I mentioned, primarily a data grid is a key-value store. It’s a distributed store in-memory, key-value store in-memory. Think about it as a distributed cache map in-memory, where every node in your cluster is responsible for caching a portion of the data. The more nodes you have, the more data you can cache. The whole memory space across the whole cluster becomes your total memory available to your application. So it’s like a big clustered RAM.

And there are several APIs that you can use to talk to data grid. We implement JCache standard, so you can – JSR 107 – so you can actually use JCache API to interact with data grid. We also implement a few other standard APIs, of course SQL being one of them. Ignite is ANSI 99 compliant with – like you have to get very deep to find some limitation, but for the most part it’s ANSI 99 compliant. We also support MapReduce if you want to use MapReduce to talk to Ignite as a standard. Spark RDD is yet another way to talk to Ignite data grid. And file system was also yet another API.

And data grid can be also – I should mention it can also be accessed in transactional fashion. So even though Spark use cases, for the most part, are non-transactional, there are plenty of other non-Spark transactional use cases where you literally interact with the data grid as you would with a database and you move data around. You maybe withdraw money from one account and put it into another account. All of that has to be done within a transaction, so data grid has a strong transactional support as well.

Let’s take a look at how data is distributed within data grid. You’re looking at a diagram showing partition cache within Ignite, and especially if you look – I mean this is simplified diagram – if you look — if you look at the keys, you A, B, C D. You have four keys. And also you have four backups. So all the keys are evenly distributed within the cluster. For example, JVM1 caches key A, JVM3 caches key B, et cetera. Note that there’s no backup node. There’s a concept of backup keys, so keys are also evenly distributed. Backup keys are also evenly distributed within cluster, and user control from any backup copies they want. For example, you can have one, two, three, four backups if you like.

And you can access this data locally or remotely. So if your data is available locally, it will generally be provided locally. If not, it will be fetched from another node responsible for storing it and then returned to the user. If you’re accessing it remotely from a remote client, we also have a concept of near cache. And in near cache you can actually also – it’s a smaller LRU cache and you can also store most recently accessed data. Data is available in a near cache. It will be fetched and returned. Otherwise, it will be fetched from a remote server and brought to the client and cached and then returned.

So, a pretty cool concept. The data is partitioned. Again, the more nodes you have the more data you can cache. Partitioned caches are very good for updates, and it’s one of the most scalable approaches to caching. Since there’s a concept of data ownership and every node owns its portion of the data, updates only take – have to go to the node that owns the data, so – and the backups as well. So let’s say if you have one primary copy and two backups and you have 100 nodes in the cluster, an object would only have to touch the primary node and the two backup nodes, so only a total of three nodes. You have 1000 nodes in the cluster, you still would only touch 3 nodes in this case.

So it’s a very scalable approach. The number of network trips, the number of updates does not increase as you add more nodes to the cluster. And the reads are also very fast because you can do near caching from the remote site. And you generally co-locate processing with data. We strongly advise, as I mentioned, that computations have to be co-located with the data and, when that happens, all the reads are local as well. So, generally, this is the most scalable way to cache data.

There’s also another type of cache called replicated cache, also available in Ignite. And in replicated cache, every node has a full data set, whole copy of the whole data set, so now you can only cache whatever fits in one node. In partition mode we could cache whatever is the total across all the nodes. In a replicated mode, we only can cache whatever fits in one node.

And the updates in a replicated cache are more expensive because now – see, every node has the same amount of keys. As you see in the diagram, every node as A, B, C, D. So, since every node has the same amount of keys, every update has to touch the whole cluster. You have ten nodes in a cluster, you’re making ten updates. You have 1000 notes in the cluster, you have to do 1000 updates. So now you have – now updates are much more expensive, so it’s not very scalable from an update standpoint.

But replicated caches are good for caching read-mostly data with smaller data sets and data like reference data, lookup data, data that you use in joined queries quite a lot. And Ignite actually supports transactions between caches and queries between caches. So a typical Ignite deployment may actually have maybe 50 partition caches and maybe 10, 20, 30 replicated caches. And you can transact between those caches and you can execute SQL join queries between those caches as well, so you can join partition data with replicated data, which is pretty cool.

All right. So, just now that we’re talking on SQL, briefly what type of SQL Ignite supports, as I mentioned, Ignite is ANSI 99 SQL compliant. All SQL queries are always consistent, so if you started an SQL query and in the middle of that query a certain node in the cluster dropped or a new node joined, and a rebalancing kicked in in the background to make sure that – or a partition kicked in in the background to make sure that all the nodes are holding about an equal amount of data, the SQL query will still return the correct result. So SQL is always consistent. It’s always fault tolerant.

As I mentioned, we have in-memory indexes, so SQL queries are very fast, and we’ll show that in a demo. You’ll get millisecond, below millisecond results on decent, pretty complex queries. It supports automatic group by, aggregations, sorting – so all this clauses like group by and having unions, all the average aggregate functions, mean, max, average, sum, count, all the standard SQL functions like CONCAT or – pretty much any SQL function that you have in ANSI 99 standard is supported.

And it allows you to query Ignite data grid, query memory in an ad-hoc fashion. Ignite comes with ODBC driver and JDBC driver, so you can connect to Ignite pretty much from any ODBC or JDBC environment which has completed the pretty cool Tableau integration as well. So you can connect to Ignite using Tableau, using standard ODBC connectivity in Tableau, and visualize your data from Tableau as well. It will also work with all other data visualization technologies like MicroStrategy and Tableau et cetera.

So here’s an example of an SQL query, a typical SQL query that you can run on top of Ignite. It’s not too complex. It’s probably one of the simplest queries you can run. But it just shows you that you can do a join between person and organization caches here. So, in database terms, it would be between person and organization tables and you’re doing a group by; you’re doing an order by here. And you’re actually selecting some average salary mix – mean salary and max salary. So just this will query but utilizes a lot of cool concepts from SQL. I’ve seen people running queries on top of Ignite that take two, three pages, with inner joins, unions, and multiple unions in the middle, sorting group bys, and they all work fast, assuming that you indexed data correctly.

And, by the way, indexing is very important. We provide an Explain command that, just in the same way as any database, you can look at your execution plan, make sure that there are no full scans and, if there are one, you can add or remove indexes to tune your queries. And on a tuned query, like on a well-indexed query, you should get really fast results, under millisecond results.

All right. So let’s take a look now at how Ignite integrates with Spark and Hadoop. I’ll start off by saying that Ignite natively integrates with YARN and Mesos environments. Lots of Hadoop and Spark deployments are deployed in YARN and Mesos environments and will natively integrate, so if you run in those environments, you should be able to add Ignite fairly easily.

There is also, as I mentioned, one of the main integrations of Ignite and Spark is the RDD API provided by Ignite. It allows you – if you look at this diagram, you have several servers. You’re running a Spark application. It gets deployed on Spark workers. And those workers are now able to – or workers or jobs – are now able to share state across multiple applications, multiple jobs, multiple workers. You can deploy Ignite caches directly inside of Spark processes that are executing the jobs or workers, or you can have a cache-aside pattern where Ignite clusters deploy separately from Spark but still in memory, and you can still access data the same way, so using Spark RDD APIs.

And as I mentioned, as a side effect of providing Spark RDDs – because we use Ignite caches and Ignite data grid to implement those RDDs – you get a much faster SQL because now you can index into those RDDs and get results a lot faster. I’ll show you a small code snippet to demonstrate that.

So the way you use Ignite RDD, you start though from IgniteContext. It’s the main entry point when using Ignite with Spark. It’s the main entry point into Ignite RDDs. And it allows you to specify different Ignite configurations, essentially. You can access Ignite in client mode. You can access it in server mode. You can create new RDDs, which essentially will mean that new Ignite caches are created with different configurations and different indexing strategies. And you can create a RDD that is partitioned, for example, or you can create an RDD that is replicated, depending on what kind of partition or replication strategy you choose.

So, essentially, everything you can do in Ignite you can do here with IgniteContext by passing proper Ignite configuration. And the RDD syntax, as I mentioned, is native too. You’re pretty much using native Spark RDD syntax. Here’s a small coded example where we create an RDD over a partition cache.

The main difference here with Ignite is Ignite RDDs are mutable. In Spark they’re immutable. In Ignite they’re mutable. And the reason for Ignite RDDs to be mutable is to actually being able to update them at the end of every job or task execution and making sure that other applications, other jobs can actually read the state. So the way you update them is by using a method called save pairs. It allows you to save tuples into Ignite RDDs.

And as I mentioned, the SQL gets a lot faster, and this is how you would use SQL with Ignite RDD – standard JDBC syntax. You just execute it and get a result. The result you get supports Dataframe API, so you can actually work with it using a standard Spark data frame as well, Dataframe API.

Another way you can use a shared memory layer with Spark application is by using Ignite’s in-memory file system called IGFS. So if you have another shared memory across all the Spark cluster nodes, so it implements natively Hadoop API, so it implements the same API as HDFS does. So it plugs in natively into any Hadoop environment and any Spark environment. With zero code change you can start using an in-memory file system.

And it also can be used as a caching layer on top of HDFS because it supports writing through to HDFS. So if data is updated in-memory, it will be inserted into HDFS. If it is not available in-memory, it will be automatically read from HDFS. So it’s yet another way to share state across multiple Spark applications and jobs in-memory.

So the question is when do I use RDD and when do I use a file system? In my view, the answer is pretty simple. If you’re working with objects, use RDDs. If you’re working with files – maybe you’re working with PNGs or log files or text files – then maybe in-memory file system will become a more native form of processing data. So it really depends on a use case.

And last form of integration with Hadoop environments here I will mention is in-memory MapReduce. As you know, MapReduce pattern in Hadoop specifically was never built for performance; it was built for batch processing. Just to warm up, it usually takes maybe 10, 15 seconds in Hadoop. So what we’ve done is we’ve implemented natively Hadoop MapReduce API, which also – just like file system – plugs in into any Hadoop environment.

And if used in combination with our file system, it provides a significant performance boost. Probably just by plugging it in in most of the MapReduce environments, you’ll get about two times performance boost with zero code change because it implements native Hadoop APIs.

And if you look at this diagram, everything in gray is what Hadoop does to execute a MapReduce task. And the blue one is what Ignite does. As you see, Hadoop will contact JobTracker, namenode, and TaskTracker, then HDFS node and disk, and all that chain will go back up. And then namenode will be contacted more than once throughout this process. So it was never built for performance really, and all this actually explains why it’s not very fast.

And in Ignite, know that in Ignite there is no namenode because, again, we’re based on tuple with data grid, and data grid uses hash values of keys to determine where the data resides, so we do not need to do any lookups into namenode, so we directly go to the node where the data is, and we can execute in the same memory space where the data is.

Spark would usually execute – even if it executes on the same server, it’s a different process, so data still needs to be copied from one process to another – not Spark. I’m sorry. Hadoop. But in Ignite, we can execute MapReduce inside of the process that holds the data in memory, and that makes it a lot faster. This goes for both Spark integration and Hadoop integration. We co-locate in process and we cache in process and we execute in the same memory space where the data resides.

So, with that, I actually want to start a demo. I will switch off from this presentation mode and I’ll start the demo. It’s using SQL with Apache Zeppelin – again, a prequel open source project which allows visualizing data queries. So I’ll start the demo and then we’ll move to Q&A.

So this is what a Zeppelin window looks like. I am going to actually have two windows open, one for Spark and one for Apache Ignite. I have preloaded about six gigabyte of data into both. Essentially, I have one instance of Spark running and one instance of Ignite, both holding six gigabyte of data. Given today’s data sizes, the data I’m working with here can be considered miniscule. We’re not doing terabytes here. We’re doing just about six gigabyte of data. And what I’m going to do is run a couple of join queries that will do joins between two tables, mainly person and organization, and we’re going to see how long each query will take.

So let’s start from Spark. We are going to select all person names, manager names, and organization names by doing a triple join, SQL with triple join, based on the data we’ve loaded. So get’s go ahead and run it. We’re still running. Spark has to do a full scan between two tables, and that’s why the query’s taking a little bit. Just want to remind you the data set is fairly small, but the full scan on this data set still takes quite a bit of time, and the query still has not completed yet. And now it’s done. The execution took about 30 seconds.

So let’s go ahead and run the same query in Ignite. The only difference here is I have to tell Zeppelin that I’m using Ignite interpreter and not Spark interpreter, and I do that by specifying this parameter: Ignite.IgniteSQL. Otherwise the query looks exactly the same. I’m gonna go ahead and run it and, boom, it’s done. So it takes literally less than a second. Zeppelin actually shows the time of zero seconds. Just to ensure that we have not hard coded anything here, I’m going to change the value maybe to 223. I’m going to run it again. It’s done. So all person names, manager names, and organization names are retrieved.

So let’s execute another query with Spark. And this time we are going to do a group by, order by, and we are going to select average salaries for all employees for a certain organization, for all managers for a certain organization. So manager ID is now. That means a person is manager. And I’m gonna go ahead and run it. I’m currently in Spark window. We’re still running. Again, Spark has to do full scan in order to execute a join between person and organization tables. We’re still running. All right. It’s almost done. Okay. So it took about 33 seconds and we got the salaries ordered in ascending order.

So let’s go ahead and run the same query in Ignite. Again, the only difference is that I specify Ignite interpreter. Run it. It’s done. So Zeppelin shows zero seconds. I can actually – just to change the result, I can change ascending to descending here, for example, re-run it, and now the results are shown in descending form. So I might also mention that Zeppelin has some pretty cool graphs to visualize your data, so do take advantage whenever playing with Ignite together with Zeppelin.

So, on this note I’m going to end my presentation and I’m going to move to the Q&A session. All right. With this, let’s move to questions. If you have any questions, you can always type them in. And, Dane, if you don’t mind, you can read those up.

Dane Christensen:
Yep, sure will. And we do have several questions for you here, Dmitriy. And, yes, we probably do have time for some more, so feel free to enter your questions while we’re taking these. Here’s the first question for you. “Does Ignite use its own SQL engine or does it use something else, for example, Apache Calcite?”

Dmitriy Setrakyan:
Yeah, Ignite actually integrates with a pretty cool open source project called H2 database, which is a local in-memory database. We have commiters from H2 database also committing and contributing to Ignite as well. And from H2 we borrow SQL parsing and execution plan, creation of execution plan. And then we changed all that to work in a distributed fashion, so our execution plan is actually two-step distributed SQL query execution, which is a mapping step, and then the reduction step on the results.

And indexing and everything else is implemented by us, so we plug in our own indexing, distributed indexing, into H2. So from SQL standard, ANSI 99 support comes from H2 database, but all the distribution and distributed execution planning after that comes from Ignite.

Dane Christensen:
Okay. Excellent. Now, next question. “Can Apache Ignite also accelerate transactions, particularly ACID transactions?”

Dmitriy Setrakyan: Yeah. So, as I mentioned, with Spark’s use case you don’t see transactions a lot. But Ignite is used a lot by large banks, and one of the main reasons why it’s used by large banks is because of strong consistency model and strong transactional support. If you know how to use Oracle transactions, you can use those transactions in Ignite – optimistic, pessimistic, repeatable, or serializable.

Moreover, optimistic serializable mode is also what we call a deadlock-free mode, so a transaction that can never get into a deadlock. In a distributed system, especially for larger teams, acquiring locks in the same order is often impossible. But if you have larger teams working with APIs for different components, it’s very difficult to keep track of the order of log acquisition. And if you don’t keep track of it, then you get into a deadlock situation unless you use optimistic serializable transactions from Ignite, which are faster and do not acquire logs. So it’s a lock-free mode and deadlocks are impossible in that mode – so yet another reason why large projects and large banks and large financials like to use Ignite transactions.

Dane Christensen:
Okay, very good. Next question is “If I am loading data from various secure sources, what security might there be to avoid one process from seeing the data loaded by another process?”

Dmitriy Setrakyan:
Well, Ignite caches actually provide a pretty good isolation, so you can actually have different process caches for different processes and then access only those caches from those processes. However, Ignite does not enforce that, and if you need it to be enforced, you should look at some enterprise projects. Like GridGain, for example, has a strong security model around Ignite caches, so you can implement your own access level, ACL, list and you can have your own authorization, so users will have to authenticate to log in into caches, and then they will only be authorized to a certain activity. So, like from Ignite, you only get cache isolation. If you need something stronger than that, then you should look at some enterprise offerings on top of Ignite.

Dane Christensen:
Okay. Very good. Next question: “What if we have a rolling update scenario and Ignite configured for replicated cache? How is that handled, assuming Ignite is embedded into each node?”

Dmitriy Setrakyan:
Well, they’re different use cases. It can be called a rolling update. I mean let’s take a look at two use cases, two main use cases. So, one is Ignite itself needs to be updated, so to a newer version. So, currently with Ignite, you’d have to restart the cluster and upgrade it to a newer version, unless you’re using some enterprise offering on top of Ignite. And, for example, again GridGain would provide you a rolling update capability without cluster restart.

However, there’s another part to it, is user classes. What if I’m changing my own code from version A to version B? And that actually is pretty well supported in Ignite. So Ignite is essentially schema-less on the server side, so we do not – or I should say class, does not deploy any classes on the server side, any user classes. So I should say that if you, for example, deploy a person class that has name and title, and that’s your version one of that class, and some clients are accessing caches using that version and it’s fine.

Let’s say another client comes in with a newer version and adds maybe salary to that class. So that can be done transparently as you go. A new field will just pop up in a cache. And whenever the update is de-serialized on the clients again, Ignite actually will introspect the user class. And, in this case, if a person class does not have salary field in it, if it’s an older version, then salary field will not be populated. On the other hand, if a class had salary field but the data doesn’t have it, then it will become null. So, essentially, you could add or remove fields or change your schema on the fly with Ignite and we’re not storing any server classes on the server side.

The same goes for compute. Compute is a little different because we have to deploy a class because we’re executing user logic inside of that class, executing certain computation. But for compute we support a concept called pure deployment. So a class will be deployed. It can access data stored in caches. It can do pretty much any kind of user logic, anything you like. And then, whenever you change this class and a new execution comes in, Ignite will automatically detect that the class has a newer version, un-deploy the old one and deploy the new one.

So, for computations, the feature is called zero deployment or pure deployment. And for data we have no need for deploy because we always keep it in a blended format. So, overall, you never have to restart your cluster any time you change your compute or data definitions. Hence, rolling upgrades would be supported.

Dane Christensen:
All right. Thanks for that – that really, in detail, answer to that question, Dmitriy. And, by the way, for our audience, I think we’ve got plenty of time for Q&A, so this is a great opportunity to ask use specific for Dmitriy. We do have a few more here, and here’s the next one. “In the demo, was the data in person and organization preloaded and available in Ignite before running the query? What is the performance like if the data isn’t in Ignite yet?”

Dmitriy Setrakyan:
Oh, all right. Good question. By the way, the data was preloaded for both Spark and Ignite. However, preloading data – so the queries work only in memory, so data would have to be preloaded in order for a query to be executed. And preloading from Ignite will generally take longer than in Spark because we index the data. So if you don’t have any indexes, then it will be pretty fast, but most of the time you will have indexes because you want to have your query execution faster. And in those cases the preloading will take a bit longer because we index the data. But once the data is preloaded, all the queries start executing very fast.

So if your use case is preload around the query and dump – don’t reuse that data set anymore – then you will not see a lot of gain from Ignite. You probably would be maybe better off with Spark. If your use case is preload once, or not as often, and then run lots of queries on that data set, then you’ll definitely see a good performance boost from Ignite.

Dane Christensen:
Okay. Great. Thanks. Next question. “How do queries in Apache Ignite compare to Apache Spark? Is there any reason to go with Ignite over Spark?”

Dmitriy Setrakyan:
Well, I mean the first difference is indexing, I think. From a syntax support, Ignite and Spark support pretty rich SQL. I’m not sure what Spark compliance is. Ignite is SQL 99 compliant. So the main reason to use Ignite SQL would be performance and ability to index in-memory. So, as you saw in the demo, the difference can be pretty significant to ignore. So if you have SQL performance issues in Spark, you should try Ignite with indexes.

Dane Christensen:
Okay, great. Next question, kind of a general one here. “How big of an effort is it to integrate with Ignite?”

Dmitriy Setrakyan:
Well, it depends. Essentially, Ignite supports many standard APIs, so if you’re using SQL, the effort is pretty minimal. If you’re using Spark, I mean you can start using Ignite RDDs directly. If you’re using a standard database and you want to move to in-memory, then you’ll probably need to change your access logic to a key-value store access logic, unless you want to use it, again, with JDBC or thread BI JDBC driver, in which cases, again, the effort would be pretty minimal, but the integration won’t be as flexible.

For example, one of the main benefits that Ignite provides is to be able to co-locate computations with data, so with nodes on which data resides, and there are probably 20 different ways in Ignite how you can co-locate. So the whole co-location is built in into API and you can take good advantage of it. And for that you would probably want to look into how Ignite compute API works, how Ignite data API works, and see what it takes to migrate.

But, generally, the data API is very close to HashMap API, with some additional features. And compute API is simple Java Runnable, Callable, or Lambda execution on remote nodes. So we try to keep it very simple.

Dane Christensen:
Okay. Excellent. Next question is another SQL question, plenty of SQL questions today. “Are there any complex SQL functions – i.e. grouping, joins – that Ignite can’t do in a distributed environment?”

Dmitriy Setrakyan:
All right, so the answer is no with an explanation. So Ignite supports full SQL. All SQL 99 language is supported, and you can run separate joins, aggregations, et cetera, in a distributed environment. However, joins can be run in two modes. Joins can be done in co-located mode and non-co-located mode. If joins are done in co-located mode, they’re very fast and there’s virtually no limit to what is supported. You can use the whole SQL syntax and the joins will work.

However, in a non-co-located mode, the joins will get slower, first of all, because now you’re co-locating keys from one server to be joined with – not co-locating. You’re joining keys from one server and another server, so we’ll have to move some data around to make that join happen. So that use case becomes slower. And in those use cases, some aggregations – if you use them in inner queries, not – so top level queries, top level SQL always works, but if you stick a complex aggregation into inner query, it may not work yet. It probably will work in a couple months. We’re actually removing all the limitations from SQL and Ignite. So, yeah, there are some differences, but only in very specific use cases.

On top of that, I just want to mention that Ignite is one of very few products that actually does even support non-co-located SQL joins. They are much more difficult to support, and they’re usually neglected because they’re not that fast, right? They don’t provide good performance, so why even focus on them? However, in our experience, if you want to integrate an in-memory technology into your project, you will try to co-locate as much as possible. You definitely should try to co-locate as much as possible. And you can co-locate data with data. You can co-locate compute with data. So, once you co-locate properly, most of the computations, most of the queries will be co-located and execute locally on the nodes where the data resides.

However, it’s very rare, if possible, to co-locate 100 percent of your processing. Sometimes you’ll get to 50, 60, 70 percent, and what to do with the other remaining 30 percent of my computations, 30 percent my SQL queries? And in those cases, ability to execute non-co-located SQL becomes very important. And that’s why we’ve added it and it’s going to be part of Ignite 1.6, so it’s already implemented, sitting on a separate branch. We’re doing some preliminary testing on it with certain users, and it will be released in month. But that’s exactly the reason why we added non-co-located join support, to make sure that you could integrate Ignite into your environment and co-locate most of your processing, but still be able to use Ignite when you cannot co-locate.

Dane Christensen:
All right. Great. Now here’s the next question. “How does query performance compare between using Ignite’s JDBC driver versus Ignite’s API?

Dmitriy Setrakyan:
Oh, I would say so the performance should not be any different at all. So it was different at some point in previous versions of Ignite, but now the JDBC driver is using the same API as Ignite directly, so you should not notice any performance difference. The same goes, by the way, for ODBC, which is a new support in Ignite.

Dane Christensen:
Okay, excellent. Next question. “How do Ignite data grid capabilities compare to Hazelcast ones?”

Dmitriy Setrakyan:
Can you ask again? I’m sorry. How doe Ignite queries compare to Hazelcast?

Dane Christensen:
Yes. Data grid capabilities.

Dmitriy Setrakyan:
Oh, data grid capability. All right.

Dane Christensen:
Yes, how did, yeah, Ignite data grid capabilities compare to Hazelcast?

Dmitriy Setrakyan:
All right. Got it. So, how is Ignite data grid different from Hazelcast data grid? I get that question a lot, so there are quite a few differences. Both are open source projects. Both support JCache standard. Both are data grid. Both are transactional. However, the devil is in the detail, usually. I mean there’s quite a few differences in how Ignite does stuff versus how Hazelcast does stuff.

For example, SQL probably would be number one differentiator. The way Ignite allows you to query your data is very rich. As I mentioned, you can join between caches. You can do any type of averages, group bys, et cetera. With Hazelcast you have a very simplistic query API and you cannot do joins, so essentially you cannot – like in my simple example when I’m joining a person table and a company or an organization table, that would not be possible in Hazelcast. You would have to implement your join in code. You would have to fetch data from persons, filter it out and send those keys back to the query, and fetch data from organization and then merge the results. So, essentially, you would have to manually implement the join. So that would be number one differentiator.

There’s also off-heap support. There are differences in how we support off-heap. I should mention that off-heap is part of Ignite but not part of Hazelcast open source, so off-heap memory is only part of Hazelcast closed source. And there are other differences, but I want to mention this one – I mean performance. We constantly benchmark our data grid operations versus Hazelcast data grid operations in-memory under load, and we generally see anywhere between 30 to 70 percent performance benefit in Ignite versus Hazelcast.

And the benchmarks are very transparent. They’re also published on Ignite website. Hazelcast has reviewed those results, so we’re very open. It’s an Apache project. We welcome all kinds of feedback. And you can run those benchmarks yourself. You can take a look at the results. We publish them as well.

So, I will not go into deeper into differences, but we have a matrix on Ignite website how Ignite compares to major data grids. There’s a comparison to Hazelcast there. There’s a comparison to Oracle Coherence. GemFire, I believe, is there and Redis is there and we’ll be adding more.

Dane Christensen:
All right. Fantastic. Dmitriy, thank you. We do have a couple more questions, but looks like we’re coming up to the end of our time, so, as promised, we will definitely answer any questions that we weren’t able to get to live here today by e-mail. So I want to thank you, Dmitriy, for all that excellent information that you provided and for answering all those questions in such detail. And I thank you all for attending the webinar today and for all those excellent questions.

Now, as we wrap up today, I did just want to make a brief announcement about the upcoming In-Memory Computing Summit that will be held in San Francisco on May 23 and 24, which GridGain is sponsoring. The conference committee is still accepting proposals for speaker sessions until February 17th, so if you have a relevant topic you’d like to present or you know someone who might, please check it out at IMCSummit.org, and if you’re thinking about attending, now is the time to get the early bird discounts on that.

So, once again thank you all very much for attending, and thank you again, Dmitriy, and we’ll see you next time.

Dmitriy Setrakyan:
Thanks everyone. Bye.