Turbocharge Your SQL Queries In-Memory with Apache Ignite™
Join Nikita Ivanov, CTO of GridGain Systems and member of the Project Committee for Apache Ignite, to learn how to boost performance 1,000x and scale to over 1 billion transactions per second with in-memory storage of hundreds of TBs of data for your SQL-based applications.
Nikita will show how Apache Ignite handles auto-loading of SQL schema and data, SQL indexes, compound indexes support, and various forms of SQL queries including distributed SQL joins, joins across caches, predicate-based queries, queries over sliding windows for streaming data, and more.
He will also take live questions from the audience, so this is your opportunity to learn how to accelerate your SQL-based applications directly from a foremost expert in the field.
Hello, everyone. My name’s Alisa Baum, and I’m GridGain’s Director of Product Marketing. We’ll begin in just a moment. But first I would like to conduct a bit of housekeeping. First, if you could please raise your hand using the hand icon located in the Go To Webinar Control Panel to let me know if you can hear me. And let me take a look. I see hands. Thank you very much, everyone.
Next, during this webinar, you will be on mute. Should you have any questions during the discussion, please enter them in the questions field within the control panel. At the end of the webinar, we’ll take time to answer as many questions as possible. And those that aren’t addressed will be answered in a follow–up blog entry on GridGain’s blog.
In addition, a recording of this webinar will be made available to everyone within 48 hours. I would like to thank you for attending today’s webinar “Turbocharge Your SQL Queries In Memory with Apache Ignite.” It’s being presented by GridGain founder and CTO Nikita Ivanov. And with that said, I will turn the floor over to Nikita. Go ahead.
Thanks, Alisa. Good morning, everybody, or good afternoon, depending on where you’re joining from. What we’re gonna be talking today is what Alisa just said. We’re gonna be talking about how do you turbocharge your SQL queries in memory using Apache Ignite.
So we’ll talk a little bit about Apache Ignite, but we’re not gonna do, like, another full-blown presentation about it because we’re gonna concentrate on a data grid. And a data grid is a component that actually hosts the SQL capabilities of it. And we’ll talk in details and in-depth and what and how we implement SQL capabilities in Apache Ignite.
In the last maybe 15 minutes, maybe ten minutes, I’ll show you a pretty cool demo. Basically, we’ll run Spark SQL and Ignite SQL on exactly the same date set from an Apache Zeppelin, which is the very nice kind of notebook interface. And we’ll see the performance difference between the two.
So let’s talk about just briefly what is Apache Ignite. And, again, I’ll be very quick here. If you guys are interested in a lot more about it Apache Ignite and all the different capabilities because this is dramatically more than just a SQL end, you guys can look on the Apache Ignite website and just – or GridGain.com website and get all the information.
But in a nutshell, Apache Ignite is what we call in-memory data fabric. It is a distributed Java-based middleware infrastructure software that architecturally slides in between your traditional data sources, your databases, your file systems, your HDFS, whatever you have on the bottom, and your applications on top. So it slides in the middle.
And it provides fundamentally speed and scale. It provides you speed by moving data from slow disk and flash-based devices to RAM across distributed computers and provides you essentially scale by master distribution and parallelization that we’re processing throughout this data.
So we call a fabric for a reason. It has many components with which we can process and store that data in RAM across multiple, multiple computers which have a data grid and streaming grids and a compute grid. We have Hadoop and Spark acceleration. We have many, many components in this fabric. And, again, the fundamental in-memory Apache – in-memory data fabric, which is Apache Ignite, is all about speed and scale. It’s always kind of interesting – I understand that in-memory is not just about a marginal speed increase.
And when I present about this subject, I always keep saying if you’re looking for just a two or three times performance increase, you can just go with a flash. Move your data from spinning disk to spinning _____ as we call them, to flash. And, again, the sizable increase, two to three, four times.
In-memory systems in Apache – one of them – can give you several orders of magnitude performance increase. You can go anywhere between ten to a thousand times faster than a traditional disk–based system. One of the prime examples I always give about performance of in-memory system, not related to the Apache Ignite specifically but about in-memory systems, a couple of years ago, we achieved a billion transactions per second.
I’m talking about a fully ACID normal transactions per second on ten commodity blades with total capacity about 1 terabyte. So that’s what in-memory systems can do for you, and specifically Apache Ignite.
Just a quick history about our project. I think maybe some of you heard about a GridGain project. And essentially, GridGain was a kind of precursor to Apache Ignite. And GridGain project was started by me and my cofounder around about ten years ago in 2005. In 2007, we have our first public release. And from that point on, the project grew and grew. And a couple of years later, we had a pretty large production installation with over 2000 physical notes.
In 2010 we started the company GridGain Systems around this project. And then the project grew and grew in popularity. Today actually starts almost every second around the globe. In 2014 – we’ve been always an open-source project by the way. We’ve been always an open-sourced company. But in 2014, we decided to not only have the license, Apache license, but we decided to join ASF, Apache Software Foundation.
And we were accepted in incubation. In 2015, last year, we were graduated to the top-level project in Apache Software Foundation. One of the fastest projects – I think we’re just the second project behind Spark in terms of the speed of graduation. Still to this day, it’s about the only in-memory data fabric that is done in a project in Apache Ignite.
So what’s really interesting about the Apache Ignite, although it’s fairly new project in the Apache ecosystem, it’s extremely mature code base. We have hundreds and hundreds of production installations around the globe. The code has been in production for almost seven to eight years by now. So it is extremely mature code-base and very mature project overall.
So as I mentioned to you a couple of slides before, it’s a fabric. And it’s a fabric of many, many different components. But we’ll be talking about the SQL today. And SQL is essentially a fundamental part of – we’ll call it data grid. So we’ll talk about data grid, what it is, and we’ll then basically dive into the SQL capabilities of what with do.
So what is a data grid? And it’s kind of important to understand what it is to understand what does SQL mean in terms of the in-memory data grids because that’s a part of it. So we’ll talk about in-memory data grid first.
So what is in-memory data grid? The very simple idea behind in-memory data grid as a component – essentially imagine this. Imagine you have a terabyte of data that you want to store in memory across multiple computers. Essentially you cannot feed all the terabyte into the one computer’s RAM. You have to spread it over multiple computers.
And when you have to do that, you essentially open a whole can of worms. How do you partition? How do you load balance? How do you transact? How do you query? How do you ensure availability? What happens when one node, dies? How do you ensure all of this different – or semantics about storing data? That’s essentially what data grids do for you. They provide you fairly simple interface, how to interact with it, you can basically put data in, get data out, query this data. And data grids manage everything else under the hood.
Fundamentally, most of the in-memory data grids – and Apache Ignite data grid is no exception – based on a key value store, in-memory distributed key-value store. It’s a very important fact. The reason that we chose the key value because this is the lowest common denominator in terms of the data storage.
Once you have a key value store, you can then look at the same key value store as a SQL store. You can look at the same data as the file system. You can look at the same data as map/reduce type of jobs on top of it. So the key value is very fundamental because it gives you ability to look at this data, essentially providing you a view on this data from a very different perspective. And that’s exactly what we do, by the way, with SQL.
So we’re based on a distributed in-memory key-value store. The data isn’t – in data grid can be traditionally stored in two major ways. It’s a replicated way and a partitioned way. And let me explain really quickly. In replicated way, each not in data grid would have exactly the same date set.
Essentially, you’re obviously limiting the entire capacity by the size of a single note. On the flip side of that, you get a tremendous high availability because you can pick any note in cluster. It will have exactly the same date set. So for the read, most systems, this is the best way. Naturally, for the write, mostly it’s a pretty bad way. It’s a pretty inefficient way because you have to update all the nodes on the updates.
Partition is about really the main mode for data grids. It’s where you actually partition in data. And each note in a cluster stores only a portion of a data. There’s plenty of configuration settings. There’s plenty of caveats, and how do you do it, and what not to do. But fundamentally, there’s just two modes. And they’re kind of important for understanding how SQL work as well, especially when we talk about distributed joins by the way. And we’ll come back this later.
So naturally, data grids are good for terabytes of data. I don’t think data grids – in-memory data grids are good for gigabytes. You can store a gigabyte on your iPhone. It’s not really good for the petabytes yet. It’s just not economical. It’s hard to scale in-memory systems with petabyte scale.
So in-memory data grids are an excellent choice for lower tens of terabytes. It’s when you have half a terabyte, a terabyte, five, ten, 20 terabytes. That’s where I see most of the value. As you go up, it becomes pretty problematic economically to do that. Memory is still somewhat more expensive than flash.
What we also do in our data grid, which is fairly unique in this sense, is that we support both on-heap and off-heap RAM. It’s very, very importantly, especially nowadays. If you guys work in the Java world, you guys know. The maximum that heap – on-heap can support before the garbage collecting starts killing you, somewhere around six to eight gigabytes.
So if you have a box with 96 gigabyte of RAM, you have to have ability to use that. And Apache Ignite has a tremendous capabilities to essentially complete and seamlessly use both on-heap and off-heap RAM. You don’t really know what’s going on. You basically just enable that and it’s just gonna use the entire physical RAM available.
So we have essentially high availability for in-memory replicas and automatic failover. Those are very traditional things. I’m not gonna spend too much time on this. What’s interesting, we have a full distribution ACID transactions. It’s a very important feature. It’s one of the most differentiating features that we have in our data grid apart from a very strong SQL support.
We have exactly the same transactional support, which by the way dovetails into the SQL level, but exactly the same transactional support as a normal database would have. We have both pessimistic and optimistic transactions. We have all the same isolations level for Java transactions like recommit and even serializable
So if you guys work with transactions in database, you can work exactly the same way with Apache Ignite. The same guarantee, same semantics. We are a – participate in other transactions. So if you want to spin transactions from in memory to database to JMS, for example, you can do that. We can participate in normal JDS, JDA transactions in the Java world.
So without further ado, we have SQL and JDBC drivers. As a matter of fact, we’re coming up with ODBC driver as well. So as we transition into the SQL conversation, it’s very important to know that you can literally just open JDBC driver and soon enough ODBC driver from a non-java environment and just literally connect with the Ignite and just use any other tools that actually work over at the JDBC or ODBC to query the data.
For example, if you guys work with Tableau, Tableau works over all ODBC. You can connect Tablor to Ignite and just visualize the data from Ignite. The same goes with JDBC as well. So this is kind of very, very quickly just a brief introduction to in-memory data grid because the SQL is the fundamental part of a data grid.
How? It’s actually very interesting. I get this question – even to this day, I get this question. Why do we need SQL if we have key-value store? And it’s very interesting because originally data grids never had SQL. And some of the data grids today, like Coherence for example from Oracle doesn’t really have a SQL either. And my answer to this the very simple. If you have a key-value store, in order to get data out, you have to have a key. If you don’t have a key, you are out of luck. You have to do the full scan.
And the fundamental property with SQL is the indexing support. And you can actually run very effective queries where you don’t have a specific key. Imagine you have a large date set and you want to say, hey –– it’s a simple date set of employee in the companies, right? And you want to say, “You know what, find me all employees that start with the – that has the first name Nikita.”
And if you don’t have a normal SQL with a fully indexed support, you have to do full scan. You have to scan the entire data set with the predicate and basically look for the first name, right? In SQL, what SQL allows you to do is extremely effectively you can index this field properly, and it can select with the WHERE clause. It will be instantaneous response. That’s the power of SQL.
I know it’s kind of obvious for most of you probably, but in the world of in-memory data grids, that’s not it. Not every project would have it in. As a matter of fact, only a few. And I would claim that Apache Ignite has the best SQL support. And we’ll talk about why I’m saying this.
So first of all, what do we have? We have a full ANSI-99 SQL. There is no gimmicks. There’s no custom things. If you know SQL, and especially ANSI-99, you know how to use Ignite. Period, full stop. There’s nothing specific, custom, proprietary here. It’s a full normal standard. It’s actually very important. And that’s the essentially side effect of how we decided to implement that particular feature.
And like almost anybody else in kind of between our friends and competitors, we decided not to implement our entire SQL engine by ourselves. It’s a very hard job. It took Oracle 25 years to get fully optimized SQL engine with query planning and optimization and everything else. So what we decided to do, we decided to take the H2 engine, which is one of the best local SQL engine available today, which is open source and free. And we have the H2 engine on each note in a cluster.
And the only thing that we do at Apache Ignite, we provide all the distribution and distributed logic and how to basically do this distributed SQL. So that actually allows us to have an extremely well-tested and well-maintained H2 execution engine locally in each node. And that’s basically where this ANSI-99 standard support comes from, and that’s where the speed of the local execution and local query planning comes from. It’s dealt with H2.
And the only thing that we do, which is in retrospect is a fairly large thing to do and a fairly complex thing to do, but the only thing we do is we provide this distribution and distributed logic for the SQL processing. That’s a very, very, very important point because we just seen how many of our friends and community is struggling with the SQL performance anywhere from Spark to all of the data grids. They’re basically trying to implement their own SQL support. And I think our design decision early on has been spot-on because we are in absolute holding record in terms of the SQL performance in our project. And I’ll show it to you.
So with shipping with JDBC drivers, soon enough, I think literally in the next point release of Apache Ignite you guys can look at the dev list. I think it’s been already developed. It’s just a matter of releasing it. We’re gonna have ODBC drivers as well. So essentially the entire ecosystem of connectivity to our system from the outside for the SQL will be covered from JDBC and ODBC as well.
Another interesting feature we have topology change tolerance. That’s actually fairly kind of a deep technical question but fairly important. The minute you have more than one node – let’s say traditionally in our clusters, clients would have ten, 15, 20, 30, sometimes hundreds of nodes. The nodes have a tendency to go up and down. You either add nodes to the cluster at one time to kind of scale out, or your nodes has crashed, or you’re doing development, doing the testing. Something is happening not constantly, but in the large clusters, that’s not something you can ignore.
So when you’re querying the data, you have to be guaranteed that in any given point in time, no matter happen nodes up and down or starting or stopping, you’re querying a consistent set of data. It’s easy to say. It’s much harder to do. But we do have a tall change tolerance.
So you can query data with SQL, and you can add notes, you can remove notes. And we’re always gonna give you either an exception back saying, “Hey, we cannot really query the data because literally the cluster is in stable condition,” or we’re always gonna give you the result from a stable dataset. We will never give you the incorrect result or, say, incomplete result because some of the data wasn’t transferred between the notes and whatnot. It’s a very important consideration. It’s actually a very nice feature that you don’t have to literally reinvent the wheel, how to support this type of functionality.
So we have in-memory indexes. As a matter of fact, we have pluggable indexes in our project. It’s one of the only implementation that I know of where you can actually plug your own index. Obviously, we support primary and secondary indexes and whatnot. You can plug your own indexes. If you fancy some kind of balloon-based indexes and you want to develop them, fine. We have this little interface. You implement interface, plug it in, and then you can use this index right in the SQL definitions.
It’s a very, very advanced feature. Not everybody would use it. You’ve got to know what you’re doing when you’re developing indexes. But guess what. If you have something very specific you want to do for optimization, fine, you can develop that. Naturally, since we support on-heap and off-heap in our project, we do support storing indexes both on-heap and off heap. Very nice feature.
So as we are distributed systems or system, we support distributed our SQL aggregated functions like group by aggregations and sorting. It kind of goes without saying. It’s natural. But we do have that. And that actually helped the performance quite a bit. We gain performance by parallelizing SQL. And naturally, it sounds simple, but believe me, implementation of that is fairly non-trivial to do that.
The next point is kind of very important. We support a distributed cross joins on different cache and joins and unions as well. If you guys know a little bit of the theory behind the SQL, the joins don’t really work very well in the general sense in distributed systems because you have the explosion of keys exchange. For you to basically to do distributed joints, each – every note has to change a dramatic amount of keys, if not all the keys in some cases. And that’s hard to do.
So naturally, there’s multiple ways to do that. And one of the ways we support almost automatically is what we call a star schema. It’s when you keep your dimensions in the duplicated cache and you keep your values in the partition cache. And remember this is replicated in partitions. So if you can do that and if you can basically store your data in this way, you can do extremely efficient a distributed joints in SQL that will actually perform extremely well. Naturally, you can join multiple caches.
And if you think about cache as a table, and it’s just kind of synonymous. Cache in in-memory system kind of equals to a table in the SQL databases. So naturally, we can do the cross-cache joins, joins between multiple tables. And if you do them in the star schema, again, dimensions in bucket cache and values in a partition cache, we can do these joins extremely efficiently. But there is no ways to change the laws of physics. If you don’t do that in having terabytes of data, you’re distributed joins will be slow. No matter what, they’ll be slow in any systems today.
Another interesting thing, it will support ad-hoc SQL. So it sounds like a normal thing, but you’ll be surprised. In some of the in-memory systems like – it’s kind of escaping me – some of the databases – in-memory databases, you have to know your SQL queries up front. And based on those queries, the system will do the data allocations. That’s not the case for us. We don’t have that limitation. We support full natural SQL. Whatever you want to write, it’s a completely ad-hoc SQL.
Another thing I want to talk about is – and by the way, we’re gonna stay on this slide for quite awhile. This is the kind of main slide we’ll talk here. Another – and basically one of the greatest features we have in our SQL support is the custom SQL functions. I cannot emphasize enough how important it is to really – for the real live use cases in SQL. So we do not have a store procedures in our support because store procedures are always specific to any database, right? And we literally don’t have store procedures. You don’t really have to have them because you’re working in Java environment.
To kind of mitigate that, we can have the ability to write custom SQL functions in whatever language you want, in any JVM language. If you Java, Scala, Groovy, whatever you like, you can literally write a function and use that function in your SQL queries. So that’s a tremendous capability to basically inject whatever logic you have. And believe me, this function can be whatever. This function can essentially do a custom map/reduce job right from a SQL.
Now think about it again. You can have a complex SQL query with the joins and what not, and right in that SQL query, you can invoke a complex, for example, map/reduce function for the distributed kind of MPT type of processing. It’s a very powerful paradigm. And it’s very easy to do. And it basically our SQL capability extremely powerful. You’re not only limited by what SQL provides you. Those are very basic functions. You can essentially do whatever, frankly, you like to do in your SQL. Pretty cool stuff.
We have a GUI-based schema import. Essentially, if you have an existing database and you want to import this database into the in memory, we have a very nice GUI-based tool that will read this database, will create the schema for you, create the Java objects, basically will do all the dirty work for you on how to basically import this data effectively and cleanly from database. You know, we’re gonna spit out a bunch of Java clusters so you can take it off and do whatever you like with them, change them, or add some functionality to it.
But we’re gonna do all this kind of in a boring error-prone for you. We’re gonna scan the database, create the Java definitions and everything else. So great start by the way. It really cuts down on hours, sometimes days of work into literally half an hour. Run the tool, get your stuff in, look at it, change here and there, and boom, you have ready. You can basically import the data from your database into the in-memory system. Very, very nice tool.
We also support dynamic schema change. So it’s actually a pretty complex subject by the way. It kind of dovetails into another feature we have in our project, which is essentially called a – basically it’s dynamic objects or a cross-platform objects is where you can store data, let’s say from a .NET and then read this data from a Java site. And we’re gonna properly convert all the types properly. So you literally store .NET object in a cache, and you can read it as Java object or you can read it as C object. And we’re gonna properly convert all those types.
So part of this is, essentially, those objects – you can think of them kind of in the JSON-based objects. Although the format is binary is much more optimized. But that gives ability to have support for dynamic schema change if we need to. We can add fields to it in a remote.
We have geo-spatial indexes as well. So if you are into this realm of processing, we support geo-spatial indexes. So you can basically use this functionality.
Kind of another feature about support of Lucene and predicate queries, it’s a little bit kind of very parallel to this, but in exactly the same functional area where we actually have Scala SQL support, we also support a Lucene text queries and basic predicate queries. So if you want to do a full scan of predicate, be my guest. Scan the whole thing with the predicate. We also have fairly optimized Lucene text queries. So if you’re just looking for a text search rather than Lucene-based search, we support that as well.
So as a matter of fact, on co-located join support is something we already I think introduced. I’m not sure if this was released with Apache, but it’s literally done and it’s either already in the release. You guys can grab it nightly build and have it, or it will be released in I think 1.6 Apache Ignite.
Again, it’s an improvement in our join support. You don’t have to co-locate any more data for it to work. It’s just gonna work but slowly. At the point right now, it doesn’t work. So it will basically not work. And if you don’t have co-locations. With this new capability, you do not have to collocate. It’s just gonna work slowly, but it’s still gonna work. So that’s an interesting thing.
And we also work in very, very strong – I mean, it’s part of a community work and part of what we do at GridGain as well is we’re gonna support more and more with disk storage capabilities for the use cases where you cannot fit everything in the RAM. And today one limitation on the project is that for your SQL queries to work in a proper way, all the data has to be in RAM.
And there is some limitation to that because there are use cases where you cannot fit all of this in RAM. You literally have terabytes and terabytes of data. It just may be impractical for you.
So for those use cases, we’re working very hard in supporting a – kind of what we call a hybrid storage model where the data can be stored in RAM in multiple types of RAM, and on flash, and on disk. And your queries, including SQL queries, will work transparently over all of this different mediums. That’s coming up very shortly, probably by early summer. Literally, we’re talking about a couple of months ago we’re gonna have that support as well.
So this is basically, again, briefly but kind of the bulk of the features we have on the SQL support. As you can see, we pay a lot of attention to the SQL stuff. And naturally – and when it comes to SQL, there is literally just a few important factors. What kind of SQL do we support? We support ANSI-99 and standard. Is there any limitation? There are some but very few. And speed. And speed is basically – is pretty much paramount. Speed has to be one of the top concerns. I’m gonna show it to you in a second.
But let me show you some very brief examples. This is basically just literally – I believe it’s a Java code. And that’s basically how simply you can use it from a Java application. So literally just get the Ignite cache, which is think about it is just the table or basically SQL’s like a table.
And then once you have a cache, you can create the query as simple as – this is a simple query. There is no gimmicks. It’s literally just a simple, normal SQL query. You can use any type of SQL in it. And you get a nice distributed cursor back with the result site, and then you can print your things.
So there’s a lot of machinery here by the way. You can literally query terabytes of data. Obviously, we’re gonna do distributed fashion. We’re gonna only pull the data that’s necessary for a particular course or asynchronously for you so you can literally iterate it over terabytes of data very nicely. And this is another basically joined example. Again, very, very simple. Again, you get the cache. You create the query. You execute the query. Simple as this. And if you know the results are pretty small, you can say “Get all,” and basically have a nice data result set back.
So, by the way, this is a “Group By” example, too. So here’s what I’m gonna show you. We have about – we have a good half an hour. So let me go and spend maybe ten or 15 minutes on the demo. And I want to leave about ten minutes for the Q&A if we have in questions. So the demo I’m gonna show you is pretty interesting.
So we thought about how can we demonstrate this speed because speed has to be somehow comparative. So we picked a pretty good project from our friends at Spark community. So what we’re gonna do – we have a data set. I’m gonna show you the data set. And what we’re gonna do, we’re gonna start the Spark cluster, and we’re gonna start the Ignite cluster. And then we’re gonna use exactly the same interface, which is Zeppelin.
And Zeppelin is a very nice – a notebook-style interface. If you guys haven’t looked at it, I highly recommend it. It’s pretty cool stuff. It has both integration with Spark and Apache Ignite. So it’s very unbiased. And we’re gonna use exactly the same SQL queries to run against exactly the same data set, but the data set will be loaded in Apache Ignite and Apache Spark. And we’re gonna see the performance.
So let me kind of walk you through some of the setup here. I’m gonna get out of this presentation slide deck. And what I’m gonna show you first is where to get this demo. The demo is completely open-source. It’s on the GitHub. And by the way, when you get the slide, on the last slide – on the slide deck, you’re gonna see exactly this URL over here.
So don’t worry if you don’t catch it right now. It’s all on GitHub. It’s all free. It has a very detailed documentation of how to do it. It’s very simple. Literally, the entire documentation is basically like this. It takes you about five minutes to set up. Everything is prebuilt and everything is ready. So you can basically replicate this entire demo very easily in about 15 minutes from scratch. If you’re interested, you can play with it.
So what I’ve done basically is we have a Spark cluster. And it’s right here. I kind of blew up the font. So I hope you guys can see that. So it’s 1.6 Spark. It’s running one node over here – one working note.
And what I’ve also done is I started two nodes on Apache Ignite. And I started them – basically when you download the Zeppelin demo, essentially, you get a very nice Maven-based project and you can basically just click on this note file, and it will run this on any IDE, Eclipse or IDE. And you can basically start two nodes. So you can see that we have two notes in Apache Ignite, and we confirm that they actually have a – two servers. So essentially we have two nodes.
Now, in terms of the dataset, there is a little program here that actually generates dataset for you. And let me see if I can actually show you a little bit of data in the presentation mode. So if you look at it over here, essentially it’s a very, very simple program that basically generates two JSON files.
And those JSON files is exactly what gets loaded into the Apache Spark or into the Apache Ignite. So we have a very simple structure here. It’s organizations, persons within the organizations. We have 500 organizations. And we have a 1,000 persons per organization. So it’s a small data set, something we can test and play pretty quickly and easily.
And we basically generate two JSON files, organization JSON and person JSON. This is a very simple program right here. So I’m gonna show you data files. I already pregenerate them so we don’t spend time right now. So if you look at here, I’m not sure if you can see. But we have the two JSON files here. We can actually open this data file, JSON file, and we can take a look at this. So this is the organization. It’s pretty boring, nothing there.
And this is the person file. And let me just open it. Oh, it’s too big. All right, fair enough. So we can open it a little bit differently, I guess. So we have the persons file that contains about half a million persons. So pretty simple data set.
So let’s go back to Zeppelin. So this is basically a Zeppelin interface. If you’re not familiar with it, I’ll give you a very, very quick overview. It’s very simple, by the way. It’s basically have notebook-style interface to a data set. We have two notebooks, Ignite demo and Spark demo.
So if you open the Spark demo, for example, this is essentially what you do. You basically can write a SQL or Scala code right there, click run, and it executes for you, and you can basically see the results. It’s a very nice, very productive kind of analytics, data analytics interface. By the way, Databricks – if you’re familiar with the Databricks analytical suite, it very much looks like this. So it’s a pretty powerful concept.
So here’s what we did for a Spark demo. For in Spark demo, essentially, you need to do a little bit of a prep work because you need to load those JSON files into the Spark. And that’s what we did here. I already did that so we don’t have to spend time. It’s very simple. You just SQL context and then read in JSON. You could do this for the organizations, and you do this for the person JSON. So you basically create those data frames in Spark.
And once you’ve done that, you register a table, basically link data frame with a table through this command. Once you’ve done that, you’re all ready. So now basically you can start querying SQL, using SQL to query this data. So once again, just to again reiterate, we have a one node in Spark and one worker, and one master. And here on the Ignite side, we just started those notes right from the ID, literally from the code. It doesn’t really matter where they are. We just start them right here.
So we can go back in our Zeppelin notebook. And let’s execute this. So essentially, this is our SQL right here. Again, we can copy this SQL right from the GitHub page, so we don’t have to type all of this. So I essentially, like this, copied this and we can play with all of this as much as we like. So I copied this over here. And we just can click Run. And it’s running right now. As you can see, basically, that’s what’s happening with the Spark. This is the running SQL against Spark. So it gave us all the results, and it took about eight seconds to do that.
So now we can go back to a different notebook, which is Ignite notebook. And literally, I already copied it again. But the only difference here is literally this. So you have to let Zeppelin know that you essentially running a different interpreter. One was for Spark, and this one is for Ignite. The data set is exactly the same as with Spark. It’s exactly the same, two JSON files. We’ll load it up. We’ll load it up on the Spark cluster. We’ll load it up on the Ignite cluster.
So now right here we’re gonna be querying data from Ignite cluster. Very much the same interface. Again, we just click Run, and we’ve got results. Now, notice the difference. Now, this is what’s really striking here. It actually shows zero seconds, but believe me, the Zeppelin doesn’t really show you the milliseconds. So if you really query this in your applications, so somewhere you can actually get down to the millisecond level, you will notice that Ignite performance is about 100 times, two orders of magnitude, faster than the Spark SQL.
And the reason for that is – it’s not that we are smart than Spark folks. Absolutely not. Spark is an awesome project. But the Spark SQL does not really have indexes. Indexes is very hard to do for RDDs which is resilient distributed datasets — the concept that Spark is based on. It’s very hard to do proper indexes there. And the Spark SQL still doesn’t have a proper indexes at all. So naturally, in the Spark SQL, almost every query is the full scan.
And even for a small microscopic data set, as we have today, literally, 500 organizations and thousands of people per organization, something you can do on your iPhone very easily, even on this small data set, the difference is noticeable. Imagine you do this on your streaming pipeline or any kind of processing you can do. The difference between eight seconds and literally eight milliseconds is quite dramatic. And we can prove it again. So it can go back and play with the Spark.
As a matter of fact, let’s just clean up the output, and let’s pick up something else here from a different query. We can pick up this query. Let’s say second one. And we can literally run it again here. And it’s a simple join with a “Group by.” And it’s going. So as you can see, that’s a little bit more involved than the previous query, but it will be done.
So it took us about 20 seconds. It took about 20 seconds. And let’s go ahead and try. Well, I don’t want to have to type this whole thing. I can just copy it over from here. We go back to Ignite notebook. Now, let’s just clean this up so we don’t have, and let’s just paste it again here, and run it.
Now, notice the result here. As you can see, there’s absolutely no slowdown, right? As a matter of fact, the difference here is much more dramatic than the difference in the previous example. The difference here is probably almost close to three orders of magnitude. Not exactly, but you guys can test it out properly with a proper instrumentation to see exact numbers. But the difference here is quite dramatic. You can probably see it visually. One thing took 20 seconds to process, which is very, very slow. You can visually see this.
This one took nothing. You press the button, you get results. It’s instantaneous. It’s exactly the same data set. It’s exactly the same SQL query. That’s the power of what we do with Apache Ignite. And, again, not to brag too much, but we do quite a lot in optimization for what we do in SQL, right?
Remember, we use H2 locally. So we use one of the best local, non-distributed SQL execution app available today. We add very sophisticated distribution logic on top of that. And we pay a tremendous attention to the performance, which is essentially indexes, and how do we keep indexes, and how do we process indexes in Apache Ignite. So the end result is this dramatic performance increase.
Again, it’s a little bit unfair because Spark simply doesn’t have indexes. But even if you compare us with the other SQL engines, we’re gonna be – nine out of ten times, we’re gonna be on top in terms of performance.
So this is the kind of all I want to talk today and show you guys today. We are a little bit ahead of time just by a few minutes. Once again, the last thing I want to show you is this little slide over here. By the way, thank you for attending this. And we’re gonna basically answer some of the questions right now. But you can always get – let me blow up this slide a little bit.
You can always get this entire demo, by the way, on this GitHub URL. If you don’t catch it, just look for Zeppelin demo on – just Google it and you’ll find it. It’s extremely easy to follow up. Literally, there’s nothing there to do. You basically go to this GitHub page. You pull it down. There’s a POM file. You open this POM file from Eclipse or ID. Literally follow this little instructions, how to set up Spark and how to set up Zeppelin.
And that’s it. You literally just generate the JSON files, you load them up into the Spark, you load them up into the GridGain or Apache Ignite, and you run the queries. It’s as simple as this. And you can play with this.
So, Alisa, back to you.
Great. Thank you so much. Can you hear me okay?
Okay, good. So the first question is, “Do we support complex data types? For example, arrays, structs, arrays of structs, nested structs?”
Good question. Kind of caught me off guard. We definitely do but maybe a little bit more setups needs to be done on how to properly get them. That’s a good question, as a matter of fact. You know what, the best way to answer this question, post this question or just ask this question on a dev list in Apache Ignite. You’re probably gonna get a better answer with a code example right away.
The answer is absolutely yes. But the only kind of hesitation on my part is that I don’t think we support it completely out of the box. You may need to convert some of the times. You may need to provide some little bit of a code to support that. But definitely, definitely yes. But again, the best way, just pop this question on dev list in Apache Ignite. Like within a half an hour, you’ll get the answer back with the specific answer.
Okay. The next question, “Is there support for functions like hives, collect list, collect set, which are useful in aggregations?”
Not out of the box, no. That’s why we have our own custom SQL functions. If you want to write them, you can write them. But no, not directly.
Okay. Is there support for Java 8?
Yes. We are fully Java 8 compatible. As a matter of fact, most of our API is Java 8. As a matter of fact, I don’t think we support Java 7 as a source code. We’re in Java 8 fully.
Okay. So the next question is about – let’s see here. It’s about the demo. “I’m not familiar with Zeppelin. But is there pre-caching occurring on the physical database server? Is there a difference between the first and second query on the database?”
Well, Zeppelin doesn’t do any pre-cache. Zeppelin is just the interface, right? So Zeppelin is just the UI. And the reason we use Zeppelin, just to make sure we’re kind of showing you from the same UI point.
As far as the pre-caching, it really depends on implementation of Spark Ignite. None of us, I believe, do any pre-cache because maybe H2 engine on each individual node for Apache Ignite does a little bit of pre-caching, but I highly doubt this. I mean, it does, of course, some of it, but I highly doubt that affects the performance because we specificcally in our implementation of our integration of H2, we instruct H2 to not do anything pre-caching because we are in-memory system. It will be kind of, you know, caching of caching.
So everything you saw today and everything you can play today, play yourself with this demo, will be straight-out just literally call after call in the system. It will get a little bit faster naturally because there is some level of pre-caching happening. But it’s not gonna be dramatic, no. And the difference in performance will be pretty much the same.
Okay. “Can you manage data distribution by a logical key, as teradata primary index for example?”
Yes. One of the strongest parts of a data grid – and, again, I can only speak here for Ignite, not for Spark. For Spark it’s a little bit different. For Ignite, the data affinity, which is the proper term here, is one of the strongest part. You can basically define it as your own affinity as much as you like.
And not only you can have a data affinity, you can have affinity between the computations in the data. Even outside of the SQL context, by the way. If you want to run a traditional kind of APP level processing or any kind of HPC-level type of processing, high-performance computing, you can collocate your a compute jobs with a data sets those compute jobs require.
And we support strong – we have a strong support for affinity between compute jobs and data jobs, not only between just data allocations itself. So, yes, I do believe we have a probably more sophisticated support on this level than most of the systems in existence because that’s the core of what in-memory data grids and compute grids do. And that’s a part of our fabric.
Okay. “Do you support ORM on top of in-memory data fabric?”
No. And we don’t have to most of the time because there isn’t any impedance between like in traditional SQL world where you have a SQL table presentation and an object–level presentation. And most of the time in-memory data grids, you’re dealing with objects. You store objects. You read objects.
In terms of the SQL, the same story. Look in our example here. Let me actually pop up this slide deck so we can look at this. Look at these examples. So the question mark is – although it’s a question mark here and here, it’s actually gonna be object. So you’re gonna get back an object. Object can be the binary object, which kind of like a JSON-like presentation, which is schemaless, or it could be normal objects.
So the need for object relational mapping is almost nonexistent in the data grids. You can still use it if you like it. If you happen to store a binary objects everywhere, you don’t want to convert them to something else. But that’s entirely your application–specific control. And most of the time, and it’s one of the greatest advantage of a data grids as a component of our fabric is that you don’t have to have an ORM. You’re dealing with objects. Your keys are objects. Your values are objects. What you’re processing are objects. What you get out of the SQL is objects, too.
Okay. “What are the memory specs on the machine on which the demo was done?”
[Laughs]. The demo was done on my MacBook Pro laptop. So it’s a very simple laptop. It has 16 gigabyte of RAM. And if you’re asking about this demo, by the way – I’m not sure. Maybe you’re asking about those billion transactions per second we’ve done. But that’s different. This demo today right now I’m running from my MacBook Pro.
Okay. “Are indexes created on the data in off-heap memory? And SQL query work better than the scan query on the data in off-heap memory?”
Yes. Yes. So we can keep indexes both on-heap and off-heap. I think we’re moving – and by the way, this is actually a very interesting discussion we’re gonna have another day. We’re actually moving most of our data storage to off-heap model because our blades get more and more RAM with every year.
Today you’re buying a blade, you’re unlikely to buy a blade with less than 64 gigabyte of RAM. And obviously, Java cannot support that. And there is no point in running multiple, multiple JVMs because you’re basically gonna get queued by thread context switching. So we’re trying to catch up with this development and we’re trying to – a little bit of rearchitect what we do in terms of memory management.
And we store more and more stuff in off-heap, basically where we fully control the RAM and we don’t hit the garbage collection anymore. So it’s a little bit kind of detour. But, yes, you can store in the – at least currently both on-heap and off-heap. And as far as performance of scan query versus SQL query, SQL query will always outperform, always, unless we’re talking about some kind of a very extreme use case where a small data set and whatnot.
I hope I answered that question, if I understood it correctly.
Okay. “Does Spark support in-memory partitioning of data across computing engines?”
I believe so, yes. I’m not a Spark expert, per se. So you guys need to ask this question to the Spark community, but I believe so, yes.
Okay. Well, that brings us to the end of – oh, wait – never mind. Somebody just came in last minute. We have time for a couple more questions. So if you do have questions, please go ahead and enter them in the Go To Webinar Questions panel.
Okay. “Do you have any advice or exact method for sizing GridGain cluster based on size of input data?”
It’s a good question. And we get this actually quite a lot. When we talk to our users and customers – potential customers, it’s one of the first questions. My immediate reaction and probably the best answer I can give to you, try it. Measure it and try because we’ve seen a dramatic difference in how a particular data sets lays out from disk to RAM, for example, because objects on disk take a different amount of space than on RAM. And there is no exact science.
If you’re looking for an absolute back-of-a-napkin estimate, I would say with the two replicas in RAM, essentially, if you want to have any kind of a full tolerance, you keep to keep data in RAM on more than one computer. If the computer goes down, you don’t lose the data.
So if you want to that behavior, which is pretty standard, I would say you’re looking for about 3X RAM requirement for the same data set you have on the disk. So if you have 100 gigabyte on the disk and you want to move this 100 gigabyte onto the RAM with a two replicas for high availability and fault tolerance, you probably need to have about 300 gigabyte, 3X of that in RAM across the cluster.
And again, this is just a back-of-a-napkin. Your use case can be different. I’ve seen use cases where a RAM requires less memory than the disk for the same data set. But most of the time, it’s about the same or slightly larger. Slightly larger mostly because of the fault tolerance requirement. You want it if you really care about this.
If you want to have a fault tolerance, you have to replicate your data at least onto the one more computer in case of the other computer goes down, you still have the data. So that’s a kind of basic answer. But again, the best way – and I always keep saying, if you’re serious about it, spend half a day and just load it up in RAM and measure it.
It’s as simple as this. And most of the time, you can optimize it, you can basically play with this, and you get your exact results you’re looking for, for your specific data set for your application.
Okay. We have a couple more questions. “How do you compare with Exasol in-memory database?”
Exasol. I don’t think we – I don’t think we’ve run any tests against that at all. But again, if you’re asking me about – just let me answer this question generally about in-memory databases. You’ll be surprised.
We do not really compete in any way, either as an open-source project or as enterprise version GridGain. We don’t because there’s a dramatic difference between database and in-memory data fabric. Database predominantly is just a SQL engine. It can be memory. There’s plenty of them. MemSQL’s a great one. If you just need essentially a SQL database that does in-memory, there’s plenty of options.
But SQL databases don’t do map/reduce They don’t do streaming. They don’t do Spark integration. They don’t do Hadoop acceleration. They don’t do file systems. They don’t do HPC. They don’t do many other things. And that’s what’s really cool about a fabric is it gives you a much more strategic view on in memory. It’s not a single–trick pony.
Yes, we do have a SQL, but we also have map/reduce. We also have in-memory streaming. We also have Spark integration with shared RDDS. We also accelerate Hadoop if you need to. We also do complex clustering and service grids. So we do a lot of different things on the fabric. That’s why we call ourselves a fabric, not a database, not a data grid.
So I’m pretty biased here, naturally. My bias towards fabric, it’s more kind of a canonical solution for in-memory processing. But if all you need is a SQL and you need a little bit faster SQL, then SQL databases or a perfect choice.
Okay. And last question. “Do you compress the data in memory? If so, what kind of compression ratios do you normally see?”
We do not, and for a good reason. We thought about it very hard because some of our friends do that. And we believe that’s not the right approach. The compression can be done much more effectively in application level if you need to because, keep in mind, when you compress the data, you have to decompress it every time you want to run a query against it. And that has enormous penalty performance-wise.
So you essentially are trading capacity for a very significant penalty in performance-wise. And instead of doing it automatically, we basically decided a long time ago that we’re gonna leave it at the application level. Compression can be done extremely simple. There’s plenty of libraries. You can do it in one-liner.
But you at application level can decide which and when and how you compress your particular data object before you store it. And you will be aware that you will not be able to query it easily because, obviously, if you compress it, it becomes just a byte store or byte blob, and you cannot really query inside of it.
So it’s a good question. I don’t believe our decision is absolutely 100 percent correct. But I believe on the balance, I think it’s a better approach than actually have a built-in compression capability into the product because, again, there’s plenty of libraries that can be done with one line of code.
There’s nothing else to do. There’s no complexity there. But we believe it’s better to be left for our application developers who know precisely when, how to compress data before it’s stored in date grid or in data fabric.
Okay. And this brings us to the top of the hour. If Nikita didn’t get to your question, I will make sure that he gets a copy of all the questions, and he will either reach out to you or post a follow-up blog entry – blog post of our GridGain blog. I also wanted to just remind everyone that I will be send…