From Big Data to Fast Data

This on-demand webinar titled "From Big Data to Fast Data" features Jason Stamper, Data Platforms & Analytics Analyst for 451 Research, and Nikita Ivanov, co-founder and CTO at GridGain Systems.


In this one hour webinar you'll hear how real-time organizations are keeping pace with massively expanding data from the perspectives of a leading analyst firm, and you’ll learn how these organizations can benefit from an In-Memory Data Fabric to enable cloud-scale, high-volume/low-latency data processing as a response. Jason and Nikita field questions from the audience in a Q&A section at the end.


This is a must-see for technology leaders in the transition to high-speed, low-latency big data systems.

Nikita Ivanov
Founder & CTO, GridGain Systems
Jason Stamper
Data Platforms & Analytics Analyst, 451 Research

Dane Christensen: Good morning or afternoon, depending on where you’re joining us from. I’m Dane Christensen, the Digital Marketing Manager at GridGain Systems. I want to thank you all for taking the time to join us to day, for this very informative webinar, From Big Data to Fast Data.

We have two speakers today, Jason Stamper, who is an analyst with 451 Research, with 20 years of experience in the IT sector. Jason meets regularly with the most influential people in the industry, including CEOs of hundreds of technology companies like IBM, Oracle, HP, SAP, and many more.

Nikita Ivanov is the Co-founder and CEO of GridGain Systems, as well as a member of the Podling Project Management Committee, for the Apache Ignite Incubating Project. In short, we have two of the foremost authorities on the subject of big data with us today.

So with that, we’re ready to turn the floor over to Jason Stamper. Jason?

Jason Stamper:
Thanks very much. Thanks for the introduction. So hello everybody. Welcome to this webinar in Memory Competing From Big Data to Fast Data. Thank you very much for joining the session. I do sincerely hope you get something out of it. So a very quick word on 451 Research.

They’re one of the industry’s best kept secrets when it comes to analyst firms. We’re actually now about the fourth largest, having made a number of acquisitions over the previous years. We bought Yankee Group, and we bought Uptime Institute. So Yankee gave us mobile coverage.

Uptime Institute is the organization that rates data centers and says how well they’re doing in terms of efficiency, and power, and so on. We’ve got over 100 analysts in the organization today. And won’t read all of these bullet points out, but we interview regularly through another acquisition we made, called the Info Pro, over 10,000 senior IT people every year to get a real handle on what’s going on in the industry.

So we’re less about sort of counting boxes, and counting numbers, and how many servers have been sold in a quarter. And we’re more about what is the sentiment about some of these technologies, and what’s really going on, in the real world, on the ground.

Anyway, obviously, you can find us at 451 dot com. So let’s get going then. What are we here to talk about really? Why is anybody interested in a webinar with GridGain and what I have to say about this? Well, the problem is there’s a big challenge in enterprises today, in that, end users aren’t getting the rapid response, and the access to information that they need, and when they need it.

And the reason for that is because the IT infrastructure that we’ve put in over the years – I think Sun Microsystems’ CEO called it, a jalopy. You know, it’s become a bit of an animal. It’s a bit chaotic. We’ve done it in piecemeal fashion, and that has meant the things aren’t able to scale and grow as we need them to.

And they’re just not flexible enough. And, of course, the end users see that they can get results from Google in a fraction of a millisecond, and they can’t understand why they can’t get information from their IT department in the same amount of time.

It takes days or weeks to find out new information, get new applications delivered, have access to new sets of data and data marks that they want to analyze. We call this, at 451, the perfect storm in terms of that bottleneck. If any of you have seen, Fawlty Towers, the British comedy program, you’ll be familiar with this picture. But the reason I put this up is – the magazine I used to work for – we did an interview of 200 U.K. CIOs. And we asked them – what sort of level of gap do you think there is between what the business expects and what IT is able to deliver? And you know, we weren’t surprised that there was a gap, but we were somewhat surprised that 98 percent of those British CIOs said the gap was significant.

And that’s CIOs admitting to themselves and to the business that, actually, we’re not delivering what the business needs right now. We’re not delivering the information in the format that they require, on the device that they’d like to see it, in a usable fashion, with the tools that they’d like to use.

And it’s just not accessible, and it’s not there for them. And equally, the applications that we’re delivering to the business take too long to come through. CIOs and IT Directors are waiting for the business to tell us what they want because they’re often seen as the cost center.

They’re unable to break through that cycle of just being a cost center, and having to deliver basically, yesterday’s promises tomorrow, kind of thing, which seems to be a lot of the struggle that the CIOs that I talk to, at a lot of events, tell me about.

So what does a business want? And this will be familiar to most of you. So I won’t spend too much time on it. Of course they want speed. Various analyst firms have talked about big data in terms of the three Vs – velocity, variety, and volume. People since then have added more.

You can add veracity – how accurate is the data? You can add value – how important is your data that affects where you store it and how you deal with it, and so on. We can add as many Vs as we want, but speed is definitely an issue. With businesses today, more and more of them are seeing themselves as much digital businesses as anything else.

And so, of course, there’s this question of how you get information into the hands of businesspeople? But not just the kind of static data that we analyze, that we used to do with something like the kit, Cognos, for business subjects, that we analyze. And we sort of look in the rearview mirror, with data that is fed into the application in front of people in their day-to-day jobs, embedded analytics, or some might call it operational analytics – which gives you access to information when you need it, rather than look in the rearview mirror. So we’re seeing a lot more of that as well. So the picture in the top left, I think, is a – I’m not much of a petrol head, but I believe that’s a Bugatti Veyron. So speed is, obviously, incredibly important. And I’ve already talked about some of these but – mobility.

We need this stuff on the move. Ease of use and self-service. I don’t want to have to ask the IT department six weeks in advance when I need to do some sort of new type of analytics application. And the bottom right picture is of a Heads Up Display. I think we’ll start to see these on our cars, unfortunately, I probably think.

I think it’s probably going to be a distraction. But this is what people want. They want information as and when they’re driving, rather than look in a rearview mirror. So what causes IT to become a bottleneck? And, again, I won’t spend too much time on this because I think most of you know this.

But IT is still, if you like the gatekeepers who are charged with making sure that their business data is secure, that it’s well managed, that it’s governed, and that it’s kept safe. And it’s all very well, having some new, fancy, open-source tool on a laptop, which somebody can do some sort of analysis on or whatever.

But the IT department knows that their jobs are on the line if this data is not kept secure, well managed, and well governed. At the same time of course, they’re struggling with – you see the word, legacy, there. And we know that up to 80 percent of most IT directors’ budget is spent on legacy technology.

They’re struggling with some of the technologies that they put in a long time ago, that they spent a lot of money on, that aren’t really delivering value, and aren’t giving them the performance and flexibility that the business is expecting. Staff, I put there because – especially in the area of analytics – we know for a fact that people like data scientists are in short supply.

And the kind of experts who can take a large data set, analyze it, and give you the results are definitely few and far between. So staff is, obviously, an issue for many companies. So more and more companies are looking at other alternatives as to how they can get better flexibility and agility for their infrastructure without breaking the budget – which is the other word on that slide.

Now that picture on the bottom right – if you do know your cars, you might recognize that as being a – I think that’s a Mitsubishi Lacetti, which was the star in Top Gear’s Car in the reasonably priced car section. So that’s why I put that on there because the IT department often is expected to do an awful lot, but the budget isn’t there for them. The enterprise wants the Bugatti Veyron, but they’re giving them the budget of the Mitsubishi Lacetti.

And we won’t say too much about Mr. Clarkson’s Frack Car, which seems to have taken up more news pages this month than the general elections in the U.K. But I think you get my point which is that IT budgets are stretched and they’re, obviously, not doing what they’re expected to, but they have all these sorts of challenges.

Let’s move on. So what are companies trying to do to solve this bottleneck woe? A lot of companies have played around with a dupe, or implemented a dupe, and seen what they can do with that.
Will that solve the bottleneck? Does it mean free storage?

We can just put in some commodity intel service and put all of our data in that. Well, no. It hasn’t solved the problem. Neither has NoSQL databases like MariaDB, and Mongo, and so on. And the reason is, they’re designed for large data sets, but they’re not designed necessarily for transactional resources.

They’re not designed to be, in any sense, real time. They work very much in batch mode. So they definitely have their uses, but they, obviously, haven’t solved the bottleneck, which is about speed. And my fourth point down there is IT is, obviously, still critical, but it needs to enable the business to help itself.

And the question is, how does the IT department start to get part of that bottleneck? And my final bullet point there, which I’m sure you’ve already read, because I know on these sorts of things, people read the bullets long before the speaker tells them.

But it’s not just about the real time access to information. It’s also about the process in which you, as an enterprise, are delivering applications to the business. And what I mean by that is – how much is involved for application developers when they want to build a new application or when they want to deliver a new data set to the end users?

How much work is involved for those application developers. How much work is involved for the likes of enterprise architects? How much time and management does this need the CIO to deliver? There’s all of those sorts of things as well that go on in the background which seem like kind of – they’re just day-to-day. We just need to do that kind of thing, but actually can be improved. The development cycle can be reduced with some these latest technologies that we’re going to hear about from GridGain. And definitely there’s an opportunity to make it a lot easier for developers to get their job done.

And, in so doing, the idea is that we’d reduce some of that bottleneck I talked about. And the slide I put up with John Cleese thrashing his car, in Fawlty Towers, I think, says it all. So that’s what we’re trying to do. So I talked briefly about the dupe.

And just in case you are in any kind of thinking that there’s a bit of a halo around a dupe at the moment, and everybody’s using a dupe now for their data storage, and that the likes of EMC, and IBM, and HP, and so on – the storage businesses are going to go away – we interview, as I said at the beginning, 10,000 people a year about their adoption of technologies.

The number of them using a dupe is that tiny orange slice on the very right hand side, and right above. We’re just coming up to produce our next version of this. So, as I said on my previous slide, the dupe hasn’t solved the problem, and it’s not designed to solve this issue of speed and big data.

It’s really designed to be a relatively cheap storage platform for data that you haven’t quite decided what you’re doing with it yet. But of course, costs come in later on when you start to work out what you do want to do with it, and how you’re going to analyze that data.

And it’s been said before that a dupe is a bit like buying a puppy. It’s relatively free when you buy it, but the costs over the life cycle grow considerably. So what are some of the challenges? I’ve already alluded to some of these, so I won’t read them all out.

But most enterprises are finding that they’ve got more users and connections. Even if they don’t have that many more customers than they used to have, they’ve got more people hitting their web site and analyzing their form. They’ve got more people interacting on social media.

Transactions are going up. Of course, there are more and more businesses that are in the eCommerce space, like online gaming, social, and so on – eCommerce. We’ve talked about the internet of speed, and so there’s all sorts of pressure on the traditional IT infrastructure, which wasn’t really designed for these kinds of users. And I understand that when you talk about scalability problems and challenges – there’s a web site in China called Qzone, which I believe is the equivalent of Facebook.

They hit their first 60 million customers in the first week. It’s just staggering when you think back, 20 years ago, it being conceivable. And so you’ve got companies like this that are having to deal with this, and work out what kind of infrastructure they can use to handle this incredible scale that so many companies are seeing.

And even in more traditional businesses, we’re trying to analyze data from more and more census in our factories, our shop floors, our web blogs, our web sites, and so on. So even if we don’t have a huge growth in customers, or even if we’re seeing a decline in customers – we’re trying to analyze vastly more data than we used to because we can and we’re able to get a bit more sophisticated about what we do with that data if – and this is a big if – if we can cost effectively analyze that data, and get it to the right people, at the right time, rather than it just be a bit of an academic exercise.

So again, I talked about some of these challenges, but if we get specifically to databases, which is the area where GridGain obviously is and is going to talk to you about, there are some clear challenges with traditional databases. And I don’t think it’s any surprise that you’ll see some of the younger database companies raising large amounts of investment for VC capital because they know, and the VCs know, and their new, young customers know that the likes of Oracle, IBM, HP, SAP, Microsoft, etc., with their databases – they’re doing what they can to address these challenges.

But, equally, they weren’t designed with them in mind. And so, depending on what you’re trying to do, there are different approaches which may well work better for your organization. And that’s why we’re seeing quite a lot of investment in younger database companies. That’s why we’re seeing so much excitement around the likes of the dupe, and Mongo, and MariaDB – whether or not they solve all of the challenges.

There’s definitely a problem for some of those incumbent vendors. Of course, they’re working on it, and they’re adding things like in memory options and so on. But they know and the market knows that there’s a change happening, and the question is how your organization deals with that change.
So these are some of the potential options that companies are trying to adopt in order to break through that bottleneck that I talked about. If you remember my slide with the Bugatti Veyron and the star in a reasonably priced car from Top Gear, you’ll remember that the organization wants more speed and flexibility.

And that’s not just in terms of delivery of information in real time, but also in terms of the development life cycle. And they also have budget constraints. So traditional relational databases aren’t always working. So they’re looking at a number of different options, and these are some of those that companies are thinking about, that we talked to.

So the relational database vendors, like IBM and Oracle have added, what they call, in memory options or add-ons. IBM’s is called Blue Accelerator. Oracle has got the Oracle In memory Option. But fundamentally, there’s still a price to pay for those, and you’re still dealing basically in the same environment that you were dealing in, potentially with a little bit of increase in speed, but you’ve still got issues around scalability.

Especially when you start to think about wanting to store time series data, for example, a sensor that reads, time series data every second. Relational databases were designed really with transactional data – a few hundred or a few thousand transactions an hour, rather than hundreds of thousands of transactions an hour.

So they’re in memory options. It certainly seems to help with analytics, but possibly not with the transactional side. You’ve obviously then also got pure in memory databases. You’ll know many of them – the sort of NoSQL, VoltDB, and so on. And they’re doing a great job, depending on what you want to do, again.

They also mean, to some extent, that you have to pay for memory – which is expensive, compared to disk. And although the price of memory is coming down, it’s still not free. And also, they’re not ideal for very large volumes of data storage. So they can be good to work as sort of an in-between role.

But you’re adding a layer of complexity, and you still probably need a relational database, etc. The data streaming offerings from lots of companies these days – IBM, Software AG, or Palmerston, or Progress, and so on. And they have their roles, especially in things like very high volume, very low latency transactions, where you need very fast throughput.
They’ve definitely got a role to play, but they’re not really designed for storage. So you still need to consider what your database platform is going to be there. Then you’ve got analytics in the cloud and database as a service, again, for testing and dev projects, for certain types of use cases – absolutely fantastic.

You can pay or pick. You can try it out without a huge investment – all great. But if you want real time, very fast data, you’re relying on the internet. So that’s not necessarily going to be an ideal for everyone. And then there’s the in-memory data grid cache approach.

Some of the advantages are you need less rewriting and replications in the database and massive performance improvements that we’ve heard about, and I think you’re going to hear about from GridGain. And the only question now, I suppose, that someone would post would be – to some extent, you’re adding another layer, rather than completely swapping out the database.

And that has it’s own pros and cons. Swapping out the database is a vastly complex and expensive job. So I think we’re going to hear now from GridGain to see how they try and minimize that complexity, and make things as easy as they possibly can, in the grid and cache approach.

This is my final slide. It’s our view of the total data approach. It’s just to show you really all the different areas that my team covers. I won’t talk too much about it. But you can see the three Vs in the middle there, and the economics. Around the outside, we surround it with a business case.

So, without further adieu, if you’d like to continue the discussion with me, feel free to ping me and email, or let me know what you thought, or how informative, or not informative my discussion was. And without further adieu, I’m just about overtime. Thanks very much. I’ll hand it back to Dane.

Dane Christensen:
Thanks very much, Jason, that was excellent, and I thought it was very informative. As you said, now we’re going to go ahead and hear from Nikita, who’s going to take a deeper dive into How GridGain is solving these problems that you talked about. Nikita, why don’t you take it over?

Nikita Ivanov:
Good afternoon everybody. Thanks, Jason, for the great overview. So where we’re going to spend the next 20 minutes is to give you a fairly high-level overview of what GridGain does, and what’s our vision of fast data. We’re building a software to support that vision. I’m going to give you a bit of a rundown of what we’re doing and how we do things. Just a quick note is that GridGain essentially is the enterprise version of Apache Ignite. Apache Ignite is the free Apache Software Foundation Project, on top of which we built.

We were original developers of this project. The project is pretty mature. It’s been in development for almost 10 years by now. We again developed the enterprise version of that. What is Apache? By the way, I’m going to be using Apache Ignite and GridGain almost interchangeably, apart for enterprise version and some of the services we bundle with the enterprise version, this is the same type of software.

So what is the Apache Ignite? We call it an in-memory data fabric. And the words, in-memory data fabric, are pretty important here. First of all, it is a software solution that runs in a cluster of computers. And architecturally, each slides in-between your data sources in your applications, and delivers all the benefits of the memory computing, predominantly extremely high performance and high scalability.

Those are two metrics – performance, high-performance/low latency, and scalability that typically define our in-memory solutions. Before we dive into the details, let me just talk a little bit about the disk effects. We generally get asked, what’s the difference between our disk-based processing and Flash-based processing of memory? What are the differences?

And, specifically, how the in-memory computing is different from just utilizing Flash storage. And it’s actually fairly not intuitive, because people sometimes confuse Flash and the memory, but it’s really nothing close. Typically in memory systems, the systems where you use Ram of computers across a cluster, as your primary data store, is anywhere between two to five orders of magnitude faster than the disk-based systems.

We’re talking about 1,000 times faster than an equal disk based system. And that’s the major difference between the Flash and the Ram. Flash can only give you a marginal improvement. If you’re looking for two times or three times faster reads on your data, Flash is probably a good option to speed up, because remember, Flash is still basically a disk, and that’s how the system view the Flash.

It’s still essentially a disk with just much faster seek time. When we’re talking about in-memory solutions, such as the in-memory data fabric we developed. It has a completely different paradigm of how do you store data and how do you process data. Data is stored in exactly the same Ram where your processing is happening and, therefore, it’s not a disk. It’s actually byte addressable storage. It has completely different characteristics that relate to this massive performance increase.

And the performance increase is truly massive. At the end of the presentation, I have a slide for you where I’m going to show you some of the metrics we achieved for one of our clients. But the difference between Flash and RAM is this dramatic performance increase, with multiple orders of magnitude.

Back to the in-memory data fabric – the key architectural idea here that’s depicted on the slide is they will truly slide in-between your data source and your apps. And this sliding ability basically not asking you to replace your databases, Jason mentioned, is a key for us.

You know you’re working on your organization. Go ahead and think about what it would cost to replace, let’s say, Oracle, or MS-SQL, or IBM from IBM, from your organization. It’s, in many cases, literally impossible. It’s even impossible to change those studies. That’s why basically with the in-memory data fabric, if you have your existing database, keep it.

Don’t even touch it. You can layer this fabric on top, essentially deploy a software on a cluster of computers, architecturally in-between databases. We have them all, by the way. SQL, NoSQL, even Hadoop. And your applications, and the fabric will take care of basically, intelligently moving in storing data into the Ram across multiple computers, away from the databases, and giving you abilities to process data right in this layer.

Now the reason we call it a fabric – I’m going to switch to my next slide – is because we support multiple types of payloads, all types of processing on that in-memory layer. We essentially look at the memory more strategically than a lot of different projects and vendors.

For us, it’s not just a bid to agree to compute grid or caching. Those are just the different use cases of what you can do if you take a strategic view on memory as your storage gear. When I presented about the subject, I always keep telling my audiences that the traditional computing is what I call a disk first, memory second approach. You know your data storage is on disk, like we’ve been typically been doing for the last 40 or 50 years. I think we include [unintelligible] user and this is for the Ram to cache, some of the most frequently accessed pieces of data. It’s a very logical approach because if constantly going from one to another for exactly the same piece of data on a disk, why not store it locally, right?

And that’s what means disk first, memory second. Disk is the primary storage. In the memory systems such as the in-memory data fabric, where we take a strategic view on Ram, the Ram becomes the primary store of the data. And we do use disk only for the backup purposes.

In this picture, it’s Ram first, disk second approach. Once again, we used Ram across multiple computers, as we distributed full tower, in memory data store, and we use disks essentially as the backup devices, just to make sure we’re backing up the data if you decide so to do.

So the main fabric actually, it’s kind of important for us because, as I mentioned to you – the unique property of what we do is that we support multiple use cases, multiple types of payloads on the fabric itself. So we’ll talk about data grid, and compute grid, service grade, in-memory, streaming even a dupe acceleration that we have in a fabric – it’s all part of the fabric.

It’s a very nice Italian approach where you don’t have to essentially double together and no apples, oranges, and cucumbers from multiple projects, and try to integrate them – you get all of that in adhesively developed, same documentation, same learning curve, same configuration, same management tools, for all these different types of use cases.

Some of the key features of the fabric that basically cross and it’s having a cross section across all of the different sub functions that we have is on the slide. Definitely performance. We talked about it. It’s pretty obvious. In order to truly understand what we’re talking about in Ram, in memory computing, performance seems to be an obvious thing.

What a lot of people don’t realize is what I mentioned – how dramatically bigger the performance is in a memory system, compared to other systems, like dispatch systems. Scalability – that’s actually much less intuitive. Many people don’t realize, but let me give you a quick historical walk back here. What’s actually unique about a memory system is – from day one when we started doing memory computing, as an industry back in the early ‘90s, if you remember, back in the ‘90s when we had 16 beats you could use, and later on we had 32 beats you could use.

It was not enough RAM on a sealed computer to do anything useful. So from day one when we started doing memory computing systems, we had to do distribution. We had to be able to basically link together multiple, multiple computers to really get this kind of virtualized pool of RAM that was big enough to do anything useful.

So what you’ll find today is that in memory computer systems are the most advanced distributed systems in existence. To a lot of folks, it’s very counter-intuitive. What does the memory have to do with distribution? It has to do a lot because from day one, not because we were smart enough, but because we were forced technically to do distribution for the last 25 to 30 years, and that’s what basically makes systems like GridGain is so advanced in how the scalability is matched.

Just to give you a good example – our largest customer runs 2,000 notes in a fully transactional pathology, and I will challenge anybody to find too many projects or products that can sustain that type of a significant load under the fully ACID transactional pathology, writing transactions on 2,000 notes.

That’s the level of scalability that you can get with systems like this, high level ability, kind of assist the property to scalability, but handle abilities well. A lot of people basically are saying – well if it’s in memory, what happens if I unplug my computer and the RAM is volatile?

Well, technically, it is correct, but practically any system, and GridGain is no exception here, would have fairly deep and rich functionality – how to ensure that data is high level properties – the data can replicate across multiple nodes. The data can be stored on a disk locally if it needs to be.

There’s multiple strategies how to deal with this. For example, GridGain even supports geographical data center is replication, if you need to have that functionality. So the high-level ability is there. What’s really unique about what we do or that it can do, with other use cases – we support full transactionality.

Unlike a lot of NoSQL databases that only support [unintelligible] consistency transactions, we support fully ACID transactions. Exactly the same transactional behavioral you’ll find in traditional databases. So when you move some of your logic from still procedures and PL/SQL and things of that nature into the data fabric, you eventually have the same data consistency guarantees in terms of transactions.

And it’s very convenient because the biggest problem a lot of people have when they move from databases to different types of storages is the consistency models breaks, and you have to re-architect your system to support that. In our case, exactly the same transaction is both pessimistic /optimistic transactions they can use them.

Persistence. Sometimes people will ask me – what the hell persistence has to do with memory computing? That’s the whole point of the memory, not to have a disk persistence. And that’s actually thoroughly been viewed. Once you mention that every in memory computer system today would have a persistent storage.

Remember I told you about memory first, disk second? This is the right way to look at memory computing. In memory computing is not about eliminating disks. It’s basically using disks for different reasons, for backup reasons. And that’s exactly what we do here in data fabric.

You can naturally have asynchronous or in effect optimized persistence of your data on the disk if you decide so. You can persist on the disk. You can persist in the database. You can do all kinds of different things. And last but not least, security. Our enterprise version has a great built in security with full added trail authentication authorization, and all of those different things.

So I want to basically give you a quick rundown on some of the key functional areas behind in-memory data fabric. We’ll talk about computing data grids, service grids, streaming, and Hadoop acceleration – one of the key areas. That’s not all of them, but those are the key areas.

So compute grid. What is the compute grid? Compute grid is all about parallelized processing. If I have 20 computers in my cluster and they have all this data in Ram, how can I arrive at an effective computation on this data? It’s actually a very interesting use case because we keep talking about data storage in terms of Hadoop.

But the biggest portion of Hadoop, for example, is it mass produces in order to process the data you store you can do. So in a memory computer system, the compute rate is basically feel the same way. How do you process the data you already have? This processing can be fairly sophisticated and complex in many cases.

How do you parallelize this processing? How do you load balance? How do you fail over? This report very advanced feature set when it comes to computing. This report multiple versions of map/reduce. In memory computing, one for Hadoop compliance. We support very cool zero deployment technology, closeted basically developed this very effective without constantly deploying the cluster.

All you code changes get automatically deployed in a cluster as you run it. So you can have exactly the same development in work flow as if you would work locally on one computer. You have a task scheduler. We have state check points for the long-running tasks.

It’s a very cool feature. It came to us from multiple clients. If you have longer in task and typically you know a buyer science has very long running tasks, you can check point them. And if the task fails, and you want to restore that, it will restore from the last checkpoint, not from the beginning.

If your task runs a while, it’s a big deal of improvements. We’re fail advanced load balancing, an automatic fail over. We have a full cluster management. It’s a very important feature because when you do the computational task, you often each have a very specific access to your cluster to see which nodes you have available, which resources available on those particular nodes, and have a change in time, and you want to basically have your mapping logic or essentially balance in logic, to be very constant towards those available resources.

We have all this functionality available to you. You are the developer as an engineer can have very hands-on approach on how your task actually executes in the cluster. So we’re not hiding this functionality from you. It’s exposed to you as a developer.

Data grid. Data grid is actually fairly similar to computing but it is solving a different problem. As compute grid solves the parallelization of the computations, the data grids essentially allow you to solve the problem of parallelization of the data storage.

So the problem is, again, very similar to computer grade. If I have a terabyte of data and I have 10 computers before me, and they each have 100 gigabyte of data, so the entire class has a terabyte, how do I store my terabyte of data on a terabyte cluster across 10 computers?

So naturally somehow data has to be partitioned. In fact, partitioned data – how do I do the fail over, how do I do the load balancing, how do I do high availability – all the myriads of discretions is handled by data grid. Data grid is essentially one of the biggest part of the in-memory data fabric.

Fundamentally, it is essentially a core layer on top of which almost everything else is built. How do you store data into this virtualized now middle layer? So fundamentally, in-memory data grid is the object key value, distributed key value store. It’s built on Java, so it’s JVM based.

Any JVM language, like Java, Scala, or Groovy would work perfectly fine. We also, by the way, have non-JVM clients for .NET and C++ allocations. But fundamentally it’s based on the object key value store. We have both replicated and partitioned ability, just ways to store data.

In replicated mode, each key value pair is stored on each node for extremely high availability, but depending on the capacity of your cluster. In a partition mode, essentially, each key value pair is stored only on one node, plus multiple replicates with high-level ability if you need to, that gets you much better capacity utilization, and typically that’s the main mode of storing data in any cluster.

Sometimes people call it a sharding, but it’s a much more advanced technique than that. Typically data grid stores in tens of terabytes. In-memory computing system in general and perform the best with basically limited as there are tens of terabytes. We’re not there yet to use the systems of terabytes of data.

It’s not there yet from an economical standpoint, but we’ve seen plenty of customers and users who basically use it from sub terabyte to lower teams of terabyte payloads. And that’s a thing that’s where there sweet spot. So being a JVM based software, Java based software, we support both on heap and off heap storage – very important.

We utilize the entire hardware capacity that’s available on a particular node, not only what’s marriage by Java, but everything beyond Java as well. As I mentioned to you, one of the biggest things we have in data grid is a full asset transactions.

The same transactions you have in databases – the big, big, build. It’s one of the most complex parts to implement, and we have that. We also have full SQL support. So once we have the data in the data grid, you can access it by key value, or you can access it through instead of SQL.

You can literally adjust each of the SQL queries. If you don’t know the keys upfront, you want to run a SQL query to get that data out. You can run SQL support distributed joints as well. So we support full SQL ’93 and we have pretty cool functionality on top of SQL, like casting SQL functions, so you can develop in Java and different languages.

Another interesting aspect of what we do and that goes to the message of fabric, is because we provide a high integration between the compute grid and the data grid. And it’s only possible when it has that integration. What it allows you to do is the very smart co-location, and we call it affinity co-location, meaning that there is affinity between the computation and the data this computation needs.

So it’s kind of a deep question, but it’s an extremely important consideration in distributed systems because in distributed systems, you want to avoid any noise traffic. You want to avoid any data movement unless absolutely, theoretically minimally needed. And ability to co-locate essentially send computation to a node where the data exists for that particular computation, is the key for a distributed system to perform well.

And we have that in automatic mode, and there’s a grid integration between the compute grid that actually manages these computations and the data grid that manages the data distribution. When they’re fully integrated, you gain this co-location ability in a very simple way. It’s a very important consideration.

Memory service grid. Yet another component in the fabric. And now you best get a stance of what the fabric really is. It is the combination of these functional areas in one cohesive product which is great. It’s service grade. We’ve heard from our customers. They’re constantly asking us – well actually four – a couple of years by now – that look, guys.

We have this philosophy at GridGain for Apache Ignite writing, and we just need to have one or two services that would run robustly on this cluster. So if the node crashes, this service will automatically restart in some other node, essentially filing certain tasks away.

And that’s exactly what we needed. It’s a single functionality, but I’m surprised how openly widely it’s used. And that’s great stuff. So you’re building your applications to your systems. You have many of these micro-services that you want to run on the cluster, and you typically want to run one or two weeks of that.

And without this functionality, you have to write quite a bit of a code to really support all of this health-mind toward restarting, starting, and the whole display management. With the service grid, you have an absolutely brilliant and simple interface to API to literally, a few lines a code and configuration. As a matter of fact, nothing to code at all, and you can run your service right on the grid, and we’ll take care of the entire SLA, how to maintain it, and monitor it, and start, and restart, and any cases. Great functionality. Used quite a lot.

Yet another functional area is the whole streaming and complexity of processing, and it’s also part of the fabric. Again, it reinforces the idea behind a fabric of a combination of multiple types of payloads and use cases which you can do in RAM. So streaming is a very exciting use case.

We’re starting to see more and more and more emphasis on this from Apache Ignite from different projects in the Apache ecosystem. Streaming is very different from a traditional database. In databases, you always have a finite set of data. If you look at your table, there’s always a beginning and there’s always an end.

You can credit stable to SQL or any other means, but the data is finite. It can be launched, but still there’s a beginning and end. In streaming scenario, what’s really unique about it is there’s no beginning and there’s no end. There’s always a stream. The data is always coming in.

Because of that, there are different types of processes or processing paradigms exist with streaming. The major one is a sliding window. You typically process streams on a sliding window. The sliding window can be defined by you as the developer, but it’s typically no last end events, last five minutes, last hundreds of transactions and what not. And it’s constantly running queries or processing on that sliding window, and you’ll be surprised as how or not real this can be because it’s extremely hard to maintain the system of operations when you have a never-ending stream. There’s no way to buffer up.

There’s no way to store it somewhere because it’s never ending. It’s literally endless. So the entire system has to be built with this high performance in mind. There should be not a single bottleneck because this is a true system where it’s going to be as slow as your slowest component.

And without emphasis of memory processing, without emphasis of an in-memory data grid as a storage layer, extremely high performance storage layer, and our compute grid as our fully parallelized and computational dg=ngine, we added here what we call a continuous querying capabilities on the sliding window.

And that’s really great because it’s very simple, very intuitive API interface. If you have a sliding window, you can register a query with it. Like a SQL for example. If you’re going to be able to get updates from this query – not just the one result as you would do in a traditional database applications – but you would get a continuous call back with the results from that continuous query.

That gives you very simple, very effective paradigm. How do you work with streaming data? Once you have those results, you can now store those in advance in a data grid, or process them, or do whatever you need. But great functionality, very simple to use.

You can literally get up and running with streaming applications literally in a matter of in a few hours, once you download Apache Ignite. Great stuff, obviously. It’s all full distributed, full scale, and it has all the benefits of underlying internal data fabric.

One last thing I want to talk about is the in-memory Hadoop Accelerator. I’m not going to spend too much time. I don’t have too much time today, but it’s a great feature again, of our fabric. And if you know anything about Hadoop in general, there are two systems there, HDFS for storing map/reduce for processing the data.

And what we’ve done in Apache Ignite we’ve developed the first in-memory file system that’s fully compatible with HDFS which can work on top of cache as the cache. And we also developed, and that’s the really neat part – we developed a full re-implementation of map/reduce based in in-memory principles, fully compatible with original map/reduce through YARN infrastructure in Hadoop. So bottom line – it’s a complete Plug and Play.

Nothing to change. You don’t have to change code for any of your high speed or map/reduce jobs. You don’t have to migrate data anywhere. Keep it in Hadoop. Keep your jobs running. Just install Apache Ignite and configure your mass produced literally one line change, and you get tremendous performance increases.

Just to give you a quick overview, if you run a Pi calculation example that comes with Hadoop – just a basic example that comes with Hadoop – if you just run with Ignite and then without Ignite, just a few configuration changes – you get about 30 times performance increase – three, zero.

Basically, automatically you get performance increase and performance with absolutely zero code change. The last slide I want to basically mention to you guys, since I’m out of time a little bit, I’m going to basically skip the slide and talk about this use case, known as Sberbank.

That will give you a sense of what a memory computing can do. I’m not going to bore you with the details. We had the supplier a couple years ago and we closed the deal. It’s a very large bank in Europe – one of the largest banks in Europe.

And they had a very traditional use case of portfolio risk analytics – basically calculating the risk on each change in the market – very traditional use case – just to basically give you a sense of what we were able to accomplish. On the 10 commodity blades, 10 Dell R610 Blade, literally commodity blades, with total capacity of about one-third of a byte of RAM, and cost a cluster.

We’ve been able to achieve a billion transactions per second. Now, think about it for a second. It’s a billion with a B, fully ACID transactions per second, in a financial application, on a hardware installation that cost less than $25K, less than the cost of a new car here in California.

That’s what’s unique about in-memory computing in general. It’s less about the GridGain, though it definitely was the key software here, but this should give you an idea of what is possible with the fast data software. That’s what fast data is. You can get to the numbers like a billion transactions per second, literally, on a cheap hardware setup with the right type of software.

Dane, I think this is my last slide and we can definitely open up the Q&A or whatever audience we have left.

Dane Christensen:
All right. Fantastic. And thanks a lot for that, Nikita. That was excellent detail there. And we do have some questions here. We only have time for a few here. So let me go right ahead and jump right in to the first one. I think these questions are pretty much going to be for Nikita.

I mean, I do have some questions that would be excellent for Jason, but let’s go ahead and handle kind of the more technical ones first. So here’s someone who says – what RDBMS are supported by this product? What about right through VB performance?

Nikita Ivanov:
Any SQL-based, any JDBC, or ODBC database we support. We don’t have any specifics. Typically it’s you’re PostGres, Oracle, MySQL, MS, SQL. The right performance where you store the JDBC drivers, so we don’t really affect that. We have an option to do this right through synchronously or asynchronously.

So, obviously, in a synchronous mode, you get all transactionality that’s carried over across memory in the database, but you lose some of the performance because of the synchronous execution. In asynchronous mode, it’s called asynchronous write from behind – you don’t really have an impact on performance, but you get certain delay between data and RAM in the database. But fundamentally, we support any standard SQL database.

Dane Christensen:
Okay. Great. All right. Here’s another one. Can an organization/companies use this in-memory fabric for data analytics? How is it different and better in something like Apache Spark? We get that question pretty routinely, so.

Nikita Ivanov:
Yeah, Spark is an awesome project. I think the difference here is pretty evident. Spark is great for what I call, interactive data science, a classic data science approach where you have a human being sitting in front of a computer, and he really types where he sees results, analyzes, and thinks about them and massages them.

It’s typical data science in approach. GridGain or Apache Ignite can definitely be used for analytics but it’s really geared towards a machine-to-machine real time analytics. So basically, if you want to connect machine or process to another process, GridGain is definitely a much better system and it was designed for it, while Apache Spark is great for this kind of interactive human dreaming analytics. Let alone, you know, GridGain supports transactions and computational stuff, and things that you know – it’s not focused for Spark. It don’t support that at all.

Dane Christensen:
Okay. Super. Thanks for that. Also, here’s another one. Someone asked me about SAP Hana. So they say SAP Hana is having Flash memory, which improves the performance of data processing. What does Ignite do and provide that SAP Hana doesn’t?

Nikita Ivanov:
Again, Ignite is about in-memory performance not Flash-based performance, and we’re going to outperform SAP Hana dramatically, but again, not because we’re smarter than the SAP people, but because we do things differently. And SAP Hana is predominantly a kind of cross-column on the role database, and that’s where they basically concentrate.

It’s essentially a SQL based database. The fabric is different because it’s seen. We have streaming support. They get computation support. We have dedicated data grid support, and SQL is just one of the options we have. So that will be the difference. And again, performance will be the key difference as well – performance and scalability.

Dane Christensen: Okay, great. I think this question is probably better for Jason. I definitely want to get Jason back in. So here’s another question. So we’re investigating Hadoop – they had mentioned Jason – as an alternative to a data warehouse, but you’ve said that it won’t save the world.

Why do you say that, and what are the implications for data management?

Jason Stamper:
OK. Yes. That’s a very good question. The reason I say that is that Hadoop really is set up and designed for relatively low cost. And I say relatively in quotes because we know that the skills required – it might run on commodity hardware, but the skills required to start to analyze and get some value from that data aren’t cheap.

So I say, relatively low cost. Hadoop’s designs are relatively low cost storage platform, but it’s not going to give you the very rapid, real-time access to data that more and more companies are looking for, unless you look to use something like a GridGain on top of it, which gives you some of that caching ability to really help you to scale up, and give you the performance that you need.

We see Hadoop is absolutely going to grow because in the internet of things that we’re seeing, more and more companies want to store all sorts of data before they’ve really worked out what they want to do with it, and Hadoop is great for that because you can put it in there.

It’s relatively cheap storage, and then you can start to do some data wrangling on it, and start to work out what is valuable and what isn’t valuable. But it’s not replacing the traditional data warehouse, like the Teradatas, and the IBMs, and the Oracles, and the Microsofts, because it doesn’t offer that very rapid, real time analysis.

It doesn’t offer those joins between different types of data that you need in order to get, for example, a single view of the customer, or single view of your logistics chain, or whatever it is. And that’s the reason that those companies – you know, Teradata probably being the gorilla in the warehousing space.

They’re not slowing down. They’re still growing. If Hadoop solves everybodys’ problems and it’s pretty much free to download, those companies would be dying very rapidly, and they’re not. So there’s definitely a difference in the terms of what we’d do with data in those different platforms.

Dane Christensen:
Thank you Jason, and thank you, Nikita for all that excellent information. We’re right up against the end of our hour here. Just before we close out, I did want to make everyone aware of the In-memory Computing Summit 2015, that’s coming up on June 29th and 30th this year. This is the first ever conference dedicated to in-memory company. GridGain is one of the sponsors, along with other companies like NoSQL, Data Tora, SanDisk, and others.

So mark your calendar for that date. Go check it out at www.IMCSummit.org, and we’ll hope to see you there. Otherwise, everyone, thank you very much for joining the webinar today, and have a great rest of the day.