The Emergence of Converged Data Platforms and the Role of In-Memory Computing
Organizations today typically have heterogeneous IT infrastructures which include a variety of database technologies and a wide variety of applications drawing on that data. Many organizations are dealing with massive and rapidly growing amounts of data with end users requiring immediate access to this real-time big data for both transactional and BI applications. Technology providers are responding to these evolving needs by moving towards a converged data platform which includes:
- A distributed grid/cache
- NoSQL database integration
- Relational operational database integration
- Analytic database integration
- Hadoop integration
- Stream processing
In this webinar, Matt Aslett from 451 Research will discuss the drivers behind the need for a converged data platform and the current state of the evolution of these solutions.
Nikita Ivanov from GridGain Systems will then discuss the role of in-memory computing within the converged data platform and describe example use cases.
Hello, everyone. My name is Alisa Baum, and I’m GridGain’s director of product marketing. We’ll begin in just a moment, but first I’d like to conduct a bit of housekeeping. Can you please raise your hand using the hand icon located in the GoToWebinar control panel to let me know if you can hear me? And let me take a look. Okay, great. I see hands.
Next, during this webinar, you will be on mute. Should you have any questions during the discussion, please enter them in the questions field within the control panel. At the end of the webinar, we will take time to answer as many questions as possible and those that aren’t addressed will be answered in a follow-up blog entry on GridGain’s blog. In addition, a recording of this webinar will be made available to everybody within 48 hours. I would like to thank you for attending today’s webinar, The Emergence of Converged Data Platforms and the Role of In Memory Computing. It’s being presented by GridGain CTO Nikita Ivanov and 451 Research Directory Matt Aslett, and with that said, I’ll turn the floor over to Matt. Go ahead, Matt.
Thanks, Elisa, and thank you everybody for joining us on today’s webinar. So as Elisa said, I’m Matt Aslett. I’m research director for data platforms and analytics at 451 Research, and I’m gonna kick off the webinar with just an overview of the landscape as we see it in terms of converging data platforms and I’ll touch on the role of in memory in that. Before we get into the presentation itself, just very briefly to introduce you to 451 Research, for anybody who has not come across the company before, so we are an IT research and advisory company focused particularly on innovation and emerging technologies.
We were founded in 2000. We’ve got 250 employees, including over 100 analysts, and over 1,000 clients, including technology and service providers, corporate advisory, finance, professional services, and IT decision makers. The final data point from this slide I’ll just point out is that the next one, 50,000 plus IT professionals, business users, and consumers in our research community. And so these are not necessarily clients of 451 Research, but they are IT practitioners, people out there using software/hardware services that we are engaging with in terms of doing surveys and interviews and they are increasingly – help shape our view on the market, so a very important part of 451 Research going forward.
So onto today’s presentation, and what we’re talking about today is converged data platforms, and this is something we’ve been looking at over recent months within 451 Research, particularly looking at the way in which some of the emerging data platforms have the potential, or in some cases, already are converging with some of the existing data platforms, and how organizations are looking at their overall data management, data platform estate, and thinking about that as more of a cohesive whole. As you can see, some of the core technologies you’re talking about here, distributed grid cache technologies, NoSQL, relational operational databases, analytic database Hadoop streams processing, and then the last one, which is a little bit of a sort of an outsider and more of a long term perspective is containerization.
We sort of threw that in because we think it will have a significant role to play, and it certainly has some interesting connotations in terms of the convergence of the overall data platforms. We started to think about how this convergence is evolving. Historically, obviously there were data platforms and landscapers being dominated by relational operational databases and analytic databases. They make up the bulk of the market and we can expect that to continue, but what we’ve seen on the server over recent years is some of these technologies have been around for some time, but they’re getting more and more important in terms of enterprise deployments and in terms of providing an alternative, particularly for new application development projects. So stream processing obviously about real time.
That obviously is a significant driver for distributed grid cache. NoSQL obviously is a bunch of a drivers around there and we could do a whole separate presentation on that, as we could on Hadoop and some of the other things here. And obviously as I mentioned, containers, more of a long term, potential impact. And what we’ve seen, obviously, is that there is a great potential being driven by some of these emerging new platforms. So this here represents around on this chart is data for our 451 Research market monitor, through which we provide revenue estimates and growth estimates for all of the products and technologies that we cover within the data platforms and analytics team.
And so here we’re looking at data related to just the core emerging data platforms that I mentioned, so Hadoop, midstream processing, NoSQL, down to grid/cache and NoSQL, and as you can see, combined, those represent revenue of about $3 billion dollars in 2015, but we can see that growing towards $14 billion dollars in 2020. So there’s $11 billion dollars of sort of – the term we use is value creation, of revenue being generated by these emerging and new data platforms across the next five years. Of course, as I said, that market as a whole will continue to be dominated by some of the incumbent platforms, relational analytic databases, but there is a clear value being generated by these new data platforms, particularly for new applications. So here we’re looking at the revenue being generated by each of those segments individually.
What is more interesting, of course, from our view, and when you talk about converged data platforms, is the potential that they have in combination, and what we clearly see is there are integration points between these emerging technologies. There are various integration points, but that’s really only half the story. I mean running distributed gate grid cache alongside or on top of the relational database, running Hadoop alongside, it’s interesting, and that’s where we see a lot of company’s enterprises today. They’re looking at all these individual platforms.
They’re exploring how they can use them for individual applications, and they’re looking at how they can integrate them. The more interesting story perhaps in the long term is not how they can be integrated [coughing] – excuse me. Sorry. Frog in my throat. As I say, the more interesting view perhaps in the long term is not just how they can be integrated, but how enterprises come to think of them more as a cohesive whole, and that’s something we’ve seen in conversations we’ve had with enterprise clients over recent years. The way they’re looking at their long term investments in their data platforms and their data landscape is in terms of some of these technologies coming together to create at least a kind of logical whole [coughing].
I’m very sorry about this. I’ll do my best to stop coughing. Just pause for a second. Okay, so we’ve been covering this, as I said, for a number of months and we looked in particular towards the end of last year at this whole convergence issue and we published a series of reports, and you can see three reports that we published at the time, looking at how will these technologies have the potential to come together. There’s a URL there, bit.ly/451converged, and if you’re a 451 client, you can go there and you can get to part one and you can read it through, click through to two and three. If you’re not a 451 client, if you go to that address, you can still sign up to have access and to go on a trial.
So there’s a lot more detail obviously than we’re covering in this webinar in terms of how we see this evolving, and so if you are interested, do visit that URL, and you can find out more. The fundamental issue is perhaps why do we see all these converging? I mean there’s some technological issues behind that, but actually it’s really about business issues, as it often is, and we’ll see, there’s three core drivers that are pushing this convergence of these data platforms. The need to accommodate different data types, different data formats, and data from different applications and workloads.
They need to drive operational efficiencies, so clearly there are investments in existing technologies. There are investments being made in emerging technologies, and obviously organizations wanna be as operationally efficient as they can and bringing some of those capabilities and functionality together is one way of doing that. And then lastly it’s about enabling variable workloads. So I think this is still something that’s evolving, but gone are the days where there was a fundamental separation of perhaps transactional and analytical workloads.
There’s still a time and a place for separation of those workloads for individual applications. What we see is a lot more organizations wanna get faster insight, faster analytics on their operational data. And so that business driver is driving the convergence of the technology to deliver both of those capabilities and add into that the different formats, different data types, and the work operational efficiencies and that’s when you begin to see this larger convergence across the different segments of data platforms that we illustrate here. To go into each of those in turn, I’ll talk a little bit more detail about how we see this convergence playing out.
So we’ll start with NoSQL, and the reason we start there is because we’ve seen convergence within the NoSQL space, if you like. We saw that polyglot persistence initially drove the expansion of the database market within NoSQL, so specialist databases and multiple data models. Document databases, graph databases, wide column stores, key value stores, and there was a lot of interest obviously in using multiple databases to support an individual application, and in order to take advantage of those multiple data models. And what we have seen is that having multiple databases can lead to operational complexity and inflexibility driven by interdependence of those multiple databases and so we’ve seen a shift towards multi-model, NoSQL databases, multi-model enabling the flexibility of polyglot persistence and being able to support multiple models without the operational complexity that is required by supporting multiple databases. So that’s, you know.
Even within the NoSQL space, we’ve seen some convergence. Obviously when you think about NoSQL as it relates to relational SQL databases, that’s clearly some convergence going on there as well. We’ve seen relational databases adding characteristics of NoSQL databases, particularly in terms of key value access and also being able to store JSON documents. We’ve also seen the NoSQL database vendors adding support for SQL or at the very least, SQL-like query languages in order to take advantage of integrating with existing BI tools and having – supporting languages that people have experience in and that are familiar to them. And we’ve also seen what we would describe as multi-model, multimode databases emerge that support not just a combination of NoSQL and SQL, especially JSON and key value, but also other specialist workloads.
So things like time series data, things like geospatial analysis and queries. And so the relational database has evolved over many years to take on support, things like objects and XML, and we see that continuing in terms of support of JSON and time series and other approaches, and we obviously see the NoSQL databases heading in the same direction. Analytic relational operational database convergence, we’ve already touched on this a little bit obviously for the history of the database market to date. We’ve seen this typical separation of operational and analytic database workloads.
In some cases, they still make absolute sense for that to do so, but there are use cases increasingly where organizations want to query their operational data in real-time and what we’ve seen is also the emerging database vendors taking advantage of in-memory and advanced processing performance to deliver that combined operational and analytic processing. Historically, it just hasn’t been possible to support operational and analytic workloads in the same database by and large because of the performance hit you took when trying to do both at the same time. Distributed grid/cache technologies obviously have been around for many years.
Historically, we’ve seen them deployed separately in combination with, but separately to a relational database to specifically provide a non-persistent, distributed data grid layer that sits above the underlying database. What we’ve seen is some of the in-memory databases have added the ability to act as in-memory cache while some of the data grid and cache technology providers have added durable database capabilities, specifically GridGain and Pivotal with GemFire, for example, and so there is this – that the lines are blurring between what is a database and what is a data grid or a data cache. And another thing we’ve seen specifically within financial services is it happens, but I’m sure there are other examples in other industries, is increasingly organizations that have made a significant investment in in-memory distributed data grid technologies seeing that as their primary data layer.
Even if they’re still running that alongside a relational database, disk-based database, that becomes kind of the back end data store and the in-memory distributed grid becomes, for them, the primary data layer and can provide a layer that crosses multiple data platforms in the background. Hadoop is obviously a key data platform for the future, as we see it, and by that, I mean Hadoop and Spark and the whole Hadoop ecosystem. And clearly that is a primary area of focus for convergence and we laid that out here starting from the top in terms of data grids and cache, Hadoop convergence.
We’ve seen Apache Geode, Apache Ignite become part of the Apache big data ecosystem, if you like, part of which is driven by the ability to integrate with Hadoop and provide caching and data grid capabilities on top of the Hadoop distributed file system. From an operational database perspective, we see convergence. Clearly NoSQL databases have had a role to play alongside Hadoop for a long time. Apache HBase, for example, Cassandra, MapR-DB and others including MongoDB and we’ve also seen very much earlier stages, but the emergence of some interesting efforts to combine the benefits of a relational operational database with Hadoop and Hadoop distributed file systems, so companies like Splice Machine, projects like Trafodion, and LeanXcale, another company there.
Obviously a lot more activity to date has taken place in terms of analytic database and Hadoop convergence, clearly well suited for at the very least, integration, and we see that going forward with increased convergence as well. Started with kind of connectors, pretty simple. Get your analytic database talking to your Hadoop cluster. Be able to move data in and out as required. Moving on from there, we saw a lot of investment in recent years in SQL and Hadoop, more about bringing the skills and the expertise of the people and the tools that surround that analytic database to the Hadoop ecosystem, to the Hadoop environment, enable organizations to actually be querying data in Hadoop using the same skills and the same tools that they have previously used to query their data warehouse.
Still initially sort of separately, but with federated query, we say that the taken to kind of the next step, which is about organizations looking to query both the analytic database, data warehouse, and the Hadoop environment with at the same time with the same query. The last point here is about columnar storage within Hadoop itself. Again, this is very early stages, but it’s interesting, if you look at Apache Kudu and especially when you think about the potential combination of that with SQL on Hadoop, that what you’ve got there is basically a SQL query engine and a columnar storage engine. Effectively, what we would previously use or think about as being an analytic database, but clearly exists in the Hadoop environment, so that again, this is clearly an area where we’re gonna see increased convergence over time of some of the core technologies and some of the foundational capabilities. Streamed processing, again.
Obviously event stream processing, complex event processing being around for many years, traditionally we’ve seen those in niche technologies. Again, used primarily in financial services, or where there was a real requirement for low latency data processing, high performance applications. What we’re seeing is in other industries, there’s a growing interest and growing understanding of the need for more frequent analysis of real-time data streams, and that’s pushed the stream processing into the mainstream. We see again a lot of those projects, and part of the Apache community and therefore with tight integration with Hadoop, and a lot of organizations exploring what some refer to as the lambda architecture, but however you wanted to describe it, but it’s real-time and batch processing in a single environment. And so clearly a lot of scope for increased convergence there.
And lastly, containerization, as I said. This is more over sort of a longer term impact, but we do see some potential and some interesting opportunities here. There’s emerging projects like Flocker from ClusterHQ, which enable in a more persistent data storage within a dockerized environment, or containerized environment. There’s Apache Myriad project, which enables YARN to run elastically in dockerized containers on shared data infrastructure resources managed by Mesos, so that’s more about Hadoop itself running within a containerized environment. We also are seeing docker support in YARN itself, so within Hadoop, enabling some of the workloads to run as containers within Hadoop, managed by YARN, and the advantage there being about isolation of resources and preventing some of the issues that can occur when you’ve got different processing engines that are competing for resources within Hadoop itself.
So these are some of the key areas that we see that are coming together, and as I said, it’s really sort of research and focus on this is driven not just by some of what the vendors are doing. It’s happening with the technology convergence in terms of the open source projects, but also with the discussions that we’re having with some of our enterprise clients about how they are increasingly viewing their landscape as a single, logical whole, particularly for, as I said, new application deployments and what we expect to happen over the long term is increased convergence driving through with those deployments and then obviously in the longer term lifecycle, some of the existing data processing infrastructure to be retired as in an actual course of its life cycle.
And so the world will – it is gonna take some time, but the world will surely become increasingly converged [coughing]. Excuse me, again. Finally, just – I know Nikita’s gonna talk a lot more obviously about the role of in-memory, but I just wanted to just briefly touch on our perspective of this. What we see is obviously – I’ve already mentioned in-memory several times in my presentation, and we do see that memory and the lower cost of memory and the companies taking advantage of in-memory processing is a key enabler for data platform convergence.
As I mentioned, we see a new breed of data platform vendors taking advantage of improved performance in their hardware, memory, and processor performance to the extent that they’re able to support transactional and analytic workloads as well as obviously grid cache stream processing and I might produce another batch process as well at the same time. And importantly, you know, what we see is enterprises not necessarily abandoning investments in disk-based system, as I said, but if you’re talking about new applications, new projects, new data processing platforms, then it makes sense given the trends that we’re seeing for in-memory data processing and analytics to be a key consideration for application development projects and to be a focal point for architectural rejuvenation.
So very much a lot of organizations thinking about creating an in-memory data processing layer and then thinking about the use cases and the technologies that they want to adopt in order to take advantage of that layer. So that comes to the end of my presentation. So I wanna thank you again for your time. Apologies again for the frog in my throat. I’d be happy to take questions as we get to the end of the presentation, but for now, I’ll hand you over to Nikita, who can talk about where GridGain obviously fits within the current landscape I’ve just described, so again, thanks for your time.
Thanks, Matt. Good morning or good afternoon, everybody. My name is Nikita Ivanov. I’m the founder and chief technology officer of GridGain Systems. So I’m gonna follow up on what Matt excellently described as this concept of a converged data platform, and I’m gonna give you basically just an example of a product or a project that in many ways follow in steps with what Matt just described as this emergent idea with converging data platform. We look at a slightly different angle, predominantly with focus on in-memory and kind of driving from in-memory idea outwards, but nonetheless, as you will see throughout the next 15, 20 minutes, most of the components that Matt described are there in our product. And that’s why we’re pretty much excited that it’s not only us driving this idea.
There is a bigger analyst community that’s actually seen exactly the same ideas in talking to customers and IT professionals. So very quickly, we’ll talk about some of the history of the project. It’s pretty interesting. The GridGain project and a product and it’s built on Apache Ignite. Has a pretty long history, and we’ll talk about some of the key components about what were called in-memory data fabric. And you will notice that we’ll have the same or many of the same components that Matt described in our product date, from the computer data grid, to service grid, to streaming complexity and processing Hadoop and Spark acceleration and things of that nature. So we’ll stick on this project for now.
I was on this slide and I’ll start by briefly talking about the history of the GridGain, so GridGain started as an open source project roughly about ten years ago in 2005, and it basically has been steadily growing except a couple of years ago it joined Apache Software Foundation, the same body that produced the Spark and Hadoop and Cassandra and many other projects. And it was renamed Apache Ignite so the GridGain company can retain its name. So essentially, right now, GridGain is based on Apache Ignite. Think about this like Hadoop/Cloudera and then Apache Ignite and GridGain. That’s the same thing. So fundamentally we are based on Apache Ignite and what is Apache Ignite? When they say, “Apache Ignite,” it’s synonymous with GridGain, so I’m gonna use it interchangeably, GridGain/Apache Ignite. So what is Apache Ignite?
Apache Ignite essentially we’ll call as a data platform, in-memory data fabric. We use the word “fabric”. Again, it’s just we can easily replace with a platform here. The idea with memory data fabric is that essentially it’s a type of software. It’s a distributed, Java-based software or JVM-based software that fundamentally sits in between your variety of data sources and your applications. Data sources in the bottom of this picture, it could be practically anything today. It could be any SQL database. It could be a NoSQL database that’s disk-based. It could be any kind of Hadoop data structure with a file system. It could be a straight file system
And above the fabric, there are all your applications and we support – actually, we work with a variety of different applications with a SaaS and mobile application, or IoT, or just traditional enterprise applications in a variety of different languages that we support to talk to the fabric from a JVM based languages, Java and Scala enclosure to .NET and C++ and straight SQL and so on and so on. So the fabric itself, what does it do? It actually provides two fundamental things. It provides you speed and scale.
It provides you speed by moving data from slow disk or flash based black label storage into the byte addressable DRAM across many, many computers, and as you probably can guess, the speed of DRAM is over a million times faster than speed of access through a black label, flash, or let alone spinning disk devices. We ask the provider to scale to this processing adaptation because we’re parallelizing this process. We’re not only moving data in one computer, we basically create this fabric out of thousands—our largest installation with 2,000 physical servers in the cluster–and that gives you a tremendous parallelization, tremendous parallelization into the system, and that as a consequence gives you great scalability.
So when we talk about in-memory systems, the probably most important thing to remember is that it brings or gives you two fundamental advantages. It gives you unprecedented speed and gives you an awesome scalability. It’s interesting to note that when we talk about speed, almost everybody makes a claim about speed. What’s really unique about in-memory system is that if you think about it a little bit deeply, in-memory systems is the last frontier in terms of where you can keep your data. Think about this. In the last 50 if not more years we went from external tapes late ’50s, mid to late ’50s. Then the hard disk came around and IBM 360 revolutionized that.
We switched to disks around late ’60s, mid ’70s, so we had hard drives and then hard drives grew and grew in performance, and then mid-ninties, late nighties Flash came around and became prominent, and we kind of were moving data closer and closer to CPU and make this medium of storing faster and faster and faster. And so today, we arguably are in an era of in-memory storage where we store all data in the physical DRAM of a computer, across multiple computers, to gain capacity. An actual question, where else can we go? Think about it. It’s about the only physical place where the modern computer can actually use their own CPU cache, but it’s very small due to physical limitations. It’s only about two or three times faster than DRAM.
So what’s interesting about an in-memory processing or in-memory data storage is that this is actually the end of the story. This is the last frontier in terms of the data storage in a computer architecture as we know it today. There is nowhere else to go unless we fundamentally change our computer architecture, which probably would not happen anytime soon. So unlike flash, which is clearly a transient technology, in-memory storage is here to stay, especially with the new types of RAM that’s coming up in the market with 9 volt RAM, with all the different types of RAM that’s coming up rapidly into availability. In-memory computing becomes a fundamental trend, and how do we keep the data and therefore how do we process it out?
Naturally, not only the speed of in-memory processing is important, but it is, for the first time, a fundamentally different way to process data, and that comes from the fact that RAM is byte addressable. We’ve spent the last 50 years dealing with the black label devices, which is the tape, or disk, or flash device. Those are essentially your file systems and you deal with those data storage systems as external. You have to constantly marshal and de-marshal your objects from and to those disks.
And because naturally the presentation of the objects in RAM is completely different from representation objects into the external black label devices, whether it’s tapes, disk or flash. So, for the first time, we have a data storage system that is completely compatible. As a matter of fact, it’s the same as the one where we used to actually run out or code, where the code lays. The code leaves and ever lived since the first days of computers in RAM, into the random access memory of a computer, but now we keep our data there, and that opens up the tremendous possibility to optimize data processing, because no longer we have to read a kilobyte of data just to change one byte.
We can literally use every point or (unintelligible) and change the byte. Instead of milliseconds, have nanosecond latencies. So all of this introduction on why in-memory computing is important I think is valid and it is important, because unlike any other previous technology, storing data in the distributed in-memory cluster is a fundamental frontier in terms of how do we store our data if we use it for years to come? There is literally nothing physically else available to do it and memory will get faster. Capacity will grow. Just one more last example. We recently tested with some of the Fujitsu boxes servers. Imagine this. You can buy today a server from Fujitsu with 64 terabyte of RAM.
It’s a single server, 64 terabyte of RAM. It runs Solaris. It runs Java. You can do whatever you like. 64 terabyte of RAM is available today, and it’s just in one box. Needless to say that anybody from Azure and Amazon already have a very nicely priced instances with 16 and 32 terabyte – gigabyte of RAM if you need it as well. So with that, having all said, so let’s just go ahead and talk about what is the in-memory data fabric from GridGain or Apache Ignite can do.
So fundamentally, we call it a fabric, and remember that Matt actually mentioned as in-memory data grids in his presentation. Well, in-memory data grids was basically a piece of technology that sort of started the whole in-memory growth about ten years ago, but we kind of are thinking about Apache Ignite as a more strategic in-memory play, where in-memory data grids is only one component, because only a source of one particular use case of what they can do with the data in RAM. Well, data grid is essentially is a key/value storage. It may or may not have a SQL.
It may not have transactions, but essentially it’s a NoSQL store in RAM. We also have a compute grid, which we’ll talk about it. We also have service grid, which is a different use case. We have streaming complexity processing, which is a completely different use case as well. We have Hadoop and Spark acceleration, which yet another different use case of how you can use in-memory storage. We have advanced clustering in RAM as well. We have an entire file system in RAM.
Think about this. You could have a distributed file system fully in RAM and have a tremendous performance characteristic of that. This is yet another interesting use case for RAM and the list goes on. So as they can see, and as you’ll see throughout presentation, throughout next 10 or 15 minutes, this is what basically makes fabric a fabric. It’s not just the one use case or what you can do in RAM. We looked at every possible imaginable use case of what actually people wanted to do once they have data in RAM or once they have a system that allows them to do that, and we’ve built those capabilities into the in-memory fabric for Apache Ignite and obviously it all comes with very consistent API’s, consistent configuration, consistent programming and everything else. So although that’s a wide swath of functionality, it looks and feels very consistent and it’s very nice and simple to use.
Let’s talk about the compute grid. As a matter of fact, historically GridGain many, many years ago started as a compute grid, and a typical application of a compute grid is traditional high performance computing, and it solves a very, very simple problem. If I have a problem, if I have a task that I wanna compute on a cluster of computers, let’s say on a computer, and it takes a long time. I can split this task into multiple step tasks. Execute those step tasks as parallel. Get results back and aggregate them to one final result. In ideal, normalized use case, you have N computers in front of me. I can execute that task N times faster. That’s all.
That’s the entire idea of high performance parallel computing. This is the entire idea behind a compute grid. The rest of it is obviously batch details. How do we do it? How do we configure it? What do we do? So we have one of the best, if not the best compute grid in the market. We’re the oldest Java based project that deals with the compute grid capabilities, and we have very rich and fat I would say API’s for what we have. Direct API’s for multiple versions of variance of the ma/reduce. Not only we have the map /reduce as compatible with Hadoop, we also have much more simplified, more optimized, map reduced version as well.
We have technology like a zero deployment that really eliminates any needs for any cluster based deployments and would not – obviously it can do chrome-like task scheduling. We have a state chip points for long running tasks. Anybody from biotech or pharmaceuticals, this is the actual feature for you, because everything in the scenario takes sometimes hours to process. We have the basic load balancing and automatic fail over. This goes without saying, and full cluster management deployable (unintellibigle). So we have very rich functional side when it comes to compute grid.
You can literally run a complex, high performance computations and high performance compute loads like you would do in those, for example, financial services industries. And you can do this right on in-memory data fabric. Another data grid very similar in concept to a memory compute grid, but still slightly different. Remember that in memory compute grids, solve the problem of essentially a problem over parallelization of computations, right? How do I parallelize my computations? The in-memory data grid solves slightly different problem. It solves a problem over parallelizing data storage, and again, I have a terabyte of data and I have ten computers in front of me, right?
So how do I store a terabyte across those ten? Well, there’s probably multiple ways, but one of the ways you can split this terabyte in ten different parts, and store one-tenths of terabyte on each of those computers, and that’s exactly what data grid do, but once you do that, it opens a whole can of worms of how do you transact? How do you fail over? How do you load balance? How do you query? And this never ending list of questions and this is exactly what in-memory data grids do.
It’s one of those components in the fabric that deals with a data storage in distributed data storage and RAM across many, many computers. So the data grid has probably the longest list of features we can probably talk almost endlessly about, but fundamentally in Apache Ignite, data grids is a key value store. It’s a very important fact, because the key value store is the fundamental data structure on which it can layer up different views. You can view in key value store and in SQL store, for example, where each combination of the key value paths becomes essentially a table, if you will, in a SQL store. You can also view a key value store as a file system.
And so there’s multiple different views that we allow you to have on the same key value store. So you have a key value store. You have a full SQL 99 capability with our system, and also you can also have the entire posix file system on it as well, which is, by the way, compatible with HDFS. So in-memory data grid is the transactional key value store. The word ‘transaction’ is also very important. It’s very important to us. We have one of the strongest transaction capabilities in the market today. We have exactly the same transactions that you will expect from a database like DB2 Oracle. We have pessimistic and optimistic transactions. We have all isolations level, so recommitted, (unintelligible) and serializable. So you have absolutely the same transactional capabilities as you would have in normal databases.
So there are two modes in which we work in terms of data grid: partition and replicated mode. And we have a little bit of time today, so we’re gonna skip the definitions of that. We work both with on heap and off heap storage. It’s for anybody who deals with the Java based applications. You guys know that Java has a limitation of how much off/on heap storage it can process. Typically it kind of maxes out about ten gigabytes. With our off-heap support, we can use the entire physical RAM on the box. So if you have one of those Fujitsu N10 box I mentioned a few minutes ago with 64 terabytes of RAM, you can use all of those 64 terabytes. We’re gonna use the entire physical RAM on the box, or as much as we configured it was.
We definitely have all the kind of lower level plumbing and functionality and data grids with high availability, high available highly available replicas for high availability, automatic failover, dimension distributed ACID transactions. By the way, we do have – when I say transactions, I wanna kinda emphasize here. I’m not talking about eventually consistent transactions. I’m talking about real transactions, something you can move money with. It’s traditional transactional behavior that you have.
We have SQL queries. We have both JDBC/ODBC drivers, so not only we have a SQL capability, so you can actually connect to our systems through a standard protocols like JDBC ODBC, connect any of your outside, external, analytical tools to the system in. Just use as data. Some of you know we also have a service grid which is yet another central use case of what you can do on in-memory system. And now you should be getting the idea again about a fabric. That’s why we call it a fabric. It’s not a one single use case of what we do. We try to cover many different patterns of what our customers and users typically wanna do.
And actually, a service grid is one of the recent additions, and that’s exactly what – it came out from requests from the customers. They were basically telling us “guys, we already have GridGain cluster running and we just need to have some kind of ability to run a service, which is essentially just an objects with some API’s, with certain pretty fine SLA’s. For example, can we run only one instance of that app on each node? Or can we run a one instance of that particular object on an entire cluster, and you guys took care of all the maintaining the SLA. If the node crashes, you automatically restart, or for example, I wanna have only three instances of that particular object in the entire cluster, and no matter what happens to your cluster nodes joining or leaving, you maintain that SLA”.
So this is exactly what in-memory service really does. It really gives you a tremendously simple little just a configuration, a capability where you can basically deploy any type of service as anything: web service, GDI service, lookup services, any in-application services, and just give us a configuration SLA and we’re gonna maintain that very simple, very beautiful functionality.
Surprisingly for me, you know, it’s one of the most frequently used features. Almost every project where Apache Ignite is used today, service grids, service grid finds, it’s used with those projects.
Streaming complex event processing: this is yet another way to process data in RAM, and although there’s quite a few different projects for in-memory streaming in the complexity of my processing, we decided to add basic capabilities and not so basic, as a matter of fact, into the Apache Ignite, because again, it fits the idea of a fabric in a way that basically it’s yet another use case of how our customers and users practically use GridGain. They wanna basically stream data through the system and have a capability for streaming processing.
Now, we’re not solving the – we’re not trying to compete with something like Kafka, so we typically use with Kafka, and the Kafka being the buffer in front of us, but we have multiple ingests available built in as well. So the idea for what we do in certain types of streaming is two things. We have window based functions or window based streaming. As a matter of fact, you can see this picture on the bottom picture here, and the window actually means last 100 events, last 5 minutes.
You basically define a window in which you wanna do the processing because the problem with streaming or actually characteristic of streaming processing is that there is no beginning, there is no end, unlike traditional databases, where you can actually have a table and you know, the size of a table. You could query this table entirely. In streaming, there is no beginning and there is no end. You can never basically have a finite set of data.
So what we typically do, what everybody’s doing today, is operate on a sliding window, and you define the sliding window, and we actually give you a continuous query capability. It’s a very traditional concept in streaming process. You define the window and then you define the query on that window. And the query will call you back with new results as they appear in that window. So you can say, “Look, my window is the last five minutes of events and I’m looking for all the events that have the type error,” and you define that query, define that window, and off you go.
It’s a fully distributed, fully parallelized systems, and you get that call when those events are popping up, and then you have the entire capability of GridGain with your compute and a data grid and a service grid and everything else to deal with whatever you like with this event. You can dump it on the data grid. You can process later. You can run SQL queries. You can do whatever you like, so streaming is there. Predominantly with the sliding, window processing, and a continuous query capabilities. And it’s fully integrated with the rest of the fabric. And that’s why it’s a good point of the fabric. It’s a very, very nice addition.
We also actually have, and that’s probably two extra – I think a couple of more slides here. We also have very instant technology for integrating with Hadoop in Spark, and this again came as a feedback from our customers, and the feedback was pretty obvious. Basically when we talk to somebody who’s using Spark and Hadoop, we often basically heard that gosh, we love the idea, but we already have Hadoop cluster and we already have Pig and Hive or a hand-rolled map/reduce jobs. We’re not gonna redesign everything. Can you just speed it up? And we thought about it about three years ago pretty hard. How can we speed up Hadoop?
And if you know anything about Hadoop, you know it could do anything with Hadoop in terms of the performance. You really have to address problems with first HDFS of the file system and then MapReduce as a processing component. So Hadoop’s consist of two major parts, HDFS and MapReduce and if you were to solve the Hadoop performance problem, you have to solve this too, and that’s exactly what we did.
We developed a first, a fully HDFS compliant in-memory file system. On some of the operations, we’re over 2,000 percent faster than traditional HDFS. Well, it’s not surprising. We work fully in-memory and we can optimize this type of processing.
We developed the first HDFS compatible in-memory file system and we also redeveloped entirely a MapReduce subsystem and made it in such a way that it’s completely plug and play. There is nothing to do. There’s no code change. There’s nothing to do really in the Hadoop cluster. It’s a fully compatible and it gives you a significant performance boost.
So again, since we are kind of pressed on time, there’s a pretty nice picture on the right of this slide, if you can see this. If you can see those gray lines, this is exactly the execution flow over traditional Hadoop cluster, you know? You submit the job. It goes to job tracker. It talks endlessly to Hadoop name node because basically name node manages the file system, then it goes to task tracker. Task tracker launch to notes and it’s the entire thing.
And if you see the blue line, the blue line is the only interaction that exists between a GridGain client and a data node. As a matter of fact, like many of the modern Hadoop improvements, we actually do in-process executions, so we don’t start the nodes every time.
So the bottom line, just to give you an example of what we can achieve in Hadoop acceleration, you can download stock Apache Hadoop, run basic example like a Pi calculation, then download Apache Ignite. Just install along the side a same cluster. Change one line configuration. Very important. There is no code recalculation. You don’t even have to have a source code because you don’t have to recompile anything. Just change the configuration, run again about ten times faster. Ten times faster with absolutely zero effort.
Now, your results may vary. Obviously if your particular Hadoop job spends most of the time in the CPU, we can’t change the laws of physics, you know? We cannot make the CPU run faster, but if many of your Hadoop payloads spend a lot of time in I/O ops, basically hitting HDFS and moving data back and forth, those types of jobs can see a very nice jump in performance when you use it with Apache acceleration.
So this is yet another component of the fabric, so by now you should start to see the whole idea over converging data platform or what we’ll call MM or data fabric or platform, is that we’re addressing many, many, many different types of use cases or patterns, sometimes very unique and strange, right?
There’s a compute grid and then there’s a whole Hadoop acceleration, but nonetheless, those are very much a different take, different views on the same in-memory acceleration, in-memory processing of data.
The last slide I have, as a matter of fact, I think is this one about essentially what we do with the Spark integration. I mentioned to you Spark is kind of supplemented in Hadoop today, especially the MapReduce side of things, and we start talking to a Spark community, the Spark users, and the one problem came out very, very quickly as the top problem. The Spark has no data store. Spark is the processing framework, and as we kind of moving our Spark users from evaluation projects to some real projects, people started to have pipelines of Spark jobs. And what happens in that pipeline of Spark jobs, when you have more than one job, you have to keep results from one job somewhere until the second job will pick it up.
And typically we’ve been using the Tachyon or HDFS, which is dumped in somewhere, and that kills the entire in-memory process, because your job can be pretty fast, but then essentially you have to dump it in some kind of slow system, and then the next job has to read it from slow system back again.
That kills the entire performance, so what we developed, we developed an Ignite based shared Spark RDD distributed data set. What it allows you to have, what it allows you to do, is to retain results in RAM between two (unintelligible) and two Spark jobs, and that in return gives you tremendous performance. You could have a pipeline of Spark jobs and you don’t have to lose performance by intermediately storing result sets somewhere in this slower system. As a side effect, by the way, since we have that shared RDD, you could run SQL on that shared RDD, and our SQL is about, you know, between 10 and 100 times faster than Spark SQL.
Again, because unlike Spark, where transactional systems to begin with, and we have a very mature (unintelligible) implementation now to SQL. Spark doesn’t have any SQL at all, so for Spark, pretty much every single query is a full scan at least for recently. I know the guy’s working on it, but at least for now it is this way, and for us, we have very mature in this implementation, in (unintelligible) implementation now on SQL. So even the SQL on Spark can be dramatically improved by using our integration with Spark.
So I believe this is the last slide I have. It’s just to give you a very quick overview and again, I don’t wanna bore you with too much. Where we’re actually used, and again, this was gonna be on GridGain’s view, but it should give you kind of a picture of what kind of industry, what kind of customers or applications people use GridGain to solve.
So we’re used in a lot in financial services applications. Trading systems analysis, risk assessment, to treading low latency/high performance transactions, fraud detection, risk analysis, analytics, basically any kind of the buzzword in the financial service industry you can imagine, we probably use that. We used it in one of the probably top three banks in the world. We used dramatically in one of the top banks in Eastern New York, so we know this work very, very well, and it’s not surprisingly, because for financial service organizations, speed is money, and it has been for decades. So the in-memory computing as the premier technology to gain that speed, to gain that scalability is a no brainer. So it’s used a lot in this.
I would say probably over half of what customer base or commercial customer base, let alone in the open source is from that. We’re also starting to see a lot of uptick in traditional big data analytics, and again, this is a very amorphous theme, but anything from – essentially analytics is the big word, big data, small data, I don’t really care. I hate the term, but the analytics is what’s important here, so not a transactional use case. It’s anything with machine loading with the typical analytics, typical large queries, and typical big calculations, you know, operational BI, things of that nature.
We started to see a lot of traction with IoT, especially in the last, I would say, 12 months. We just recently closed a couple of customers, commercial customer in this space, and I think IoT is really ramping up from a concept to a real segment in the market where we’re starting to see a lot of activity. And now it’s very a natural feat for in-memory computing. IoT is basically dealing with millions, if not sometimes billions of end points and devices.
That requires a very sophisticated back end infrastructure. Speed and scale’s paramount to be able to process that. Welcome to in-memory computing. This is the technology for it. Another basically subset of customers we’ve seen in kind of a robust way year over year biotech and pharmaceuticals kind of bioinformatics overall.
This is a traditional area where there’s a lot of data sets to be processed. I’m always joking to my audience that, you know, the pharmaceutical companies today is basically, you know, predominantly consists of software engineers. It’s all software development and they do process a lot of data sets and the performance for them is critical. So this is basically a kind of landscape of where we are used very, very quickly. I think it’s my last slide. Alisa, back to you, if we have any questions.
Hi, there. Can you hear me okay?
Okay, great. So the first question is, is there a machine learning library available for GridGain?
Short answer is no. You guys are more than welcome to use a variety of Fling, Spark, Mahout. We just don’t feel like reinventing the wheel again. I do believe that a ML support in Spark is excellent, and you know, literally we didn’t wanna reinvent the wheel, so if you guys doing maching learning Spark, hopefully with our shared RDD support is an excellent choice.
Okay, and this question I believe is for Matt. What impact will the move to the cloud by many companies have on the development of the converged data platform?
Yeah, no. I think that is a good point. Clearly, along with those core technologies that we talked about, sort of movement to cloud is obviously a driving factor in terms of companies reconsidering their existing data platforms as state, so I think it will definitely have a role to play. I think at this point, there certainly isn’t a converged data platform in the cloud that someone could move to. So at the moment, it would still be a matter of selecting multiple services and multiple service offerings and trying to integrate and pull them together. Over time, I suspect that may change as well, and we will see the emergence of what we might consider a converged data platform in the cloud.
Yeah. I wanna add from my side here is that just to give you a flavor, I’ll call it commentary. Five years ago we wouldn’t find a single financial service company that would do anything on the cloud, for obvious reasons, for data security, regulations, and whatnot. Literally five years ago our business was bifurcated. Financial service companies would be entirely on the premise and almost everybody else would be already in Amazon or anything else. In the last 12 months, literally almost every conversation we have with financial service company, started with the number one bank, number two bank in the world, number three bank in the world, number one bank in Eastern Europe, and all the way down to small companies, all of them moving to the cloud.
We just talked to Barclays. The entire Barclays initiative right now globally is to move to the cloud as rapidly as physically possible. If I’m talking about Barclays, which is the global behemoth bank, and we’ve talked to other banks as well, would not name the names, with almost identical corporate mandates. So I believe something has happened or eventually pressure has been growing. It’s so hard for them, but they have to do it. So I think the latest last holdouts in the whole cloud migration is probably giving up and we’re gonna see mass adoption of cloud infrastructures and a place for from GridGain point of view, our key customers, financial service area, financial services segment, will be moving to the cloud very, very rapidly.
Okay, and we have time for one more question, so how does the GridGain compare with Hazelcast? What are the pros and cost?
Well, we love Hazelcast. It’s one of the key competitors we have, of course, (unintelligible) couple of companies, but again, that’s I think where we’re different a lot based on essentially our conversation with our customers because they always look at multiple vendors, including us and some of the companies. Hazelcast is predominantly data grid company, and a good one. We are fabric. For us, data grid is the component of what we do and I would say five, seven years ago, that would be a very fair competition because back then, nobody really knew of anything but data grid, and you would be competing with data grid.
Today we hardly, hardly find a customer who will be satisfied with just data grid. Almost everybody needs not all of our components, but I would say two-thirds of our components practically used by almost every client we have. It’s anywhere from data grid obviously, service grid. Our Hadoop and Spark integration plays an enormous role in winning customers and winning projects. Often enough, streaming plays important role, but it’s typically combination of that. I attribute that to the a very simple fact that the seriousness or complexity of use case memory has grown dramatically.
It’s not casual anymore. Memory computing is not new. If you look 25 years ago, caching is in-memory computing, right? And the caching has been in databases for the ages, but the complexity of those use cases grew dramatically. Caching is not even considered to be even like a use case because it’s so obvious and so ubiquitous.
So people look at the transactions, for SQL or Hadoop and Spark integration. Somebody’s asking for machine logarithms. Yeah, it’s an excellent question. Streaming, things of that nature, so I think that’s a fundamental difference between us and Hazelcast and the companies like Hazelcast, you know? They’re a single kind of trick companies that do one particular function and we’re trying to position GridGain as more strategic play, and we’re not only trying to position, we have the technology and software that does that, and that’s the biggest difference here.
Okay. Well, that brings us to the top of the hour. I want to thank both Matt and Nikita for a great presentation and audience, thank you for joining us. We hope to see you in a future webinar and with that said, I will go ahead and close out the webinar. Thank you again, everybody.
Thank you. Have a good day.
[End of Audio]