Deploying the In-Memory Data Fabric
Businesses large and small are increasingly turning to comprehensive in-memory data processing solutions such as the GridGain In-Memory Data Fabric to address their Fast Data challenges and create a competitive advantage by operating as a real-time business. When deploying an In-Memory Data Fabric into a production environment, typical challenges that need to be addressed are around availability and resilience, security and manageability, among other things.
In this on-demand webcast, GridGain Solutions Architect Mac Moore explains how to harden the deployment of the GridGain In-Memory Data Fabric by taking advantage of a number of enterprise-grade features in the commercial version of the product designed for always up, always on real-time data processing.
This webinar is a must-see for technology leaders in the transition to high-speed, low-latency big data systems.
Solutions Architect, GridGain Systems
Interviewer: Good Morning or afternoon depending on where you’re joining us from, I’m Dane Christenson the Digital Marketing Manager at GridGain Systems. I want to thank you all for taking the time to join us today for this very information webinar Deploy the In-memory Data Fabric. You’re presenter today is Mac Moore, a solutions architect at GridGain Systems. Mac has over 12 years of experience designing, developing, and integrating enterprise systems where performance and scalability are essential. He’s a frequent speaker on the subject of in-memory computing and he knows the technology and the industry inside and out. But before we get into Mac’s presentation I have just a few quick administrative points to make.
Mac will be speaking for about 35-40 minutes after which we’ll have about 15 minutes for Q&A. You’ll notice that just above the presentation screen is a questions button, click there to ask a question at any point during the presentation and Mac will respond to your questions during the Q& A segment. We may not have time to get to all the questions during this event but we’ll definitely answer all of them afterwards so please take advantage of this opportunity to ask questions from an expert in the field. You can also use these questions to let us know if you’re having any technical difficulties. Also I wanted to make sure you’re aware that we have included a few case studies as well as some feature comparison documents in the attachments area which can help you to determine if GridGain is the right solution for you. So with that we’re ready to turn the floor over to Mac Moore, Mac.
Interviewee: Thank you Dane and let me echo Dane’s comments and thank everyone for joining us today, so let’s go ahead and jump right in and get started. The first thing that we’re going to do is talk a little bit about in-memory competing. So we’re gonna start at the beginning today and kind of assume that not everyone is necessarily using in-memory competing today, I’m sure that some of you are. But for those of you that are sort of new to this discussion we’ll begin at the beginning, talk a little bit about why this is relevant, we’ll discuss some of the use cases where we see this come into play quite frequently, some of the concepts and considerations as well. And from there we’ll talk about how we address some of the challenges in this particular market. We’ll talk specifically today about enterprise features so that’s our focus, we’ll spend the majority of the time sort of focused on that piece but we will do a brief introduction for those of you that haven’t taken a look at GridGain before. Then towards the end we’ll wrap up with some practical recommendations you know for those of you thinking about deploying a solution like GridGain in a production environment since again that is our focus for today. And then as Dane said we’ll hopefully leave a little bit of time for some Q&A at the end so if you have questions as we proceed through please post those to the online tool and we’ll be able to pick some of those up towards the end of the presentation.
So let’s jump right in, first and foremost what is GridGain, what are we focused on? So we are entirely an in-memory focused company, our use cases typically center around performance. We are very active in many of the spaces that you see here so from big data analytics to SaaS providers, cloud computing, providers driving the back ends for mobile applications, for sensor applications, and the internet of things, streaming, real-time use cases, these are the things that we’re focused on and as it states there at the bottom we are active in hundreds of deployments within these spaces today. If you’ve been to our website you may have seen some of this, you know we’ve won some great awards, some things we’re pretty proud of like the one from Gartner so don’t have to say too much about that but if you haven’t seen it already just something that we like to point out ’cause we’re quite proud of it.
But let’s jump into the trends piece, so we’re gonna talk about sort of what in-memory computing means to us before we go into the specifics of our particular products. But looks first at the stage, so why is in-memory computing such a hot topic right now? Well the trends I think are fairly obvious and probably things that most of us sort of accept as a given at this point and time, the first one being data growth so here’s another chart that kind of gives you an idea of data growth over the last few years and of course it’s tremendous. So all of us are dealing with more and more data all the time and I certainly hope that your data sets are growing because that means that your customer base is growing, your business is growing, so that’s the first factor. The second factor, fortunately for those of us in the in-memory space is that the cost of RAM continues to decline so it’s much more cost-effective than it was even a few years ago to store large amounts of data in-memory.
So what do we mean by in-memory computing, again if we start at the highest level, what does the concept mean? Well if you think about the picture on the left here, it’s sort of the traditional architecture that we’ve all been building applications within for many years. The fundamental premise is all of my data lives in a relational database, it’s all sitting on disk and if you think about typical application servers for the most part they don’t have a lot of memory or they haven’t until recent years. So again the picture here is my data lives on disk on a relational database and my application server typically is not able to store a lot of data in-memory. When we talk about in-memory computing we’re really talking about turning that picture upside down so we’re moving into a world where large amounts of RAM on servers are not only cheap but commonplace and we’re really saying we want to turn this picture around. And of course the relational database will still be there, it’ll be there for years to come, we have lots of great attributes and things that we’ve worked on for years in terms of backup and recovery processes and ETO and those sorts of things so that’s gonna continue, we want to keep leveraging that investment. But fundamentally let’s take the data that’s important to us, let’s put that in-memory and not in-memory on one server but in-memory but across lots of servers. That’s really what in-memory computing means, let’s turn that picture upside down, let’s store the majority of our data whether it’s transactional or analytic data, let’s store that in-memory on lots of boxes.
So if you’re gonna do that, so we accept this premise and we say okay I’ve decided I now want to store large amounts of data, my most important data in-memory, keep it close to my applications. The next obvious implication there is that there are a lot of features that I need in order to be able to accomplish that. So to us those features would encompass all these things and we’ll talk in more detail about these specific features. But the point here is you need a rich set of functionality if you’re going to move into this next step of evolution there’s a lot of functionality that sort of comes with that premise or that assumption but we’re gonna move forward.
So I said the word evolution, to us this is really what is happening in the industry right now, we have seen a progression from sort of the beginnings, you know you could say we’re the traditional sort of local caches, small caching libraries or sort of homegrown approaches to caching. From there you move to things like distributed caching and in-memory data grids and now we’re seeing a whole proliferation of different options, all kids of different products that have different capabilities, different aspects of in-memory that they’re taking advantage of. Our idea at GridGain is really to kind of bring these options back together to a certain extent and to offer a platform or data fabric that provides that rich set of functionality that you might need in one integrated solution.
So again we’ll talk in a fair bit more detail about what that looks like but before we do that to sort of complete the picture here let’s talk for a moment about some of the use cases where this typically comes into play. The first things at the top you’ll see are sort of the pure big-data analytics type of applications and also financials so this is a vertical where we see a tremendous amount of activity. You know there are a lot of problems, a lot of challenges in the financial space that sort of ultimately drive you towards an in-memory based solution. If you think about if you’re a trading operation at a large bank you may have hundreds of traders each one having hundreds of portfolios, each one having thousands of positions. And then let’s say you want to build something that will calculate the value of those portfolios every time the market takes over, and it’s not just the value either it’s probably some complex logic that you want to apply to figure out potential upside downside, those sorts of things.
Or to take another example in the online advertising space you have a very complex set of interactions, you have data about users, you have advertisers bidding to reach that audience. If you’re the guy sitting in the middle you have to very, very quickly decide as somebody is loading a webpage or loading a mobile app, you have to decide what do I know about this person, what can I show them, how can I maximize the opportunities from a revenue perspective to serve the ad from the advertiser that’s most interested in reaching that person. And again you have maybe a second to make that decision so across all these categories, you know you get the feeling for sort of the broad characteristics that all these have in common which are really a requirement for real time performance and then some combination of data volume, data velocity. In a lot of cases it is very large data sets but not in all cases, everybody’s requirements are different, it may be a modest sized data set that you may need to serve it up very quickly, do computing instant, very quickly and that sort of thing.
So with that said let’s take a look at what we do at GridGain and again our focus today is around enterprise features so we’ll start with kind of a level set, explain what we do for those of you that have not been exposed to it before. And then we’ll spend quite a bit of time talking about enterprise features and things that matter in an enterprise deployment scenario. So what you’re seeing here is the picture of the data fabric, the whole idea is that we essentially provide a scalability layer that runs on commodity boxes, that’s dynamic meaning I can grow this layer as I need to. But this layer sits between my existing applications whether those are written in any of the common languages that you see here on the chart, sits between those applications and my existing data store which today could be diverse. We may have relational databases, we may have NOSqL, you may have data in Hadoop, again the point is let’s take the data that’s important, let’s loft this into the data fabric, serve it up to our applications very rapidly. The key is we’ve now decoupled ourselves from an application perspective; we’ve decoupled ourselves from depending on the performance of those underlying storage tiers. So you can leverage the investments you have, use what you want to use at that tier, but achieve flexible scale and much, much better performance.
So there’s essentially four components here so we’ll dive into the four components a little bit during the presentation here, the four components being the daily grid piece, the cluster and compute capabilities, the streaming, so those three are really one integrated product. Of course you can choose to use or not use the different pieces as you see fit but that’s really one product. And then we have the Hadoop accelerator which is built on much of the same technology but kind of a stand alone piece so we’ll treat that a little bit separately but we’ll cover that as well.
Now for those of you that don’t know before we start talking in specifics about several of the features we’ve had an exciting development at GridGains so towards the end of last year our open source project which we’ve had for quite some time has been accepted into the Apache Software Foundation so this is going to be a project that’s gonna be called Apache Night. But it does not change our approach to enterprise versus open source software so it’s still the same approach we’ve always head, we have that open source offering which will now be an Apache project which is very, very exciting.
And then we also have our enterprise software so the key is open source is there, it’s free to use, do what you want with it but the enterprise adds a lot of value if you’re gonna deploy this in a production scenario. So what does that mean and again we’ll talk in more detail about some of these features but just a quick chart, you get all the basic functionality in the open source so the four pieces that we just spoke about are present. But there are a lot of features as you can see that we have, I won’t go through these but the theme here is its manageability, security, high availability, it’s those types of things that you care about when you’re dealing with production data, production applications, in an enterprise setting.
Along those same lines, if you’re thinking about applying this type of a-solution there are a few things that I would suggest, you might want to think about and I sort of lumped them into these six broad categories. So the first is performance, obviously you’re gonna gain performance but I bring this up because performance means different things to different people. The important point I think is to understand what performance means to you, is it about through-put, is it about latency, is it about being able to achieve approximately the same performance that you have now but with a two times or ten times larger data set or user base? So performance could mean different things so number one, understand what it is that you’re aiming for. Second I would say is scalability, so again scalability should be provided by a solution and certainly would be by GridGain. But a few things to think about, can I scale horizontally, can I scale vertically, do I have good monitoring tools to understand how I’m using the capacity that I have so that I can sort of plan ahead?
Consistency, right anytime you have data in a distributed system you want to think about consistency, what is the consistency model, are there choices that I’m offered, are there trade-offs, so you want to think about that. Persistence, with any in-memory solution of course memory itself is volatile, now in our approach we keep multiple copies of the dated memories so you have high availability but you may also want persistence in a complete outage scenario. So what are the options there, another thing to think about, is this important to you, what is it that you might want to take advantage of in that respect? And then the last two pieces transactions, do you want to transact on this data, what are the options there. And last but not least search and query capabilities, so if I need to query this data set that’s now in-memory how do I do that, how do I go about that, what are the available options?
So with that said let’s jump into the next piece in terms of benefits, so along these same lines you should be able to achieve high through-put, low latencies. You should be able to achieve easy, flexible, cost-effective scalability, you should have high availability as well, and in our case fully acid compliant transactions with different flavors, lots of different persistence options so we’ll talk about those in good options for security and of course that’s one of the themes around the enterprise edition as well. Also for those of you that don’t know we do support a lot of standard approaches so we are very, very soon announcing support for JCache the JSR-107 standard. We already support many of the standard constructs so for those of you that are Java or .Net developers or have those folks on your team things like maps, cues, you know executors, we support those things. From a SQL perspective our SQL is standards compliant; we have support for things like joins, custom SQL functions, things that many other products of this type do not necessarily support. And as I sort of hinted at earlier we do support several different languages so Java, .Net, C++, Scala, we also have a rest interface so if you’re not one of those languages typically you can leverage that rest interface for easy connectivity.
Now if you’re interested in trying out the product or if you’re thinking about installing what should you expect? So first of all when you download the software it’s kind of a big zip file so it’s very easy. We include sample configuration files, it includes the start scripts that you would use, there’s lots of code examples that we ship with as well so it’s very easy to use. You just download the zip file, you unzip it, you can run some examples, you can run some scripts and king of try different configurations. You can do it on your laptop; you can do it in the Amazon cloud. If you’re looking at compute functionality you can write code and have it deployed on the grid automatically so you’re not having to do a bunch of file management. So again the point being here it’s very, very easy to get started with. So with that said if you think back to that picture of the data fabric and we had the three components on the left, the data grid, the streaming, and the compute we want to go ahead and start drilling into those now.
So the first piece here is the data grid itself, now the data grid as you would expect is a distributed in-memory key value storer so all the data’s in-memory, it is fundamentally key value structure. There are many different types of data stores or caches that you could use so we can have local, replicated, or partitioned data sets. Now 90 percent of the time you’re gonna be talking about partitioned data sets, typically the point of implementing a solution like this is that you want to distribute work, distribute data across lots of nodes in a clustered fashion. So in that partitioned scenario we have high availability provided for that data, by default we keep two copies of the data in-memory, you can change that number or course.
But that means that we have one primary, one replica, you know you can obviously lose some nodes, as long as you don’t lose the whole grid you’re not gonna lose data. If you’re interested talk to us we can go into quite a bit more detail about we do that but suffice it to say that some of the primary data, some of the redundant data is going to live on every machine. So if you lose one node a little bit of that data that was on that particular node also resides on a whole bunch of other nodes so those other nodes can pick up a little bit of extra work and you have no downtime, no outages, no major performance impacts. So that’s all automatic, automatic fill over and recovery and of course the same thing works in reverse, if you want to add capacity you add some nodes, we’ll automatically copy some data over there. Once it’s done we’ll bring it up, they’ll become live, they’ll start doing work, and they’ll be participants in the overall grid so none of that requires any downtime, none of that requires any outages or anything like that. We also provide a strong consistency for data; our philosophy is that we want you to trust the data that’s in the grid the same way that you would trust data that is in a database. So we’re always gonna provide strong consistency, we support transactions with different flavors, different isolation levels. Again you’ll be very comfortable, very familiar with this based on your experience with existing databases.
We also support SqL so we mentioned that earlier, pretty much any standard SqL queries that you want to do for reporting for analytics you can do. We even provide a JEBC driver which can also be wrapped into ODBC if you want to, we support Read-through, Write-through to the existing database so this is kind of a key piece. When we talked about the data fabric we talked about the fact that we’re sitting between applications and existing data source and you can continue to leverage the data stores that you have but you no longer have to worry about the performance of those data stores. So the way that works is through this Read-through, Write-through, or it could be write-behind as well but it works through that mechanism. This is an extensible piece, we’re not tying you to any particular database or solution, it’s fully extensible, you can pretty much plug-in any system that you want to underneath it.
One key piece of the data grid, when we talked about scalability is the off heap support, so this is pretty important, those of you that work in Java or in .Net you know a little different but similar challenges. You have to think about your heap size, you have to think about avoiding garbage collection, this is a real killer for performance which is what we’re after here. So if you want to store large amounts of data, if you’re lacking some sort of off heap capability, all you can really do is run lots and lots of processes, each process can only use a little bit of memory and so I have to run a lot of them, it becomes a real management headache if you get into dozens and hundreds of processes. The much more elegant solution is to support off heap memory which is what we do so you can run a single process on a single box, say it’s a 256 gig box, for example I can run a modestly sized heap and then use that full amount of RAM with that single process via the off heap support so it’s very easy to manage, it really gives you the flexibility on the vertical scale as well as the horizontal scale that we already talked about. So if you’re thinking about enterprise deployments you know this is one of those things where you want to think about do I want to go and scale more horizontally, do I want to scale more vertically, and of course there’s trade-offs. You know you can think about how many CPUs do I want to put on the problem versus just how much raw storage and those are the types of considerations and of course we can help you think through that process.
Now the next piece here is really I think the key differentiated for GridGains so we have a very rich compute capability against this data as well. So it’s not just about serving data more quickly although of course that is an important piece, it’s also about being able to take, compute, chew that data and we’ll show you a picture of that in just a moment. But let me first talk about the compute API itself, so this is a very rich API, it’s been part of our product since day one so this was not an after thought or something that came later, it was a core part of the fabric from the beginning. As a result it’s a very rich set of APIs, so we support many different paradigms of distributive computing, we have an API for map reduce if that fits your particular use case.
But we also support many different other approaches, we have messaging, kind of a light-weight messaging support in the APIs. So if you want to do MPI style applications, if you want to do MPP style applications, you can do all of those things. And there’s some pretty cool technology in there like there’s your deployment, I kind of hinted at this earlier, if you think about running code in a distributed fashion across a grid, even if you’ve got 5 machines or 10 or 15 or 100, you know if I want to test some code I don’t want to have to build jar file or code our codes, copy those around to a whole bunch of different machines every time I want to try a new line of code. It’s just too painful so our zero deployment technology is there to address that.
If you’re writing code in a development environment you can hit run, your client machine, your laptop, whatever you’re working on would join the grid, it would send the computations that you’ve created out to the grid dynamically and then return the results back. So there’s no work you have to do, you’re not having to move files around, it’s a huge time saver in development. In practical terms what are we talking about here, we’re talking about running compute tasks in the grid, so to use the Java example we’re talking about things like closures, callables, runnables, those sorts of things, that’s what we’re executing in the grid. Typically it’s even based but we do support scheduling so if you want to schedule tasks whether it’s background tasks or maintenance tasks or those sort of things of course we support that as well with sort of a Cron-style scheduling.
We also have all of the infrastructure underneath that you would expect so the automatic load balancing, the automatic fail-over, all those things are sort of part of the underlying infrastructure of the data fabric. You can choose where to run tasks, if you have specific requirements based on types of machines or available resources that sort of thing. Or you can let us make a decision automatically; find the machine that has the most free CPU cycles, execute the task there, so you have a lot of control over how this works.
And the last piece that I’ll mention here is plugability so this is another fundamental difference that I’ve seen with GridGain versus a lot of other products. In the architecture of GridGain we thought about not just configure building but also extensibility from the beginning. So this doesn’t really apply just to compute it applies to the whole product but we mention it here because it’s a really, really nice feature. You can take advantage of this in a couple of different ways, typically for one of probably 14 or 15 different subsystems, these subsystems governing different aspects of behavior, things like discovery, load balancing, authentication, lots of different things like that. For those particular sub systems we will typically define what we call service provider interface, we’ll give you multiple implementations of that interface so you can choose different types of discover for example, which is how nodes find each other within the grid. But if for some reason none of the provided implementations is exactly right for you you’re free to write a little bit of code and implement custom behavior which I think is very, very cool.
When we kind of tie these concepts together, this very, very fast data, very, very fast access to data, and this ability to do compute, this brings us to this picture. So we think this is a really important concept and kind of the way that we think all of these cases ultimately are going to evolve. If you start with the picture on the left, the picture on the left is sort of you know what I would call the standard approach to caching. They have application servers, instead of going to a disk based database and getting data I’m gonna go to an in-memory solution hopefully a grid, that data’s much faster, it’s all in-memory, it’s already in the objects so I spend less time converting the data into something that the application can use, that’s all great, tremendous performance benefits. One or more orders of magnitude so that’s a great place to start, that’s what most people do. But if you think about those data sets growing and we have customers today doing dozens even a few hundred terabytes of data in these types of solutions, at a certain point moving the data across network doesn’t make a lot of sense. If you’ve got terabytes and terabytes of data I can’t move that across network fast enough to meet the kind of SLAs that we typically are aiming to meet. So the only other solution that’s, once again turn this picture around and that’s the picture on the right, where instead of moving data across the network to an application serving and working on it in one place I’m essentially gonna have that application server make a request to the grid to do some work. So I’m gonna send computations out to the grid, let them work in parallel on lots of machines against local in-memory data sets and then based on that all I have to do is send back a few results, they get aggregated back, and I get a single unified results that was executed in parallel against local in-memory data. So it doesn’t get much faster than that so again so as data sets grow over time we think this is the natural evolution.
So with that out of the way that kind of brings us to the focus for today which was enterprise features, so here’s the first of the enterprise features that we offer as it relates to the data grid. This is something that’s important for a lot of enterprises, especially large enterprises, you’re not running in one data center, you’re gonna run at least a couple. You know you have a primary data center, maybe a DR or a backup data center, or you may have several, you may have several in different geographic locations to serve users particular countries or particular geographies. In that scenario what you want to be able to do is you want to be able to run a grid in each of those data centers, each of those geographic locations, and then keep data in sync between them. So we have a feature for that, that’s what our data replication feature does, it allows you to connect multiple grids together in different types of typologies, it can be active-active, active-passive. As it mentioned it doesn’t necessarily have to be two data centers as we see here it can be quite a few so it’s very flexible. And this is actually accomplished without any third party products so this is entirely done with the GridGain software.
The next piece that’s very important in many enterprise environments is across language capabilities so in large enterprises you typically have a variety of different languages in use, typically Java, some .Net, maybe C++s in the mix, you may even have some others. So we’ve put a lot of work over the past year or two in really enhancing this cross language capability because we think it’s an important feature. You know you have to be able to address questions like how do I share data between applications written in these different languages so our answer to that is called degrid portable object, this is a compact binary storage format, it’s essentially raw binary data with a little bit of metadata attached to it. So we can store this very compact binary format in the grid and all of our client APIs and the different languages know how to work with this format so any client API can create one of these objects, modify it, store it in the grid. And by the way when you’re using this format since it begins life is binary you’re avoiding any serialization issues so it performs very, very well. And the other nice thing about it is if you should need to turn this data into a real Java object or a real .Net object you could do that so the API knows how to do that as well. I can retrieve a portable object from the grid and then turn it into a real Java object, work on it, put it back, same thing with .Net so very flexible.
It also gives you the nice advantage, even if you’re using a single language of being able to change your schema at any time so these portable objects you can think of as sort of a generic object with any arbitrary getters and setters that you want. If you go to set something that, set a property that didn’t exist before then we’ll just add it. If you go to retrieve a property that wasn’t set in a particular object we’ll just return _____ so there’s no errors, there’s no class versioning, none of that stuff that you have to worry about so that is another enterprise feature that many of our customers appreciate.
The next piece is this, so it’s called at the local restartable store or recoverable store we kind of use both terms a little bit interchangeably, in order to explain what this is let me first explain what you have in open source. What you have in open source is the piece that’s above the line here, you essentially have three storage tiers, I have on heap storage, off heap storage. And then optionally if you choose to turn it on you can have a swap file on disk so much like in OS with swap the disk if you’re out of memory, this is the same idea. It’s something that I as an architect would try to architect such that I don’t need it but it can be there as a fallback, obviously from a performance standpoint we don’t want to hit disk, that’s kind of the whole point.
Now that above the line swap piece is temporary, it’s just an overflow, it does not persist across free starts etcetera. The piece that we add for enterprise customers is the piece below so it essentially takes all the data that’s in-memory and persists it to a local disk. Now for some of our customers they may already have this data in a relational database, they may say I know that there’s high availability in the grid, I know I’ve got multiple copies, I can lose some servers, I’m not gonna lose data, that’s enough for me. Now of course there’s still the unlikely scenario where you lose your whole data center for some reason and you’ve lost every single node in the grid and a bunch of other stuff. And then of course you could lose that data, if you have it in a database and you can go get it then that’s fine for some people, but for some of our enterprise customers you think about that scenario for a minute, my whole data center is down for whatever reason, whether it’s a planned outage or unplanned. When I’m bringing everything back up, I’m restarting my databases, I’m restarting my application servers, I’m letting users back into the system, what’s happening on the database at that point?
It’s hammered already; I don’t necessarily then want to hammer the database further by going and loading a large amount of data to repopulate my data grid or my data fabric. So for that particular scenario that’s where this local restartable store comes in, so if the data’s on local disks in the GridGain grid you don’t have to go across the network, you don’t have to hit the database, it’s a much better recovery scenario especially if you’re interested in rapid recovery. Another feature that I’m sure many of you in enterprise deployments are thinking about is security; this has been another area of focus for us so we support a number of different standards. As you can see we have pluggable authentication and authorization which is what we mean by off and off up there. There’s a number of things you can do, you can authenticate nodes to be able to join the grid so you can ensure that only trusted nodes join the grid, you can authenticate clients as well. In the case of the rest interface which we mentioned earlier you can also maintain a session so that you’re not having to authenticate every time you want to make a request of the grid so that is another set of enterprise features as well.
There’s a couple more so you know one thing that can happen in the data center environment is network interruptions, so hopefully you don’t have this problem frequently but every once in awhile it’ll happen, a switch goes haywire or somebody unplugs the wrong port, that sort of thing. What can happen if you have a data grid or a data fabric or product like this, you know I have data that’s in memory across lots of machines, if some of the machines suddenly can’t talk to some of the other machines yet both of them can still talk to you, perhaps some other clients or other applications, you potentially have a problem. This is typically known as the split brain problem so what I don’t want to happen is I don’t want each of these segmented grids to think that it’s the only one, it’s the master, and allow data to be changed within it and then suddenly when they come back together and the network connectivity is restored I now have data drift and I have two different data sets that can’t reconcile. So that’s what we don’t want so this feature, the network segmentation protection is intended to prevent exactly what we just described. So the idea here is any time there is a network interruption, (A) we detect it, and then (B) we run a series of tests or checks that you define to determine what a good segment is. So you can determine this based on other resources that are available on the network, other things that may or may not be able to be reached. If one of these segments is able to successfully pass all the tests he is identified as a good segment or the correct segment. The other segment would then not allow any data changes so that you’re guaranteed that you’re not going to run into that situation where you have two diverging data sets. And again in an enterprise setting that’s certainly a scenario that you want to be able to avoid so that’s a pretty important enterprise feature.
The next piece to me is also hugely important; it’s not the kind of thing that you necessarily think about until it comes time to do it but if you’re thinking about upgrading the data fabric itself so you want to go from one version to the next version of the data fabric. What you’d like to be able to do of course is to do this without down time, I’d like to take a couple of machines out of my grid and upgrade those, put ’em back in the grid, and then continue through until I’ve upgraded every machine. So if you think about what I’m doing during that process as I’m working my way through the various machines in the grid, I’m actually running two different versions of the software together. So for our enterprise customers we support that so we add this backwards compatibility to ensure that you can successfully do that. In the open source that feature just isn’t there so you essentially would have to shut down the grid, upgrade everything, and then bring it back up, repopulate the data. Again for our enterprise customers we don’t want you to have to do that and that’s why this feature is there.
So we’ve touched on most of the enterprise features, there’s a couple more that we’ll get to in a few minutes and we’ll also get to a little bit of practical advice along those lines as well. But let’s go ahead as we’ve sort of covered a lot about the data grid, we talked a little bit about compute, let’s cover the other two pieces that we haven’t touched on which are streaming and Hadoop. So conceptually these two are pretty straightforward, the streaming component is really intended to do two things. So we believe if you’re dealing with streaming dataset that the first thing you have to be able to do is keep up with the data. So the first assumption is I can’t slow down a stream, I can’t stop it, I can’t make it pause until I’m ready, I have to keep up with it in real time and that means I need horizontal scalability. And typically I can’t just take the data from the stream the way that it is, it’d be nice but in most cases I have to do some level of processing on that data. It may be fairly simple, cleanup, filtering, enrichment with metadata, these are the types of things that we typically see folks need to do to estrange. So you want to do that work, you want to do it in a horizontally scalable fashion so that’s the first piece that we offer within the streaming capabilities. It’s another API set within the data fabric so part of that API is this concept of work flows I define through a combination of configuration and code a series of steps that becomes a work flow. Those steps can execute any arbitrary code that you wish and that work flow gets executed on the grid. So that’s the first step, I’m adjusting data; I’m cleaning it up, filtering it, adding metadata and so on.
The second bit of functionality within our streaming component is the ability to do some real time analytics. So if I’m processing my streams, I’m keeping up with the flow that’s step one, step two is I want to now extract intelligence from those streams, I want to look for things that are interesting. So to support that we have this concept of sliding windows, windows over events, you can have as many of these as you want, typically they’re time bound so you might define a 5-minute window, 15-minute window, 2-hour window, et cetera. Within those windows you can bring events from those streams into the windows, the data will get indexed, and of course I mean in-memory index here, and then you can run continuous queries.
So as events come in they’re indexed, we’ll look for any of the continuous query criteria that you’ve predefined, and you can even do ad hoc queries later for that matter, but the idea is look for things that are interesting to me for some reason. So they have a characteristic, a property that exceeds some threshold that I’ve set or deviates from some moving average or some pattern that I’m looking for. Ultimately the point is we find those events, you get a callback so that you’re application can do whatever it wishes with these events that have been found. Typically that means you might notify somebody, put it on dashboard, put a message on a cue, drop something into a reporting table, those sorts of things, so that’s really the gist of the streaming component.
Those three, the data grid that compute the streaming, again the core piece of the data fabric, that brings us to the fourth piece which as I mentioned at the beginning is kind of a stand-alone component which is our Hadoop accelerator. The Hadoop accelerator builds fundamentally on a lot of the same technology that we’ve talked about but is essentially a stand-alone piece that can drop into a Hadoop environment. What it is – is it’s an in-memory file system so we call it the GridGain file system, this typically would sit on top of HDFS, it implements the same API as HDFS in fact so it looks like HDFS to the other Hadoop components, it just plugs in as this file system and essentially becomes an intelligent caching layer on top of HDFS. So the idea is you run a GridGain process on your Hadoop data nodes, that’s how we grab memory on the appropriate data nodes, we can preserve that data locality. And then in your configuration you point to this GridGain file system instead of HDFS. From that point you tell us what data you want cached, as you access that data whether to do map or Hive or however other standard Hadoop tools.
If you’ve accessed data that you’ve told us should be cached we load that into memory by using something that’s derived from data grid technology. We cache at the black level, we’ll store as much data as we can within the space that you’ve told us to use. Once that’s full of course we have intelligent eviction to get rid of the least frequently used, the sort of coldest data, and preserve the hottest data in memory, and of course if we need something later we go back to HDFS and get it. The other piece is what happens when you access data that you have not told us to cache, well we simply proxy that call straight through to HDFS almost as if we weren’t there so it’s very seamless, very transparent, it works with any of the major Hadoop distributions, and again very, very easy to get up and running.
Now there is one additional piece of the Hadoop accelerator which I’ll cover quickly, there’s a particularly used case, this came from work with our prospects and customers around the Hadoop accelerator. We found that for certain folks if you’re trying to run map-produced jobs very, very quickly with low latencies, sub-second, that’s kind of a challenge in Hadoop. Hadoop has a wonderful architecture for extreme scalability in terms of storing large amounts of data, petabytes of data, but the design for this really around resiliency, redundancy, and that sort of thing not so much real time performance. There are improvements being made from that respect but still that is essentially the original design so given that there’s a lot of these different processes, they all have to talk to each other, ultimately new processes are spun-up for new jobs, that all just takes a certain amount of time. So for example if you have a job that you want to run that’s gonna take one second and it takes 30 seconds to actually start the job that’s just pretty painful. So we quickly realized that we could offer a solution in that particular scenario through our map reduced acceleration, it’s kind of a nice side effect all right ’cause you think about the fact that the GridGain file system which is the prerequisite for this piece, so we’re assuming that you’re storing data in the GridGain file system. Given that there’s some data grid technology under the hood, our data grid includes our approach to map reduce, we realized that we could basically plug-in our execution path and then Hadoop too you can plug it in under the standard map reduce APIs, there’s no code changes. And essentially we short-circuit all that overhead so you now instead of having the different processes, name node, job target, task tracker, all talking to each other, spinning up new processes, you now move to an execution path that’s point-to-point. It’s two already running processes with an already open connection, we just send the request across, we run the job, we send the result back, much, much faster, that’s how you can get to sort of sub-second latencies for map reduce.
So I promised a couple more enterprise features, the one here probably one of the key pieces at least from my perspective, monitoring. So no matter what enterprise software you’re working with and I’ve myself worked with a lot of them over the years, one thing is always true for me, I want to have good visibility, I want to see what’s going on under the hood, understand performance, have the tools I need to understand what the solution is doing for me, and how well things are running. So that’s something at GridGain that we believe in, we’ve put a lot of work into our tool, it’s called Visor and it’s really great both from an operations and a development perspective. So from the operation side the first thing you’d see in Visor is a dashboard, you’re gonna see all these different real time metrics, you’re gonna see all the nodes in the cluster, you have the ability to start/stop and restart nodes, to search the logs on the machines, to look in environment variables in our configuration.
So it’s all the kinds of things that you would otherwise have to remote into all these machines and start gripping logs and editing or reviewing config files. All those types of things you can now do just through Visor, through a nice graphical interface. So that’s great in and of itself but it actually goes quite a bit deeper than that so depending on the functionality that you’re using there are dedicated screens even more than we list here for each type of functionality. So if you’re using streaming for example, we were just talking about that, you’ll see all the streaming stages that you have defined, the different processes, where they’re running, what the through-put and latency is, those sorts of things. In the case of data grid you’d see here’s all the data source or the caches that I have to find, how many objects are stored there, how much memory is each one using, what are my hit rations, my read/write ratios, all those kinds of things will be right there at your fingertips. So again there’s a lot of functionality here, we have some screen casts on our website if you want to see all the different bits and pieces. There’s some monitoring capabilities, lots of stuff, is should say monitoring and alerting, so there’s lots of stuff there that you can learn more about on our website.
So I think we’re coming up on close to the 45 minute marks and then we want to leave some time for questions so just a couple more points here and I think we’re pretty much right on time. So first we promised some practical recommendations so if you’re thinking about deploying this is in enterprise environment, we’ve already highlighted a lot of the features that you may want to take advantage of. To kind of put this in practical terms I would recommend a few things, so number one, authenticate your grid notes, so be sure who’s joining the grid. This can be as simple as just putting a password, it can be much more complex with the different features that we looked at earlier but at least put a password, make sure that you don’t have nodes joining that you didn’t intend to have join the grid. Going along with that I always assume that you’re also going to use appropriate firewall rules, so treat the grid fairly similarly to the way you would a database, a database typically is not gonna live in the DMZ for example, and a grid would be the same way. A database is not going to be arbitrarily restarted, nodes in a grid would be treated the same way so just set those expectations and especially for those of you that are developers, set those expectations with your ops teams.
The next piece is size appropriately, so this is very, very use case dependent and of course this is the type of discussion that we can help in and personally I always enjoy digging into different use cases so feel free to reach out to us. But let me give you a rule of thumb, so my rule of thumb is start with a data set size that you actually want to hold in-memory, so the first question is what’s my hot data set? I may have terabytes or petabytes of data in my database but what’s actually important, what do my users need, what do my applications hitting day-after-day, minute-after-minute as my users are using this system? So first of all you identify that hot data set, then you say how big is that, so in my example we’re gonna say it’s 500 gigs of data. What I typically do from there to start planning capacity in the grid would be to first of all double it because we’re providing HA in-memory so we’ve got two copies of the data so you gotta double it. And then the next piece, I always want to add some metadata growth factor, so with any system whatever type of data storage system you’re looking at there’s always gonna be some metadata, some data about the data that we’re gonna have to keep track of and some operational overhead. And of course you don’t want to start with a cluster that’s at 100 percent capacity so you want to leave a little room for growth so I kind of lump all that together and I say let’s add 30 percent, it’s a good rule of thumb. So in our example we start out with 500 gig, we doubled it, we added 30 percent so we’re at 1300 gigs so that’s the size cluster that I would want to build. Now again we have customers that maybe they only have a couple of gigs of data and it’s a very small cluster, on the other hand we have customers with terabytes and it would be much bigger than this.
And of course you know you can think about tweaking what the growth factor is based on your own expectations so there’s variables there but it’s a good rule of thumb, a good place to get started. Other recommendations, definitely take advantage of off heap storage, you notice we mentioned earlier from a manageability perspective it’s much easier to manage a few large machines than it is dozens or hundreds of smaller machines. There might be some practical scenarios where you want to have more machines particularly if gonna use compute features heavily and you want to put more CPUs on the problem. But my preference typically would be to scale vertically first so get a box that’s of a reasonable size and then scale out beyond two, of course you always want to start with two for HA, preferably three, but then scale out from there.
Next couple of points are about testing, so as you’re testing and these are scenarios where hopefully we would sort of be in some of these discussions with you about what you’re gonna test and we love to engage at that level. But regardless I would certainly recommend testing at scale if you can, depends on how big your scale is. If it’s too large you can’t build a testing environment that matches production, some can, some can’t. Then I would suggest we at least test at some sort of predictable fraction, one half scale, one tenth scale, so at least you can do the math and figure out how it would scale well. And then the last piece there, test the most likely points of failure so the most likely scenarios are gonna lose one box or you’ll lose one network connection, those kinds of things.
And then last but not least you know we talked about monitoring so be able to monitor and understand your data utilization patterns. The reason I point this out is we see this in practical terms with customers on a pretty regular basis, you know you may have an application that has been around for awhile and it’s fairly well understood with any organization. But if you have good monitoring tools to actually look at how the data is used you may find some surprises. So you’ll have your expectations of how things are going to work, you’ll have the expectations around what data you think is important, but the point here is validate that, validate that those assumptions are correct and that you’re seeing what you would expect to see.
So last piece that we have for you today as we sort of have those conversations, the kinds of considerations we were talking about, what can GridGain do to help so we love to get involved in proof concept projects as you would expect, those may include things like demonstrating performance, typically wood of course. But we may demonstrate a number of other things, data consistency, different features, different capabilities that are appropriate to your particular use case. And of course we can support you throughout that process so that may include things like or support Jira account, it may include obviously the ability to ask questions and answers, but even to get help in terms of configuration reviews or even code reviews and of course whether we need to do that via e-mail or phone or web meetings, all those things are possible. So for our enterprise customers these are all the kinds of things that we’re more than happy to do and we’d love it if you would want to reach out with us and kind of have some discussions about what that might look like for you.
So with that said we have just a few minutes left and that brings us to the question and answer period so Dane if you’re still on, I don’t know if we had any questions from the group but if we do this would be a great time to address those.
Interviewer: Yeah you bet, we’ve got a bunch of questions actually, now I can, and we are coming up to the top of the hour and the webinar will end right at the top of the hour so we’ll get in as many questions as we can and then we’ll definitely answer any questions that we didn’t have time to get to after the webinar. So I’m gonna go ahead and jump into here’s one that was asked, how frequently is the local restartability store updated?
Interviewee: So it’s transactional so it’s essentially anytime that you’re storing a new piece of data in the data fabric that store on disk is going to be updated as well. So it’s guaranteed to be a fully-consistent replica if you will of the data that’s in memory so good question.
Interviewer: Okay and that’s the answer okay, now next question, what happens if you fail to detect an isolated segment and then the grid topology is healed? Did that make sense?
Interviewee: It does, I’m trying to think if there is a scenario where we couldn’t detect it, I mean basically the way that the feature that we were talking about earlier works is anytime there’s a topology change we will run those checks so I really can’t think of a scenario where there is a network interruption between certain nodes and we would not see that as a topology change. So to us a topology change means a number of different things but at the most basic it means that any node has joined or left the grid. So if some nodes can’t talk to other nodes you’re definitely in a situation where from the perspective of either end some nodes have left the grid, all those checks would be run. So off the top of my head I can’t think of a scenario where we wouldn’t be running those checks and wouldn’t be able to detect if something’s happened.
Interviewer: Okay great, and here’s another one, so the Visor front end is part of the enterprise distribution correct?
Interviewee: That is correct.
Interviewer: Does is us the proprietary, yes, and the follow-up is does it use a proprietary protocol to pull metrics from the GridGain nodes or is all the same information available in the GridGain API?
Interviewee: Gotcha, so the short answer is the same information is available for almost everything you would see in Visor, it’s available via the API, it’s also available via JMX. So you can certainly incorporate those statistics into third party tools, you know for example a lot of people will use something like Nagios or Hyperic, you can certainly import those metrics if you wish into one of those tools.
Interviewer: Okay great and here’s another one, with respect to the GridGain map reduce, is it possible to only install GridGain and not the rest of the Hadoop stack if low latency is the primary requirement for processing, i.e. only the purple boxes from the diagram shown in the slide showing the GridGain map reduce execution path.
Interviewee: So great question and the answer is yes you can do map reduce within GridGain without Hadoop so you do not need all the Hadoop pieces to do map reduce with GridGain. You would only need those additional pieces if you actually wanted to read data out of HDFS and accelerate the Hadoop flavor of map reduce but you can stand up our grid, our data fabric, load data, run map reduce against it, entirely stand alone.
Interviewer: Okay great and here’s another one, what kind of high-speed networking do you recommend when running GridGain?
Interviewee: You know that’s a good one, so common question but even a year ago or so I would say pretty much everybody’s on Gigabit Ethernet and that’s fine, if you’re getting up into the larger data sets, you know terabytes of data I would certainly prefer ten gig and we’re seeing more and more people go to that. We’ve even had a couple of customers go to 40 gig, that’s great if you can get it but it’s really not a requirement so unless you’re moving huge pieces of data around, typical Gigabit is fine, certainly by default I’d say let’s go to ten gig because that’s easily affordable in most cases nowadays but you certainly don’t need anything esoteric.
Interviewer: Super, okay thanks Mac and we’re coming right up to the end of our time, we do have several other questions, again we’ll get to those after the event but we’re gonna go ahead and wrap it up now so thanks again Mac for all that great information. I hope everyone got a lot out of today’s presentation and as sign off I did want to remind you about the attachments that I mentioned earlier and also if you would be so kind as to rate the webinar and/or leave any comments in the questions area that would help us to improve the material for future webinars. So have a great rest of the day everyone and thanks again for attending.