Blog

Archives

Dmitriy Setrakyan

On Thursday May 24 (2012), GridGain participated in a discussion hosted by DM Radio and Information Management, on the performance requirements for many of today’s applications and services, and technologies suited to supporting these environments. Program hosts Eric Kavanagh and Jim Ericson interviewed several leading experts, including:

  • Dominic Delmolino of Agilex
  • Paul Groom of Kognitio
  • Graham Toppin of Infobright
  • Dmitriy Setrakyan of GridGain

Listen to the audio recording on Information Management.

Few real-life facts about the economics of In-Memory Computing that I’ve recently observed when talking to our customers and users that put the concepts into simple perspective…

Let’s just say you one of the few dozens startups sprung just last week building social analytics platform and Twitter’s firehose is your main source for streaming data. Few facts:

  • Twitter currently produces ~177 millions twits per day globally.
  • Let’s assume that you can store a tweet and all its meta information in roughly 512 bytes.
  • Your typical “working set” is 2 weeks of tweets.

Simple calculation shows that in order to keep this data in-memory you’ll need to have a cluster with total of ~ 1TB of RAM to keep all this information in memory.

In other words, you can keep 2 weeks of all tweets in memory as long you have in-memory data grid with 1TB total capacity.

Now – how much does this cost today? You can easily buy a new single Xeon-based blade with 64GB RAM for ~$1500 (on eBay and elsewhere). Which brings the hardware cost for this cluster to roughly $30,000.

Let me say it again: you can buy a new cluster with 1TB RAM capacity for ~$30K today (in US).

It will take probably about a day to setup and properly configure GridGain on this cluster and you’ll have a full featured in-memory platform with 1TB RAM data capacity and ~20 parallel Xeon computing capacity.

Given the fact that modern RAM access is up to 10,000,000 times faster than disk access – you can start analyzing your 2 weeks of tweeter data in real time in no time at all!

That’s the power of in-memory computing…

Grid Dynamics, an eCommerce technology solutions company, and GridGain Systems, makers
of the leading open source in-memory platform for big data processing, today announced the
expansion of their partnership which began in 2008.

Read the entire press release here.

Lately there has been lots of noise about “Real Time” Big Data. Lots of companies that associate themselves with this term are generally in analytical space and to them it really means “low-latency” for analytical processing of data which is usually stored in some warehouse, like Hadoop Distributed File System (HDFS). To achieve this they usually create in-memory indexes over HDFS data which allows their customers run fast queries on it.

Although low latencies are very important, they only cover one side of what “Real Time” really means. The part that is usually not covered is how current the analyzed data is, and in case of HDFS it is as current as the last snapshot copied into it. The need for snapshotting comes from the fact that most businesses are still running on traditional RDBMS systems (with NoSql gaining momentum where it fits), and data has to be at some point migrated into HDFS in order to be processed. Such snapshotting is currently part of most Hadoop deployments and it usually happens once or twice a day.

So how can your business run in “Real Time” if most of the decisions are actually made based on yesterday’s data? The architecture needs to be augmented to work with LIVE data, the data which has just been changed or created – not the data that is days old. This is where In-Memory Data Grids (a.k.a. Distributed Partitioned Caches) come in.

By putting a Data Grid in front of HDFS we can store recent or more relevant state in memory which allows instant access and fast queries on it. When the data is properly partitioned, you can treat your whole Data Grid as one huge memory space – you can literally cache Terabytes of data in memory. But even in this case, the memory space is still limited and when the data becomes less relevant, or simply old, it should still be offloaded onto RDBMS, HDFS, or any other storage. However, with this architecture, businesses can now do processing of both, current and historic data – and this is very powerful. Now financial companies can quickly react to latest ticks in market data, gaming applications can react to latest player updates, businesses can analyze latest performance of ad campaigns, etc.

Here are some of the important benefits our customers get when deploying GridGain In-Memory Compute and Data Grid in the above architecture:

  • Partitioning of data
  • Real Time MapReduce
  • Integration between Compute and Data Grids and ability to collocate your computations with data (a.k.a. data affinity)
  • In-Memory data indexing and ability to run complex SQL queries, including ability to execute “SQL Joins” between different types of data
  • Native affinity-aware clients for Java, .NET, C++, Android (with Objective-C client right around the corner)
  • Tight integration with various storage systems, including any type of RDBMS, Hadoop HDFS, HBase, etc…

Note that in this architecture In-Memory Data and/or Compute Grids really complement warehouse technologies, like HDFS – one is for in-memory processing of current data and another is for processing of historic data.

TriJUG

Come to our talk “Streaming MapReduce with GridGain” and see live coding of famous Hadoop’s example of counting popular words… but in Real Time context. As always – live coding from scratch in Scala is never dull!

Stop by and say hello to our CTO Dmitriy Setrakyan.

GridGain 4.0.1 Released!

Posted by on Wednesday, April 11, 2012
 Archives, Blog, Product Releases

GridGain

GridGain 4.0.1 has been released this Monday. This is a point release that includes several bug fixes as well as number of new features.

.NET
With 4.0.1 we are introducing native support for .NET with our C# Client. C# Client provides native .NET/C# APIs for accessing both GridGain’s In-Memory Data Grid and Compute Grid from outside of the GridGain topology context. Internally it’s deferring to the REST protocol.

C# Client is one of many native clients we’ll be releasing shortly including ObjectC, C++, PHP, Scala, Ruby, and some others we’re already working on.

Improved Support for 32-bit and 64-bit Systems

We’ve modified our scripts for better out-of-the-box support for 32-bit and 64-bit systems. We’ve had several clients complaint that additional configuration properties were required and specifically GridGain Visor didn’t work fully on 32-bit system with default configuration. All these issues have been resolved.

Enhancements to GridGain Visor

We are continuing making rapid improvements to GridGain Visor that is part of GridGain Enterprise and OEM editions.

We’ve added ability to specific the time span for chart views:

We’ve added nice in-place filtering for events in Dashboard:

You can now double click on event and to see its details:

We have promised a while back to publish the code from live coding GridGain presentation we did at QCon London earlier this year. Since presentation was in Scala, the code we will be posting here is in Scala.

First a brief intro. We all know Hadoop’s counting words example which takes a file with words and then produces another file with number of occurrences next to each word. Hadoop does this example very well, however the main caveat with Hadoop’s example is that it is not real time.

The counting words example we did at QCon actually counted words in real time. The program was split into two parts. First part is responsible for loading the words in real time into GridGain data grid, and the second part was querying the grid every 3 seconds to continuously print out top 10 words stored so far.

The example was done using ‘Scalar‘ – GridGain DSL for Scala, but it could have been done In Java as well using GridGain Java APIs.

Continuously Populate Words In Real Time

Let’s start by continuously loading data grid with new words. To do that, we downloaded several books in text format and started concurrently reading them from the populate(…) method, one thread per book. For every word read, we store it in cache, having the word itself as a key and number of current occurrences as a value. Also note how we let grid asynchronously update cache using asynchronous run while reading the next line from the book file (in reality you would most likely have more than one asynchronous job or have GridGain data loading functionality do it for you).

def populate(threadPool: CompletionService, dir: File) {
  val bookFileNames = dir.list()
 
  // For every book, start a new thread and start populating cache
  // with words and their counts.
  for (bookFileName <- bookFileNames) {
    threadPool.submit(new Callable {
      def call() = {
        val cache = grid$.cache[String, JInt]
 
        var fut: GridFuture[_] = null;
 
        Source.fromFile(new File(dir, name)).getLines().foreach(line => {
          line.split("[^a-zA-Z0-9]").foreach(word => {
            if (!word.isEmpty) {
              if (fut != null)
                fut.get()
 
              fut = grid$.affinityRunAsync(null, word, () => {
                // Increment word counter and store it in cache.
                // We use cache transaction to make sure that
                // gets and puts are consistent and atomic.
                cache.inTx(
                  () => cache += (word -> (cache.getOrElse(word, 0) + 1))
                )
 
                ()
              })
            }
          })
        })
 
        None // Return nothing.
      }
    })
  }
 
  // Wait for all threads to finish.
  books.foreach(_ => threadPool.take().get())
}

Distributed SQL Query

Now let’s implement our distributed query against GridGain data grid which will run every 3 seconds. Note that we are using standard SQL syntax to query remote grid nodes. Interesting enough that GridGain data grid allows you to use SQL virtually without any limitations. You can use any native SQL function and even SQL JOINs between different classes. Here, for example, we are using SQL length(…) function to only query words greater than 3 letters long just to get rid of frequent short articles like “a” or “the” in our searches. We are also using desc keyword to sort word counts in descending order and limitkeyword to limit our selection only to 10 words.

def queryPopularWords(cnt: Int) {
  // Type alias for sequences of strings (for readability only).
  type SSeq = Seq[String] 
 
  grid$.cache[String, JInt].sqlReduce(
    // PROJECTION (where to run):
    grid$.projectionForCaches(null),
    // SQL QUERY (what to run):
    "length(_key) > 3 order by _val desc limit " + cnt,
    // REMOTE REDUCER (how to reduce on remote nodes):
    (it: Iterable[(String, JInt)]) =>
      // Pre-reduce by converting 
      // Seq[(String, JInt)] to Map[JInt, Seq[String]].
      (it :\ Map.empty[JInt, SSeq])((e, m) => 
        m + (e._2 -> (m.getOrElse(e._2, Seq.empty[String]) :+ e._1))),
    // LOCAL REDUCER (how to finally reduce on local node):
    (it: Iterable[Map[JInt, SSeq]]) => {
      // Print 'cnt' of most popular words collected from all remote nodes.
      (new TreeMap()(implicitly[Ordering[JInt]].reverse) ++ it.flatten)
        .take(cnt).foreach(println _)
 
      println("------------") // Formatting.
    }
  )
}

Start Example

And finally let’s implement our main(…) method that calls our populate(…) and queryPopularWords(…) methods we just defined.

def main(args: Array[String]) {
  // Initialize book directory
  val bookDir = new File(BOOK_PATH);
 
  // Start GridGain with specified configuration file.
  scalar("examples/config/spring-cache-popularwords.xml") {
    // Create as many threads as we have book, so we can use
    // thread per book to load data grid concurrently.
    val threadPool = Executors.newFixedThreadPool(bookDir.list.length);
 
    val popWordsQryTimer = new Timer("words-query-worker");
 
    try {
      // Schedule word queries to run every 3 seconds.
      popWordsQryTimer.schedule(new TimerTask {
        def run() {
          queryPopularWords(10) // Query top 10 words from data grid.
        }
      }, 3000, 3000)
 
      // Populate cache with word counts.
      populate(new ExecutorCompletionService(threadPool), bookDir)
 
      // Force one more run to print final counts.
      queryPopularWords(POPULAR_WORDS_CNT)
    }
    finally {
      popWordsQryTimer.cancel() // Cancel timer.
 
      threadPool.shutdownNow() // Graceful shutdown.
    }
  }
}

To execute the example, start several GridGain stand-alone nodes using examples/config/spring-cache-popularwords.xml configuration file and then start the example we just created from IDE. You may wish to add more printouts for better visibility of what’s happening.

This example is also shipped with GridGain 4.0 and also available in GridGain GitHub Repository.

GridGain and Hadoop

Posted by on Wednesday, March 28, 2012
 Archives, For Your Information

Over the past few months I’ve been repeatedly asked on how GridGain relates to Hadoop. Having been answering this questions over and over again I’ve compacted it to just few words:

We love Hadoop HDFS, but we are sorry for people who have to use Hadoop MapReduce.

Let me explain.

Hadoop HDFS


We love Hadoop HDFS. It is a new and improved version of enterprise tape drive. It is an excellent technology for storing historically large data sets (TB and PB scale) in a distributed disk-based storage. Essentially, every computer in Hadoop cluster contributes portion of its disk(s) to Hadoop HDFS and you have a unified view on this large virtual file system.

It has its shortcomings too like slow performance, complexity of ETL, inability to update the file that’s already been written or inability to deal effectively with small files – but some of them are expected and project is still in development so some of these issues will be mitigated in the future. Still – today HDFS is probably the most economical way to keep very large static data set of TB and PB scale in distributed file system for a long term storage.

GridGain provides several integration points for HDFS like dedicated loader and cache loaders. Dedicated data loader allows data to be bulk-loaded into In-Memory Data Grid while cache loader allows for much more fine grained transactional loading and storing of data to and from HDFS.

Many clients using GridGain with HDFS is a good litmus test for that integration.

Hadoop MapReduce

As much as we like Hadoop HDFS we think Hadoop’s implementation of MapReduce processing is inadequate and outdated:

Would you run your analytics today off the tape drives? That’s what you do when you use Hadoop MapReduce.

The fundamental flaw in Hadoop MapReduce is an assumption that a) storing data and b) acting upon data should be based off the same underlying storage.

Hadoop MapReduce runs jobs over the data stored in HDFS and thus inherits, and even amplifies, all the shortcomings of HDFS. Extremely slow performance, disk-based storage that leads to heavy batch orientations which in turn leads to inability to effectively process low latency tasks… which ultimately makes Hadoop MapReduce an “elephant in the room” when it comes to inability to deliver real time big data processing.

Yet one of the most glaring shortcomings of Hadoop MapReduce is that you’ll never be able to run your jobs over the live data. HDFS by definition requires some sort of ETL process to load data from traditional online/transactional (i.e. OLTP) systems into HDFS. By the time the data is loaded (hours if not days later) – the very data you are going to run your jobs over is… stale or frankly dead.

GridGain

GridGain’s MapReduce implementation addresses many of these problems. We keep both highly transactional and unstructured data smartly cached in extremely scalable In-Memory Data Grid and provide industry first fully integrated In-Memory Compute Grid that allows to run MapReduce or Distributed SQL/Lucene queries over the data in memory.

Having both data and computations co-located in memory makes low latency or streaming processing simple.

You, of course, can still keep any data for a long term storage in underlying SQL, ERP or Hadoop HDFS storages when using GridGain – and GridGain intelligently supports any type of long terms storage.

Yet GridGain doesn’t force you to use the same storage for processing data – we are giving you the choice to use the best of two worlds: keep data in memory for processing, and keep data in HDFS for long term storage.

http://vimeo.com/39065963

GridGain 4.0 Released!

Posted by on Sunday, March 25, 2012
 Archives, Blog, Product Releases

GridGain Logo

I’m pleased to announce that today we released GridGain 4.0 – latest edition of our platform for Real Time Big Data processing. I’m proud that our team set this final deadline almost 5 months ago and we were able to hit without a single delay.

I’m especially proud of this fact because of the enormous complexity of the development process involved in making software like GridGain – dozens of production clients, testing on serious massively distributed environments, set of new features, and the usual array of setbacks that we had to go through to get here.

Needless to say that we have also grown significantly as a business in the last 6 months including more than doubling our team headcount, rolling out new website, new branding, sales team, messaging, press and analysts relationships, investment, and the whole scope of other business activities.

But… GridGain System is an engineering company first and foremost and I’ll talk about technology in GridGain 4.0:

Visor Management & Monitoring

Enterprise and OEM Editions of GridGain comes standard with GridGain Visor – GUI-based and scriptable environment for managing and monitoring GridGain distributed installations.

Visor GUI allows to perform all major management and monitoring operations for GridGain installations:

Various Node Actions

Visor Node Actions

Topology View with Metrics

Nodes Table

Metrics For Any Projection

Nodes Metrics

Comprehensive Historical Charts

Charts
Charts

Advanced Grid-Wide Events

Evetns

… and plenty of other cool stuff!

Affinity-Aware Native Clients

In GridGain 4.0 we are finally introducing native clients for various languages. The 4.0 release includes native Java and Android clients with rich APIs to support our Compute Grid and Data Grid connectivity. Our native .NET, C++, Groovy, and Scala clients are already in testing stage and will be coming out shortly as well. After that we will be adding Objective-C, Ruby, PHP, Python, and Node.js native clients.

All of our clients natively support essentially the same APIs specifically adapted to a certain language. You can execute MapReduce tasks, perform bunch of data operations, like storing and retrieving values to/from remote caches, compare-and-set/replace/put-if-absent atomic operations, etc… You can also subscribe to topology updates and get very creative with partitioning remote data grid into logical subgrids.

But one of the coolest features in our clients is affinity-awareness. This basically means that when working with data grids, GridGain will automatically figure out on which node the data is stored and will route client requests to that node. Imagine the amount of network trips you can save by retrieving data directly from the node which is responsible for storing it (same goes for updates). This feature is available for all of our native clients, not only for Java, which makes GridGain into the only native cross -language distributed Real Time Big Data platform.

Memcached Binary Protocol Support

In GridGain 4.0 we significantly enhanced our REST support for HTTP(S) and added Binary protocol support as well. What’s even cooler is that our binary protocol is fully Memcached-compliant. As a matter of fact, during our testing we have been connecting to GridGain using available open source Memcached clients and executing commands on GridGain data grid.

Having said that, GridGain 4.0 Binary connectivity protocol supports a lot more than Memcached does. Essentially we have taken Memcached protocol as our starting point and significantly enhanced it with our own commands and features. For example, you can configure security with proper authentication and secure sessions for remote clients, or you can execute MapReduce tasks, get remote node topology, etc…

Advanced Security

In GridGain 4.0 we added a notion of secure grids. Grids can now request for nodes to be authenticated prior to joining them into topology. Authentication implementation is fully pluggable through our SPI-based architecture and comes with several implementations out of the box, such as Passcode or JAAS-based authentication.

Additionally remote clients can also be required to authenticate themselves and once authenticated, they establish a secure session with the server.

Both, authentication and secure-session SPIs are available in GridGain enterprise edition only.

Data Loaders + Hadoop HDFS Support

We have also added support for efficient concurrent data loading for our data grid. There are plenty of ways to load data into data grids, including using basic cache APIs or our support for bulk-loading of data from data stores. Data loaders make it easy to externally load data into grid by adding collocation with data nodes, sending concurrent data loading jobs and properly controlling the amount of memory consumed by data loading process.

As a good use case, you can use data loader to preload data from Hadoop HDFS in order to process it in Real Time on GridGain. In fact GridGain 4.0 comes with HDFS data loading example, GridCacheHdfsLoaderExample, which reads data from HDFS and then uses GridGain data loader to load it into data grid.

1000+ Nodes Guaranteed Discovery

In this release we enhanced our TCP-based discovery protocol with enterprise-proven support for network segmentation and half-connected sockets. Our discovery protocol was tested on thousands of grid nodes on Amazon EC2 cloud by us and by our customers to make sure there are no cluster or data inconsistencies. This protocol is already running on customer sites in several deployments quite successfully.

LevelDB Swap Space Implementation

n GridGain 4.0 you can load terabytes of data into cache. GridGain will try to fit as much of the data in memory as possible – the more grid nodes you have, the more memory is available for caching data. However, if you have more data than fits into the whole memory of the grid, you can use LevelDB swap implementation which is based on Google LevelDB storage to swap infrequently used data to disk. We found that LevelDB can efficiently store large amounts of data with a fairly small disk footprint (using compression). We have also enhanced it with our swap eviction policy to prevent infinite disk growth.

Presenting At QCon London 2012!

Posted by on Tuesday, February 28, 2012
 Archives, Blog, Events and Meetups

GridGain will be presenting at QCon London 2012 in London on March 7-9. This is going to be new presentation that we’ve specifically prepared: Live Scala coding of streaming real time MapReduce application… We are going to have Hadoop’s ubiquitous popular word counting example and turn it on its head making it a real time MapReduce application using upcoming GridGain 4.0.

Come to see at our booth and talk to our CTO Dmitriy Setrakyan who’s be coding GridGain software since 2005.

Hope to see as many of you as possible!

1 2 3 4 13