Archives

Come to our talk “Streaming MapReduce with GridGain” and see live coding of famous Hadoop’s example of counting popular words… but in Real Time context. As always – live coding from scratch in Scala is never dull!
Stop by and say hello to our CTO Dmitriy Setrakyan.

GridGain 4.0.1 has been released this Monday. This is a point release that includes several bug fixes as well as number of new features.

With 4.0.1 we are introducing native support for .NET with our C# Client. C# Client provides native .NET/C# APIs for accessing both GridGain’s In-Memory Data Grid and Compute Grid from outside of the GridGain topology context. Internally it’s deferring to the REST protocol. Check out examples on GitHub.
C# Client is one of many native clients we’ll be releasing shortly including ObjectC, C++, PHP, Scala, Ruby, and some others we’re already working on.
Improved Support for 32-bit and 64-bit Systems
We’ve modified our scripts for better out-of-the-box support for 32-bit and 64-bit systems. We’ve had several clients complaint that additional configuration properties were required and specifically GridGain Visor didn’t work fully on 32-bit system with default configuration. All these issues have been resolved.
Enhancements to GridGain Visor
We are continuing making rapid improvements to GridGain Visor that is part of GridGain Enterprise and OEM editions.
We’ve added ability to specific the time span for chart views:

We’ve added nice in-place filtering for events in Dashboard:

You can now double click on event and to see its details:


We have promised a while back to publish the code from live coding GridGain presentation we did at QCon London earlier this year. Since presentation was in Scala, the code we will be posting here is in Scala.
First a brief intro. We all know Hadoop’s counting words example which takes a file with words and then produces another file with number of occurrences next to each word. Hadoop does this example very well, however the main caveat with Hadoop’s example is that it is not real time.
The counting words example we did at QCon actually counted words in real time. The program was split into two parts. First part is responsible for loading the words in real time into GridGain data grid, and the second part was querying the grid every 3 seconds to continuously print out top 10 words stored so far.
The example was done using ‘Scalar‘ – GridGain DSL for Scala, but it could have been done In Java as well using GridGain Java APIs.
Continuously Populate Words In Real Time
Let’s start by continuously loading data grid with new words. To do that, we downloaded several books in text format and started concurrently reading them from the populate(…) method, one thread per book. For every word read, we store it in cache, having the word itself as a key and number of current occurrences as a value. Also note how we let grid asynchronously update cache using asynchronous run while reading the next line from the book file (in reality you would most likely have more than one asynchronous job or have GridGain data loading functionality do it for you).
def populate(threadPool: CompletionService, dir: File) {
val bookFileNames = dir.list()
// For every book, start a new thread and start populating cache
// with words and their counts.
for (bookFileName <- bookFileNames) {
threadPool.submit(new Callable {
def call() = {
val cache = grid$.cache[String, JInt]
var fut: GridFuture[_] = null;
Source.fromFile(new File(dir, name)).getLines().foreach(line => {
line.split("[^a-zA-Z0-9]").foreach(word => {
if (!word.isEmpty) {
if (fut != null)
fut.get()
fut = grid$.affinityRunAsync(null, word, () => {
// Increment word counter and store it in cache.
// We use cache transaction to make sure that
// gets and puts are consistent and atomic.
cache.inTx(
() => cache += (word -> (cache.getOrElse(word, 0) + 1))
)
()
})
}
})
})
None // Return nothing.
}
})
}
// Wait for all threads to finish.
books.foreach(_ => threadPool.take().get())
}
Distributed SQL Query
Now let’s implement our distributed query against GridGain data grid which will run every 3 seconds. Note that we are using standard SQL syntax to query remote grid nodes. Interesting enough that GridGain data grid allows you to use SQL virtually without any limitations. You can use any native SQL function and even SQL JOINs between different classes. Here, for example, we are using SQL length(…) function to only query words greater than 3 letters long just to get rid of frequent short articles like “a” or “the” in our searches. We are also using desc keyword to sort word counts in descending order and limitkeyword to limit our selection only to 10 words.
def queryPopularWords(cnt: Int) {
// Type alias for sequences of strings (for readability only).
type SSeq = Seq[String]
grid$.cache[String, JInt].sqlReduce(
// PROJECTION (where to run):
grid$.projectionForCaches(null),
// SQL QUERY (what to run):
"length(_key) > 3 order by _val desc limit " + cnt,
// REMOTE REDUCER (how to reduce on remote nodes):
(it: Iterable[(String, JInt)]) =>
// Pre-reduce by converting
// Seq[(String, JInt)] to Map[JInt, Seq[String]].
(it :\ Map.empty[JInt, SSeq])((e, m) =>
m + (e._2 -> (m.getOrElse(e._2, Seq.empty[String]) :+ e._1))),
// LOCAL REDUCER (how to finally reduce on local node):
(it: Iterable[Map[JInt, SSeq]]) => {
// Print 'cnt' of most popular words collected from all remote nodes.
(new TreeMap()(implicitly[Ordering[JInt]].reverse) ++ it.flatten)
.take(cnt).foreach(println _)
println("------------") // Formatting.
}
)
}
Start Example
And finally let’s implement our main(…) method that calls our populate(…) and queryPopularWords(…) methods we just defined.
def main(args: Array[String]) {
// Initialize book directory
val bookDir = new File(BOOK_PATH);
// Start GridGain with specified configuration file.
scalar("examples/config/spring-cache-popularwords.xml") {
// Create as many threads as we have book, so we can use
// thread per book to load data grid concurrently.
val threadPool = Executors.newFixedThreadPool(bookDir.list.length);
val popWordsQryTimer = new Timer("words-query-worker");
try {
// Schedule word queries to run every 3 seconds.
popWordsQryTimer.schedule(new TimerTask {
def run() {
queryPopularWords(10) // Query top 10 words from data grid.
}
}, 3000, 3000)
// Populate cache with word counts.
populate(new ExecutorCompletionService(threadPool), bookDir)
// Force one more run to print final counts.
queryPopularWords(POPULAR_WORDS_CNT)
}
finally {
popWordsQryTimer.cancel() // Cancel timer.
threadPool.shutdownNow() // Graceful shutdown.
}
}
}
To execute the example, start several GridGain stand-alone nodes using examples/config/spring-cache-popularwords.xml configuration file and then start the example we just created from IDE. You may wish to add more printouts for better visibility of what’s happening.
This example is also shipped with GridGain 4.0 and also available in GridGain GitHub Repository.

Over the past few months I’ve been repeatedly asked on how GridGain relates to Hadoop. Having been answering this questions over and over again I’ve compacted it to just few words:
We love Hadoop HDFS, but we are sorry for people who have to use Hadoop MapReduce.
Let me explain.
Hadoop HDFS

We love Hadoop HDFS. It is a new and improved version of enterprise tape drive. It is an excellent technology for storing historically large data sets (TB and PB scale) in a distributed disk-based storage. Essentially, every computer in Hadoop cluster contributes portion of its disk(s) to Hadoop HDFS and you have a unified view on this large virtual file system.
It has its shortcomings too like slow performance, complexity of ETL, inability to update the file that’s already been written or inability to deal effectively with small files – but some of them are expected and project is still in development so some of these issues will be mitigated in the future. Still – today HDFS is probably the most economical way to keep very large static data set of TB and PB scale in distributed file system for a long term storage.
GridGain provides several integration points for HDFS like dedicated loader and cache loaders. Dedicated data loader allows data to be bulk-loaded into In-Memory Data Grid while cache loader allows for much more fine grained transactional loading and storing of data to and from HDFS.
Many clients using GridGain with HDFS is a good litmus test for that integration.
Hadoop MapReduce

As much as we like Hadoop HDFS we think Hadoop’s implementation of MapReduce processing is inadequate and outdated:
Would you run your analytics today off the tape drives? That’s what you do when you use Hadoop MapReduce.
The fundamental flaw in Hadoop MapReduce is an assumption that a) storing data and b) acting upon data should be based off the same underlying storage.
Hadoop MapReduce runs jobs over the data stored in HDFS and thus inherits, and even amplifies, all the shortcomings of HDFS. Extremely slow performance, disk-based storage that leads to heavy batch orientations which in turn leads to inability to effectively process low latency tasks… which ultimately makes Hadoop MapReduce an “elephant in the room” when it comes to inability to deliver real time big data processing.
Yet one of the most glaring shortcomings of Hadoop MapReduce is that you’ll never be able to run your jobs over the live data. HDFS by definition requires some sort of ETL process to load data from traditional online/transactional (i.e. OLTP) systems into HDFS. By the time the data is loaded (hours if not days later) – the very data you are going to run your jobs over is… stale or frankly dead.
GridGain
GridGain’s MapReduce implementation addresses many of these problems. We keep both highly transactional and unstructured data smartly cached in extremely scalable In-Memory Data Grid and provide industry first fully integrated In-Memory Compute Grid that allows to run MapReduce or Distributed SQL/Lucene queries over the data in memory.
Having both data and computations co-located in memory makes low latency or streaming processing simple.
You, of course, can still keep any data for a long term storage in underlying SQL, ERP or Hadoop HDFS storages when using GridGain – and GridGain intelligently supports any type of long terms storage.
Yet GridGain doesn’t force you to use the same storage for processing data – we are giving you the choice to use the best of two worlds: keep data in memory for processing, and keep data in HDFS for long term storage.

I’m pleased to announce that today we released GridGain 4.0 – latest edition of our platform for Real Time Big Data processing. I’m proud that our team set this final deadline almost 5 months ago and we were able to hit without a single delay.
I’m especially proud of this fact because of the enormous complexity of the development process involved in making software like GridGain – dozens of production clients, testing on serious massively distributed environments, set of new features, and the usual array of setbacks that we had to go through to get here.
Needless to say that we have also grown significantly as a business in the last 6 months including more than doubling our team headcount, rolling out new website, new branding, sales team, messaging, press and analysts relationships, investment, and the whole scope of other business activities.
But… GridGain System is an engineering company first and foremost and I’ll talk about technology in GridGain 4.0:
Visor Management & Monitoring
Enterprise and OEM Editions of GridGain comes standard with GridGain Visor – GUI-based and scriptable environment for managing and monitoring GridGain distributed installations.
Visor GUI allows to perform all major management and monitoring operations for GridGain installations:
Various Node Actions

Topology View with Metrics

Metrics For Any Projection

Comprehensive Historical Charts


Advanced Grid-Wide Events

… and plenty of other cool stuff!
Affinity-Aware Native Clients
In GridGain 4.0 we are finally introducing native clients for various languages. The 4.0 release includes native Java and Android clients with rich APIs to support our Compute Grid and Data Grid connectivity. Our native .NET, C++, Groovy, and Scala clients are already in testing stage and will be coming out shortly as well. After that we will be adding Objective-C, Ruby, PHP, Python, and Node.js native clients.
All of our clients natively support essentially the same APIs specifically adapted to a certain language. You can execute MapReduce tasks, perform bunch of data operations, like storing and retrieving values to/from remote caches, compare-and-set/replace/put-if-absent atomic operations, etc… You can also subscribe to topology updates and get very creative with partitioning remote data grid into logical subgrids.
But one of the coolest features in our clients is affinity-awareness. This basically means that when working with data grids, GridGain will automatically figure out on which node the data is stored and will route client requests to that node. Imagine the amount of network trips you can save by retrieving data directly from the node which is responsible for storing it (same goes for updates). This feature is available for all of our native clients, not only for Java, which makes GridGain into the only native cross -language distributed Real Time Big Data platform.
Memcached Binary Protocol Support
In GridGain 4.0 we significantly enhanced our REST support for HTTP(S) and added Binary protocol support as well. What’s even cooler is that our binary protocol is fully Memcached-compliant. As a matter of fact, during our testing we have been connecting to GridGain using available open source Memcached clients and executing commands on GridGain data grid.
Having said that, GridGain 4.0 Binary connectivity protocol supports a lot more than Memcached does. Essentially we have taken Memcached protocol as our starting point and significantly enhanced it with our own commands and features. For example, you can configure security with proper authentication and secure sessions for remote clients, or you can execute MapReduce tasks, get remote node topology, etc…
Advanced Security
In GridGain 4.0 we added a notion of secure grids. Grids can now request for nodes to be authenticated prior to joining them into topology. Authentication implementation is fully pluggable through our SPI-based architecture and comes with several implementations out of the box, such as Passcode or JAAS-based authentication.
Additionally remote clients can also be required to authenticate themselves and once authenticated, they establish a secure session with the server.
Both, authentication and secure-session SPIs are available in GridGain enterprise edition only.
Data Loaders + Hadoop HDFS Support
We have also added support for efficient concurrent data loading for our data grid. There are plenty of ways to load data into data grids, including using basic cache APIs or our support for bulk-loading of data from data stores. Data loaders make it easy to externally load data into grid by adding collocation with data nodes, sending concurrent data loading jobs and properly controlling the amount of memory consumed by data loading process.
As a good use case, you can use data loader to preload data from Hadoop HDFS in order to process it in Real Time on GridGain. In fact GridGain 4.0 comes with HDFS data loading example, GridCacheHdfsLoaderExample, which reads data from HDFS and then uses GridGain data loader to load it into data grid.
1000+ Nodes Guaranteed Discovery
In this release we enhanced our TCP-based discovery protocol with enterprise-proven support for network segmentation and half-connected sockets. Our discovery protocol was tested on thousands of grid nodes on Amazon EC2 cloud by us and by our customers to make sure there are no cluster or data inconsistencies. This protocol is already running on customer sites in several deployments quite successfully.
LevelDB Swap Space Implementation
n GridGain 4.0 you can load terabytes of data into cache. GridGain will try to fit as much of the data in memory as possible – the more grid nodes you have, the more memory is available for caching data. However, if you have more data than fits into the whole memory of the grid, you can use LevelDB swap implementation which is based on Google LevelDB storage to swap infrequently used data to disk. We found that LevelDB can efficiently store large amounts of data with a fairly small disk footprint (using compression). We have also enhanced it with our swap eviction policy to prevent infinite disk growth.

GridGain will be presenting at QCon London 2012 in London on March 7-9. This is going to be new presentation that we’ve specifically prepared: Live Scala coding of streaming real time MapReduce application… We are going to have Hadoop’s ubiquitous popular word counting example and turn it on its head making it a real time MapReduce application using upcoming GridGain 4.0.
Come to see at our booth and talk to our CTO Dmitriy Setrakyan who’s be coding GridGain software since 2005.
Hope to see as many of you as possible!
GridGain Closes $2.5 Million Series A Funding to Accelerate Innovation in Real Time Big Data Processing
Innovative Cloud-Based Software Middleware Provider Receives Financing Led by RTP Ventures
Foster City, CA (PRWEB) December 06, 2011 — GridGain, the leader in high performance cloud computing and real time big data processing, today announced that it has closed a $2.5 million Series A round of financing led by RTP Ventures. The company will use the new funding to accelerate growth, continue innovation in real time big data processing, and expand its global market share. Continue Reading →
GridGain will be presenting it’s “Live Coding” talk at 1DevDay in Detroit, MI on November 5th. If you are interested in “Functional Cloud Computing with Scala and GridGain” – come by and listen to the talk. No slides, only live Scala coding of distributed applications for 60 minutes. Several MapReduce and Data Grid apps from scratch and working right in front of your eyes.
Hope to you see you there!
I think for the 4th year in the row GridGain will be presenting at Silicon Valley Code Camp. With over 400 tracks – this is one of the largest events in the valley (and zero travel for me which is a huge bonus!).
My talk “Distributed Programming with Scala and GridGain” will be held on Sunday, October 9th at 1pm. Presentation will be a new format – zero slides & 100% live coding. As always – nothing is prepared and you’ll see everything that goes into creating several cool Scala-based highly distributed applications from scratch.
If you are around come to see my presentation – it’s always fun!

I will be presenting the talk “Distributed Functional Programming Done Right” atScalathon 2011 hosted at Penn University on July 16-17th. What a great crowd!
Looking forward to catch up with folks from Circumflex and give my grief to good people developing IDEA Scala plugin :) Two days of Scala hacking – I’ll probably going to be wiped out for sure in the end.
Right after two grueling days at Scalathon 2011 I will stop by at Boston to present “In-Memory Data Grid with Scala and GridGain” to Boston Scala User Group.
Last time I presented there (a year ago) – it was a great talk!
If you are round Boston and interested in In-Memory Data Grids (and Scala!) – make sure to stop by!
