Impressions from the Spark Summit Europe

After attending the European Spark Summit I felt the need to share a few words summarizing my experience and perception of it, which hopefully can provide insight to those who were not able to attend, or attended and want a vendor’s perspective of it.

As with every Spark Summit in the recent years, as well as the huge influx of data science and big data, it is really challenging to pick the right talk to attend. As a speaker and booth representative, unfortunately, I was not able to attend many talks hence my view will be certainly tainted. However, a few things grabbed my attention that are worth sharing.

To my surprise the exhibition floor had a select few vendors highly relevant in the domain. It was easy to navigate around without being harassed by sales talks and only engaging when you like. The booths that certainly caught my interest were the Cloudera, IBM and Basho ones. It's great to see IBM reinventing itself and approaching the market with a fresh look and feel. Being a behemoth of a company it's not easy to stay relevant in the current day and age where startups are popping up left right and center to snatch huge market shares overnight. As always, the Basho booth and branding were engaging. They had a fun toy car challenge where laps would be recorded in their time series database to demonstrate their product. Certainly not the typical kind of problem to solve or rather demonstrate your product, but nevertheless an engaging installation that lured in the unsuspecting data scientist or engineer.

During the breakout sessions’ breaks the conversations with attendees were highly relevant and interesting, but people did seem to share a common problem: "How do I speed up my existing Hadoop deployment”, or "How do I make Spark™ faster". As a GridGain architect and an Apache® Ignite™ evangelist I knew that these are problems that we highly specialize in and deliver the technology and expertise to solve them. Just have a look at our Hadoop and Spark acceleration or watch my breakout session recording from the summit that should be available soon.

During my session I covered at a high level the basics and building blocks of Apache Ignite, specifically around the in-memory data-grid functionality. I demonstrated how we can store data in memory in a resilient manner and provide real-time scalability with no sacrifice in uptime. I covered the compute grid only briefly since the audience was spark-oriented and for our Ignite-Spark integration we only use the data grid feature of Ignite. Finally, I talked about how the integration between Ignite and Spark can deliver mutable, shared RRDs for Spark whilst at the same time improving Spark SQL dramatically via the use on in-memory indexes in Ignite. Following my presentation I had a number of great follow up conversations with users that really understood the benefit of this integration.

One common challenge I had to address was the necessity for Spark users to find workarounds when they tried to use Spark as a data source aggregation layer by extracting data from various sources and creating RDDs in Spark memory It's not really what Spark was designed for; Spark is an in-memory processing platform with no direct storage abilities. Sure you can use HDFS for that, but we all know that you will only be as fast as your disk. Hence, using something like Ignite would fit better for such tasks, whilst still using Spark as the processing layer. This would mean your data and processing will all be done in RAM, the fastest commodity medium of this day and age.