After attending the
Summit I felt the need to share a few words summarizing my experience and perception of it, which hopefully can
provide insight to those who were not able to attend, or attended and want a vendor’s perspective of it.
As with every Spark Summit in the recent years, as well as the huge influx of data science and big data, it is
really challenging to pick the right talk to attend. As a speaker and booth representative, unfortunately, I was not
able to attend many talks hence my view will be certainly tainted. However, a few things grabbed my attention that
are worth sharing.
To my surprise the exhibition floor had a select few vendors highly relevant in the domain. It was easy to navigate
around without being harassed by sales
and only engaging when you like. The booths that certainly caught my interest were the Cloudera, IBM and Basho ones.
It's great to see IBM reinventing itself and approaching the market with a fresh look and feel. Being a behemoth of
not easy to stay relevant in the current day and age where startups are popping up left right and center to snatch
huge market shares overnight. As always, the Basho booth and branding were engaging. They had a fun toy car
challenge where laps would be recorded in their time series database to demonstrate their product. Certainly not the
typical kind of problem to solve or rather demonstrate your product, but nevertheless an engaging installation that
lured in the unsuspecting data scientist or engineer.
During the breakout sessions’ breaks the conversations with attendees were highly relevant and interesting, but
people did seem to share a common problem: "How do I speed up my existing Hadoop deployment”, or "How do I make
Spark™ faster". As a GridGain architect and an Apache® Ignite™
knew that these are problems that we highly specialize in and deliver the technology and expertise to solve them.
Just have a look at our Hadoop
and Spark acceleration or watch my breakout session recording from the summit that should be available soon.
During my session I covered at a high level the basics and building blocks of Apache Ignite, specifically around the
in-memory data-grid functionality. I demonstrated how we can store data in memory in a resilient manner and provide
real-time scalability with no sacrifice in uptime. I covered the compute grid only briefly since the audience was
spark-oriented and for our Ignite-Spark
only use the data grid feature of Ignite. Finally, I talked about how the integration between Ignite and Spark can
deliver mutable, shared RRDs for Spark whilst at the same time improving Spark SQL dramatically via the use on
in-memory indexes in Ignite. Following my presentation I had a number of great follow up conversations with users
that really understood the benefit of this integration.
One common challenge I had to address was the necessity for Spark users to find workarounds when they tried to use
Spark as a data source aggregation layer by extracting data from various sources and creating RDDs in Spark memory
It's not really what Spark was designed for; Spark is an in-memory processing platform with no direct storage
can use HDFS for that, but we all know that you will only be as fast as your disk. Hence, using something like
Ignite would fit better for such tasks, whilst still using Spark as the processing layer. This would mean your data
and processing will all be done in RAM, the fastest commodity medium of this day and age.