The Case for Smart Open Storage

I would like to give my ideas on why I believe we are on the cusp of a new chapter in storage technology — namely an ascendance of an open smart storage standard. This standard should define an open vendor-agnostic interface for integration between compute and data processing and traditional storage systems. Using these open APIs, developers should be able run SQL, MapReduce, K/V access, file system and many other types of data processing directly on storage systems.

There are three trends that are pushing the storage industry in this direction. Many would argue that the concept of software-defined storage has been pushing the same boundaries for quite some time. However, I think the trends below are new or substantially different.

Business values gets generated outside of "dumb" storage

One of the strongest trends dominating conversations about storage is the realization by storage vendors that most of the business value is being created elsewhere outside of the products provided by EMC, NetApp, SanDisk, and HP.

To understand this, think about where you typically process the data, i.e. extracting the value from the fact that you have acquired or stored that data in the first place… More and more often it happens in Apache Spark or other in-memory systems like Apache Ignite, NoSQL databases or in streaming applications. That’s where the bulk of application and system logic is and that is where most of the application engineering happens. And thus the business focus and the engineering focus is concentrated elsewhere away from “dumb” storage.

But this situation in and of itself isn’t dire enough. In order for those new types of data processing systems to have the data they need, this data needs to be moved from “dumb” storage. This leads us to the second point.

Moving data for processing became prohibitively slow

Back 15-20 years ago, moving a few megabytes of data from RDBMS to an application server for overnight processing wasn’t a big deal. Data could be moved relatively quickly (a few minutes) and real-time processing wasn’t a necessity as it is today.

Fast forward to today. Try to move a few terabytes of data for analytics processing from a NetApp appliance to a Spark cluster. Not only may it take hours but it removes any possibility of real-time processing of this data (or any type of processing in a timely manner at all). That is why storing actionable data (data lakes, etc.) is increasingly problematic on traditional storage systems. They have no processing capabilities. To do anything with that data, it needs to be moved — physically sent over the network — which rapidly becomes impractical.

Proprietary storage APIs fly in the face of reality

These ideas are not new, of course. Pursuing this concept of software defined storage, big (EMC) and small (Nutanix) vendors are working on integrating compute with data storage. What I believe is badly missing is an industry-wide, vendor agnostic, open standard for such integration.

Given the tremendous success of open source organizations like the Apache Software Foundation (home of Hadoop, Spark and many other projects defining today's data processing ecosystem) or OpenStack, it is absolutely essential to have that integration be done in an open source world. Only truly open standards can garner the industry support and hearts and minds of developers — both of which are essential for smart storage to succeed.