In a previous article, I discussed redefining the challenge facing companies that want to become data-driven. The way most people think about this problem – and the most commonly proposed solution – is putting all data into a single place, such as a data lake.
This strategy has challenges, the biggest of which is that while data lakes make it economical to store data, retrieval, and analysis of that data can be slow and cumbersome, rendering data lakes impractical for low-latency analytics needs.
Instead, let us think about the problem simply as the need for real-time access to all relevant data across the enterprise and external sources in a way that enables building cross-sectional views and analyzing the complete set of data required to make informed decisions.
With this approach, the only requirements are: 1. the need to access data from multiple internal and external sources in real time, and 2. the ability to curate and quickly access relevant sections of this data.
In this article, I will discuss one strategy for addressing these needs: a data integration hub (DIH).
What is a data integration hub?
Data integration hubs have been around for some time and have been successfully embraced by companies with extreme data processing speed and scale requirements, especially in the financial services and insurance industries.
A DIH architecture creates a common data-access layer that aggregates different types of data from multiple on-premises, cloud-based, and streaming sources. Multiple business applications can then access relevant portions of the aggregated data – ideally, cached in an in-memory data grid for real-time processing.
Underpinning the DIH architecture are several capabilities:
• A multi-model datastore with a standards-based API layer that synchronizes the data with disparate back-end sources or systems of records.
• A high-performance and scalable data access layer, supporting all forms of data-interaction APIs, including SQL; non-SQL like Java, C#, Python, Scala, etc.; or RESTful APIs.
• A data-integrity management mechanism, with something similar to ACID support.
• A strong security and access control framework to support secure and controlled data access by various audiences.
A distributed, in-memory platform that combines the features listed above would be an example of a DIH to address business needs for low-latency access to large amounts of data across the enterprise data ecosystem.
Why deploy a data integration hub?
Today, the need for extreme data processing speed and scale requirements is spreading to an ever-greater number of companies in a larger number of industries, such as healthcare, telecommunications, retail, logistics and travel, as well as many others.
The reasons for this spread are simple:
1. More data is available to enterprises through several different sources, along with the ability to process such large amounts of data quickly.
2. Within companies, the number of use cases for real-time processing tends to explode once a first use case proves successful.
3. Especially recently, nearly every company has started looking for ways to exploit AI and generative AI to accelerate innovation, improve productivity and enhance customer experiences.
A DIH architecture can help achieve these cross-sectional data access goals at scale for immense amounts of information. The DIH layer also decouples systems of record from consuming applications, allowing applications and the underlying systems to evolve at their own pace without relying on or impacting the other technology components.
This capability is fundamental to another critical initiative at many organizations: Developing the ability to move individual on-prem components to the cloud or switch cloud service providers or consumers in any way they want.
What are the limitations of a data integration hub?
By definition, a DIH creates a low-latency data access layer across multiple systems of records. Therefore, it inherently creates a separation between data read workloads (queries) and transactional data writes (changes to the data). This separation – known as the command query response segregation (CQRS) – introduces a degree of latency, as well as some complexity tied to data synchronization across two systems.
Data integration hubs are also primarily a source of cross-enterprise data, and one still has to pull data from these data integration hubs to wherever the actual processing and analysis of such data happens. This means DIHs can fall a bit short of what today’s real-time data use cases require – speed, scale, and performance not only in accessing data but also in processing large amounts of data.
Think of the DIH as a “data bus,” similar to an enterprise service bus (ESB), that essentially creates a scalable and flexible plug-and-play architecture.
The first step to becoming data-driven is getting access to all the available data – once you have access, you have the opportunity to inspect, assess, and analyze that data. If your organization is on a journey to become data-driven and facing challenges accessing data with speed and at scale, a DIH offers a potential first step as a minimally intrusive and decoupled architecture for high-speed access to the relevant data anywhere it resides.
It is also now possible to overcome the DIH limitations above. I referenced the term “enterprise data ecosystem” a couple of times in this article. In my next article, I will investigate what an enterprise data ecosystem looks like and how to address DIH limitations in order to tackle a new type of data challenge being faced by enterprises.
This article was originally published at Forbes.com