Apache® Ignite™ has supported Machine Learning capabilities for a while now. With the release of Ignite v2.7, additional Machine Learning and Deep Learning capabilities have been added, including the much-anticipated support for TensorFlow™.
TensorFlow is an open-source library that can be used for numerical computations and for performing Machine Learning at scale. It is also very flexible and can run across a variety of different platforms, such as CPUs and GPUs.
The combination of TensorFlow and Ignite provides a complete and powerful solution for working with day-to-day (operational) data and long-term (historical) data; we can perform data analysis and build complex mathematical models. Using TensorFlow and Ignite also delivers a number of significant benefits, such as:
- Unlimited Capacity - Using Ignite as a distributed database provides unlimited capacity enabling much larger data sets to be used for many TensorFlow Machine Learning and Deep Learning tasks.
- Faster Performance - If TensorFlow workers are deployed on the same machines as Ignite server nodes, there will be little or no data movement over the network; TensorFlow can work with the local Ignite node.
- Fault Tolerance - If there is a failure during calculation, Ignite can restart the process from the point of failure.
Let’s now discuss the integration between Ignite and TensorFlow in a little more detail.
The integration between Ignite and TensorFlow is achieved using Ignite Dataset. This enables Ignite to be used as a data source for TensorFlow operations and has a number of advantages:
- Ignite provides a fast, distributed database for TensorFlow training and inference data.
- All pre-processing can be performed in the TensorFlow pipeline as objects fed by Ignite Dataset can have any structure.
- Ignite Dataset supports SSL and distributed training.
Since Ignite Dataset is a part of TensorFlow, no third-party packages need to be installed and it can be used out-of-the-box. Technically, the integration uses Ignite’s Binary Client Protocol and TensorFlow’s tf.data.
One of the features supported by Ignite is an in-memory distributed file system called IGFS that is similar to HDFS. Distributed training with TensorFlow requires IGFS.
Ignite supports the native IGFS API. IGFS integration with TensorFlow is based upon a custom file system plugin. The IGFS integration provides several benefits:
- Improved reliability and fault-tolerance can be achieved by saving checkpoints of state to IGFS.
- Communication between training processes and TensorBoard is achieved by writing event files to a directory. If TensorBoard is running in a different process or machine, IGFS will enable this communication to take place.
TensorFlow natively support distributing training. This enables better utilisation of resources and avoids bottlenecks. Ignite provides data partitioning across a distributed cluster, enabling resources to be used efficiently. TensorFlow distributed training with Ignite uses what is termed as the standalone client mode. Let’s see how this works in more detail using an example.
In Figure 1, we have an Ignite cluster and a cache called MNIST. The MNIST cache is distributed across 8 partitions and each partition is stored on one of 8 Ignite server nodes.
Figure 1. Ignite Cluster with MNIST cache.
Each Ignite server node also has a TensorFlow worker, as shown. Each TensorFlow worker will access the data on the partition stored on the Ignite server node where it is running. In other words, the TensorFlow worker will only access local data. Also in Figure 1, we can see a TensorFlow client running on one of the Ignite cluster nodes on the right-hand side. This client runs as a service. Therefore, in the event of a service failure, Ignite can restart training.
Once TensorFlow is initialised and configured, Ignite moves aside and lets TensorFlow do its work, as shown in Figure 2.
Figure 2. TensorFlow Client and Workers.
Ignite only intervenes if there are cluster failures or data need to be rebalanced. Behind the scenes, however, Ignite manages a complex infrastructure for maintaining a TensorFlow cluster. More details are available in the official documentation.
In this first article, we have discussed at a high-level the integration between Ignite and TensorFlow. In further articles in this series during the coming weeks and months, we’ll look at this integration in detail and build some example applications to demonstrate the capabilities of using Ignite with TensorFlow. Stay tuned!