GridGain Developers Hub

Command Line Tool

To allow a user to control the process of building a TensorFlow cluster on top of an Apache Ignite cluster, Ignite provides a simple command line tool with the following commands.

Start Command

The start command starts a new TensorFlow cluster on top of an Apache Ignite cluster for the specified cache and then starts training (specified by JOB_DIR, JOB_CMD, and JOB_ARGS). When everything is started, Apache Ignite maintains all processes and automatically restarts them in case of any failure. The output of the start command is an output of training.

Usage: ignite-tf start [-hV] [-c=<cfg>] CACHE_NAME JOB_DIR JOB_CMD [JOB_ARGS…​]

Starts a new TensorFlow cluster and attaches to user script process.

CACHE_NAME: Upstream cache name.

JOB_DIR: Job folder (or zip archive).

JOB_CMD: Job command.

[JOB_ARGS…​]: Job arguments.

-c, --config=<cfg>: Apache Ignite client configuration.

-h, --help: Show this help message and exit.

-V, --version: Print version information and exit.

Internally it means the following procedure:

  • Determine the placement of partitions for the specified cache.

  • According to the partitions placement, start workers on the appropriate nodes.

  • Start training code on a random node in the cluster with TF_CONFIG that contains information about workers placement.

  • Route output of training to output of start command.

  • In case of failure, stop everything and start again from the first step.

  • If training is successfully completed, stop everything.

Stop Command

The stop command stops the specified TensorFlow cluster and corresponding training.

Usage: ignite-tf stop [-hV] [-c=<cfg>] CLUSTER_ID

Stops a running TensorFlow cluster.

CLUSTER_ID: Cluster identifier.

-c, --config=<cfg>: Apache Ignite client configuration.

-h, --help: Show this help message and exit.

-V, --version: Print version information and exit.

Attach Command

The attach command attaches to the specified training and routes output of this training to the output of the attach command.

Usage: ignite-tf attach [-hV] [-c=<cfg>] CLUSTER_ID

Attaches to running TensorFlow cluster (user script process).

CLUSTER_ID: Cluster identifier.

-c, --config=<cfg>: Apache Ignite client configuration.

-h, --help: Show this help message and exit.

-V, --version: Print version information and exit.

Ps Command

The ps command prints identifiers of all running TensorFlow clusters.

Usage: ignite-tf ps [-hV] [-c=<cfg>]

Prints identifiers of all running TensorFlow clusters.

-c, --config=<cfg>: Apache Ignite client configuration.

-h, --help: Show this help message and exit.

-V, --version: Print version information and exit.

Cluster Manager

Apache Ignite has a complex infrastructure that maintains a TensorFlow cluster. A quick overview of this is shown in the following diagram:

Cluster Manager Infrastructure