Quick Start Guide

Prerequisite

Must:

Apache Hadoop 3.1.x, YARN service enabled.

Optional:

Enable YARN DNS. (When distributed training is required.)
Enable GPU on YARN support. (When GPU-based training is required.)

Run jobs

Commandline options

usage: job run
 -checkpoint_path <arg>       Training output directory of the job, could
                              be local or other FS directory. This
                              typically includes checkpoint files and
                              exported model
 -docker_image <arg>          Docker image name/tag
 -env <arg>                   Common environment variable of worker/ps
 -input_path <arg>            Input of the job, could be local or other FS
                              directory
 -name <arg>                  Name of the job
 -num_ps <arg>                Number of PS tasks of the job, by default
                              it's 0
 -num_workers <arg>           Numnber of worker tasks of the job, by
                              default it's 1
 -ps_docker_image <arg>       Specify docker image for PS, when this is
                              not specified, PS uses --docker_image as
                              default.
 -ps_launch_cmd <arg>         Commandline of worker, arguments will be
                              directly used to launch the PS
 -ps_resources <arg>          Resource of each PS, for example
                              memory-mb=2048,vcores=2,yarn.io/gpu=2
 -queue <arg>                 Name of queue to run the job, by default it
                              uses default queue
 -saved_model_path <arg>      Model exported path (savedmodel) of the job,
                              which is needed when exported model is not
                              placed under ${checkpoint_path}could be
                              local or other FS directory. This will be
                              used to serve.
 -tensorboard <arg>           Should we run TensorBoard for this job? By
                              default it's true
 -verbose                     Print verbose log for troubleshooting
 -wait_job_finish             Specified when user want to wait the job
                              finish
 -worker_docker_image <arg>   Specify docker image for WORKER, when this
                              is not specified, WORKER uses --docker_image
                              as default.
 -worker_launch_cmd <arg>     Commandline of worker, arguments will be
                              directly used to launch the worker
 -worker_resources <arg>      Resource of each worker, for example
                              memory-mb=2048,vcores=2,yarn.io/gpu=2

Launch Standalone Tensorflow Application:

Commandline

yarn jar path-to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar job run \
  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name tf-job-001 \
  --docker_image <your-docker-image> \
  --input_path hdfs://default/dataset/cifar-10-data  \
  --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
  --worker_resources memory=4G,vcores=2,gpu=2  \
  --worker_launch_cmd "python ... (Your training application cmd)" \
  --tensorboard # this will launch a companion tensorboard container for monitoring

Notes:

1) DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image.

2) DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker image.

3) --worker_resources can include gpu when you need GPU to train your task.

4) When --tensorboard is specified, you can go to YARN new UI, go to services -> <you specified service> -> Click ... to access Tensorboard.

This will launch a Tensorboard to monitor all your jobs. By access YARN UI (the new UI). You can go to services page, go to the tensorboard-service, click quick links (Tensorboard) can lead you to the tensorboard.

See below screenshot:

alt text

Launch Distributed Tensorflow Application:

Commandline

yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
 --name tf-job-001 --docker_image <your docker image> \
 --input_path hdfs://default/dataset/cifar-10-data \
 --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
 --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
 --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
 --num_workers 2 \
 --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd for worker ..." \
 --num_ps 2 \
 --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \

Notes:

1) Very similar to standalone TF application, but you need to specify #worker/#ps

2) Different resources can be specified for worker and PS.

3) TF_CONFIG environment will be auto generated and set before executing user’s launch command.

Get job history / logs

Get Job Status from CLI

yarn jar hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar job show --name tf-job-001

Output looks like:

Job Meta Info:
	Application Id: application_1532131617202_0005
	Input Path: hdfs://default/dataset/cifar-10-data
	Checkpoint Path: hdfs://default/tmp/cifar-10-jobdir
	Run Parameters: --name tf-job-001 --docker_image wtan/tf-1.8.0-gpu:0.0.3
	                (... all your commandline before run the job)

After that, you can run tensorboard --logdir=<checkpoint-path> to view Tensorboard of the job.

Run tensorboard to monitor your jobs

# Cleanup previous service if needed
yarn app -destroy tensorboard-service; \
yarn jar /tmp/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
  job run --name tensorboard-service --verbose --docker_image wtan/tf-1.8.0-cpu:0.0.3 \
  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
  --num_workers 0 --tensorboard

You can view multiple job training history like from the Tensorboard link:

alt text

Get component logs from a training job

There’re two ways to get training job logs, one is from YARN UI (new or old):

alt text

Or you can use yarn logs -applicationId <applicationId> to get logs from CLI

General

Common

HDFS

Ozone

MapReduce

MapReduce REST APIs

YARN

YARN REST APIs

YARN Service

Submarine

Hadoop Compatible File Systems

Auth

Tools

Reference

Configuration

Quick Start Guide

Prerequisite

Run jobs

Commandline options

Launch Standalone Tensorflow Application:

Commandline

Notes:

Launch Distributed Tensorflow Application:

Commandline

Notes:

Get job history / logs

Get Job Status from CLI

Run tensorboard to monitor your jobs

Get component logs from a training job