Tao-Runner is a opinionated way to organize and train machine learning models with NVIDIA TAO. It does all the housekeeping around directory structures and keeping your experiments clean.
- Clone this repository to a suitable directory for your projects and create a 'projects/' dir inside the cloned repo.
- Install the required python packages, preferably in a virtual environment (./scripts/setup_venv.sh can help you with that).
- Change the paths in
.tao_mounts.jsonaccording to your local system. - Download the pretrained models from NVIDIA NGC (using
python scripts/download_pretrained_models.py)
The main entry point is python -m tao-runner
A strict folder structure is required and enforced by tao-runner.
projects/
├─ example_01/
│ ├─ data/
│ │ ├─ kitti_detection/
│ │ ├─ vott_json/
│ ├─ models/
│ │ ├─ experiment_01/
│ │ ├─ experiment_02/
│ ├─ specs/
│ │ ├─ experiment_01/
│ │ ├─ experiment_02/
│ ├─ experiments.yml
├─ example_02/
│ ├─ ...
Each dataset is referenced by its dir name under data/.
A dataset can contain multiple subsets, such as full, train and val.
You can name your subsets however you want, as long as you reference them correctly in your tao config (e.g. $dataset/full or $dataset/custom_subset)
Most model architectures require just the full dataset containing all images.
retinanet uses two datasets (train and val) which can be generated via the split task from the full set.
- experiments/ --> /workspace/experiments
- repositories/ --> /workspace/repositories
$experiment: Name of the experiment (name of the section in experiments.cfg).$dataset: Path to the dataset (most likely in kitti format) as configured inexperiments.yml(docker side).$tfrecords: Path to the tfrecord-formatted dataset directory (docker side).$pretrained_model: Path to the pretrained model file (.hdf5 file, docker side).
Use convert_dataset.sh to convert between different data formats.
TAO mostly uses the KITTI format for object detection.
After that, you can use python -m tao-runner convert -h to see how to convert a KITTI dataset to TFRecords.
This task is idempotent.
Examples:
python -m tao-runner convert example_01 experiment_01python -m tao-runner convert example_01 experiment_01 experiment_02 --overwrite
Use this task to split the dataset into disjunct train and val subset.
You can set the percentage of the val subset by setting --val to a value between 0.0 and 1.0.
This task is idempotent.
Required by:
- RetinaNet
Examples:
python -m tao-runner split --subset full --val 0.1 example_01 experiment_01python -m tao-runner split --subset custom_subset example_01 experiment_01 experiment_02 --overwrite
I recommend using the samples from NVIDIA as a starting point.
See python -m tao-runner train -h for the required arguments.
You always provide the task you want to carry out (split, convert, train or export), the project and the experiments for which the tasks are executed.
This task is idempotent.
Examples:
python -m tao-runner train example_01 experiment_01python -m tao-runner train example_01 experiment_01 experiment_02 --overwrite
This file defines all your different experiments inside of a project.
The following example explains its structure in detail:
# This section contains tao-specific configurations.
tao_config:
# Copied over from tao. Denominates the number of gpus to use.
gpus: 1
# Copied over from tao. List the indices of the gpus to use (see nvidia-smi)
gpu_indices: [1]
# This sections contains all experiments inside the project.
# Create new named experiments via keys:
experiments:
# The name of the experiment
dssd_resnet18_01:
# Detection head ("retinanet", "dssd", "detectnet_v2")
head: dssd
# Detection Backbone ("resnet10", "resnet18", "resnet50", "vgg16", "vgg19", etc.)
backbone: resnet18
# Directory of the repository which holds the pretrained model to be used (dir under 'repositories')
repository: pretrained_object_detection
# Encription key of the trained model
model_key: secret_key
# Directory of the raw dataset under 'data/' to use. Often the kitti_detection format is required.
dataset: kitti_detection
# Filename of the model to export (.tlt model)
export_model: dssd_resnet18_epoch_080
# The data type of the exported model (fp32, fp16, int8)
export_type: fp16
detectnetv2_resnet18_01:
head: detectnet_v2
backbone: resnet18
repository: pretrained_detectnet_v2
model_key: secret_key
dataset: kitti_detection_1000x1000
export_model: detectnet_v2_resnet18_epoch_080
export_type: fp16Todo....