Multiple Object Detection using NVIDIA’s Transfer Learning Toolkit

Ronak Bhatia
19 min readMar 8, 2021

This post isn’t meant to be an in-depth explanation of machine or deep learning, but rather, a practical guide on setting up object detection for projects. This blog post will cover building a custom object detection system using NVIDIA’s Transfer Learning Toolkit (TLT). It is important to note that training with TLT is only on x86 with an NVIDIA GPU such as a V100; models trained with TLT can be deployed on any NVIDIA platform. I have written other blog posts on how to build an object classification model with Fast.AI as well as using TensorFlow’s Object Detection API for multiple object detection.

Example Output of an NVIDIA TLT Objection Detection Model Trained on Nike Shoes

NVIDIA’s Transfer Learning Toolkit is a Python-based AI toolkit for taking pre-built AI models and customizing them with your own data. Transfer learning involves extracting learned features from an existing neural network to a new one; it is often used when creating a large training dataset isn’t feasible (NVIDIA). The tool claims to reduce AI training and development time via optimized pre-trained models and model pruning. You do not need AI framework expertise to use it and no additional code needs to be written. However, please note that training via TLT is only available via an NVIDIA GPU.

The impetus for this blog post is to walk through the nuances of setting up TLT, as the existing documentation may not be clear for those who are using custom data. This post will cover (1) Obtaining & Modifying Data for TLT; (2) Training the Model using TLT; and (3) Model Optimization and Visualizing Results. For our training purposes, we used an NVIDIA DGX-1, which we accessed remotely. If using another set-up, your training times may take longer. In addition, for the data manipulation and converting, I used a Macbook Pro with the following specifications: macOS Catalina, Version 10.15.7, 16GB RAM, 2.3 GHz, 8-Core Intel i9.

References & Acknowledgements

Before beginning this post, I’d like to acknowledge the following people and resources, which this work either references or is made possible due to. Joev Valdivia has an incredible YouTube channel that goes through using the TLT tool that I found incredibly useful and based this walkthrough off of. In addition, I’d like to thank those that helped our team via NVIDIA’s developer forum. This resource is good for asking questions and allowed us to narrow down what problems we were having. Next, I’d like to acknowledge Eddie Weill’s useful convert-dataset tool to convert a COCO dataset to the KITTI format. Mityakov Aleksandr’s tool for converting XML to KITTI was incredibly useful as well. Also, I want to acknowledge Naveen Malwani’s helpful article on downloading Google’s OpenImages dataset for AI applications. Finally, I’d like to thank and acknowledge Allison Youngdahl for her help with proofreading this article and working on this project with me. Parts of this tutorial are taken from documentation and code that she had written.

Obtaining & Modifying Data for TLT

The most difficult part, in my opinion, of using the TLT tool is the fact that it requires data to be in the KITTI format. If you have a dataset in a different format (e.g., COCO), you will need to convert it to KITTI. In addition, depending on which model you’re using, you will need to resize your images or ensure that your images meet the sizing requirements for TLT.

As NVIDIA puts it, “the tlt-train tool doesn’t support training on images of multiple resolutions or resizing images during training. All the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly.” For more information, see the following link. This link also contains information about image sizing requirements for the respective models (e.g., C * W * H [where C = 1 or 3, W >= 128, H >= 128, W, H are multiples of 32] for YOLOv3). Our team used the TLT tool with two datasets: (1) a custom dataset we created for sneakers and (2) a dataset we made with appended images from Google’s OpenImages. This article includes instructions on obtaining and modifying data for these two options.

Important Step Before Continuing

To save time and hassle in later steps, please make sure to convert any photos that are not in the PNG format to PNG. We noticed that the TLT Toolkit did not work properly if JPEG photos were used. Please do this step after you’ve obtained the data that you want to convert to KITTI but before actually converting the data. If you download images and labels from Google’s OpenImages (explained below), it sometimes puts the images in JPEG format. A quick way to convert images to PNG is to run the following in the terminal:

mkdir pngs; sips -s format png *.* --out pngs

This will create a directory called pngs and put the converted photos there. You can then delete the non-PNG photos and replace them with the above. For example, let’s say I have a folder called images with JPEG images.

Running the Command for Converting JPEGs to PNGs.
The Completed Output: PNGs are Stored in a Folder Called “pngs.”

Option #1: Modifying a Custom Dataset for TLT

For our custom dataset, we had our relevant annotations in the XML format, which looked like the following, for example.

Example of XML Data for Object Detection for a Custom Dataset

The XML files were contained in two folders called test and train. To quickly and easily convert this format to KITTI, we used the following tool. After Git cloning the repository, navigate to the root folder and run the following command:

python3 xml2kitti.py {Path}
Successful Output for Running the XML2KITTI Script (python3 xml2kitti.py [name of folder]).

This will essentially convert all the XML files in the specified folder to the KITTI format (as text files) without deleting the XML files. Please make sure that the label (e.g., air_force_1) is lowercase. So, for example, if I go back to my test folder, I will see the following output:

Properly Converting the XML Files to the KITTI Format.

While it would be ideal if these were all the steps, TLT requires the KITTI format to only contain 15 elements. Currently, there are 16 elements per text file; to alleviate this, we needed to remove the last zero for all the generated text files. The most convenient and quick solution to this was to run the following CLI command and then delete the resulting .bat files that were generated. This command will take the existing .txt files and edit them to remove the last 0. As the command perpetually runs, we had to hit CTRL+C after all the files were modified (assuming around 0.5 seconds per text file).

find . -maxdepth 1 -name '*.txt' -exec sed -i.bak 's/[[:space:]]\{1,\}[^[:space:]]\{1,\}$//' {} \;sed 's/[[:space:]]\{1,\}[^[:space:]]\{1,\}$//'
The Original Text Files are now in the Correct KITTI Format with Fifteen Elements.

Follow these steps for both your training and test data. If you want a systemic way to check if your labels are properly done (i.e., have 15 elements), try running the following Python commands in an empty Jupyter Notebook to check the number of elements.

# This code checks if you have 15 Fields for your labels.import glob
rootdir = "{your_path}" #e.g., Data/testing/label_2
for i in glob.glob(rootdir + "/*.txt"):
with open(i, 'r') as j:
for line in j.readlines():
label = line.strip()
length = len(label.split(" "))
print("This label have {} fields".format(length))
assert length == 15, 'Ground truth kitti labels should have only 15 fields. And make sure there is not empty lines. Please check the label in the file %s' % (j)

Please ensure the photos you have are in the PNG format as mentioned at the beginning of the article before continuing to the next steps. In addition, we want to make sure that the photos meet the appropriate sizing requirements for YOLO. PNGs need to have a width and height that are multiples of 16 as per the YOLOv3 requirements.

In our case, we had to resize our custom dataset images to meet this requirement. The following code modified our original images in our custom dataset (which were 800x600) to 800x576 and is included here as an example of how to resize images if needed.

import glob
from PIL import Image
rootdir = "{your_path}/training/image_2"
#e.g., Data/training/image_2
for i in glob.glob(rootdir + "/*.png"):
print(i)
# Opens a image in RGB mode
im = Image.open(i)
# Size of the image in pixels (size of original image)
width, height = im.size #800, 600
# Setting the points for cropped image
left = 0
top = 0
right = width
bottom = height - 24

# Cropped image of above dimension
im = im.crop((left, top, right, bottom))
newsize = (800, 576)
im = im.resize(newsize)
# Save resized image
im.save(i)

Therefore, we also need to modify the labels to match the new image size.

import os, re
import glob
rootdir = "{your_path}/training/label_2"
for i in glob.glob(rootdir + "/*.txt"):
print(i)

data = open(i,'r').read().replace('\n', '')
data_list = re.split(" ", data) #list of values
y_max = data_list[7] #extract bounding box
#fit inside resized 576x800 image size

if float(y_max) > 576.0 :
data_list[7] = '576.0'
new_data = " ".join(data_list)f = open(i, 'w')
f.write(new_data)
f.close()

Option #2: Obtaining Data from Google’s OpenImages for TLT

If you do not have a custom dataset, you can create one using Google’s OpenImages. We used the following tutorial to download Google’s OpenImage data. First, install the openimages package via the following command:

pip install openimages

Then, select which items or categories you are interested in. You can observe the various different items available at the following link. For example, let’s say that I was interested in obtaining labels and photos for pastries; I would input the following command in the command line (this is assuming that you will be using DarkNet-19 architecture with YOLO for Object Classification):

oi_download_dataset --base_dir ~/dir_Pastry --labels Pastry --format darknet --limit 2000 

Or, if you would like the directory to show up in the desktop, do the following:

oi_download_dataset --base_dir /Users/[your name]/Desktop/dir_Pastry --labels Pastry --format darknet --limit 2000
The Output from Running the Above Command. It May Take a Few Minutes to Run Completely.

This line will create adir_Pastry folder, which contains overall annotation information and the images/labels, respectively. If you are planning on detecting multiple objects, please run this process through for the different categories until you have directories for each item. If you look into dir_Pastry, you will see adarknet_obj_names.txt file that contains the name of the class (in this case, pastry) and a folder called pastry with the images and labels (called darknet). We will now need to properly convert the text files in the darknet folder to the KITTI format.

Directory After Running the Above Commands

In order to convert this format to the KITTI format, we will use the following tool. Clone the repository and create two folders (with format [classname] and [classname]KITTI) for each item you have pulled images and labels for. For example, if I am interested in detecting backpacks, suitcases, handbags, and briefcases, I will have the following folders in convert-datasets:

Initial Folder Structure for Convert-Datasets Directory. Renamed darknet_obj_names.txt for Various Classes and Added Them to Repository.

Now, for each folder with just the classname (e.g., briefcase), create a folder called train and val. We need to do this as it's the format that the convert-dataset tool recognizes. In the train folder, copy the darknet and images folders from dir_[classname]B (e.g., dir_Briefcase) and rename them to labels and images, respectively. Do the exact same thing for the val folder as well and repeat these steps across all the different classes you’re interested in.

Example File Structure for Each Individual Class in Convert-Datasets.

Also, take the darknet_obj_names.txt file in the dir_[classname] folder and rename it to darknet_obj_names_[classname].txt (e.g., darknet_obj_names_briefcase.txt). Then, move it to the root folder of the convert-datasets repository. Do this for all the classes you’re using.

Now, return to the root directory of convert-datasets and run the following:

# Generic python3 convert-dataset.py --from yolo --from-path [classname]/ --to kitti --to-path [classname]KITTI/ --label darknet_obj_names_[classname].txt# Example with Pastry python3 convert-dataset.py --from yolo --from-path pastry/ --to kitti --to-path pastryKITTI/ --label darknet_obj_names_pastry.txt

If there are issues with running this command, try doing the following:

# Installing Pillow (run via command line) 
python3 -m pip install --upgrade pillow
# Installing LXML (run via command line)
pip3 install lxml

Then, you should receive the following; please type yes to override.

A Successful Conversion to the KITTI Format

Sometimes, you won’t receive a “Conversion complete!!” output and the command will hang. In this case, it may have actually completed the conversion and you can just CTRL + C. You can check your labels to make sure they have properly converted to the KITTI format. As an example:

Proper Output to KITTI Format Using the Pastry Example

However, if you count the elements, there are 16 elements. As mentioned previously, TLT requires the KITTI format to only contain 15 elements. The most convenient and quick solution to this was to run the following CLI command and then delete the resulting .bat files that were generated. The command will take the existing .txt files and edit them to remove the last 0.

find . -maxdepth 1 -name '*.txt' -exec sed -i.bak 's/[[:space:]]\{1,\}[^[:space:]]\{1,\}$//' {} \;sed 's/[[:space:]]\{1,\}[^[:space:]]\{1,\}$//'

After doing these steps, your [classname]KITTI directory will have your data. You can delete the labels directory in the val folder as it will not be utilized.

Example Output for the Pastry Example.

Steps After Obtaining KITTI Formatted Data

At this point, you should have your photos for training/testing and their KITTI labels, respectfully. If you followed the OpenImages method, you will have one or several directories labeled [classname]KITTI with a train and val directory inside them. For your own custom dataset, you will have a folder with your labels and a folder with your photos for your train/val sets, respectively. However, it’s important to consolidate all this information — regardless of method — into a standard format for TLT. Because TLT contains a sample Jupyter notebook and pre-written code that checks for specific file labeling and format, this standard format allows for minimal modification to that code and prevents errors in future steps.

First, we need to ensure that the images and labels follow an ascending numerical order and start with four zeros in addition to being sized properly. Please note that you only need to rename the images for the validation dataset and both the image and label data for the training dataset, regardless of the method you used to obtain KITTI data. If you have only one repository or dataset, this process is fairly simple as you can create an empty Jupyter notebook and run the following for both the training and validation repositories. For label data in the training dataset:

import osi=0
# Path to labels data (e.g., {your-path}/convert)
# datasets/pastryKITTI/train/labels)
path="{your_path}/test/labels"
dir = os.listdir(path)
dir.sort()
for filename in dir:
os.rename(path+'/'+filename,path+'/0000'+str(i)+'.txt')
i = i + 1
Example Output of Running the Above Command.

This command renames everything in the folder in ascending order. Next, do the same command but for PNG images. Note that there should be the same number of text files as there are PNG images for each class. For image data:

import osi=0
# Path to labels data (e.g., {your-path}/convert
# datasets/pastryKITTI/train/images)
path="{your_path}/test/images"
dir = os.listdir(path)
dir.sort()
for filename in dir:
os.rename(path+'/'+filename,path+'/0000'+str(i)+'.png')
i = i + 1
Example Output of Running the Above Command.

Now, you should have your labels and images in ascending numerical order, where 00001.txt corresponds to 00001.png and so on. If you used the Google OpenImages method and have multiple directories for different classes, it helps to consolidate your photos and labels into one directory as TLT assumes your data is stored in one repository. We found that images and labels may share the same name when pulling info on different classes. Therefore, if you were to consolidate the photos/labels immediately into one folder, you would have duplicates that refer to different information; it helps to rename the files in each repository separately before consolidating.

Do the above step on the first class’s repository and then for the subsequent classes you can use the same code but replace i=0 to i=[1 + last label number of previous repo]. In other words, if we had 100 images for repository one, then we’d rerun the above commands on the second repository’s labels and images with i=101. Once done, combine the images and labels into one folder; more info on this specific format is below.

Regardless of the method, we need to create or use one folder (e.g., tlt-experiments/data) that has two folders, training and testing, where testing contains just the images using for validation/testing and training contains images and labels for training. In other words, the format for your data should look like so:

Folder Chart for Datasets for TLT. Use Label_2/Image_2 as the Internal Folders Within Training and Testing.

The reason we are using this specific format is that it matches the expected format in the YOLO sample Jupyter notebook that NVIDIA has provided for TLT. Putting our data into this format now saves us time in future steps.

You can pick a random subset of your images to put into testing under image_2 or pick new images entirely (make sure they follow the ascending format as mentioned previously). For training, put your consolidated images folder there and rename it to image_2. Then, put your consolidated labels folder there and rename it to label_2. Now, check labels and resizing info.

Once your Data folder is properly set-up, upload it to Github or use an alternative storage method that can be used to pull or access the data elsewhere. We uploaded our data to Github and pulled it to the DGX-1.

Training a Model with TLT

Setting up TLT for Use

Next, we will need to set-up TLT for use. NVIDIA’s instructions for doing so are provided here as well. As a reminder, we were running TLT on a remotely-accessed DGX-1 and using TLT version 2.0. It appears version 3.0 handles this differently and doesn’t utilize docker as explicitly; please see the documentation linked previously for more information. First, run:

docker pull nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3
Resulting Output of Running the Above Command

Next, create an NGC account by going to the following site and click next after entering your email, selecting create account, and then entering your info. Then, configure the NGC command-line interface via ngc config set. You will be asked for an API key, which you can obtain from the following (assuming you are logged into NGC). Copy the API key and paste it into the terminal; for the rest of the set-up, use the default options. After this part, you will run the TLT docker image (making sure to add your own project folder). In our case, we had uploaded our data to Github. We pulled the data (this is the Data folder that was created in previous steps) to the DGX-1 via Git and moved it to a purestorage directory.

docker run --gpus all -it -v /purestorage/{project_name}:/workspace/{project_name} -p 8848:8848 nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3 /bin/bash

As our data was in purestorage, our location starts with /purestorage. You should replace this part with the location of your data. In addition, the -p 8848:8848 line enables port forwarding on port 8848, which we needed to do in order to access Jupyter Notebook on the DGX-1 from a remote desktop. You may not need to run this line if you aren’t running anything remotely.

It may take a few minutes to run the first time. Essentially, this starts up a docker container that contains sample Jupyter notebooks from NVIDIA for object detection. In our case, we want to use the YOLO example.

Change directories to the example (e.g., /workspace/examples/yolo) that you’re interested in and run jupyter notebook. Open localhost and navigate to the relevant .ipynb file (in this case, yolo.ipynb). In order for us to run Jupyter Notebook given our set-up, we used the following command:

jupyter notebook –-ip 0.0.0.0 –-port 8848 –-allow-root

This will allow root access and port-forwards on port 8848. In the remote desktop, open http://localhost:8848 in Chrome or use one of the URLs provided in the terminal with a token.

Note: if you are done and would like to leave docker, simply press CTRL + P + Q which will leave docker but keep the container running. If you would like to stop the container entirely, just type in exit. To see all running containers, you can run docker ps -a. To remove a container, type docker rm [container_name]. To start a container (assuming you started a project, already configured it, and want to resume work), you run the following command:docker start -i [container_id] and press Enter.

Assuming you have successfully started the Jupyter notebook and opened it (e.g., yolo.ipynb), you will first need to input some information about where your data is stored.

If You Run Everything Successfully, This Will Be The Jupyter Notebook.

The first step is to set up our environment variables. First, in the Set up Env variables section, put the API key from the previous section under the variable called KEY. Then, under USER_EXPERIMENT_DIR, put the directory that contains the yolo folder (for our case, this was workspace/{project_name}/yolo). Next, under DATA_DOWNLOAD_DIR, put the directory that contains your Data folder from the previous part (for us, this was /workspace/{project_name}/data. Then, run the code block. You can check to see if the folder contains the images and labels via the following:

!dir /workspace/{project_name}/data/training/image_2
!dir /workspace/{project_name}/data/training/label_2

Next, you can skip to the last code block in “Prepare Dataset and Pre-Trained Model.” Add a new code block with the following code, which will generate the best anchor sizes based on your data.

!python kmeans.py -l $DATA_DOWNLOAD_DIR/training/label_2 -n 9
Example Output of Running the Above Command.

Now, we need to modify the paths in the yolo_tfrecords spec. If you go to 127.0.0.1:{port_number. For us, this was 8848}/tree and click on the specs directory, it is called yolo_tfrecords_kitti_trainval.txt. Change the root_directory_path and image_directory_path to match what you have (where your training data is). In our case, the format was the following:

root_directory_path: "/workspace/{project_name}/data/training"
image_directory_path: "/workspace/{project_name}/data/training"

Once you’ve modified this file, make sure you save it and then go back to the Jupyter Notebook. You can now run the code block with tlt-dataset-convert. This will create a directory and put the TF records there.

Successfully Running the Above Command.

If all goes well, the last output should say Tfrecords generation complete. Next, the Juptyer Notebook will guide you to view the generated TF records.

The next step is to download the pre-trained model. We will use NGC CLI to get the pre-trained model (!ngc registry model list nvidia/tlt_pretrained_object_detection:* will show what models exist). We’re interested in Darknet, so let’s run the following command after creating the directory (!mkdir -p $USER_EXPERIMENT_DIR/pretrained_darknet19/). These steps should be in the Juptyer Notebook already.

!ngc registry model download-version nvidia/tlt_pretrained_object_detection:darknet19 --dest $USER_EXPERIMENT_DIR/pretrained_darknet19
Example Output for Running the Above.

The next step is to provide training specifications. Remember that command we ran earlier that gave us anchor sizes? We will use that output now. We need to modify information in yolo_train_resnet18_kitti.txt. If you go to 127.0.0.1:{port_number. For us, this was 8848}/tree and click on the specs directory, it is called yolo_train_resnet18_kitti.txt. Under yolo_config, change the anchor shapes to match the output.

Changing the Anchor Boxes for Yolo_Config.

Then, change the arch parameter to say “darknet” and change the nlayers to say 19. You can also change the batch_size_per_gpu parameter and num_epochs here, which is recommended but will vary based on your situation (i.e., workstation) and how much data you have. If you are getting poor results, try changing these values and seeing if there’s a difference.

Under augmentation_config, you can change, for example, output_image_width to be 800 and output_image_height to be 576. Additionally, crop_right and crop_bottom will be 800 and 576, respectively. Now, under dataset_config, change the tfrecords_path to match where your tfrecords are stored (e.g., “/workspace/{project_name}/data/tfrecords/kitti_trainval/kitti_trainval*”. You will also need to change the image_directory_path to the path of the images like so: /workspace/{project_name}/data/training. The final and arguably most important part is making sure that you include all the classes under target_class_mapping, which should have a key and value with the name of your class. Put all the classes that you’re trying to detect here. For example, if I have a dataset and am trying to detect different Nike shoes, my dataset_config may look like the following:

Part of the Yolo_Train_Resnet18_KITTI.txt File.

After doing this, save the file and continue to step 3 in the Jupyter Notebook. In this step, we run the actual training process, which may take some time depending on the capacity of your machine. Additionally, please change the number of GPUs to match that of your system and include the DarkNet-19 pre-trained model (we did not change the name of our spec file so it still says resnet18 though it’s using darknet19). It should look something like this:

Running the Training Command.

Once finished, you will see that the final epoch number will show with the AP and mAP values for the classes/model, respectively. You can then check to see that the model for each epoch was saved via !ls -ltrh $USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights. You then want to go through and select the model with the highest accuracy via the following:

# Now check the evaluation stats in the csv file and pick the model with highest eval accuracy. Outputs all the models' accuracies. !cat $USER_EXPERIMENT_DIR/experiment_dir_unpruned/yolo_training_log_darknet19.csv# For example. Likely, you may have more than 80 epochs and the most accurate epoch may be a different number. 
%set_env EPOCH=080

The next step is to evaluate the trained models via the following:

!tlt-evaluate yolo -e $SPECS_DIR/yolo_train_resnet18_kitti.txt \ -m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/yolo_darknet19_epoch_$EPOCH.tlt \ -k $KEY

Optimizing the Model & Visualizing Results with TLT

The next step involves model pruning. You can just follow the instructions in the Juptyer Notebook for this part, which will look like the following:

The Output of the Pruning Trained Models Step.

Next, we will retrain the pruned model to bring back accuracy after pruning. In this case, we need to change the spec file for yolo_retrain_resnet18_kitti.txt, which is found in the spec file (same place as the other spec files). In this case, make the same changes that you did for the yolo_train_resnet18_kitti.txt file here and save the file.

Output of Running the Above Commands.

Just like in the last section, make sure that you %set_env EPOCH to the model with the highest evaluation accuracy before you run step seven, shown below.

!tlt-evaluate yolo -e $SPECS_DIR/yolo_retrain_resnet18_kitti.txt \-m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/yolo_darknet19_epoch_$EPOCH.tlt \ -k $KEY

Now, to visualize the output (finally), we run the following (which will copy some test images and run the object detection on those images):

Running the Output of the Above Commands.

The Jupyter Notebook does contain code that allows you to visualize the outputs in a grid-like format, but we felt like the images were too small to see the labels. This is an example output of the provided code:

Example Output of the YOLO Object Detection Model

Therefore, we used the following code to see images one by one:

from IPython.display import ImageImage(filename='/workspace/{project_name}/yolo/yolo_infer_images/000038.png') # replace with the pathname of the image you want. 
Example Output of the Object Detection Model
Another Example Output of the Object Detection Model

If the results are not satisfactory, consider changing some parameters (e.g., epoch size) and re-training your model. Hopefully, you found this guide helpful. Feel free to leave a comment if you have any questions.

--

--

Ronak Bhatia

Engineer (HMC ’19), DJ, DDR Addict, Cheese Aficionado, and Polyglot. Interested in the intersections of technology and society. Views are my own.