Converting Neural Network To TensorRT . Part 1 Using Existing Plugins.

4 min readJul 9, 2019
Every 3D bounding box estimation on image above took only 6 milliseconds.

What is TensorRT?

TensorRT is a framework from NVIDIA that allows significantly speed-up inference performance of neural network. TensorRT does this by fusing multiple layers together and selecting optimized (cuda) kernels. In addition, lower-precision data type can be used (e.g. float16 or int8). In the end, up to 4x-5x performance boost can be achieved, which is critical for real-time applications.

TensorRT usually comes together with JetPack (NVIDIA’s software for Jetson series embedded devices). For non-Jetpack installations checkout TensorRT installation guide.

DISCLAIMER: This post describes specific pitfalls and advanced example with custom layer. For “out-of-the-box” TensorRT examples please check out tf_to_trt_image_classification repo. NVIDIA has really good tutorials there!

Example that described in this post has following inference times.

  • 30 milliseconds TensorFlow
  • 18 milliseconds TensorRT with FP32 (Floating point 32bit)
  • 6 milliseconds TensorRT with FP16 (Floating point 16bit)

As you can see final result is 5x times!!! faster than pure TensorFlow.

Inference time in ms

Blog post divided into two parts: Part 1 (this): describes overall workflow and shows how to use existing TensorRT plugins. Part 2 (link): shows how to create custom TensorRT layer/plugin.

Selecting a neural network

As an example i’ll take TensorFlow/Keras based network for 3D bounding boxes estimation (code, paper).

Why this network?

  • l2_normalize operation, which is not supported by TensorRT (we will build a custom plugin for it)
  • LeakyRelu operation can be replaced with official LRelu_TRT plugin, but unsupported by default. More about official plugins below.
  • Flatten layer, which is able to silently confuse TensorRT and completely ruin network quality. This layer will be replaced with Reshape.

General workflow

First thing to do is to freeze and optimize your graph . Resulted prorobuf (.pb) file will be used by next steps. This script can be used to freeze graph.

After graph has been frozen:

  1. Convert TensorFlow protobuf to UFF format (seems UFF used only in TensorRT)
  2. Convert UFF to TensorRT plan (.engine)

INFO: TensorRT plan is a serialized binary data compiled exclusively for specific hardware type (i.e. plan for Jetson TX2 only works on Jetson TX2).

For PyTorch, Caffe or other frameworks workflow is a bit different and not covered here.

In general, both steps can be done with one python script. But because some TensorRT API functions are not available via Python API (e.g. Deel Learning Accelerator related), most of the NVIDIA official scripts use C++ for the second step. For my plugin (Part 2) I also will use C++ script (feel free to make a PR with pybind to make it work with Python ).

As baseline example (save it for later ;)), take a look at the following python script that converts our network (VGG16 backbone) to TensorRT plan. To avoid custom layers l2_normalize operations are removed with “dynamic_graph.remove(node)” (covered in Part 2). But before this script can actually work, few modifications in the original network are required.

Replacing ambiguous operations

Even if TensorFlow/Keras operation is supported by TensorRT sometimes it can work in unexpected manner.

Flatten operation do not specify exact parameters — (?, ?) — (i.e. depends on input) and for some reason this confuses TensorRT. By default this messes up NN output exactly after this operation. This can be related to the fact that TensorFlow uses NHWC and TensorRT uses NCHW dimentions order. N-number of batches, H — height, W-width, C-channels. So we have to specify exactly what we want and do not rely on TensorRT assumptions about flatten behavior.

- x = Flatten()(vgg16_model.output)
+ x = Reshape((25088,))(vgg16_model.output)

Reshape with “-1”. TensorRT complains if “-1” is used in reshape operation. In TensorFlow “-1" is a specific case, when exact value will be computed based on input shape.

UFFParser: Parser error: reshape_2/Reshape: Reshape: -1 dimension specified more than 1 time

This can be solved by using exact shapes

- orientation = Reshape((bin_num, -1))(orientation)
+ orientation = Reshape((bin_num, 2))(orientation)

NOTE: After above changes we don’t need to retrain network as operations are equal.

Replacing LeakyRelu with TensorRT plugin

Here is a bit the tricky part. Despite the fact that there is existing plugin for LeakyRelu, it’s not obvious (from docs) how to use it.

We need a surgeon.

Turns out this is a pretty simple operation called “collapse_namespaces” by graph surgeon. Graph surgeon as name states is special utility intended to cure graphs via surgery. With graph surgeon you can remove/append/replace nodes. “collapse_namespaces” is intended to replace nodes, that’s it, we specify original node name and prepare new replacement node and collapse_namespaces will do the surgery.

“negSlope” is alpha coefficient in LeakyRelu. Different plugins can have different attributes.

Under the hood, if we look on original box3d.pbtxt and box3d-uff.pbtxt (not generated here), original LeakyRelu node (.pbtxt) will be replaced with new plugin node.

Original TensorFlow node.

node {
name: "leaky_re_lu_2/LeakyRelu"
op: "LeakyRelu"
input: "dense_3/BiasAdd"
attr {
key: "T"
value {
type: DT_FLOAT
attr {
key: "alpha"
value {
f: 0.10000000149011612

New TRT plugin node.

nodes {
id: "leaky_re_lu_2/LeakyRelu"
inputs: "dense_3/BiasAdd"
operation: "_LReLU_TRT"
fields {
key: "negSlope_u_float"
value {
d: 0.1

IMPORTANT: Do not forget to initialize TRT plugins, otherwise TRT will not be able to recognize _LRelu_TRT.


trt.init_libnvinfer_plugins(G_LOGGER, “”)


nvinfer1::initLibNvInferPlugins(&gLogger, “”)

All other TRT plugins listed here. All supported operations/layers here.

In Part 2, I will describe how to create a custom plugin.


  • Nvidia Jetson AGX Xavier
  • Jetpack 4.2 (CUDA 10.0.166, TensorRT 5.0.6)
  • TensorFlow 1.13.1