Converting Neural Network To TensorRT . Part 2 Creating a Custom Layer.

Roman
3 min readJul 9, 2019

Part 1: https://medium.com/@r7vme/converting-neural-network-to-tensorrt-part-1-using-existing-plugins-edd9c2b9e42a

In part 1 i’ve described how to convert neural network with supported layers to TensorRT plan. In this part i’ll try to describe how to create a custom layer for TensorRT. Example will be a “l2norm_helper” plugin that i created to support TensorFlow l2_normalize operation.

Source code: https://github.com/r7vme/tensorrt_l2norm_helper

TensorRT plugin requires two major parts to be implemented:

  1. CUDA kernels (aka device code)
  2. TensorRT classes: IPluginV2 and IPluginCreator

Why l2_normalize is not supported by TensorRT?

This is a reasonable question. First let’s check what l2_normalize is in TensorFlow.

It consists bunch of operations actually (not only one). And two throwing errors in TensorRT (checked with 5.0.6). Seems these two do not pass some internal TensorRT restrictions (related thread).

l2_normalize/Maximum: Unsupported binary op max with constant right

l2_normalize/Rsqrt: Unary not supported for other non-constant node

CUDA kernels

Let’s start with workhorse (CUDA kernels). If you’re new to CUDA, highly recommend this presentation from NVIDIA. Rsqrt kernel does “1/sqrt(x)” for every member of input vector and assigns value to output vector. Easy.

Actually initially i though that i’ll be implementing whole l2_normalize, but after realizing what it costs to just implement “simple” reduce_sum, i fell back to solution with rsqrt and max. Just check out awesome presentation from Mark Harris about parallel reduction.

Ok, continue, second kernel for max. Compare vector member and compare with “eps” (really small number that prevents division by zero) and take bigger one.

Actually implementing this kernel can be avoided by replacing “math_ops.maximum” with “math_ops.add”, but this require reimplementing l2_normalize in your network definition. That’s it, just two CUDA kernels.

Implementing IPluginV2 and IPluginCreator

In general, you have to just implement two classes IPluginV2 and IPluginCreator, easy! This even can be single .cpp file like in “samplePLugin”.

NOTE: for 5.1 you have to go with IPluginV2Ext, but in 5.0 it’s not supported yet.

NOTE: In older plugin versions, you have to implement IPluginFactory, in newer (from 5.0?) it’s not necessary, but IPluginCreator have to be implemented instead.

Before implementing your own plugin, i recommend to start with official docs (which contain many important details). Good starting points are official normalizePlugin from TensorRT repo (thanks for making it opensource, NVIDIA). Or under “samples” there are bunch trivial ones (e.g. samplePlugin is just one file). Below i’ll highlight most important details that took me a time to realize on my own.

Plugin input can consist:

  • custom parameters (in my case “op_type” and “eps”)
  • input vector dimensions C, H, W
  • weights (no weights in my case)
L2NormHelper(int op_type, float eps, int C, int H, int W);

At runtime, input values read from serialized buffer (separate constructor required)

here are useful read/write snippets for Serialization/Deserialization from nvidia samples.

Execution of CUDA kernels should be called from enqueue

Input is always a 4D matrix (in NCHW dimension order) (as far as i can tell). Even if in TF you had 3D matrix (e.g. shape (?,6,2)) in TRT it will be 3D matrix (?,6,2,1). In my case, batch members processed sequentially (TODO: this is actually can easily be done in parallel).

IPluginCreator stays mostly intact. Main function is createPlugin. It gets values from “PluginFieldCollection” and then creates new instance of a plugin.

Plugin registration happens automatically by adding special line in header.

After this done, application can use plugin by simply including header file.

#include "l2norm_helper.h"

Cmake from 3.8 has built-in support for CUDA. The only problem i was not able to solve is autodetection of gpu compute capabilities (cmake 3.10.2). Hard-coded value for Xavier (in fp16 branch) (Help appreciated).

Last, but not least FP16 (half-precision) support. I implemented all pieces that are necessary imo (branch f16support), but still i see that before and after it adds reformat layers. Is it even possible to use FP16 in custom layer?

Adding reformat layer: leaky_re_lu_3/LeakyRelu output to be reformatted 0 (leaky_re_lu_3/LeakyRelu) from Half(1,1,1,12) to Float(1,1,1,12)
Adding reformat layer: orientation/l2_normalize output to be reformatted 0 (orientation/l2_normalize) from Float(1,1,2,12) to Half(1,1,2,12)

This was my first own TensorRT plugin, any feedback appreciated.

--

--