Optimizing any TensorFlow model using TensorFlow Transform Tools and using TensorRT

If you have a frozen TF graph you can use the following methods to optimize it before using it for inferences.

There are two types of optimization. One to make it faster or smaller in size to run inferences. And the other to change the weights from higher precision to lower precision. Usually from FP32 to FP16 or INT8. For the latter, the GPU should have the ability to run mixed precision operations (Tensor Cores). Usually, NVIDIA’s desktop or laptop class GTX 1080 or similar are restricted from running lower precision operations. NVIDIA’s server-class GPUs support this. Especially the newer GPUs V100, T4, etc. Not all server GPU’s support it.

The GPU I l use is NVIDIA V100 32 GB GPU which has support for mixed precision operations. Also, you need to run the optimization in the GPU that you are optimizing for. Especially if you are using TensorRT.

Step 0. The model, and the Docker Containers

The first thing that has to be done is to convert the TensorFlow graph to a Frozen Graph. If the graph is Kearns based it is the HD5 format and has to be converted to the TF model and then to the frozen graph. A frozen graph has the value of variables embedded in the graph itself. It is a GrpahDef/protocol buffer (pb) format like a Saved Model only it cannot be retrained.

The model that we are using is the SSD model ssd_resnet_50_fpn_coco form TF model zoo -https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

Docker container used for the optimization is tensorflow/tensorflow:1.13.0rc1-gpu-jupyter

docker run --entrypoint=/bin/bash --runtime=nvidia  -it --rm -p 8900:8500 -p 8901:8501 -v /usr/alex/:/coding --net=host tensorflow/tensorflow:1.13.0rc1-gpu-jupyter
once inside
cd /coding
jupyter notebook --allow-root &

Note- I changed the entry point to something more convenient to me than default tf-notebook I believe.

After optimizing, to run inferences I am using the same docker image after installing on that TF serving API’s, as well as headless opencv-python version. This is because we will be converting the optimized model to a TF serving compatible model for inference.

docker run --entrypoint=/bin/bash --env http_proxy=<my proxy> --env https_proxy=<my proxy>  --runtime=nvidia  -it --rm -p 8900:8500 -p 8901:8501 -v /usr/alex/:/coding --net=host tensorflow/tensorflow:1.13.0rc1-gpu-jupyter
pip install tensorflow-serving-api
pip install opencv-python==3.3.0.9
cd coding
python ssd_client_1.py -num_tests=1 -server=127.0.0.1:8500 -batch_size=1 -img_path='../examples/google1.jpg/'

Step 1. Get the output node names in the Tensorflow Graph

Why is this important? We need to find the output node names of the frozen graph as it is needed to optimize the graph. Note Tensorflow version that is used in TF 1.13

# To Freeze the Saved Model
# We need to freeze the model to do further optimisation on it
from tensorflow.python.saved_model import tag_constants
from tensorflow.python.tools import freeze_graph
from tensorflow.python import ops
from tensorflow.tools.graph_transforms import TransformGraph
def freeze_model(saved_model_dir, output_node_names, output_filename):
output_graph_filename = os.path.join(saved_model_dir, output_filename)
initializer_nodes = ''
freeze_graph.freeze_graph(
input_saved_model_dir=saved_model_dir,
output_graph=output_graph_filename,
saved_model_tags = tag_constants.SERVING,
output_node_names=output_node_names,
initializer_nodes=initializer_nodes,
input_graph=None,
input_saver=False,
input_binary=False,
input_checkpoint=None,
restore_op_name=None,
filename_tensor_name=None,
clear_devices=True,
input_meta_graph=False,
)

For this, we can plot the model in TF Board and see the output nodes, or print the nodes and grep on some keywords.

# Source https://medium.com/google-cloud/optimizing-tensorflow-models-for-serving-959080e9ddbf
def get_graph_def_from_file(graph_filepath):
tf.reset_default_graph()
with ops.Graph().as_default():
with tf.gfile.GFile(graph_filepath, 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
return graph_def

let us use the above helper to print the input and output nodes, input nodes via the for loop –

graph_def =get_graph_def_from_file('/coding/ssd_inception_v2_coco_2018_01_28/frozen_inference_graph.pb')
for node in graph_def.node:
if node.op=='Placeholder':
print node # this will be the input node

and output nodes by plotting it in a format readable by Tensor Board.

with tf.Session(graph=tf.Graph()) as session:
mygraph = tf.import_graph_def(graph_def, name='')
writer = tf.summary.FileWriter(logdir='/coding/log_tb/1', graph=session.graph)
writer.flush()

Let us invoke Tensor board.

#ssh -L 6006:127.0.0.1:6006 root@<remoteip> # for tensor board - in your local machine type 127.0.0.1
tensorboard --logdir '/coding/log_tb/1'

From this, I could make out the output nodes. Note that if you are building the graph yourself you don’t need to do this circus. Since I am using a model that is opensourced and with less documentation I am using this. Sometimes for auto converted/TF imported graphs, the names will be pretty long. You can then print the nodes in a for a loop as I did for Placeholder and from the output, shape make out ( for detections class, score, rectangle coordinates)

# These are the output names. Add a index usually 0 for graph nodes. # You can print the node details by nodenames
output_node_names = ['detection_boxes','detection_scores','detection_classes','num_detections']
outputs = ['detection_boxes:0','detection_scores:0','detection_classes:0','num_detections:0']

Step 3 Optimise using TF Graph Transform Tools

The snippet below illustrates how you can optimize a graph after reading it from disk.

# Source https://medium.com/google-cloud/optimizing-tensorflow-models-for-serving-959080e9ddbf
#https://gist.github.com/lukmanr
# Optimizing the graph via TensorFlow library
from tensorflow.tools.graph_transforms import TransformGraph
def optimize_graph(model_dir, graph_filename, transforms, output_names, outname='optimized_model.pb'):
input_names = ['input_image',] # change this as per how you have saved the model
graph_def = get_graph_def_from_file(os.path.join(model_dir, graph_filename))
optimized_graph_def = TransformGraph(
graph_def,
input_names,
output_names,
transforms)
tf.train.write_graph(optimized_graph_def,
logdir=model_dir,
as_text=False,
name=outname)
print('Graph optimized!')

Let us use the above helper to optimize the graph first quantize_weights

# Optimization without Qunatization - Reduce the size of the model
# speed may actually be slower
# see https://medium.com/google-cloud/optimizing-tensorflow-models-for-serving-959080e9ddbf
transforms = ['remove_nodes(op=Identity)', 
'merge_duplicate_nodes',
'strip_unused_nodes',
'fold_constants(ignore_errors=true)',
'fold_batch_norms',
'quantize_weights'] #this reduces the size, but there is no speed up , actaully slows down, see below
optimize_graph('/coding/ssd_inception_v2_coco_2018_01_28', 'frozen_inference_graph.pb' ,
transforms, output_node_names,outname='optimized_model_small.pb')

Let’s then convert the optimized model to TF serving compatible format.

#lets convert this to a s TF Serving compatible mode;
convert_graph_def_to_saved_model('/coding/ssd_inception_v2_coco_2018_01_28/2',
'/coding/ssd_inception_v2_coco_2018_01_28/optimized_model_small.pb',outputs)

The helper that does this is given below

# Source https://medium.com/google-cloud/optimizing-tensorflow-models-for-serving-959080e9ddbf
#https://gist.github.com/lukmanr
def convert_graph_def_to_saved_model(export_dir, graph_filepath,outputs):
graph_def = get_graph_def_from_file(graph_filepath)
with tf.Session(graph=tf.Graph()) as session:
tf.import_graph_def(graph_def, name='')
tf.saved_model.simple_save(
session,
export_dir,# change input_image to node.name if you know the name
inputs={'input_image': session.graph.get_tensor_by_name('{}:0'.format(node.name))
for node in graph_def.node if node.op=='Placeholder'},
outputs={t:session.graph.get_tensor_by_name(t) for t in outputs}

)
print('Optimized graph converted to SavedModel!')

And then ‘quantize_weights’ and ‘quantize_nodes’.

This should really covert also the calculation to lower precision – but does not work as of now.

“This process converts all the operations in the graph that have eight-bit quantized equivalents and leaves the rest in floating point. Only a subset of ops are supported and on many platforms, the quantized code may actually be slower than the float equivalents, but this is a way of increasing performance substantially when all the circumstances are right.”
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_transforms#optimizing-for-deployment

transforms = ['add_default_attributes', 
'strip_unused_nodes',
'remove_nodes(op=Identity, op=CheckNumerics)',
'fold_constants(ignore_errors=true)',
'fold_batch_norms',
'fold_old_batch_norms',
'quantize_weights',
'quantize_nodes',
'strip_unused_nodes',
'sort_by_execution_order']
optimize_graph('/coding/ssd_inception_v2_coco_2018_01_28', 'frozen_inference_graph.pb' ,
transforms, output_node_names,outname='optimized_model_weight_quant.pb')

However this does not work in the sense, inference using this optimized model gives the error. I had tried with a Keras model earlier and got another error message. This seems to be a bug as now this model is a pure Tensorflow model and I have not changed anything here

(‘Got an error’, <_Rendezvous of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = “input_max_range must be larger than input_min_range.
[[{{node Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/ClipToWindow_87/Area/mul_eightbit/Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/ClipToWindow_87/Area/sub_1/quantize}}]]
[[{{node Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/zeros_like_83}}]]”
debug_error_string = “{“created”:”@1555723203.356344655",”description”:”Error received from peer”,”file”:”src/core/lib/surface/call.cc”,”file_line”:1036,”grpc_message”:”input_max_range must be larger than input_min_range.nt [[{{node Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/ClipToWindow_87/Area/mul_eightbit/Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/ClipToWindow_87/Area/sub_1/quantize}}]]nt [[{{node Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/zeros_like_83}}]]”,”grpc_status”:3}”
>)
Response Received Exiting

Step 4 Optimise using NVIDIA TenosrRT

Base reference for this is these two posts

https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html

https://developers.googleblog.com/2018/03/tensorrt-integration-with-tensorflow.html

Inference with TF-TRT `SavedModel` workflow: we are using the TF Serving model.

import tensorflow.contrib.tensorrt as trt
tf.reset_default_graph()
graph = tf.Graph()
sess = tf.Session()
# Create a TensorRT inference graph from a SavedModel:
with graph.as_default():
with tf.Session() as sess:
trt_graph = trt.create_inference_graph(
input_graph_def=None,
outputs=outputs,
input_saved_model_dir='/coding/ssd_inception_v2_coco_2018_01_28/01',
input_saved_model_tags=['serve'],
max_batch_size=1,
max_workspace_size_bytes=7000000000,
precision_mode='FP16')
#precision_mode='FP32')
#precision_mode='INT8')
output_node=tf.import_graph_def(trt_graph, return_elements=outputs)
#sess.run(output_node)
tf.saved_model.simple_save(sess,
"/coding/ssd_inception_v2_coco_2018_01_28/4",
inputs={'input_image': graph.get_tensor_by_name('{}:0'.format(node.name))
for node in graph.as_graph_def().node if node.op=='Placeholder'},
outputs={t:graph.get_tensor_by_name('import/'+t) for t in outputs}

)

Inference with TF-TRT `Frozen` graph workflow:

Reference https://medium.com/tensorflow/speed-up-tensorflow-inference-on-gpus-with-tensorrt-13b49f3db3fa

#Lets load a frozen model and reset the graph and use
gdef =get_graph_def_from_file(‘/coding/ssd_inception_v2_coco_2018_01_28/frozen_inference_graph.pb’)
tf.reset_default_graph()
graph = tf.Graph()
sess = tf.Session()
# Create a TensorRT inference graph from a SavedModel:
with graph.as_default():
with tf.Session() as sess:
trt_graph = trt.create_inference_graph(
input_graph_def=gdef,
outputs=outputs,
max_batch_size=8,
max_workspace_size_bytes=7000000000,
is_dynamic_op=True,
#precision_mode=’FP16')
#precision_mode=’FP32')
precision_mode=’INT8')

output_node=tf.import_graph_def(trt_graph, return_elements=outputs)
#sess.run(output_node)
tf.saved_model.simple_save(sess,
“/coding/ssd_inception_v2_coco_2018_01_28/5”,
inputs={‘input_image’: graph.get_tensor_by_name(‘{}:0’.format(node.name))
for node in graph.as_graph_def().node if node.op==’Placeholder’},
outputs={t:graph.get_tensor_by_name(‘import/’+t) for t in outputs}

)

Step 5: Pause and Check the models

The outputs of the various models are given below. You can see that the model size reduces after optimizations.

Original model ('/coding/ssd_inception_v2_coco_2018_01_28/frozen_inference_graph.pb', '') 
Model size: 99591.409 KB
Variables size: 0.0 KB
Total Size: 99591.409 KB
---------Tensorflow Transform Optimised model Weights Quantised ('/coding/ssd_inception_v2_coco_2018_01_28/2/saved_model.pb', '') Model size: 26193.27 KB
Variables size: 0.0 KB
Total Size: 26193.27 KB
---------Tensorflow Transform Optimised model Weights and Nodes Quantised ('/coding/ssd_inception_v2_coco_2018_01_28/3/saved_model.pb', '') Model size: 29265.284 KB
Variables size: 0.0 KB
Total Size: 29265.284 KB
---------NVIDIA RT Optimised model FP16 ('/coding/ssd_inception_v2_coco_2018_01_28/4/saved_model.pb', '') Model size: 178564.229 KB
Variables size: 0.0 KB
Total Size: 178564.229 KB
---------NVIDIA RT Optimised model INT8 ('/coding/ssd_inception_v2_coco_2018_01_28/5/saved_model.pb', '') Model size: 178152.834 KB
Variables size: 0.0 KB
Total Size: 178152.834 KB

Step 6: Ready the TF Serving container to server these models

Note the container we are using here — Client

docker run --entrypoint=/bin/bash --env http_proxy=<my proxy> --env https_proxy=<my proxy>  --runtime=nvidia  -it --rm -p 8900:8500 -p 8901:8501 -v /usr/alex/:/coding --net=host tensorflow/tensorflow:1.13.0rc1-gpu-jupyter
pip install tensorflow-serving-api
pip install opencv-python==3.3.0.9
cd coding
python ssd_client_1.py -num_tests=1 -server=127.0.0.1:8500 -batch_size=1 -img_path='../examples/google1.jpg/'

Server -This is pasted from Step 0. This is run in the V100 32 GB Linux/machine.

docker run  --net=host --runtime=nvidia  -it --rm -p 8900:8500 -p 8901:8501 -v /usr/alex/:/models  tensorflow/serving:1.13.0-gpu --rest_api_port=0  --enable_batching=true --model_config_file=/models/ssd_inception_v3_coco.json

where the config json is like below. Since I have placed the different models in folders under “/models/ssd_inception_v2_coco_2018_01_28/” as 01 — original model, 2-TF Graph Transform Weight Quantized, 3- TF Graph Transform Weight and Node Quantized,4-TensorRT FP16,5-TensorRT INT8; I just change the versions in the file to load different servables for each test.

model_config_list {
config {
name: "ssd_inception_v2_coco",
base_path: "/models/ssd_inception_v2_coco_2018_01_28/",
model_version_policy: {
specific: {
versions:[01]
}
},
model_platform:"tensorflow",
}
}

Step 7: Write a TF Serving Client for tests

I have written about this in detail in a previous post.

The saved model of the SSD is like below You can use the saved model CLI to view it

saved_model_cli show --dir '/coding/ssd_inception_v2_coco_2018_01_28/3' --all
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['input_image'] tensor_info:
dtype: DT_UINT8
shape: (-1, -1, -1, 3)
name: image_tensor:0
The given SavedModel SignatureDef contains the following output(s):
outputs['detection_boxes:0'] tensor_info:
dtype: DT_FLOAT
shape: unknown_rank
name: detection_boxes:0
outputs['detection_classes:0'] tensor_info:
dtype: DT_FLOAT
shape: unknown_rank
name: detection_classes:0
outputs['detection_scores:0'] tensor_info:
dtype: DT_FLOAT
shape: unknown_rank
name: detection_scores:0
outputs['num_detections:0'] tensor_info:
dtype: DT_FLOAT
shape: unknown_rank
name: num_detections:0
Method name is: tensorflow/serving/predict

Note that in this, the input and output node names are slightly different from the original model- whose input is ‘inputs’ and output is ‘detection_boxes’,’detection_classes’,’detection_scores’ (without the :0 part- which is a deficiency in the conversion scripts that I have used- but can be rectified easily)

Original model

root@ndn-oe:/coding/tfclient# saved_model_cli show - dir /coding/ssd_inception_v2_coco_2018_01_28/01/ - all
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['inputs'] tensor_info:
dtype: DT_UINT8
shape: (-1, -1, -1, 3)
name: image_tensor:0
The given SavedModel SignatureDef contains the following output(s):
outputs['detection_boxes'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 100, 4)
name: detection_boxes:0
outputs['detection_classes'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 100)
name: detection_classes:0
outputs['detection_scores'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 100)
name: detection_scores:0
outputs['num_detections'] tensor_info:
dtype: DT_FLOAT
shape: (-1)
name: num_detections:0
Method name is: tensorflow/serving/predict

The TF Serving client is given here -https://gist.github.com/alexcpn/d7c28230af437dafb0d2cc7f50140eed

The rest of the imports are here, the client is slightly different, the names of inputs and outputs, that’s why it is on gist https://github.com/alexcpn/tf_serving_clients

The image file used for the test is https://github.com/fizyr/keras-retinanet/blob/master/examples/000000008021.jpg

Step 8: The Output from various models

Basically, there is hardly any difference between the optimized and non-optimized model. Batch size is one here.

Time for parsing an HD image — 800*1066 (3 channels)

More details below

Original Model

Invocaiton :

coding/tfclient# python ssd_client_1.py -num_tests=1 -server=127.0.0.1:8500 -batch_size=1 -img_path=’../examples/000000008021.jpg’

(‘Image path’, ‘../examples/000000008021.jpg’)
(‘original image shape=’, (480, 640, 3))
(‘Input-s shape’, (1, 800, 1066, 3)) → This is the size of input tensor

Ouput 
(‘Label’, u’person’, ‘ at ‘, array([412, 171, 740, 624]), ‘ Score ‘, 0.9980476)
(‘Label’, u’person’, ‘ at ‘, array([ 6, 423, 518, 788]), ‘ Score ‘, 0.94931936)
(‘Label’, u’person’, ‘ at ‘, array([ 732, 473, 1065, 793]), ‘ Score ‘, 0.88419175)
(‘Label’, u’tie’, ‘ at ‘, array([529, 337, 565, 494]), ‘ Score ‘, 0.40442815)
(‘Time for ‘, 1, ‘ is ‘, 0.5993821620941162)

Tensorflow Transform Optimised model Weights Quantized

(‘Label’, u’person’, ‘ at ‘, array([409, 174, 741, 626]), ‘ Score ‘, 0.99797523)
(‘Label’, u’person’, ‘ at ‘, array([ 4, 424, 524, 790]), ‘ Score ‘, 0.9549346)
(‘Label’, u’person’, ‘ at ‘, array([ 725, 472, 1064, 793]), ‘ Score ‘, 0.8900732)
(‘Label’, u’tie’, ‘ at ‘, array([527, 338, 566, 494]), ‘ Score ‘, 0.3943166)
(‘Time for ‘, 1, ‘ is ‘, 0.6182711124420 → This is higher a model size is reduced and during inference the higher precision coversion has to be done

You should see that the size of the output graph is about a quarter of the original. The downside to this approach compared to round_weights is that extra decompression ops are inserted to convert the eight-bit values back into floating point, but optimizations in TensorFlow’s runtime should ensure these results are cached and so you shouldn’t see the graph run any more slowly.- https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md

TensorRT FP 16 Converted model

(‘Label’, u’person’, ‘ at ‘, array([412, 171, 740, 624]), ‘ Score ‘, 0.9980476)
(‘Label’, u’person’, ‘ at ‘, array([ 6, 423, 518, 788]), ‘ Score ‘, 0.9493193)
(‘Label’, u’person’, ‘ at ‘, array([ 732, 473, 1065, 793]), ‘ Score ‘, 0.8841917)
(‘Label’, u’tie’, ‘ at ‘, array([529, 337, 565, 494]), ‘ Score ‘, 0.40442812)
(‘Time for ‘, 1, ‘ is ‘, 0.5885560512542725) →

I was hoping this would be half the original value — twice as fast. But during optimization TensorRT was telling it could convert only a few of the supported* operations – "There are 3962 ops of 51 different types in the graph that are not converted to TensorRT -Conv2D" though Convolution operation is shown as supported here →https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html. Bug raised for this by me here

2019-04-14 08:32:31.357592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-14 08:32:31.357620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-14 08:32:31.357645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-14 08:32:31.358154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30480 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:b3:00.0, compute capability: 7.0)
2019-04-14 08:32:34.582872: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 1
2019-04-14 08:32:34.583019: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-04-14 08:32:34.583578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-14 08:32:34.583610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-14 08:32:34.583636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-14 08:32:34.583657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-14 08:32:34.583986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30480 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:b3:00.0, compute capability: 7.0)
2019-04-14 08:32:36.713848: I tensorflow/contrib/tensorrt/segment/segment.cc:443] There are 3962 ops of 51 different types in the graph that are not converted to TensorRT: TopKV2, NonMaxSuppressionV2, TensorArrayWriteV3, Const, Squeeze, ResizeBilinear, Maximum, Where, Add, Placeholder, Switch, TensorArrayGatherV3, NextIteration, Greater, TensorArraySizeV3, NoOp, TensorArrayV3, LoopCond, Less, StridedSlice, TensorArrayScatterV3, ExpandDims, Exit, Cast, Identity, Shape, RealDiv, TensorArrayReadV3, Reshape, Merge, Enter, Range, Conv2D, Mul, Equal, Sub, Minimum, Tile, Pack, Split, ZerosLike, ConcatV2, Size, Unpack, Assert, DataFormatVecPermute, Transpose, Gather, Exp, Slice, Fill, (For more information see https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#support-ops).
2019-04-14 08:32:36.848171: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:913] Number of TensorRT candidate segments: 4
2019-04-14 08:32:37.129266: W tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3710] Validation failed for TensorRTInputPH_0 and input slot 0: Input tensor with shape [?,?,?,3] has an unknown non-batch dimension at dim 1
2019-04-14 08:32:37.129330: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:1021] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 707 nodes failed: Invalid argument: Validation failed for TensorRTInputPH_0 and input slot 0: Input tensor with shape [?,?,?,3] has an unknown non-batch dimension at dim 1. Fallback to TF...
2019-04-14 08:32:37.129838: W tensorflow/contrib/tensorrt/convert/convert_nodes.cc:3710] Validation failed for TensorRTInputPH_0 and input slot 0: Input tensor with shape [?,546,?,?] has an unknown non-batch dimension at dim 2
2019-04-14 08:32:37.129859: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:1021] TensorRT node TRTEngineOp_1 added for segment 1 consisting of 3 nodes failed: Invalid argument: Validation failed for TensorRTInputPH_0 and input slot 0: Input tensor with shape [?,546,?,?] has an unknown non-batch dimension at dim 2. Fallback to TF...
2019-04-14 08:32:38.309554: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_2 added for segment 2 consisting of 3 nodes succeeded.
2019-04-14 08:32:38.420585: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_3 added for segment 3 consisting of 4 nodes succeeded.
2019-04-14 08:32:38.644767: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] Optimization results for grappler item: tf_graph
2019-04-14 08:32:38.644837: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] constant folding: Graph size after: 6411 nodes (-1212), 10503 edges (-1352), time = 848.996ms.
2019-04-14 08:32:38.644858: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] layout: Graph size after: 6442 nodes (31), 10535 edges (32), time = 225.361ms.
2019-04-14 08:32:38.644874: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] constant folding: Graph size after: 6432 nodes (-10), 10535 edges (0), time = 559.352ms.
2019-04-14 08:32:38.644920: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] TensorRTOptimizer: Graph size after: 6427 nodes (-5), 10530 edges (-5), time = 2087.5769ms.

TensorRT INT 8 Converted model

One can see from the V100 server logs some Tensor Core magic happening

2019–04–20 01:30:39.563827: I external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:574] Starting calibration thread on device 0, Calibration Resource @ 0x7f4c341ac570
2019–04–20 01:30:39.563982: I external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:574] Starting calibration thread on device 0, Calibration Resource @ 0x7f4ce8008e60

(‘Label’, u’person’, ‘ at ‘, array([412, 171, 740, 624]), ‘ Score ‘, 0.9980476)
(‘Label’, u’person’, ‘ at ‘, array([ 6, 423, 518, 788]), ‘ Score ‘, 0.9493195)
(‘Label’, u’person’, ‘ at ‘, array([ 732, 473, 1065, 793]), ‘ Score ‘, 0.8841919)
(‘Label’, u’tie’, ‘ at ‘, array([529, 337, 565, 494]), ‘ Score ‘, 0.40442798)
(‘Time for ‘, 1, ‘ is ‘, 0.5967140197753906)

With batch size 2 there is an error/ out of memory for TensorCores

python ssd_client_1.py -num_tests=1 -server=127.0.0.1:8500 -batch_size=2 -img_path=’../examples/000000008021.jpg’
2019–04–20 01:34:25.042337: F external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:227] Check failed: t.TotalBytes() == device_tensor->TotalBytes() (788424 vs. 394212)
2019–04–20 01:34:25.042373: F external/org_tensorflow/tensorflow/contrib/tensorrt/kernels/trt_engine_op.cc:227] Check failed: t.TotalBytes() == device_tensor->TotalBytes() (34656 vs. 17328)
/usr/bin/tf_serving_entrypoint.sh: line 3: 6 Aborted (core dumped)

Results from other models (and Comparison with different GPU’s)

Here are some results from other tests and models

Details here — https://docs.google.com/spreadsheets/d/1Sl7K6sa96wub1OXcneMk1txthQfh63b0H5mwygyVQlE/edit?usp=sharing

Model — Resnet_50 FP 32 and FP16

FP32 = http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NCHW.tar.gz

FP16 = http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp16_savedmodel_NCHW.tar.gz

Resnet 50

You can see that there is a slight difference, V100 32 GB takes slightly less time than the consumer grade GTX 1070 8GB, when the batch size increases the more memory resource of V100 stands out; but not the number of CUDA cores. It seems as is noted in other blogs, that simply having more CUDA cores does not automatically mean that an inference will run faster. It may depend on memory and the model characteristics also.

Model Retinanet

One can see here that there is not much difference. Actually, this was my first experiment, but this was a Keras model that was converted to a TF frozen model and optimised. So I thought maybe I would get better results from a pure TF written model like SSD. But did not make much difference.

Summary

One can see that there are no drastic improvements in the inference time between the models. Also, TF GraphTransform for Model Quantization has not worked for me in this nor one other model I tried. Will raise a bug for that.TensorRT is better but is only able to convert a few layers to lower precision- have raised a bug/clarification for this, and if that works, hopefully, we can see the models runs twice as fast as advertised in Tensor Cores.

Main References

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md

https://medium.com/google-cloud/optimizing-tensorflow-models-for-serving-959080e9ddbf

https://colab.research.google.com/drive/1wQpWoc40kf__WSjfTqDaReMx6fFjUn48

Other related posts