The Triton Inference Server does support tensorrt models as well as tensorrt_llm. One of the disadvantages of tensorrt is that it makes little sense to ship engine/plan files as they need to be created on the GPU (generation/family?) it is going to be deployed on. Therefore, it is often practical to ship onnx models and convert them right before deployment.

A useful tool to convert an onnx model to tensorrt is trtexec which is the cli for tensorrt. However, if we run e.g. NVIDIA’s latest triton server container (e.g. nvcr.io/nvidia/tritonserver:24.07-py3) then we won’t find trtexec available. However, it is installed and therefore we don’t need to build it but we need to know the path of the executable or create a symlink to use it. Creating a symlink might be most convenient.

ln -s /usr/src/tensorrt/bin/trtexec /bin/trtexec does the job and we can use it without worrying about its path.