The Triton Inference Server does support tensorrt
models as well as tensorrt_llm
. One of the disadvantages of tensorrt is that it makes little sense to ship engine/plan files as they need to be created on the GPU (generation/family?) it is going to be deployed on. Therefore, it is often practical to ship onnx models and convert them right before deployment.
A useful tool to convert an onnx model to tensorrt is trtexec
which is the cli for tensorrt. However, if we run e.g. NVIDIA’s latest triton server container (e.g. nvcr.io/nvidia/tritonserver:24.07-py3
) then we won’t find trtexec
available. However, it is installed and therefore we don’t need to build it but we need to know the path of the executable or create a symlink to use it. Creating a symlink might be most convenient.
ln -s /usr/src/tensorrt/bin/trtexec /bin/trtexec
does the job and we can use it without worrying about its path.