Latency vs. Throughput in Machine Learning Pipelines

Contents

Introduction
Problem-centric View
Implementation Complexity View
Systems Configuration

Introduction

“Latency vs. Throughput in machine learning pipelines” is topic I end up discussing almost as often as how to avoid common mistakes. While many people seem to primarily think about the training side of pipelines, I’d consider the inference side a lot more interesting and challenging as there are many more trade-offs. IMHO, throughput is key on the training side - latency basically doesn’t matter although unnecessarily high latency decreases throughput of course. The thoughts presented here are valid for any kind of machine learning application however using computer vision examples seem to be the easiest to present some core ideas.

Let’s start with some definitions. We’ll define the latency as the time required to execute a pipeline (e.g. 10 ms) and throughput as the number of items processed in such a pipeline (e.g. 100 imgs/sec). In this particular example throughput and latency yield are the same.

A machine learning pipeline could be made faster. This leaves us with basically two choices:

reducing latency
increasing throughput

They can be combined to a certain degree but at some point a decision has to be made what approach is more important/useful to achieve a certain task.

Problem-centric View

The most fundamental question with respect to “latency vs. throughput” is a rather simple one: “what is the maximum acceptable latency”. The answer to this question determines further steps. There might actually be an information gain due to higher throughput despite higher latency (e.g. consider less frames dropped with cleaner signals recovered) however if the information arrives too late it is of no value anymore.

If reducing latency is not feasible for serial execution of a pipeline, then converting it into a concurrent/async pipeline might already solve the problem if higher throughput is more desirable. Another option, if possible and feasible, would be increasing batch sizes from 1 to e.g. 8 or 32. This way the throughput is increased but the end to end single frame latency is increased at least a tiny bit as well which brings us back to the question of the maximum acceptable latency.

Implementation Complexity View

When deploying machine learning pipelines, we can’t neglect the implementation side completely. Personally, I don’t consider migrating something to C++ (or Rust) too much of a deal to avoid doing for any kind of deployment case. However, not all latency and throughput issues can be resolved but moving pipelines from Python to C++ or Rust.

While such a change to significantly optimized code will certainly result in reduced latency it might not be enough as many Python libraries are often Python wrappers for a C++ library. In these rather common cases the Python overhead is reduced. If this is not enough, there is a cost re-implementing functions in questions to reduce latency by e.g. merging certain functions, use suitable instruction set extensions or apply other neat tricks. At some point the economics of optimizing latency don’t make much sense anymore.

Increasing throughput, besides reducing serial execution latency, does come with different costs. Assuming that we want to increase batch sizes, then there is a small price to pay for the wait period until a batch size is full (after some waiting period e.g. a half batch could be filled blanks for execution of fixed size tensors but this is not the point here). Further, I/O management and redistribution to e.g. multiple threads adds complexity and development costs as well. There certainly is a much higher latency cost to do such when it comes to using Python compared to C++. However, implementation time of such data stream management might be higher than development costs reducing latency.

As most machine learning applications are somewhat unique all these things have to be taken into consideration.

Systems Configuration

Something that should be pointed out is that even rather tiny changes such as deactivating hyperthreading will reduce the latency significantly.