Life of a model after deployment

Example dashboard used to monitor models — Stakion

Application monitoring is a key part of running software in production. Without it, the only way of finding about an issue is through shear luck or because a client has reported it. Both of which are less than ideal to say the least !

You wouldn’t deploy an application without monitoring, so why do it for Machine Learning models?

In order to try and keep post manageable length, we will focus on just a few key topics:

  • The data required for monitoring
  • The key metrics to track and alert on
  • Similarity metrics between training and inference distributions

The data required for monitoring

In order to monitor Machine Learning models accurately, granular data needs to be recorded when making predictions. These logs can be used to debug why a certain prediction was made and once aggregated will enable us to monitor the model.

Key parts to good logging:

  • Unique Id per request provided by the system that called the ML model. This unique identifier will be stored with each log and will allow us to follow the path of a prediction before, during and after the ML model.
  • Input features before feature engineering
  • Input features after feature engineering
  • Output probabilities
  • Predicted value

While this might seem like a lot of logging, without it we will be blind to how the model is handling certain inputs.

Key metrics to track

In the best case scenario, we would compare the ground truth label to our predictions in near realtime. Unfortunately this is usually not possible as for most applications, the ground truth label isn’t available until a long time after the prediction has been made. This is the case when trying to predict next years revenue for example or predicting if a transaction is fraudulent.

If you can’t track accuracy of a model in real-time, track the distribution of the input and output features instead.

The assumption behind monitoring model features is that the performance of the model will change when the model is asked to make predictions on new data that was not part of the training set.

We can have a good idea of the health of a model by tracking just a handful of metrics:

  • Number of prediction made
  • Prediction latency — How long it takes to make a prediction
  • Changes in the distribution of input and output features

At a feature level, we will want to track similarity metric between the distribution computed during training (training distribution) and the distribution computed during inference (inference distribution).

Similarity metrics between training and inference distributions

In order to monitor a feature, we need to compute a single metric that will compare the training and inference distributions. There are many different similarity distributions to choose from but few that can be applied to both categorical and numerical variables.

Computing similarity metrics

The Wasserstein distance [1] is a similarity measure that can be computed for both numerical and categorical features and is defined as:

Wasserstein distance with U as the training CDF and V the inference CDF

Graphically the Wasserstein distance is the area between the two Cumulative Distribution Functions. As such if the training and inference distributions are the same, then it will be 0. As they start diverging, the distance will increase and does not have an upper bound.

Graphical representation of the Wasserstein distance.

To compute the Wasserstein distance for categorical variables, you have to sum the absolute value of the differences in frequency for each category.

Storing distributions

In order to compute the Wasserstein distance, we need to store both continuous and discrete distributions.

Continuous distributions can be represented using t-Digests [2]. Without going into details, t-Digests are a new data structure that store a sparse version of the CDF and have a lot of properties (easily serialisable, parallelisable, etc).

Discrete distributions are easier to store, we just need to store the categories and their respective frequencies. If the total number of categories is very high, then we can just store the top 1000.

Visualising discrete distributions can be tricky, one thing we can do is group the categories into buckets which each contain at least x% of the data (see below). Given that this is done on the training data, we expect the size of each bucket to fluctuate slightly during inference.

Bucketing categories into buckets of frequency of 10%.


With a bit of planning, monitoring a Machine Learning model does not have to be complicated.

Monitoring models boils down to tracking the number of predictions and a similarity measure between training and inference distributions. If you have that in place, you will be able to rest easy that if something goes wrong, you will be the first to know !


[1] Wasserstein metric,

[2] Computing Extremely Accurate Quantiles Using t-Digests,

Life of a model after deployment was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.