Deploy new ML model only if better than currently deployed model

7 min readJan 5, 2022

If you already have something decent, only exchange it for something better.

This is part of a related set of posts describing challenges I have encountered and things I have learned on my MLOps journey on Azure. The first post is at: My MLOps Journey so far.

While it is easy to retrain machine learning (ML) models as new data comes in or code is updated, it is not a given that a new model is better than the model that is currently deployed. In this post, I will describe how I make sure that a new model is deployed only if it outperforms the current model.

When I deploy new models, I deploy the most recently registered version of the model. To make sure I on

Apples to apples comparison

When comparing models, the same data must be used. I decided to maintain an apples-to-apples comparison for each training run with the current model at the time of the training run. To load the currently registered model after training the new model, I use the Model class from azureml.core:

current_model = Model(workspace=workspace, name=model_name)

At this point, I use the most recent 20% of available data to evaluate both these models.

Do not overfit to test data

I usually construct HyperDrive objects to tune hyperparameters when I use the Azure ML Studio. To find the best set of hyperparameters, the model with the best performance is found and the hyperparameters used in the training of that model are said to be the best.

It is well known that training data should not be used in this comparison to determine best hyperparameters since an overfitted model that has “memorized” the training data would do well in such a comparison. However, we are not interested in evaluating whether a model works well on training data. Rather, we want to know if the model will perform well in the future on unseen data. Hence we need a separate dataset to evaluate which hyperparameters yield the best model.

We run into the same issue if we use the same dataset to both 1) determine the best hyperparameters and 2) compare to the currently deployed model. That is, if we first use a dataset to find the best model out of a large number of models, then it is likely that the best evaluation metric was overly optimistic. If we then compare this overly optimistic evaluation result to the currently deployed model, we risk wrongly deploying a new model.

To guard against this risk, I took the standard approach of splitting data into three parts. The first part, the training dataset, is used for training. The second part, the validation dataset, is used to select hyperparameters. Finally, the third part, the test dataset, is used to determine whether the model selected using the validation data is better than the currently deployed model.

It could be argued that, after finding out whether the new model or the current one is better, a new model should be trained using the new data with the configuration found to be best. However, the practical gains from this seemed minimal and I decided to save the training cost.

Using pipelines to start model runs and select model to deploy

Since I use Azure DevOps Pipelines to run the Python scripts, I need to make sure that my scripts do not run for longer than the Pipelines timeout limit. Since a training run can take a long time, I split training and comparison of the new and current model into two Python scripts, which are called by different pipelines.

Training

My training pipeline calls a Python script that submits an experiment with a HyperDriveConfig. The training script used by each run in the HyperDriveConfig evaluates the model on the training, validation, and test set after completing training. Additionally, the training script also evaluates the currently registered model on the test set. All these evaluation results are saved with the run and can be viewed through a brower using the Azure ML Studio or retrieved programatically.

I had a bug in early versions of the code where I saved the metrics DataFrame as a string, like when you print a DataFrame to text. As you might know if you have worked with Pandas DataFrames in Python, printing a large DataFrame will show only the first and last rows and columns. This meant that the metrics artifacts were saved as text, and had some missing columns. Luckily, all rows were preserved. To maintain compatibility when comparing old and new runs, I decided to stick with saving the metrics DataFrames as text. Going forward, I made sure to use the .to_csv method on DataFrames to retain all the information, using this helper function:

def log_dataframe_as_csv_text(df, df_name, run):
    run.log(df_name, df.to_csv(sep='\t', line_terminator='\n'))

Model selection

When an experiment with a HyperDriveConfig has completed, I find the best run by comparing the evaluation results from the validation data set. Once the best model has been found, I use the evaluation metrics from the test set to determine whether or not to register the model so that it will be deployed instead of the currently deployed model.

A complicating factor in model comparison was that I trained models for different lead times. Most models were trained to make predictions for the timestamp at which the most observations were recorded, 10 minutes after this, 20 minutes after most recent observations and so on in 10 minute intervals up to 3 hours. That is, most models would make 19 predictions (0, 10, 20, …, 180 minutes after most recent observations). After having determined a good model type, I wanted to evaluate how the model works for longer lead times and trained a model that predicted up to 5 hours after the most recent observations. If I were to compare models on the average Root Mean Squared Error (RMSE) over lead times, as I had done so far, the new model would be at a disadvantage since prediction performance naturally degrades with longer lead times. To compare on an equal footing I saved the RMSE for each lead time in a Pandas DataFrame with each run. At comparison time, I can then select the intersection of row names present in the DataFrames from the two runs being compared and compare using the average of RMSEs over the lead times that both models were trained for.

My script to determine whether a new model is better than the current one ,and register the new one if it is better, looks like this:

import osfrom project_name.utils.cloud_development_helpers import get_workspace, \
    register_new_model_if_better, get_latest_completed_hyperdrive, get_metrics_for_current_and_new_modeldef main():
    ws = get_workspace()    aml_experiment_name = 'experiment_type'
    model_name = 'model_name'
    experiment = ws.experiments[aml_experiment_name]    hyper_drive = get_latest_completed_hyperdrive(experiment)    best_run   =  hyper_drive.get_best_run_by_primary_metric(include_failed=False,  include_canceled=False)    dataframe_suffix = '_rmses_per_lead_time'
    primary_column_name = 'wet_m3_per_s'
    secondary_column_name = 'dry_m3_per_s'    best_new_model_metric, current_model_metric =   get_metrics_for_current_and_new_model(
 best_run, dataframe_suffix, primary_column_name, secondary_column_name)    register_new_model_if_better(best_new_model_metric, best_run, current_model_metric, model_name)if __name__ == '__main__':
    main()

The helper function get_workspace expects the existence of some environment variables and is defined like this:

def get_workspace():
    svc_pr = ServicePrincipalAuthentication(
        tenant_id=os.environ['tenant_id'],
        service_principal_id=os.environ['application_id'],
        service_principal_password=os.environ['svc_pr_value'])
    ws = Workspace.get(subscription_id=os.environ['subscription_id'],
                       resource_group=os.environ['resource_group'],
                       name=os.environ['workspace_name'],
                       auth=svc_pr)
    return ws

The following functions retrieves the HyperDriveConfig experiment just run:

def get_latest_completed_hyperdrive(experiment):
    completed_hyper_drives = get_completed_hyperdrive_runs(experiment)
    start_times = [run.get_details()['startTimeUtc'] for run in completed_hyper_drives]
    latest_run_idx = int(np.argmax(start_times))
    hyper_drive = completed_hyper_drives[latest_run_idx]
    return hyper_drive

When I submit the training runs, I define the RMSE calculated on the validation data as the primary metric, so the following line returns the best model out of the ones trained in the most recent experiment:

best_run   =  hyper_drive.get_best_run_by_primary_metric(include_failed=False,  include_canceled=False)

Next, I define both a primary and a secondary metric to use for comparison. In the project this code was developed for, it is most important that the model does well when it rains. Thus we prefer to compare the models on their performance during rainy periods. Sometimes, though, the data in the test set does not contain rain. In this case, we want to be able to have a fallback metric to use for comparison. The function that retrieves the metrics for each of the two models compared is defined like this:

def get_metrics_for_current_and_new_model(candidate_run, dataframe_suffix, primary_column_name, secondary_column_name):
    dataframe_name = 'test_current_registered_model' + dataframe_suffix    current_model_rmses = read_metric_as_dataframe(candidate_run, dataframe_name)    dataframe_name = 'test' + dataframe_suffix    candidate_model_rmses = read_metric_as_dataframe(candidate_run, dataframe_name)    common_index = current_model_rmses.index.intersection(candidate_model_rmses.index)    candidate_model_metric = candidate_model_rmses.loc[common_index, primary_column_name].mean()    current_model_metric = current_model_rmses.loc[common_index, primary_column_name].mean()    if np.isnan(candidate_model_metric) or np.isnan(current_model_metric):
        candidate_model_metric = candidate_model_rmses.loc[common_index, secondary_column_name].mean()
        current_model_metric = current_model_rmses.loc[common_index, secondary_column_name].mean()    return candidate_model_metric, current_model_metric

To read the logged DataFrames with metrics, which were saved as strings as described in the Training section, the following two helper functions are defined:

def read_metric_as_dataframe(run, dataframe_name):
    dataframe_string = extract_metric_value(run, dataframe_name)
    if 'rows' in dataframe_string:
        dataframe_string = '\n'.join(dataframe_string.split('\n')[:-1])
    dataframe = pd.read_csv(io.StringIO(dataframe_string), lineterminator='\n', header=[0], delim_whitespace=True)
    return dataframedef extract_metric_value(run, metric_name):
    metric_dict = run.get_metrics(metric_name)
    metric_val = metric_dict.get(metric_name, np.nan)
    return metric_val

Finally, I call the function that registers the new model if it is better than the currently registered model:

def register_new_model_if_better(best_new_model_metric, best_run, current_model_metric, model_name):
    print("Model name: " + model_name)
    print("Best new model metric: " + str(best_new_model_metric))
    print("Current model metric: " + str(current_model_metric))
    if np.isnan(current_model_metric) or (best_new_model_metric < current_model_metric):
        best_run.register_model(model_name)
        print("Registered new model")
    else:
        print("Did not register new model as currently registered model is better")

Making the new model available for users

The next time the model of this name is deployed, the most recently registered version of the model will be used. All it takes to make a better, newly trained model, available for users is then to redeploy it. The new model will then be the one used when the endpoint is queried.

Summary

I have described how I compare models, with dataset splitting in mind to avoid wrongly deploying a new model that seems better than a current one. This could otherwise happen due to comparing multiple models on a dataset before comparing the best of these to another model on the same dataset. I also described how I can compare different models even though they are trained for different lead times.

I would love to hear from you, especially if there is some of this you disagree with, would like to add to, or have a better solution for.