Access to data from ML model training

This is part of a related set of posts describing challenges I have encountered and things I have learned on my MLOps journey on Azure. The first post is at My MLOps Journey so far.
The recommended way to use tabular data in Azure Machine Learning (AML) Studio is to use the Azure TabularDataset. This allows easy versioning of data, which is of course able for full reproducibility of ML experiments and models. How to do this using Azure Datasets is documented here: Version and track Azure Machine Learning datasets.
Unfortunately, I was not able to use Azure Datasets due to a Python 3.6.2 dependency. I was getting memory errors with the Datastore.get() call and found posts recommending to upgrade the azureml-core package to fix it. That was not possible with my Python 3.6.2 dependency, so I decided to connect directly to my Azure SQL Database instead. To do this I needed the correct ODBC driver on the Docker image that training runs on, which was not present in the default image.
Make Docker image with necessary dependencies available to base AML environment on
In this section, I will describe how I wrote and built the needed Docker image and based my Azure ML environments on it.
Dockerfile with the appropriate ODBC driver
Since I am using Azure SQL Database to store my relational data, I need at least version17.0 of ODBC driver, as shown in this compatibility matrix: SQL version compatibility. This is the Dockerfile I have defined that worked for me to install the ODBC 17 driver that I needed:
# https://docs.microsoft.com/en-gb/azure/machine-learning/how-to-deploy-custom-docker-image
FROM ubuntu:18.04ARG CONDA_VERSION=4.7.12
ARG PYTHON_VERSION=3.7
ARG AZUREML_SDK_VERSION=1.13.0
ARG INFERENCE_SCHEMA_VERSION=1.1.0ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
ENV PATH /opt/miniconda/bin:$PATH
ENV DEBIAN_FRONTEND=noninteractiveRUN apt-get update — fix-missing && \
apt-get install -y wget bzip2 curl && \
apt-get install -y fuse && \
apt-get install -y gnupg && \
apt-get install -y git && \
apt-get clean -y && \
rm -rf /var/lib/apt/lists/*RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
RUN curl https://packages.microsoft.com/config/ubuntu/18.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
RUN apt-get update
RUN ACCEPT_EULA=Y apt-get install -y msodbcsql17
# optional: for bcp and sqlcmd
RUN ACCEPT_EULA=Y apt-get install -y mssql-tools
RUN echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bash_profile
RUN echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bashrc
RUN /bin/bash -c "source ~/.bashrc"
# optional: for unixODBC development headers
RUN apt-get install -y unixodbc-devRUN useradd — create-home dockeruser
WORKDIR /home/dockeruser
USER dockeruserRUN wget — quiet https://repo.anaconda.com/miniconda/Miniconda3-${CONDA_VERSION}-Linux-x86_64.sh -O ~/miniconda.sh && \
/bin/bash ~/miniconda.sh -b -p ~/miniconda && \
rm ~/miniconda.sh && \
~/miniconda/bin/conda clean -tipy
ENV PATH=”/home/dockeruser/miniconda/bin/:${PATH}”RUN conda install -y conda=${CONDA_VERSION} python=${PYTHON_VERSION} && \
conda update -n base -c defaults conda && \
conda clean -aqy && \
rm -rf ~/miniconda/pkgs && \
find ~/miniconda/ -type d -name __pycache__ -prune -exec rm -rf {} \;
One thing that caused issues for me was figuring which steps needed which privileges to run. I have pointed out what worked in the end below:

- Add key for Microsoft packages repository to list of trusted GPG public keys
2. Download the appropriate package
3. Update package list cache
4. Install ODBC driver, agreeing to the end-user license agreement
Build and store Docker image in Azure Container Registry connected to your AML Studio Workspace
Make sure you have the Azure Command Line Interface installed. Then, if on Windows, you can use Command Prompt to log in to your account using the command:
az login
Then make sure the appropriate subscription is enabled, using the following command (az account subscription enable):
https://docs.microsoft.com/en-us/cli/azure/account/subscription?view=azure-cli-latest#az_account_subscription_enable
Find the name of the Azure Container Registry associated with your AML Studio Workspace:
az ml workspace show -w <your-workspace-name> -g <your-resource-group-name> --query containerRegistry
The above will return a response like: “/subscriptions/<your-subscription-id>/resourceGroups/<your-resource-group-name>/providers/Microsoft.ContainerRegistry/registries/<your-container-registry-name>”.
Now log in to the Azure Container Registry using the name returned above:
az acr login --name <your-container-registry-name>
You can now build the Docker image and store it in the Container Registry. Make sure you execute the following from the directory containing the Dockerfile you want to build from:
az acr build --image myimage:v1 --registry <your-container-registry-name> --file Dockerfile .
Base AML environment on the custom Docker image
To base an AML Environment on the Docker image using the Python SDK, use the following code to define the AML Environment:
container_registry_def = ContainerRegistry()
container_registry_def.address = "<your-container-registry-name>.azurecr.io"
container_registry_def.username = os.environ['SPN_ID']
container_registry_def.password = os.environ['SPN_PASSWORD']
docker_def = DockerSection(enabled=True, base_image_registry=container_registry_def,
base_image="<your-container-registry-name>.azurecr.io/<your-image-name>:v1")
aml_env = Environment(name=os.environ['AMLEnvironmentName'], docker=docker_def, inferencing_stack_version='latest')
The two environment variables SPN_ID and SPN_PASSWORD relate to a Service Principal that I have given access to the Azure Container Registry. These need to be present in the environment and should not be hardcoded.
The third environment variable, AmlEnvironmentName, can be hardcoded. I find it more convenient to have is an environment variable since I call my Python Script that creates my AML Environment from an Azure DevOps Pipeline. I have other Azure DevOps Pipelines to train, register, and deploy the trained ML models. I pass in the environment variables from these pipelines and the pipelines read the environment variables from a Variable Group in the Azure Devops Pipeline Library. If I want to change the name of the AML Environment at any time, I can just change it in one place, i.e. in the Variable Group.
Ensure Python packages necessary for AML are present
When using a custom Docker image, you need to make sure that the AML dependencies azureml-train and azureml-defaults are present. You can do this either by installing them directly on the Docker image or by defining them in a Conda Dependencies section in the AML Environment definition.
Summary
It is possible to configure the AML Environment that you use for training and deployment to a very high degree, but it is not entirely hassle-free. An important element of customizing the AML Environment is being familiar with defining Docker images.
I would love to hear from you, especially if there is some of this you disagree with, would like to add to, or have a better solution for.
Join FAUN: Website 💻|Podcast 🎙️|Twitter 🐦|Facebook 👥|Instagram 📷|Facebook Group 🗣️|Linkedin Group 💬| Slack 📱|Cloud Native News 📰|More.
If this post was helpful, please click the clap 👏 button below a few times to show your support for the author 👇