Near real time ingestion and processing of image data

Laura Frolich
6 min readNov 24, 2021

--

This is part of a related set of posts describing challenges I have encountered and things I have learned on my MLOps journey on Azure. The first post is at: My MLOps Journey so far.

As mentioned in Ingestion of tabular data using Azure, there are several options for storage of non-relational data. For a good overview, I recommend the self-paced online courses collected in the Learning Path Microsoft Azure Data Fundamentals: Explore non-relational data in Azure.

In our setting, new image data is made available on an FTP server approximately every ten minutes as a .zip file. Each .zip file is deleted after 48 hours.

In the following I will describe how we set up services to build up an archive of data and the storage service considerations.

Building up archives of raw and preprocessed image data

As mentioned above, new image data is made available on an FTP server every ten minutes and deleted after 48 hours. We wish both to

  1. save all data so we have historical data available to train deep learning models, and
  2. make preprocessed image data available for inference by the current deep learning model.

The two above steps are illustrated in the below figure. The first step is implemented with a Logic App that copies data from the FTP server to Blob Storage. The second step is implemented with a Logic App that starts a pipeline in Azure DevOps that runs code to preprocess the raw data saved in Blob Storage and saves the preprocessed data both in Table Storage for fast access and in another Blob Storage.

Illustration of data flow from raw data on external FTP source to preprocessed data in internal storage.

In the following, I will describe some caveats and insights I have learned since the initial setup for each of the two steps.

Step 1: Copy data from FTP server to Blob Storage using Logic App

Our first version of the Logic App that copies the raw data to Blob Storage was defined in the Logic App Designer like this:

Designer view of first version of Logic App that copies data from FTP server to Blob Storage.

The first task, “When a file is added or modified”, checks the FTP server for new files at regular intervals, as defined when adding the task. When a new file is detected, the Logic App is triggered.

One day, as I was looking through our existing services and clicking through all the various information tabs, I saw this:

Documetation on the “When a file is added or modified” task.

The image data we have varies in size, with the most interesting data files exceeding 50 megabytes. Reading this documentation, which had somehow escaped our notice when setting the original Logic App up, was shocking. This meant that we had lost the most interesting data and explained large data gaps that we had noticed, but had attributed to other causes. Googling, I found this documentation: https://docs.microsoft.com/en-us/azure/connectors/connectors-create-api-ftp. Following this advice, our current Logic App is designed like this:

Current implementation in Designer of Logic App that copies data from FTP server to Blob Storage.

Step 2: Preprocess data and store in Blob Storage and Table Storage

The initial design of the Logic App that starts preprocessing is illustrated below:

Original design of Logic App that starts preprocessing of raw data.

This Logic App was set up to run every ten minutes. When the app is triggered, it starts a pipeline in Azure DevOps that performs preprocessing operations on the raw data available that has not yet been preprocessed, using Python code.

The reason a pipeline was used instead of an Azure Function when we set up the various services was that Function Apps for Python code were not available at the time. They now are, so to run custom Python code, Function Apps should be chosen.

Starting a pipeline using a recurrent trigger from a Logic App is an overly complex approach. However, at the time we were not aware that a pipeline in Azure DevOps could be defined to run automatically at a regular schedule as described here: Configure schedules for pipelines.

The design with the recurrent trigger implies that if a file is uploaded to the FTP server a bit later than usual, and thus available in the raw data storage a bit later than assumed by the recurrent trigger, then that file will be missed and only be preprocessed after the next invocation of the preprocessing code. Hence running the preprocessing code on a fixed schedule induces delays in the data flow. Instead, we have now set up a trigger such that the Logic App starts the pipeline when a new file is created in the Blob Storage for raw data:

Trigger preprocessing pipeline run when a new raw data file is available.

The definition of the “When a resource event occurs” task looks like this:

Settings in the definition of the event trigger task in the Logic App Designer.

To emit a trigger when a new blob is created in a storage account with name “Storage-account” in the subscription “Subscription-name”, choose “Microsoft.Storage.BlobCreated” in the “Event Type Item -1” box. Choose a name for the subscription, in the above example it is “My-blob-created”. You can see that the Event Subscription was created, and monitor when events are emitted, by going to the storage account, as shown below:

View event subscription in Blob Storage.

Storage service considerations

In terms of where to store data, we started to use Blob Storage since this seemed to be the most general purpose storage for binary data. When we needed to retrieve data quickly to train a deep learning model, this turned out to be slow to retrieve.

To maintain a consistent archive of data, we continued to store preprocessed data in Blob Storage.

To also enable quick retrieval, we also store the preprocessed data in Table Storage. However, it is only by luck that this solution works since some of our files are only just under the limit of what can be stored in a row in Table Storage, which is 1MB (see https://docs.microsoft.com/en-us/learn/modules/explore-non-relational-data-offerings-azure/2-explore-azure-table-storage).

I have now learned that a better solution would have been to either use Comos DB with links to Blob Storage files. If setting up the system today, the new Data Lake offering based on Blob Storage, which was not available when we started working on this project, would probably be the better choice. Another important thing to consider in Blob Storage or a Data Lake is the hierarchy since a flat structure makes listing of files slow when there are many files.

Summary

I have described how we set up a number of Azure services to build up a historical collection of raw as well as preprocessed data. Two important lessons I have learned are:

  • Using triggers to start the next step when one step finishes can decrease the delay from receiving raw to having preprocessed data substantially.
  • Having a good overview of available services and their differences before setting up a system is important to make better choices that are difficult to change at a later time.

I would love to hear from you, especially if there is some of this you disagree with, would like to add to, or have a better solution for.

--

--

Laura Frolich

I enjoy combining data sources and data science/engineering to deploy machine learning models that help make the world a better place.