Ingestion of tabular data using Azure
This is part of a related set of posts describing challenges I have encountered and things I have learned on my MLOps journey on Azure. The first post is at: My MLOps Journey so far.
I needed to find the cheapest way to store tabular data on Azure so my ML code would be able to access the data. When looking into storage options on Azure, I was overwhelmed with the many options. There is Blob Storage (and several variants of that), Cosmos DB (and several interfaces, some of which resemble SQL), Table Storage, Azure Database for PostgreSQL, Azure Database for MySQL, Azure Database for MariaDB, Azure SQL Server, Azure SQL Database, and Azure SQL Database Managed Instance.
In my previous projects, I had not had to set up new data storage myself as I was merely a consumer of data stored in existing data stores. So facing this many options without knowing the characteristics, pros, and cons of each was daunting.
In addition to identifying the optimal storage, I also had a challenge in how to copy data into the database in near real-time from a proprietary database.
In the following, I will first discuss my considerations around data storage location. Next, I will describe how I solved the issue of moving data from the proprietary database into a location on the cloud.
Where to store the data
It was quickly apparent which options were not meant for tabular data, and so could be excluded for further consideration (Blob Storage, Cosmos DB and Table Storage).
I was also able to relatively quickly find out that Azure Database for PostgreSQL, Azure Database for MySQL, Azure Database for MariaDB are mainly meant for cases in which such databases already exist on premises, but a company wishes to move them to the cloud with minor adjustment of systems dependent on those databases. In my case I was about to establish a new database, not migrate an existing PostgreSQL, MariaDB or MySQL DB, ruling out these options.
I was then left with deciding between Azure SQL Database, Azure SQL Database Managed Instance, and Azure SQL Server. I found it difficult to understand the differences between these offerings from the documentation. To get a broader understanding of these differences, I recommend the free online self-paced courses on relational data collected in the Learning Path “Microsoft Azure Data Fundamentals: Explore relational data in Azure” offered by Azure. In summary, here is what I learned about these SQL services from Azure:
- Azure SQL Database
This is a platform as a service offering, meaning that there is no infrastructure related management on the customer side, this is handled by Azure. This also means that some features that require access to the OS are not available. - Azure SQL Database Managed Instance
This is also a platform as a service offering, but is almost also a full SQL Server hosted in the cloud, capable of supporting several database instances. There are still a few limitations on features compared to running on infrastructure you own. - Azure SQL Server
This is an infrastructure as a service offering. This gives full access to all features, but also requires the customer to handle maintenance of the infrastructure.
Since I only needed a database to store data, without having a need for running complex scripts, I chose the Azure SQL Database.
Copying data to the database on Azure
The data I need is first stored in a DIMS.CORE system. To access the data, I originally set up an Azure Virtual Machine, that I also installed DIMS.CORE on. Data could then be replicated between the two DIMS.CORE systems, from their first landing place to my VM. I then wrote Python code and scheduled a Python script to be run regularly using the Windows Task Scheduler. This Python could would read the latest data in the DIMS.CORE database on the VM, add UTC timestamps, and insert it into my Azure SQL Database.
From the beginning, I have wanted to get rid of the VM to free myself from having to maintain it and to save costs. A colleague has recently been able to have the first DIMS.CORE system send the data directly to a cloud location in its raw form. I have then deployed an Azure Function App that regularly reads from this cloud location, adds the UTC timestamps as before, and inserts the data into my Azure SQL Database. Hence I have been able to stop the VM, saving compute costs. Once the new setup has run for a while, I will delete the VM to also save storage costs.
Summary
I identified that an Azure SQL Database was the best option for storing the data, and that an Azure Function App allowed me to use existing Python code to perform the minimal necessary preprocessing of raw data before inserting it into the Azure SQL Database.
I would love to hear from you, especially if there is some of this you disagree with, would like to add to, or have a better solution for.