AzureML: A Painless Solution for Developing ML Apps on Cloud

18 min readApr 3, 2021

Photo by Wolfgang Hasselmann on Unsplash

Machine Learning on Cloud is becoming the new Black — lovable, but a trend which needs significant thought and research. ML PaaS is one of the most desirable and also the fastest growing trend in the public cloud space. Almost every cloud provider is constantly trying to improve their ML platform by offering a more comprehensive solution for Developers, Data Scientists and DevOps Engineers. But a big problem with ML on Cloud is how complex and costly these solutions can become as a company scales. However, Microsoft has built a very robust ML PaaS System. AzureML offers a painless end-to-end solution for developing and deploying ML applications for different business needs.

With AzureML, all you really need to start is an Azure account (… and a somewhat reliable internet connection), and you can start building enterprise-level ML solutions using even your Windows XP Machine! Everything ranging from provisioning clusters, running training jobs, building and operationalizing ML models to deploying scalable ML solutions is managed by AzureML at very low costs. Another really cool thing is AzureML binds DevOps and Data Science seamlessly. Pretty great!

Let’s now take a look at building with AzureML. We will be designing a binary classification pipeline and will make use of AzureML SDK for Python along with AzureML Console to build a powerful solution. As we go through each step, we will take a closer look into various AzureML offerings and how they can be used across a variety of business-solutions.

Login to Azure Portal

Assuming you already have an Azure account and a personal or an enterprise subscription (this is what Azure uses to track your usage, permissions and expenses), login to your Azure account, and visit the Azure portal. Azure portal is akin to a shopping mall: it contains everything (and more)you need to develop, collaborate and deploy on Azure. You can add a variety of cloud services (compute, storage, AI, analytics, IoT and more) into your subscription right from the portal. Alternatively, you can also use Azure CLI to manage (create, update, delete) your resources.

Creating a new AML Resource

Now that you are in the portal, let’s create a new resource by clicking on Create a Resource listed within Azure services, and search for Machine Learning within the Marketplace. This is where we will provision Azure Machine learning (also known as AzureML and AML). To create an AML resource, you will need to specify the subscription and the resource group. A resource group will bundle all similar resource together, allowing you to manage the whole bundle of resources together. Next, we will configure our ML Workspace. (DW! We will take a look at what a Workspace is here after we’re done creating our AML Resource). In order to create a Workspace, here’s what we will need:

A name for the workspace
Region — As we define the region, we need to be mindful of what compute do we need for this workload. Our compute target (compute resource which will execute our jobs) will be configured in the same region as the region of the workspace. So if our solution needs a GPU-optimized machine, we will have to make sure those machines are available in the region we chose for our workspace. Note: Resources in a resource group are location-independent, so you can have resources from 2 different location in the same resource-group.
Storage Account — Azure Storage is an Azure cloud service that provides highly available, secure, durable, scalable and redundant storage. You can store images, videos, logs, sensor data and a whole lot more options in Azure Storage. And an Azure Storage Account provides access to Azure Storage, and manages configurations like redundancy for you. You can either go with the default storage account OR create a new account to tweak the redundancy-level. Default is Locally Redundant (redundant across the same data-center in your defined region).
Key Vault — A Key Vault is where you can store your information you want to keep confidential. In order to access this information in your scripts, you can make use of AzureML SDK. Similar to Storage Account, you can use a specific Key Vault you have defined earlier.
Application Insights — AzureML uses App Insights to collect data from models deployed as Web Services. You can collect insights on requests, responses, outputs, et cetera from the deployed ML App.
Container Registry — AzureML uses Container Registry to store images used in pipeline runs. By default, no container registry is created when the ML Workspace is first created to prevent user paying unnecessarily. As we go through steps in the pipeline, and generate docker images for those steps, Azure will create and configure a new Container Registry, and store the images there. In subsequent runs of the pipeline, AzureML pulls images from this repository itself, thereby significantly saving time.

Great! Once we have everything defined, we can go ahead and click on Review+Create. Our resource will first be reviewed, and once our permissions and setup is reviewed, we can click on Create. This will start deployment of our ML Resource. Can definitely sip some coffee as the resource is being deployed!

So, if you are here, either you don’t drink coffee, or are curious to know more about AzureML … or you fall in the crazy Venn-diagram-intersection of the two! Let us dive into AzureML slightly deeper and talk about components which make AzureML so great.

Components of AzureML:

ML Workspace — Workspace is the umbrella which contains everything you need to develop and deploy your ML solution. Components we need to develop and deploy ML solutions like data, compute targets and environments, notebooks, models are all present inside the workspace. AzureML Workspace provides us a way to track and version each of these components so teams are able to go back into history and easily trace back changes they have made over time.
Environment — Environment is what defines the runtime for the various stages in our pipeline. We can define various dependencies, env-variables, version of our runtime-language to use here, and more
Pipeline — Think of this as a roadmap but for the computer. We define a bunch of steps which need to run sequentially, and that makes a pipeline. For example, we first define a step to get the data, then process the data, and finally train the data. A pipeline is a way to define what needs to be done in sequential steps. In order to define a pipeline, we have to first configure steps and then mention the order of running those steps. These steps can involve preparing our data, training hyperparameters or deploying our models, transferring data between different sources and more. AzureML Pipelines gives us the freedom to define different hardware requirements for different steps. If a step is more compute-intensive, we have the ability to use a compute-intensive hardware for this step and general-purpose for the others. Note: You don’t have to necessarily run steps sequentially. Depending upon your business needs, you can also run steps parallelly.
Run —Alright, so now we have a pipeline defined which is nothing but defining steps which run sequentially, a Run in AzureML is when we run those steps in a pipeline. A Run takes our defined pipeline and runs it on the defined compute-targets (we define the compute targets in steps of our pipeline).
Experiment — An experiment is a collection of runs. We define an experiment and submit a run to an experiment. So for example, an AzureML Run is running an AzureML Pipeline. Now, if we tweak a hyperparameter of a step, we will submit this run which will run the pipeline and produce a different result. If we submit this run to the same experiment, we will be able to track the performance between the two runs in this experiment effectively.

Quite some discussion on AzureML Pipelines. Should we now check back at our AzureML Resource?! Once your resource is deployed, let us open our new resource by clicking on ‘Go to Resource’ and you’ll be brought into your AzureML Workspace interface. Pretty sleek UI with information but what’s sleeker is the AzureML Studio. Click on ‘Launch Studio’ to launch the super cool interface to AzureML.

Now we are inside AzureML Studio. Here you will be able to see the data you are working with, a beautiful viz of our pipeline as it is running, information on your registered models (models you have registered to be used in your business). As we will create our pipeline and run it, we will see more activity here, I promise!

Building Our Pipeline

By this step, we should have created an AzureML Workspace. If you are having troubles creating a Workspace, the first thing you’d want to do is check your subscription to make sure your subscription allows for you to create ML Resource.

We will be working with Pima Indian Diabetes dataset to design a binary classification pipeline using AzureML.

To design our ML Pipeline, we will be using AzureML SDK for Python. Install the dependency using:

pip install azureml-sdk

The directory structure seems quite intuitive, right? The data folder is where we will store the csv file containing the data. We will then use Azure Storage to store this data and make it accessible to the compute target running our pipeline by utilizing DataStore Class from the AzureML SDK we defined earlier. We can also write a script to download this csv file and pass it as a step to our pipeline. The output (downloaded csv file) will be tracked and stored by our pipeline.

The two steps we want to define for our pipeline are prepping the data and then running a training script on the prepped data. What do we know about the steps?

The prepping step should be capable of retrieving the raw csv, so it needs an input which is the raw csv file.
The prepping step has an intermediary data output, that is, the output of this stage will be used by the training step.
The training step will train the model, and we also want to save the model as a serialized file in a way that can be directly pushed to production on Azure.

Cool! We know a bit more, that’s always good. However, the really fun thing I have learned is just how massively AzureML allows Developers and Data Scientists to develop ML solutions specific to their business needs. Sky is really the limit when it comes to thinking about storing data, designing steps of a pipeline or deploying the pipeline!

Prep-Step Script

Here is the script we will be running to prepare the data, put this within the prep folder. The reason why we want to separate directories is to A) Make the ecosystem more cohesive and B) As we run our pipeline, Azure will create a docker image with the script used in the step. We will be defining the prep folder as the source directory for prep-step, and this ensures that the docker image is built from the right script.

Some interesting things about prep.py script:

Using azureml.core…? This is the core library which gives us access to all core functionalities of AzureML like access to Datasets, Pipelines and Run (which we are using in this script).
Run.get_context() — as we talked about earlier, Run in an AzureML Pipeline is one run of the pipeline which contains logs. metrics, etc about that particular run/trial. Here, we are getting the configuration of that run of the pipeline.
We are using an argument parser. What exactly is it doing here? Excellent thought! An Argument Parser is what we are using to pass data in form of parameters which is being used when script is run. We do not want to hard-code the parameters this script needs, because that is not a good design practice. What we rather want to do is pass these parameters dynamically. So if we want to change the seed for the random split, we will pass that value as a parameter to this script from our step. This is also a great way to send parameters between two steps of a pipeline. Think about it this way, the compute target running the pipeline does not have a similar directory structure like we do in our local system. As we run through each step, a docker image of that script is being created. There should be a way to define input to this script which is running in an isolated environment. In that circumstance, passing arguments through Argument Parser helps profoundly. When we write our pipeline config code, we will look at that more closely.

Train-Step Script

Here is the script we will use to train our data. We have a Discriminant Analysis Model we want to work with.

Some interesting things about train.py script:

This script is receiving parameters using Argument Parser. args.train and args.test has reference to where the training and testing data is stored. While, args.model provides reference to where the final model should be saved.
We are using run.log(), what could that be? As mentioned previously, Run.get_context() in an AzureML Pipeline provides information on that particular run of the pipeline. Here we are extracting the start and end-time of training to track how long does the process take, and logging it to that run of a pipeline. Ideally, we would like to keep the this time pretty low to keep our users happy. If we tweak our hyperparameters, or the algorithm itself, and submit the run again to the same experiment, we will be able to look at the total time taken between the runs; great way to make important decisions in terms of UX.

Pipeline Configuration

OK, OK! Super cool part now: actually designing and running the pipeline. I am using jupyter notebook here, you have the total freedom to use whatever works.

So here what we are trying to do is access, configure the AML Workspace from our local system. AzureML SDK allows us to do just that. All we need is to go back to AML Workspace and download the config.json file, which contains information on our workspace including the name and the resource group inside of which our Workspace lies. We need to download this file, and store it at the root of the directory we have been working with.

We are packed and ready to start writing the code for configuring the pipeline. Let’s first import all the needed dependencies.

import os
# azureml.core for importing core components like Experiment, Workspace
from azureml.core import Workspace, Experiment, Datastore, Dataset, Environment#To track and log run-details as the experiment runs
from azureml.widgets import RunDetails
  
from azureml.pipeline.core import Pipeline, PipelineData, PipelineRun, StepRun, PortDataReference 
from azureml.pipeline.steps import PythonScriptStep
 
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
 
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
 
from azureml.core.model import Model

We already talked about azureml.core. Moving forward, we are exporting RunDetails to print logs in the notebook as we submit a run to an experiment. PipelineData is data that is being used either as an input or an output in one or more steps. PipelineRun provides information on what is the status of the running pipeline. StepRun is the run of a particular step in a pipeline and a great use of it is to download output of a particular step from a run along with PortDataReference. An AzureML Pipeline has several kinds of steps; PythonScript is one kind of step which runs a Python Script. Other can be checked here: AzureML Pipeline Steps. ComputeTarget and AMLCompute gives allows us to perform CRUD on compute resources. CondaDependecies gives a great way to configure environment dependencies. Microsoft has really great documentation on each of these modules and classes. I recommend taking a look. You might find some really cool stuff.

Aight, with that done, I will move to getting access to my workspace, and begin working with that!

#Configure the Workspace from config.json
ws = Workspace.from_config()

Now, configuring our data. AzureML Pipelines uses Datastores to get access to data stored across a variety of storage services on Azure. Datasets are attached to the workspace and can be referred by name. There is support for object storage, SQL DB and more. And, the data within datastores which can be thought of as versioned data object used within the pipeline as input to a step is called a Dataset. There are 2 different kinds of Datasets: a File Dataset and a Tabular Dataset. A tabular dataset gives Data Scientists the capability of transforming the data into a Pandas OR a Spark Dataframe.

def_blob_store = ws.get_default_datastore()
def_blob_store.upload_files(["./data/diabetes.csv"], target_path="data", overwrite=True)

We now have access to the default datastore, and also uploaded our data to the store. In order to access this data as an input dataset to the pipeline, we will use Dataset and register it with our workspace (so that our Pipeline will have access to this dataset and can pass this as input to a script).

#creating a tabular dataset. 
diabetes_data = Dataset.Tabular.from_delimited_files(def_blob_store.path('./data/diabetes.csv'))
#registered the created dataset in the workspace with the name diabetes_data
diabetes_data = diabetes_data.register(ws, 'diabetes_data')

Super, we now have a dataset which is registered to our workspace. It can easily be passed as an input to a script inside a pipeline step, and be loaded into a Python DF. With that done, we will configure our compute target which will be running all these hefty tasks below. We will first check if the cluster exists already and spin a new one only if the cluster is absent; else use the already existing one. We are using a Standard general purpose D-Series VM. You can choose between a very huge list of VM Sizes depending on your use case. So amazing! Check VM Sizes here!

aml_compute_target = "ml-cluster"#Search for existing CLuster, create one in the ML Workspace if cluster not found
try:
    aml_compute = AmlCompute(ws, aml_compute_target)
    print("Compute Target exists already, using that")
except ComputeTargetException:
    print("Creating a new Compute Target")
    
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2",
                                                                min_nodes = 1, 
                                                                max_nodes = 2)    
    aml_compute = ComputeTarget.create(ws, aml_compute_target, provisioning_config)
    aml_compute.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
print("Azure Machine Learning Compute attached")

Next, what we have done is define a run configuration which will be used across all the runs in an experiment. One can also use different run configuration for different runs. For example, if one would like to change the compute cluster the pipeline is running on, AzureML gives developers the freedom to define and use varying run configurations.

aml_run_config = RunConfiguration()
 
aml_run_config.target = aml_compute
aml_run_config.environment.docker.enabled = True
aml_run_config.environment.docker.base_image = "mcr.microsoft.com/azureml/base:latest"
 
aml_run_config.environment.python.user_managed_dependencies = False
 
aml_run_config.environment.python.conda_dependencies = CondaDependencies.create(
    conda_packages=['pandas','scikit-learn','numpy'], 
    pip_packages=['joblib','azureml-sdk','fusepy'], 
    pin_sdk_version=True)

Over to the next step, where we will be defining inputs and outputs to various steps in the Pipeline we will run.

#passing the dataset as_named_input to retrieve in the script later, like a parameter for function
raw_data = diabetes_data.as_named_input('raw_data')
train_data = PipelineData("train_data", datastore=def_blob_store).as_dataset()
test_data = PipelineData("test_data", datastore=def_blob_store).as_dataset()
scaler_file = PipelineData("scaler_file", datastore=def_blob_store)
model_file = PipelineData("model_file", datastore=def_blob_store)

If this hasn’t been all that fun, here’s the pulp. In the next code-block, we will define our first step and have a look at how we are configuring inputs and outputs to the step.

source_directory="./prep"
prep_step = PythonScriptStep(name="prep_step",
                         script_name="./prep.py", 
                         arguments=["--train", train_data,"--test", test_data,"--scaler",scaler_file, "--test_size",0.3, "--seed", 42],
                         inputs=[raw_data],
                         outputs=[train_data,test_data,scaler_file],                         
                         compute_target=aml_compute,
                         runconfig=aml_run_config,
                         source_directory=source_directory,
                         allow_reuse=True)

Looking at the code block above, we have first defined the source_directory, which is where the script is present. Next, we are using a PythonScriptStep, which means, this step runs a Python script: just one of many different AzureML Pipeline steps available to us from Azure. We next define a bunch of arguments. All these arguments are being used in the prep.py script we saw earlier. This is how we are passing arguments to the script dynamically. We also saw how we were using the input from this step back in our prep.py file:

dataframe=run.input_datasets[“raw_data”].to_pandas_dataframe()

Here we are also seeing the use of dataset. raw_data, which is a dataset being passed as an input, directly got loaded to a pandas dataframe. Super snazzy and super clean!

Similarly, we will also define a second step, which is the training step.

source_directory="./train"
train_step = PythonScriptStep(name="train_step",
                         script_name="./train.py", 
                         arguments=["--train", train_data,"--test", test_data,"--model",model_file],
                         inputs=[train_data,test_data],
                         outputs=[model_file],                         
                         compute_target=aml_compute, 
                         runconfig=aml_run_config,
                         source_directory=source_directory,
                         allow_reuse=True)

The configuration to this step is pretty similar to the previous one. Few things:

We are similar environment and compute for both these steps. However, one is totally free to use different environment and/or hardware depending upon need.
The allow_reuse=True will reuse the allow the reuse of its previous run results if the step contents (scripts/dependencies) as well as inputs and parameters remain unchanged. Note: changes to data as input from the datastore is not included here. Any change to such data will not rerun. Only changes to scripts or dependency will.

Phew! We have reached so far! All we need to do now is put the steps together to design our pipeline, and run it. Also, remember what we talked about earlier, steps here are being run sequentially, however AzureML gives us the space to run steps in a pipeline parallelly.

#Define the order of steps to run in a sequence
steps = [prep_step,train_step]# Design the pipeline with the steps
pipeline1 = Pipeline(workspace=ws, steps=steps)

Nice…just one last step, and our pipeline is ready, set, go! In the code-block below, we will be submitting a run of this pipeline to our experiment.As soon as we implement this cell, a run will be executed. Let’s run this and head back to AzureML Studio. We will see a new experiment is started, and it is running Run1. One thing to note is the use of regenerate_outputs=False; by setting regenerate_output to False we are configuring the pipeline to reuse the output of previous runs. This way, we will not generate a new output for each run if a particular step has not changed. However, if we set this to True, a new output will be generated every time we submit a run to the Experiment.

pipeline_run = Experiment(ws, 'diabetes').submit(pipeline1, regenerate_outputs=False, show_output=True)

As we click on the Run ID, we will get a more clearer look into the run itself. Primarily, the pipeline configuration of this run. The Pipeline is presented as a DAG-like system. As we see, this graph was generated for us on the console from how we had defined the inputs and outputs to each step along with the order in which steps should run.

Over on the console, There are option to check the graph itself; look at the metrics of either the entire run or any particular step. The experiments tab provides a lot of extensive information on the entire run and runs of particular steps. Every time we submit a run to this experiment, the run will be listed as Run#<number> within that experiment. One thing to note here is that the first time we run the pipeline (Run#1), it takes a longer time than when run subsequently. This is due to the fact that it takes longer to build a docker image the first time we run each step. In subsequent runs, we are using cached version of the image. As soon as the image is built, the status will change from preparing to running. However, if the process is still very slow, you can always play around with hardware config of the compute. Plus, you can always browse through the logs to see what is going on as the pipeline is running. In case of an error, also head over to the Outputs+logs of that particular stage to get a clearer idea of what went wrong.

Visualizing the Run we submitted to our Experiment

Once the run is completed, we can download the output of the step which produces our model, and then register the model to our Workspace. Once the model is registered, one way through which we can expose the Model for inference is by deploying the model as a Web Service by wrapping it inside of a scoring script whose job will be to load our registered model from AML Workspace and make the prediction.

Possibilities for development and deployment with AzureML are endless. One reasone why I love AzureML is the how expansive yet easy to implement AML is. There are such great tools available for teams to accelerate the adoption of ML into their business. Azure SDK for example has greatly simplified the creation of ML Pipelines. By changing a single parameter, teams would be able to switch from a CPU-based training cluster to a powerful GPU cluster running GPU-intensive workflows. Similarly, AzureML Workspaces has made tracking and versioning data and assets a cake-walk. Azure Pipelines has allowed teams automate the process of building and releasing ML solutions as steps. Plus, the pricing is amazingly flexible giving teams the opportunity to try different components without setting their fund on fire. And of course, the great documentation, and ever-growing, ever-supporting developer community makes developing with Azure so much more fun!!!