Data Scientist’s starter toolkit for end-to-end Machine Learning Life cycle in 2022

12 min readNov 4, 2022

Introduction

The field of data science is always getting better, both in terms of the tools that are already available and those that are being developed every day. Even though no tool is flawless and comprehensive, having the right tools on your side can make a big difference in productivity and execution.

In this tutorial, I'll talk about a few open-source tools for machine learning experimentation and integrated MLOps.

Table of Content

Define business problem and use-case.
Perform machine learning experiment using PyCaret to develop an end-to-end machine learning pipeline.
Use MLflow to log experiment metrics and parameters.
Use MLflow Model registry to register model endpoints on-premises / (locally)
Use MLflow Model registry to register model endpoints on-premises / (Remotely on DagsHub)
Use MLflow Sagemaker integration to deploy an API on the AWS cloud with just one command.
Use DVC to version control the data files.
Use DagsHub to manage the entire cycle.

PyCaret

PyCaret is an open-source, low-code machine learning library which we will use to build an end-to-end machine learning pipeline in Python. PyCaret is known for being easy to use, simple, and being able to build and deploy end-to-end machine learning pipelines quickly and efficiently. To learn more about PyCaret, check out this link.

DagsHub

DagsHub is a GitHub for machine learning. It is a platform for data scientists and machine learning developers to version their data, models, experiments, and code. DagsHub is built on popular open-source tools and formats, making it easy to integrate with the tools you already use. To learn more about DagsHub, check out this link.

Data Version Control

Data Version Control (DVC) is a type of data versioning, workflow, and experiment management software, that builds upon Git. DVC bridges the gap between well-known engineering tools and the needs of data science. This lets users take advantage of new features while still using their existing skills and intuition.

MLflow

MLflow is an open source platform to manage the machine learning lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

Before we start, last week I have written a blog Simplify MLOps with PyCaret, MLflow, and DagsHub. It is a step-by-step data scientist’s guide to integrate MLOps in machine learning experimentation lifecycle. Feel free to check it out!

Problem Statement

An insurance company wants to improve its cash flow forecasting by using demographic and basic patient health risk metrics at the time of hospitalization to get a better idea of what patient charges will be.

(data source)

Objective

Build and deploy a machine learning pipeline in the form of an API endpoint that takes a patient's demographic and health information and gives back the estimated costs based on the machine learning model.

Tasks

Create a project on DagsHub, clone the repo locally and configure DVC and MLflow remote origin.
Train and develop a machine learning pipeline using PyCaret. (we will first perform some experimentation to select the best model).
Use DVC to version control data files.
Use MLflow for experiment tracking.
Use MLflow Model Registry to register and deploy model on-premises.
Use MLflow Sagemaker integration to push MLflow image on AWS ECS and deploy the model as an endpoint on Sagemaker with just one command.

Let’s get started.

Create Projects on DagsHub

If you do not have an account already, you have to sign up for DagsHub. It is free. You will notice the interface of DagsHub looks like GitHub. This is not a coincidence, it is by design.

A sample repo on Dagshub.com — Image by Author

You can create a new repository by clicking on the Create button on the top right corner. If you want, you can connect to an existing Git repository, but in this tutorial I am going to click on “New Repository” and create it from scratch.

The repo is created. Since it’s public, you can also check it out here. Next, I am going to clone this repo locally and start working on the project.

git clone https://dagshub.com/moez.ali/insurance-app.git

In order to version control data files, we have to configure DVC. To configure, there are a few commands you have to run on a command line terminal. You can copy these commands by clicking the “Remote” button.

https://dagshub.com/moez.ali/insurance-app

Now, run this in your terminal:

# init dvc repo
dvc init# remote add origin
dvc remote add origin https://dagshub.com/moez.ali/insurance-app.dvc# authenticate dvc
dvc remote modify origin --local auth basic 
dvc remote modify origin --local user moez.ali 
dvc remote modify origin --local password YOUR_TOKEN_WILL_BE_HERE

Now I will create a data folder inside my project folder.

mkdir data

Finally, let’s add the data folder to DVC by running the following command:

dvc add data
git commit -m "added data folder to dvc"
git push
dvc push -r origin

At this point, we are ready. If you head over to the project repo, you will notice that the data folder is there and next to it is the DVC logo which signifies that all the data files under this folder are version controlled by DVC and files are hosted on remote storage of up to 10 GB provided for free by DagsHub.

Model Training and Selection using PyCaret

Now that configuration is out of the way, let’s jump right into Jupyter Notebook to start experimenting, but first you need to install pycaret if you haven’t already.

# install pycaret
pip install pycaret

To organize better, I have created a subfolder Notebooks where I will keep all my notebooks. The output of all the data files will be stored in the data subfolder that we created above.

The dataset that we will be working with is available within PyCaret. The first step I will take is download that and store it as a csv in the data subfolder.

# load data from pycaret repo
from pycaret.datasets import get_data
data = get_data('insurance')# store in data folder
data.to_csv('../data/raw_data.csv', index = False)

Initialize Experiment in PyCaret

setup is the first step in any machine learning experiment performed using PyCaret. This function takes care of all the data preparation required before training the models. Besides performing some basic default processing tasks, PyCaret also offers a wide array of preprocessing features. To learn more about all the preprocessing functionalities in PyCaret, you can see this link.

from pycaret.regression import *
s = setup(data, target = 'charges', session_id = 123, log_experiment = True, experiment_name = 'insurance1')

Notice that we have passed log_experiment = True and experiment_name = 'insurance1' — This will automatically log entire experiments using MLflow (Yes, PyCaret has integration built-in with MLflow). Once the setup function completes, you will see the following output:

Once the setup is finished, we are ready to start model training and selection. Just one word compare_models will train 25+ machine learning models and evaluate their performance using cross-validation.

# train baseline models
best = compare_models()

Output from compare_models will look like this:

The variable best contains the trained model. In this case, the best model is the Gradient Boosting Regressor.

print(best)

Now type mlflow ui in terminal or !mlflow ui in Jupyter Notebook and then goto https://localhost:5000 you will see MLflow Dashboard.

!mlflow ui

Click on the plus sign to open experiment. Each run inside experiment is a model trained with compare_models function.

You can click on a specific model:

You have an entire machine learning pipeline packaged as an artifact. If you want, you can download it as pkl file and write your own scoring script. When you click on model you will notice there is a Register Model button towards the right.

MLflow Model Registry and on-premise Deployment

You can use MLflow Model Registry to register and version control the model. However, this functionality will not work in current setup. You can try and you will see an error towards the top.

There are 4 scenarios of running MLflow. The current one we have used is as follows:

Many developers run MLflow on their local machine, where both the backend and artifact store share a directory on the local filesystem.

The second scenario which we want to implement as Model Registry is supported with this method:

In this scenario, the MLflow client uses the following interfaces to record MLflow entities and artifacts:

An instance of a LocalArtifactRepository (to save artifacts)
An instance of an SQLAlchemyStore (to store MLflow entities to a SQLite file mlruns.db)

To achieve this we will have to do the following:

Install sqlite
Modify the script

To modify the script we just have to add one line on the top:

# loading dataset
from pycaret.datasets import get_data
data = get_data('insurance')# set mlflow tracking uri
import mlflow
mlflow.set_tracking_uri('sqlite:///mlruns.db')# init setup
from pycaret.regression import *
s = setup(data, target = 'charges', session_id = 123, log_experiment = True, experiment_name = 'insurance3')# compare baseline
best = compare_models()

Now if you try to open the dashboard and Register Model you will be able to register the model.

If you click on Models towards the top, you will see all the registered models:

In this case, we only have one version of this model.

Why did we register our model? Well, the first benefit is version control, but another benefit is that we can now use MLflow's native serving capabilities to generate predictions from the model. This is how we do it:

import mlflowdef score_model(data, model_name, model_version):
    mlflow.set_tracking_uri('sqlite:///mlruns.db')
    model_uri = "models:/{}/{}".format(model_name, model_version)
    model = mlflow.pyfunc.load_model(model_uri)
    return model.predict(data)

Now let’s load the dataset:

# load dataset
from pycaret.datasets import get_data
data = get_data('insurance')
data.drop('charges', axis = 1, inplace=True)# score data
score_model(data, 'my_first_model', 1)

Output

Very simple.

There is one more scenario for running MLflow. Imagine you want to collaborate with other data scientists and engineers working on the project. Tracking experiments is great but what is the benefit of this if other people cannot see those metrics, models, etc. In this scenario, the tracking server, backend store, and artifact store reside on remote hosts.

A Fully configured Free MLflow Remote Server

DagsHub is under this category where the backend store and artifact store live on remote hosts and are provided to you as a managed service for free.

Now let’s modify our script to integrate it with the DagsHub project that we created.

# loading dataset
from pycaret.datasets import get_data
data = get_data('insurance')# set environment variables
import os
os.environ['MLFLOW_TRACKING_USERNAME'] = 'username'
os.environ['MLFLOW_TRACKING_PASSWORD'] = 'password'
os.environ['MLFLOW_TRACKING_URI'] = 'https://dagshub.com/moez.ali/insurance-app.mlflow'# set mlflow tracking uri
import mlflow
mlflow.set_tracking_uri("https://dagshub.com/moez.ali/insurance-app.mlflow")# init setup
from pycaret.regression import *
s = setup(data, target = 'charges', session_id = 123, log_experiment = True, experiment_name = 'insurance-exp1')# compare baseline
best = compare_models()

Now you can head over to https://dagshub.com/moez.ali/insurance-app.mlflow and you will see the same MLflow dashboard that you have worked with locally.

We can now register the model through the UI, similar to what we have done above locally. Click on Gradient Boosting Regressor since that’s the best model we want to deploy and repeat the same steps as we have done above.

Click on Models towards the top (next to Experiment) and you will see the registered model appear here.

MLflow Model Deployment on AWS Sagemaker

Now for the final part where we deploy a model from the MLflow Model Registry to the AWS Sagemaker endpoint with one simple command.

First, we have to download the model from Model Registry hosted on DagsHub. Navigate to this link to get the Run ID of the model.

https://dagshub.com/moez.ali/insurance-app.mlflow/#/experiments/0/runs/647b3bf77c3c46c4b56d68c16b8ae558

Run the following commands to download the model from DagsHub.

# set env variables
import os
os.environ['MLFLOW_TRACKING_USERNAME'] = 'username'
os.environ['MLFLOW_TRACKING_PASSWORD'] = 'password'
os.environ['MLFLOW_TRACKING_URI'] = 'https://dagshub.com/moez.ali/insurance-app.mlflow'# set mlflow tracking uri
import mlflow
mlflow.set_tracking_uri("https://dagshub.com/moez.ali/insurance-app.mlflow")# Download the artifact to the current directory
logged_model = 'runs:/647b3bf77c3c46c4b56d68c16b8ae558/model'
mlflow.artifacts.download_artifacts(logged_model, dst_path='.')

Now that the model is downloaded locally — we are ready to build and push image to Sagemaker.

If you have used Sagemaker in the past, you already have AWS IAM roles setup, but if you are doing it for the first time, follow this documentation to setup your IAM roles for Sagemaker.

Next, you have to type aws configure on command line and authenticate your laptop to access AWS services in your console.

There are two steps to deployment in AWS Sagemaker. First is to push the Docker image, which you can do by running one simple command in the terminal:

mlflow sagemaker build-and-push-container

Once the image is built, it will show up in the AWS ECR that you can access through the AWS console.

Next, is a simple script that uses deploy function from mlflow.sagemaker API to create an endpoint in AWS Sagemaker.

import mlflow.sagemaker as mfs# define variables
region = "region"
run_id1 = "mlflow_run_id"
model_uri = "runs:/" + run_id1 + "/model"
image_ecr_url = "image_url"
arn = "arn_number"
app_name = "insurance-app"# deploy to sagemaker
mfs.deploy(app_name=app_name, model_uri=model_uri, region_name=region, mode="create", execution_role_arn=arn)

Boom! we are done here. This is how you can send request to endpoint to obtain predictions:

# load unseen dataset
from pycaret.datasets import get_data
df = get_data('insurance')
df.drop('charges', axis=1, inplace=True)# predict on the first row of the dataset
payload = df.iloc[[0]].to_json(orient="split")# send data to endpoint
runtime = boto3.client('runtime.sagemaker')
runtime_response = runtime.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=payload)
result = json.loads(runtime_response['Body'].read().decode())
print(f'Payload: {payload}')
print(f'Prediction: {result}')

To recap in this tutorial we have:

Performed machine learning experiment using PyCaret and developed end-to-end machine learning pipeline.
Used MLflow to log experiment metrics and parameters.
Used MLflow Model registry to register model endpoints on-premises.
Use MLflow Sagemaker integration to deploy an API on the AWS cloud with just one command.
Use DVC to version control the data files.
Use DagsHub to manage the entire cycle.

Conclusion

The latest tooling in MLOps is very exciting and provides a lot of possibilities for automating the process of machine learning. The ease of use in these tools in driving the community adoption at a large scale.

PyCaret, MLFlow, DVC, DagsHub are all very useful frameworks by themselves. Combining them to create a lightweight machine learning platform for experimentation and few layers of MLOps is really fast and easy.

Liked the blog? Connect with Moez Ali

Moez Ali is an innovator and technologist. A data scientist turned product manager dedicated to creating modern and cutting-edge data products and growing vibrant open-source communities around them.

Creator of PyCaret, 100+ publications with 500+ citations, keynote speaker and globally recognized for open-source contributions in Python.

Let’s be friends! connect with me:

👉 LinkedIn
👉 Twitter
👉 Medium
👉 YouTube

🔥 Check out my brand new personal website: https://www.moez.ai.

To learn more about my open-source work: PyCaret, you can check out this GitHub repo or you can follow PyCaret’s Official LinkedIn page.

Listen to my talk on Time Series Forecasting with PyCaret in DATA+AI SUMMIT 2022 by Databricks.

🚀 My most read articles:

Machine Learning in Power BI using PyCaret

A step-by-step tutorial for implementing machine learning in Power BI within minutes

towardsdatascience.com

Announcing PyCaret 2.0

An open source low-code machine learning library in Python

towardsdatascience.com

Time Series Forecasting with PyCaret Regression Module

A step-by-step tutorial for time-series forecasting using PyCaret

towardsdatascience.com

Multiple Time Series Forecasting with PyCaret

A step-by-step tutorial on forecasting multiple time series using PyCaret

towardsdatascience.com

Time Series Anomaly Detection with PyCaret

A step-by-step tutorial on unsupervised anomaly detection for time series data using PyCaret

towardsdatascience.com

Data Scientist’s starter toolkit for end-to-end Machine Learning Life cycle in 2022

Introduction

Table of Content

PyCaret

DagsHub

Data Version Control

MLflow

Problem Statement

Objective

Tasks

Create Projects on DagsHub

Model Training and Selection using PyCaret

Initialize Experiment in PyCaret

MLflow Model Registry and on-premise Deployment

A Fully configured Free MLflow Remote Server

MLflow Model Deployment on AWS Sagemaker

Conclusion

Liked the blog? Connect with Moez Ali

Let’s be friends! connect with me:

🚀 My most read articles:

Machine Learning in Power BI using PyCaret

A step-by-step tutorial for implementing machine learning in Power BI within minutes

Announcing PyCaret 2.0

An open source low-code machine learning library in Python

Time Series Forecasting with PyCaret Regression Module

A step-by-step tutorial for time-series forecasting using PyCaret

Multiple Time Series Forecasting with PyCaret

A step-by-step tutorial on forecasting multiple time series using PyCaret

Time Series Anomaly Detection with PyCaret

A step-by-step tutorial on unsupervised anomaly detection for time series data using PyCaret

Written by Moez Ali

Responses (3)