A step-by-step data scientist’s guide to integrate MLOps in machine learning experimentation lifecycle
Introduction
The goal of Machine Learning Operations or “MLOps” is to simplify the process of deployment and monitoring of machine learning models in production. Data scientists, Machine Learning engineers, and Data engineers collaborate together to make up the MLOps function. The terms “machine learning” and “development operations,” both from the field of software engineering, are combined to form the term “MLOps.”
MLOps has the potential to include everything from the data to the models. In some cases, MLOps implementation is only used for the deployment of the machine learning models. However, in mature organizations MLOps is also implemented in a variety of other stages of the machine learning Lifecycle development such as data processing, feature engineering, experiment management, etc.
In this article I will touch upon one such area of MLOps known as Experiment Tracking which is a process of logging all metrics, parameters, artifacts during experimentation stage.
Experiment Tracking
Experiment tracking is the process of saving all experiment related information (Metadata). This normally includes:
- Model parameters
- Model metrics
- Model Artifacts
- Environment variables / configuration files
Experiment tracking organize all meta data from machine learning experiments in one place and makes it easy to compare different experiments. Model versioning enables reproducibility in experiment with no extra efforts. Experiment tracking also improves collaboration in the project and standardize processes across enterprise.
What you will learn in this tutorial?
- Build end-to-end machine learning pipelines using PyCaret.
- Log experiments (metrics, parameters, model artifacts) using MLflow (integrated within PyCaret).
- Use Data Version Control (DVC) to version control data files.
- Use DagsHub to host the project (MLflow and DVC integrated within DagsHub).
👉 What is PyCaret?
PyCaret is an open-source, low-code machine learning library which we will use to build end-to-end machine learning pipeline in Python. PyCaret is known for its ease of use, simplicity, and ability to quickly and efficiently build and deploy end-to-end machine learning pipelines.
To learn more about PyCaret, check out this link.
👉 What is MLFlow?
MLflow is an open source platform to manage the machine learning lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components:
To learn more about MLFlow, check out this link.
👉 What is Data Version Control (DVC)?
Data Version Control (DVC) is a type of data versioning, workflow, and experiment management software, that builds upon Git. DVC reduces the gap between established engineering toolsets and data science needs, allowing users to take advantage of new features while reusing existing skills and intuition.
In simple words, Git is used for versioning code while DVC offers perfect use-case for versioning data because in machine learning you iterate over data as much as you iterate over code. To learn more about DVC, check out this link.
👉 What is DagsHub?
DagsHub is a GitHub for machine learning. It is a platform for data scientists and machine learning developers to version their data, models, experiments, and code. It provides an experience similar to that of GitHub for machine learning, allowing you and your team to quickly share, review, and reuse the work that you create.
DagsHub is built on popular open-source tools and formats, making it easy to integrate with the tools you already use. To learn more about DagsHub, check out this link.
Now that we have talked about all the four tools PyCaret, MLFlow, DVC, DagsHub we will be using in this tutorial, let’s head over to the problem statement in the next section.
Machine Learning Problem Statement
Retaining existing customers is one of the most important key performance indicators (KPIs) for businesses that operate using a subscription-based business model. The SaaS industry is particularly difficult to compete in because customers have the freedom to select from a large number of service providers. A single negative interaction is all it takes for a consumer to decide to switch to a competitor, resulting in customer churn.
The percentage of a company’s clientele that stops purchasing the company’s goods or services during a specified period of time is referred to as “customer churn.” One method for determining a company’s churn rate is to divide the total number of customers who cancelled their subscriptions during a specific time period by the total number of subscribers who were still active at the start of the period. For instance, if you had 1000 clients at the beginning of the month and lost 50 of them, your churn rate for the month would be 5%.
Dataset
In this tutorial, I am using a Telecom Customer Churn dataset from Kaggle. You can read this dataset directly from this GitHub link. (Shoutout to srees1988)
# import libraries
import pandas as pd
import numpy as np# read csv data
data = pd.read_csv('https://raw.githubusercontent.com/srees1988/predict-churn-py/main/customer_churn_data.csv')
Exploratory Data Analysis (EDA)
Let’s a basic EDA on the data before modeling the churn.
# check the data types
data.dtypes
Notice that TotalCharges
is an object instead of float. This is because null values are coded as empty string causing the type of column to be object instead of float. Let’s fix this problem first.
# replace blanks with np.nan
data['TotalCharges'] = data['TotalCharges'].replace(' ', np.nan).astype('float64')# check dtypes after the fix
data.dtypes
Now, let’s check if there are missing values in the data:
# check missing values
data.isnull().sum()
We will not remove these as PyCaret will automatically impute the missing values before training the model. Since this is a classification problem, let’s also check the target balance:
# check the target balance
data['Churn'].value_counts(normalize = True).plot.bar()
Model Training and Selection
Let’s start the modeling process by initializing setup
function in PyCaret. This function takes care of all the data preprocessing and cleaning automatically and generates a processed train and test set for the next step i.e. model training and selection
# initialize setup
from pycaret.classification import *
s = setup(data, target = 'Churn', session_id = 123, ignore_features = ['customerID'], log_experiment = True, experiment_name = 'churn1')
setup
function takes the dataframe, name of the target variable, and any other preprocessing settings which are all optional. Check out all the preprocessing options available in PyCaret.
In the example above we also passed log_experiment = True
and experiment_name = 'churn1'
. This will automatically log experiments using MLflow. Once setup is completed successfully we are ready to do model selection with one simple command.
# compare models
best_model = compare_models()
The compare_models
function trains all the models in model library and evaluate the performance of models using cross-validation. From the output of this function we can see that in this use-case Ada Boost Classifier is the best performing model based on Accuracy and AUC. The trained scikit-learn model is also returned with this function.
print(best_model)
There are few things you can do here if you want to improve the performance of the model. Tuning hyperparameters is one of many things you can do here. tune_model
function in PyCaret can automatically tune the hyperparameters. Check out all the functions available in PyCaret.
At this point if you want you can check out the final pipeline using the save_model
function and deploy the pipeline in your choice of environment.
# save pipeline
save_model(best_model, 'my_first_pipeline')
Let’s visualize the pipeline:
# load pipeline from file
my_pipeline = load_model('my_first_pipeline')# sklearn settings to render diagram
from sklearn import set_config
set_config(display = 'diagram')# display pipeline
print(my_pipeline)
Because we have passed log_experiment = True
in the setup
function. PyCaret has logged everything using MLFlow. There is a pretty dashboard you can use to track your experiments. Run the following command in your Notebook or command line and head over to localhost:5000
# start the mlflow ui
!mlflow ui
This is great. We now have a dashboard that tracks all the hyperparameters, metrics, artifacts for all the models we have trained with no extra efforts. However, notice that this server is hosted and being used on localhost:5000
i.e. your own computer. What if you are working in a team and want to collaborate with other members to work on the same project or even just read the information from the dashboard. This will not be possible unless you setup a remote server.
At this point if we want we can create a Git repo and commit our Notebook to that repo. This is what the repo looks like:
Now let’s redo this experiment and include DagsHub this time. This will give you a great idea of how DagsHub can add value. For this you will have to sign up for DagsHub. It is free. You will notice the interface of DagsHub looks like GitHub, this is not a coincidence, its by design.
Step 1 — Create New Repository
You can create new repository by clicking on Create button on top right corner. If you want you can connect to existing Git repository but in this tutorial I am gonna click on “New Repository”.
Fill up the details:
Step 2 — Clone the repo locally
I use VS Code so I am gonna use VS Code to clone the repo.
This is what it looks like at this point:
Step 3 — Create Data Folder
Create a data folder where we will save our outputs (csv files) and use DVC to version control the data. This could be raw data, intermediate output, final output, etc.
Step 4 — Create Notebook to Download Data
Create a Notebook to read the data from the URL and write raw csv file and cleaned csv file in the Data folder.
# read and write raw data
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/srees1988/predict-churn-py/main/customer_churn_data.csv')
data.to_csv('./data/raw_data.csv')# replace blanks with np.nan
import numpy as np
data['TotalCharges'] = data['TotalCharges'].replace(' ', np.nan).astype('float64')# write final data
data.to_csv('./data/final_data.csv')
This is how project looks like at this point:
Let’s commit and push the change to the repo.
Step 5 — Configure DVC
In order to version control data files we have to configure DVC. To configure there are few command you have to run on command line terminal. You can copy these commands by click “Remote” button.
Run this on terminal / command prompt:
# init dvc repo
dvc init# remote add origin
dvc remote add origin https://dagshub.com/moez.ali/customer_churn.dvc# authenticate dvc
dvc remote modify origin --local auth basic
dvc remote modify origin --local user moez.ali
dvc remote modify origin --local password YOUR_TOKEN_WILL_BE_HERE
Now we can add the data folder to DVC repo by running the following commands:
dvc add data
git commit -m "added data folder to dvc"
git push
dvc push -r origin
Now if you refresh DagsHub repo, you will note DVC logo’s next to Data folder:
The files are now version controlled by DVC and the storage account that is usually needed for DVC is managed and hosted by DagsHub.
DagsHub has a interactive CSV file viewer built-in. It also has a custom diff tool built to show changes in data. I have made few changes in first line of raw_data.csv
to show how diff is shown for CSV files.
Step 6 — Create Experiment Notebook
Let’s create a Notebook to do model training and selection using PyCaret and log experiment by passing log_experiment = True
in the setup
function.
# set env variables
import os
os.environ['MLFLOW_TRACKING_USERNAME'] = 'USERNAME'
os.environ['MLFLOW_TRACKING_PASSWORD'] = 'PASSWORD'
os.environ['MLFLOW_TRACKING_URI'] = 'https://dagshub.com/moez.ali/customer_churn.mlflow'
os.environ["PYCARET_CUSTOM_LOGGING_LEVEL"] = "CRITICAL"# set mlflow tracking uri
import mlflow
mlflow.set_tracking_uri("https://dagshub.com/moez.ali/customer_churn.mlflow")# read final_data
import pandas as pd
data = pd.read_csv('./data/final_data.csv')
data.head()# init setup
# initialize setup
from pycaret.classification import *
s = setup(data, target = 'Churn', session_id = 123, ignore_features = ['customerID'], log_experiment = True, experiment_name = 'churn1')# model training and selection
best = compare_models()
If you now head over to this url: https://dagshub.com/moez.ali/customer_churn.mlflow you will see MLflow server which is now hosted on DagsHub. It’s no more on your local host.
DagsHub also has its own tracking UI as well, if you want you can use that by clicking on “Experiments” next to “Files”.
Just like CSV viewer DagsHub has also a very nice view to render Notebooks:
Conclusion
PyCaret, MLFlow, DVC, DagsHub are all very useful frameworks by themselves. Combining them to create a lightweight machine learning platform for experimentation and few layers of MLOps is really fast and easy. In this tutorial we have seen how we can use MLFlow for experiment tracking and DVC for version control of data files. In next tutorial I will show how you can use these tools for deployment aspect of MLOps.
Thank you for reading!