Simplify MLOps with PyCaret, MLflow, and DagsHub

Photo by Clem Onojeghuo on Unsplash


The goal of Machine Learning Operations or “MLOps” is to simplify the process of deployment and monitoring of machine learning models in production. Data scientists, Machine Learning engineers, and Data engineers collaborate together to make up the MLOps function. The terms “machine learning” and “development operations,” both from the field of software engineering, are combined to form the term “MLOps.”

Experiment Tracking

Experiment tracking is the process of saving all experiment related information (Metadata). This normally includes:

  • Model parameters
  • Model metrics
  • Model Artifacts
  • Environment variables / configuration files
Image Source

What you will learn in this tutorial?

  • Build end-to-end machine learning pipelines using PyCaret.
  • Log experiments (metrics, parameters, model artifacts) using MLflow (integrated within PyCaret).
  • Use Data Version Control (DVC) to version control data files.
  • Use DagsHub to host the project (MLflow and DVC integrated within DagsHub).

👉 What is PyCaret?

PyCaret is an open-source, low-code machine learning library which we will use to build end-to-end machine learning pipeline in Python. PyCaret is known for its ease of use, simplicity, and ability to quickly and efficiently build and deploy end-to-end machine learning pipelines.

👉 What is MLFlow?

MLflow is an open source platform to manage the machine learning lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components:

Components of MLFlow

👉 What is Data Version Control (DVC)?

Data Version Control (DVC) is a type of data versioning, workflow, and experiment management software, that builds upon Git. DVC reduces the gap between established engineering toolsets and data science needs, allowing users to take advantage of new features while reusing existing skills and intuition.

👉 What is DagsHub?

DagsHub is a GitHub for machine learning. It is a platform for data scientists and machine learning developers to version their data, models, experiments, and code. It provides an experience similar to that of GitHub for machine learning, allowing you and your team to quickly share, review, and reuse the work that you create.

Sample repo:

Machine Learning Problem Statement

Retaining existing customers is one of the most important key performance indicators (KPIs) for businesses that operate using a subscription-based business model. The SaaS industry is particularly difficult to compete in because customers have the freedom to select from a large number of service providers. A single negative interaction is all it takes for a consumer to decide to switch to a competitor, resulting in customer churn.


In this tutorial, I am using a Telecom Customer Churn dataset from Kaggle. You can read this dataset directly from this GitHub link. (Shoutout to srees1988)

# import libraries
import pandas as pd
import numpy as np
# read csv data
data = pd.read_csv('')
Sample Dataset

Exploratory Data Analysis (EDA)

Let’s a basic EDA on the data before modeling the churn.

# check the data types
# replace blanks with np.nan
data['TotalCharges'] = data['TotalCharges'].replace(' ', np.nan).astype('float64')
# check dtypes after the fix
dtypes after fixing data type
# check missing values
Missing Values
# check the target balance
data['Churn'].value_counts(normalize = True)
Target Balance for Churn column

Model Training and Selection

Let’s start the modeling process by initializing setup function in PyCaret. This function takes care of all the data preprocessing and cleaning automatically and generates a processed train and test set for the next step i.e. model training and selection

# initialize setup
from pycaret.classification import *
s = setup(data, target = 'Churn', session_id = 123, ignore_features = ['customerID'], log_experiment = True, experiment_name = 'churn1')
Output from setup function (truncated…)
# compare models
best_model = compare_models()
Output from compare_models function
# save pipeline
save_model(best_model, 'my_first_pipeline')
Output from save_model(…)
# load pipeline from file
my_pipeline = load_model('my_first_pipeline')
# sklearn settings to render diagram
from sklearn import set_config
set_config(display = 'diagram')
# display pipeline
my_pipeline (Output truncated…)
# start the mlflow ui
!mlflow ui
A sample repo on

Step 1 — Create New Repository

You can create new repository by clicking on Create button on top right corner. If you want you can connect to existing Git repository but in this tutorial I am gonna click on “New Repository”.

New Repository

Step 2 — Clone the repo locally
VS Code Clone the repo

Step 3 — Create Data Folder

Create a data folder where we will save our outputs (csv files) and use DVC to version control the data. This could be raw data, intermediate output, final output, etc.

Step 4 — Create Notebook to Download Data

Create a Notebook to read the data from the URL and write raw csv file and cleaned csv file in the Data folder.

# read and write raw data
import pandas as pd
data = pd.read_csv('')
# replace blanks with np.nan
import numpy as np
data['TotalCharges'] = data['TotalCharges'].replace(' ', np.nan).astype('float64')
# write final data

Step 5 — Configure DVC

In order to version control data files we have to configure DVC. To configure there are few command you have to run on command line terminal. You can copy these commands by click “Remote” button.
# init dvc repo
dvc init
# remote add origin
dvc remote add origin
# authenticate dvc
dvc remote modify origin --local auth basic
dvc remote modify origin --local user moez.ali
dvc remote modify origin --local password YOUR_TOKEN_WILL_BE_HERE
dvc add data
git commit -m "added data folder to dvc"
git push
dvc push -r origin

Step 6 — Create Experiment Notebook

Let’s create a Notebook to do model training and selection using PyCaret and log experiment by passing log_experiment = True in the setup function.

# set env variables
import os
os.environ['MLFLOW_TRACKING_URI'] = ''
# set mlflow tracking uri
import mlflow
# read final_data
import pandas as pd
data = pd.read_csv('./data/final_data.csv')
# init setup
# initialize setup
from pycaret.classification import *
s = setup(data, target = 'Churn', session_id = 123, ignore_features = ['customerID'], log_experiment = True, experiment_name = 'churn1')
# model training and selection
best = compare_models()
Output from compare_models()
MLFlow tracking
Comparing training time and accuracy of different models
Notebook rendering on DagsHub


PyCaret, MLFlow, DVC, DagsHub are all very useful frameworks by themselves. Combining them to create a lightweight machine learning platform for experimentation and few layers of MLOps is really fast and easy. In this tutorial we have seen how we can use MLFlow for experiment tracking and DVC for version control of data files. In next tutorial I will show how you can use these tools for deployment aspect of MLOps.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store