Build a simple MLOps stack using PyCaret and DagsHub
Introduction
Using PyCaret with DagsHub, you can now log your experiments and artifacts on remote DagsHub servers without changing any code.
With this integration, you can use MLflow on a remote server that DagsHub manages and hosts for free. It enables collaboration as multiple people can access and work with the same MLflow experiment runs, allowing for better collaboration on projects.
Also, by using a remote MLflow server, your data is backed up and will not be lost in the event of a local machine failure. You can access your experiment runs from anywhere, as long as you have an internet connection. This can be especially useful if you are working on a project from multiple locations.
PyCaret
PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.
In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with a few lines. This makes experiments exponentially fast and efficient.
The design and simplicity of PyCaret are inspired by the emerging role of citizen data scientists, a term first used by Gartner.
To learn more about PyCaret, check out the official documentation.
DagsHub
DagsHub is a platform for data scientists and machine learning engineers to version their data, models, experiments, and code. It allows data science teams to easily share, review, and reuse their work, providing a GitHub experience for machine learning.
DagsHub is built on popular open-source tools and formats, making it easy to integrate with the tools you already use. To learn more about DagsHub, check out the official documentation.
Integration: PyCaret and DagsHub
PyCaret provides an out-of-the-box integration with MLflow, enabling users to log experiment metrics, parameters, artifacts, and data locally that can be accessed using the MLFlow UI. This is great if you are working alone, but not so much if you want to coordinate with other data scientists and engineers on your team.
To collaborate on MLFlow experiments, you must set up a remote URI, configure a database to store model metrics and parameters, and use a file storage system like AWS S3, Azure Blob, etc. Setting up a remote URI and managing the MLFlow service on your own can be difficult, time-consuming, and costly in terms of run time and storage.
This is where integration with DagsHub comes in.
DagsHub provides a remote MLflow server for each repository, enabling users to log experiments with MLflow and view and manage the results and trained models from the built-in UI.
The DagsHub repository also has fully configured object storage for storing data, models, and any large files. These files can be diffed, allowing users to see the differences between different versions of their data and models and better understand the impact of those changes on their results.
With this integration, PyCaret users will be able to log their experiments on a DagsHub-hosted remote MLflow server and easily compare and share them with others.
Additionally, users can use DVC to version raw and processed data, which can then be pushed to DagsHub for viewing, comparison, and sharing.
All of this without changing a single line of code.
How to log experiments using DagsHub?
To use the DagsHub Logger with PyCaret, set log_experiment = 'dagshub'
in the setup
function.
# installing libraries
pip install --pre pycaret
pip install mlflow dagshub
# load dataset
from pycaret.datasets import get_data
data = get_data('iris')
# init setup
from pycaret.classification import *
s = setup(data, target = 'species', session_id = 123,
log_experiment = 'dagshub', experiment_name = 'project_iris', log_data = True)
# compare base models
best = compare_models()
# save best model
save_model(best, 'best_iris_model')
On running the setup
it will print a link to authorize your DagsHub account. Click on that link to authorize the connection. Once that is done, you will be asked to enter ownner_name/repo_name
to complete the setup
function.
At this point, the repo is initialized on DagsHub. You can go to https://www.dagshub.com/moez.ali/project_iris to check out the project.
Click on Experiments
tab to see all the model runs. This is an internal DagsHub logger (pretty similar to MLFlow).
You can click on the runs to see the parameter and metric details:
What about the MLFlow logger? Well, it is also there.
Go to https://www.dagshub.com/moez.ali/project_iris.mlflow and you will be able to see the MLFlow dashboard.
BOOM! The MLFlow server is fully managed and hosted for you by DagsHub for free.
You can now share this link with other people on your team and also work collaboratively with them on the same experiment (if you give them permission to write).
Isn’t it amazing?
Check out this Colab Notebook for a full demo.
To learn more about this integration, you can also read DagsHub official announcement.
Liked the blog? Connect with Moez Ali
Moez Ali is an innovator and technologist. A data scientist turned product manager dedicated to creating modern and cutting-edge data products and growing vibrant open-source communities around them.
Creator of PyCaret, 100+ publications with 500+ citations, keynote speaker and globally recognized for open-source contributions in Python.
Let’s be friends! connect with me:
👉 LinkedIn
👉 Twitter
👉 Medium
👉 YouTube
🔥 Check out my brand new personal website: https://www.moez.ai.
To learn more about my open-source work: PyCaret, you can check out this GitHub repo or you can follow PyCaret’s Official LinkedIn page.
Listen to my talk on Time Series Forecasting with PyCaret in DATA+AI SUMMIT 2022 by Databricks.