OS

Data science
Machine learning

The importance of model versioning in machine learning

Doing iterations right


Version control is an essential concept in software development. Version control tools, such as Git, are used to manage and track source codes. Developers and organizations alike value source code because it is the backbone of their product(s), from which they earn revenue. Model versioning is the application of version control concepts within the machine learning process.

Machine learning and artificial intelligence offer an edge to businesses and developers over their competitors, motivating the competition to push to build the best model as it provides a business advantage. As developers optimize for speed and accuracy, minimizing errors also becomes a priority. Model versioning is a great tool for all of those purposes.

In this article, you’ll learn more about the importance of model versioning in machine learning by exploring a demo on how to apply it when building ML models and reviewing best practices to follow, limitations to be aware of, and some of the best tools currently available.

Learn the secrets of performant and explainable ML

Be the first to know by subscribing to the blog. You will be notified whenever there is a new post.

What is model versioning?

Model versioning derives its core processes from version control in software development. The difference is that software development runs mainly on code, while machine learning runs on both code and data, which is the backbone of the best-performing models.

Model versioning, therefore, is the process of tracking and managing changes made to source code and data, including metrics, parameters, and hyperparameters. Popular version control systems like GitHub, Bitbucket, and Perforce have failed to track changes made to data and are limited in the amount of data they’re able to host due to storage limits. Solutions such as Git Large File Storage (Git-LFS) can accept a large amount of data but still fail to track data and models.

Applying model versioning when building machine learning applications is important because it allows for the following:

  • Referencing previous versions of the model
  • Reverting changes quickly, making building models less risky
  • Easily reproducing and sharing models among development teams

How model versioning works

Using a hands-on approach to learning about model versioning will aid your understanding of the concept. In this guide, MLflow will be used to demonstrate how model versioning works. MLflow is a platform used for tracking, reproducing, deploying, and managing models in a central repository.

To get started, install MLflow and scikit-learn using one of the following methods:

  • pip install mlflow and pip install sklearn: These install scikit-learn and MLflow separately.
  • pip install mlflow[extras]: This installs MLflow with other dependencies, including scikit-learn.

Next, download the sample data set here and create a file named train.py.

Copy the following code into the file:

# import libraries
import  pandas  as  pd
import  mlflow
from  sklearn.model_selection import train_test_split
from  sklearn.linear_model import  LogisticRegression
from  sklearn.preprocessing import  StandardScaler
from  sklearn.pipeline import  make_pipeline
from  sklearn.metrics  import  f1_score, accuracy_score, precision_score
from  urllib.parse  import  urlparse

# reading and separating the data into training and tests sets (0.75,0.25) split
heart_data  =  pd.read_csv('framingham.csv')
heart_data.dropna(inplace=True)
X  =  heart_data.drop(columns=["TenYearCHD"])
y  =  heart_data["TenYearCHD"]
X_train, X_test, y_train, y_test  =  train_test_split(X,y,test_size=0.25)


with  mlflow.start_run():

    # model hyperparameters
    PENALTY  =  'l1'
    C  =  1.0
    SOLVER  =  'solver'

    # scaling and fitting the data using a pipeline
    pipe  =  make_pipeline(StandardScaler(), LogisticRegression())
    pipe.fit(X_train, y_train)
    y_pred  =  pipe.predict(X_test)

    # evaluating the model
    f1  =  f1_score(y_pred=y_pred,y_true=y_test, average='binary')
    precision  =  precision_score(y_pred=y_pred, y_true=y_test, average='binary')
    accuracy  =  accuracy_score(y_pred=y_pred,y_true=y_test)

    print(f'F1-Score: {f1}')
    print(f'Precision Score: {precision}')
    print(f'Accuracy Score: {accuracy}')

    # logging the model's parameters and metrics to mlflow
    mlflow.log_param("Penalty", PENALTY)
    mlflow.log_param("C", C)
    mlflow.log_param("Solver", SOLVER)
    mlflow.log_metric("F1 Score", f1)
    mlflow.log_metric("Precision Score", precision)
    mlflow.log_metric("Accuracy Score", accuracy)

    # logging runs
    tracking_url_type_store  =  urlparse(mlflow.get_tracking_uri()).scheme
  
    # Model registry does not work with file store
    if  tracking_url_type_store  !=  "file":
        # Register the model
        # There are other ways to use the Model Registry, which depend on the use case,
        # please refer to the doc for more information:
        # https://mlflow.org/docs/latest/model-registry.html#api-workflow
        mlflow.sklearn.log_model(pipe, "model", registered_model_name="LogisticRegressionHeartModel")
    else:
        mlflow.sklearn.log_model(pipe, "model")

Then, run the file in a terminal pointing to the file directory using python train.py.

Run mlflow ui in the same directory containing train.py to compare the models produced. View the page at http://localhost:5000.

You should see the MLflow interface displaying logged details about the model:

Next, set PENALTY to l2, C to 0, SOLVER to lfgbs, and train the model again by running train.py. Restart MLflow using the same command, mlflow ui. MLflow automatically saves the new metrics and parameters, logs them, and displays them on the localhost, as pictured here:

MLflow tracks logged objects, such as parameters and metrics, by creating project directory folders containing the result of each run in separate folders with unique hash names to distinguish them and make it easier to reference particular objects as needed.

Each time you run the code, a new folder is created. Within these hash-named folders are a different set of folders, such as artifacts and metrics, which contain the logged objects. Here’s an example of what the file structure looks like:

Benefits of model versioning

ML models are generally built by developers across a variety of teams, which requires a high level of collaboration. Model versioning then becomes necessary as it permits teams to do the following:

  • Build reproducible models: Building machine learning models is nondeterministic. In other words, training the same algorithm on the same data, as done in the past, can produce a different result due to the change in parameters and hyperparameters. Tracking data and everything involved in the model-building process becomes possible with model versioning, allowing for the convenient reproduction of models.
  • Build shareable models: Working on machine learning projects requires collaborative efforts depending on the size and complexity of the project. Model versioning allows teams to share models by saving files in a remote storage location and recording/tracking each file that produced a model. Teams can constantly update and share models among themselves.
  • Ensure proper data governance: Industries that deal with sensitive information, such as healthcare, need data governance to ensure that data is handled with care and that it remains consistent and accurate. Model versioning allows for proper model auditing to ensure compliance with government laws like the GDPR.

Best practices for model versioning

In order to prevent having bugs in production, which can break the system, here are some best practices to follow:

  • Review models regularly: Regularly reviewing your model will help you spot errors that could otherwise cause it to break or produce inaccurate predictions. Be sure to keep an eye out for outliers and data points that can weaken or change your model’s performance.
  • Regularly delete models that are not in use: Keeping obsolete models uses up space and causes redundancy. Deleting these models provides clarity and ensures that your resources are invested only in models that are currently in use.
  • Test models before deployment: Testing models before deployment is crucial to help you identify errors that may otherwise lead to downtime, operation costs, and for a commercial product, loss of users.

Limitations of model versioning

Model versioning is still in its early stages, with tools like DVC and MLflow made available for public use in 2017 and 2018, respectively, compared to Git, which was built in 2005. Thus, tools used for versioning—especially open-source tools—are not yet built to support end-to-end model versioning. For instance, some prioritize tracking model changes over robust storage options (or vice versa). Similarly, some model versioning tools come with pipeline management, leading to data redundancy in cases where your team already has a pipeline system in place.

To overcome these limitations, organizations like Uber and Airbnb have built their own internal versioning systems, while others opt to use third-party solutions like Openlayer.

Tools for model versioning

To tackle the challenges of model versioning and solve the limitations of versioning software such as Git, several open-source and enterprise solutions have been built. Here are a few useful tools to consider:

  • Openlayer: Openlayer is a collaborative quality assurance platform for machine learning that allows you to track and version your models with ease. It helps you discover errors when building models, boost model performance, and test your model before deployment. If you’re building an end-to-end machine learning project, it may be just the right tool for you.
  • DVC: Data Version Control (DVC) is an open-source tool used for versioning data sets. DVC commands are similar to Git’s, and both are used in tandem when building models. DVC versions large data sets, while Git stores the .dvc files used to retrieve data from DVC remote storage. DVC supports popular remote storage options like Amazon S3 and Google Cloud Storage and is a great open-source option to version your data.
  • LakeFS: LakeFS is an open-source platform similar to DVC that provides Git-like commands for performing model versioning operations. It also integrates storage options like AWS and GCP to store data. LakeFS minimizes data duplication via a copy-on-write mechanism and is well suited for enterprise use due to its high performance over data lakes of any size.

Conclusion

Model versioning is a valuable part of the development process for machine learning projects because of the need for collaboration, tracking changing codes, and monitoring the model’s performance over time. If you’re considering what tool might be right for you and your team, here are a few guidelines to keep in mind. The ideal tool will have the following features:

  • Ease of use: Model versioning tools should be easy for you and your team to get familiar with. Machine learning is complex enough without these tools—using them should make building models simpler, not more complicated.
  • Good stack integration: Whatever tool you choose to use should integrate well with your tech stack for smooth building. Fortunately, most tools built for model versioning are language-agnostic, so finding one that meets your needs shouldn’t be difficult.
  • Data set structure support: Choosing a tool or platform that supports the kind of data you are working with ensures faster model development. Versioning tabular-based data is different from versioning audio, video, or image files. You’ll want to pick a tool that supports the data type you work with now—as well as any others you plan on working with in the future—to avoid needing to switch platforms later on.

If you’re in the market for a great ML platform that allows you to expertly version and track your models, be sure to consider Openlayer. Get started today to learn about how Unbox can help you with error analysis, synthetic data generation, testing, deploying models, and more.



Oghenefejiro Esosuota

I am a python developer with 2+ years of experience in Data Science and Machine Learning. I love writing technical articles because it helps me learn faster and build my skills. I also enjoy contributing to the developer community by sharing my knowledge base and experiences.


Recommended posts

Data science
Machine learning

How to generate synthetic data for machine learning projects

Solving the data gap

Sundeep Teki

June 13th, 2023 • 10 minute read

How to generate synthetic data for machine learning projects
Data science
Machine learning

The challenge of becoming a full-stack data scientist

Mastering the full ML lifecycle

Roel Peters

April 18th, 2022 • 7 minute read

The challenge of becoming a full-stack data scientist
Data quality
Data science

Data labeling and relabeling in machine learning

The never-ending process in data science

Sundeep Teki

June 26th, 2023 • 5 minute read

Data labeling and relabeling in machine learning