MLOps is a discipline aimed at increasing the number of machine learning and data science projects that successfully mature into production. This article explains all you need to know about the discipline. What MLOps is, the goal, principles, business benefits, tools, and the necessary things to know to seek a career in the path.
"87% of data science projects never make it into production", VentureBeat AI reported in 2019. Lack of access to adequate data, not having the right talents, solving the wrong problems, and following a fallacious model development process are some of the many issues responsible for this appalling rate of failure.
Of the four key points mentioned, especially for commercial ML solution providers, an erroneous development process happens to be more defiling to the maturity of a data science or machine learning project from data sourcing through production.
Commercial ML solution providers also face serious issues with the process of moving to the cloud, creating and managing ML pipelines at scale, deployment, and automation of model development workflows, and making the ML solution available to a large number of users.
A similar problem used to be prevalent in the software development workflow—where commercial software solution providers struggle to make software available to users at scale. DevOps annihilated this by introducing a set of practices, techniques, and tools to develop, test, and deploy software at scale.
Similar to DevOps, MLOps introduces a set of practices, principles, and tools to the end-to-end development of machine learning models.
Wikipedia defines MLOps as "the set of practices at the intersection of Machine Learning, DevOps and Data Engineering." It is a practice that promotes the collaboration of hybrid teams—data science/ML engineering and Operations teams—to build and deploy models that serve real business needs.
MLOps or machine learning for operations takes its root from ModelOps; an even bigger concept that refers to the operationalization of all types of artificial intelligence models that also takes its core from DevOps.
Because, as Wikipedia says, MLOps is an intersection, it shares many of its practices and methodologies with DevOps. Methods such as continuous integration, continuous deployment, continuous monitoring, testing, and so on. However, because some challenges are unique to machine learning model development, MLOps executes these practices a little bit differently.
The machine learning development workflow consists of three constituent pipelines: data pipeline, ML pipeline, and application pipeline. MLOps implements its principles at each pipeline level to make the model development cycle reliable.
Testing in regular software happens in different forms such as unit testing, penetration testing, security testing, etc. In addition to that, MLOps tests machine learning systems by training and validating the model at each of the data, ML model, and application pipelines to ensure that the system is reliable and validate that the feature is useful when it eventually gets to production.
Keeping track of changes and performance of machine learning models is another prevalent issue in enterprise ML model workflow, before MLOps. The MLOps monitoring principle performs periodic checks on the dependencies of the model, the usage, and the performance to ensure that it serves as expected. MLOps encourages that the model's desired behaviors should be pre-recorded and used as a benchmark which when the model underperforms or spikes irregularly, necessary actions are taken.
Due to the experimental and unstable nature of data used in ML model training, a significant number of events can bring about changes in data or anomalies in the model behavior. MLOps introduce versioning of machine learning codebase in version control systems such as GIt. This makes it easy to revert to a previous version or know exactly what version of the code is ticklish when a problem arises.
One of the beauties of DevOps that is adopted in MLOps is the continuous workflow. Machine learning models are temporary and subject to change based on their use cases as soon as new data is available. MLOps allows for easy implementation of the ML engineering processes, including continuous integration (CI), continuous delivery (CD), continuous testing (CT), and continuous monitoring (CM).
Quite different from the conventional continuous integration like we have in DevOps, continuous integration in MLOps involves validating testing and validating data, data schemas, and models; continuous deployment is about deploying an ML pipeline that can automatically deploy another model prediction service or roll back changes from a model. MLOps’ continuous testing also iteratively retrains, validates, and serves ML models.
Probably the most important principle of MLOps is automation; the same as DevOps. To successfully implement MLOps in your machine learning model development workflow, you must integrate automation. However, automation in MLOps is in three stages: the manual stage, the ML pipeline automation stage, and the CI/CD pipeline automation stage.
Being the first stage for many small teams, the manual stage involves the normal machine learning process where models are manually validated, tested, and executed iteratively to train the model for subsequent automated operations.
At the ML pipeline automation stage, continuous training is introduced to the model. When new data is available, the validation and retraining of the machine learning model are automatically triggered without any manual intervention.
The last stage, CI/CD pipeline automation, builds on the success of the two previous steps. Like in DevOps, continuous integration and delivery are introduced in the third stage to build, test, and deploy machine learning models automatically and continuously.
After automation, reproducibility is the ultimate operational principle of MLOps. For MLOps to be successfully applied in machine learning, the design, data processing, model training, deployment, and other machine learning artifacts should be well stored to ensure that the model can be easily reproduced, provided the same data input.
In concise words, the goal of MLOps is to integrate automation and collaboration into the step-by-step workflow of ML model development.
To safely implement MLOps principles and reach the full business potential of machine learning, the entire model development workflow is grouped into 3 iterative phases—design, model development, and operations—with MLOps principles that streamline each process.
The design phase is the very beginning of model development. The application's potential users, a suitable business model, and a machine learning solution to solve the user's problem are designed in this phase. The software's possible use cases are also developed, and a check for the availability of necessary data to train the machine learning model is performed. Information gathered from each of these processes is then used to design the architecture of the ML-powered application.
In the experimentation and development phase, the product from the previous design stage is then put to the test to validate the proposed ML solution's real-life implementation. A proof of concept (POC) is then implemented and run iteratively on the Machine learning algorithm until a stable model that can run in production is achieved.
With consistent integration of proven MLOps techniques, the stable model attained in the previous phase is then delivered into production. Testing, monitoring, versioning, automation, continuous deployment, monitoring, and governance are applied to the machine learning model in this phase.
By applying the principles discussed earlier to each phase of the machine learning model development workflow, MLOps aims to improve the producibility of machine learning models. It creates a solid end-to-end model development framework to help businesses design more efficient workflows that saves time, reduce cost and improve the customer experience while tapping into new revenue sources.
MLOps fosters rapid innovation, faster time to market, effective machine learning lifecycle management, consistency, reproducibility of machine learning workflows, and lower failure rate through continuous integration, deployment, delivery, monitoring, and testing of ML models.
At length, MLOps helps experts concentrate on their field of specialization to drive business benefits, rather than spending a long time building a single solution due to ineffective production practices.
MLOps is practicable in any machine learning model development workflow; the health industry, diagnosis, for example.
MLOps can be very handy in reusable and collaborative use cases such as health diagnosis systems. In the design phase of the diagnosis model, the model is trained with available data on the symptoms of a type of illness, say cancer, and the architecture of the ML application is designed.
In the experimentation and development phase, the model is tested on patients who have the illness to see if it truly identifies. And if not, the model is retrained and tested iteratively until a stable diagnosis model is achieved. After that, the stable model is pushed into deployment. With continuous integration, continuous deployment, continuous testing, and other MLOps principles actively enabled, the model can be retrained, validated, and deployed with new data when new symptoms are identified without having to manually go through each phase all over again. Also, with MLOps, the model can be continuously monitored to ensure that it does not deviate from its goal—diagnosing cancer—and retrained if it does.
The easiest way to enable MLOps principles in your machine learning model development workflow is to adopt an MLOps platform.
MLOps platforms streamline the implementation of MLOps principles in enterprise machine learning model development workflow. They provide a framework that efficiently automates the administration and management of the end-to-end lifecycle of machine learning models. MLOps platforms also help enterprises establish a cross-functional and automated monitoring governing system while allowing them to assess the health of the models, make the workflow auditable, and manage access control in real-time.
You can build a custom homegrown MLOps platform for your organization. Still, an easier way to jump on the practice is by adopting one or more of the open-source or managed MLOps platforms available in the market—but you should know some important features of a good MLOps platform to help you make a good choice.
A good MLOps platform helps to bootstrap the model development by providing reusable templates and artifacts. It must provide integration with a version control system to track changes to the dependencies, building code, and data used in training the model and enable easy reversal to an old version in case of anomalies. A good MLOps platform must also build and stage the model and start the CI/CD process for deployment. It must provide intuitive monitoring and visibility into the usage and performance of the model. Finally, a good MLOps platform must automate each of the processes mentioned earlier to ensure automatic and continuous testing, building, integration, deployment, delivery, and monitoring throughout the model's lifecycle.
While some platforms offer an end-to-end solution, some are focused on a specific aspect such that you need to combine two or more platforms to achieve a complete MLOps pipeline. Such platforms are preferably referred to as MLOps tools.
Some of the best open source and managed MLOps platforms available in the market are;
An open-source MLOps tool for implementing iterative CI/CD for machine learning projects. It keeps track of changes to the model and enables continuous training and validation of models.
A full-featured MLOps platform that manages the deployment of machine learning workflows on Kubernetes. It offers a simple, scalable, and portable solution for running machine learning pipelines on Kubernetes. Kubeflow started as a platform for running TensorFlow tasks via Kubernetes but has since evolved into a full-fledged data pipeline experimentation platform that operates on multiple platforms.
A technology-agnostic ML platform for building and deploying machine learning models at scale. It allows management of end-to-end data science workflow in a single, simple and intuitive interface. Cnvrg accelerates the building of machine learning pipelines that are readily deployed in Kubernetes by leveraging available cloud resources. It offers a managed and free open source community version of the platform, which helps data scientists make the most out of their time and resources.
Built by Databricks, MLflow is a popular open-source MLOps platform for managing the machine learning lifecycle. It is designed with four components; MLflow Tracking, Projects, Models, and Model Registry, which all help manage ML lifecycle from experimentation, reproducibility to deployment.
A production-level MLOps platform that offers fast, secure, and cost-effective end-to-end ML lifecycle management. It provides full automation of ML model deployment and flexible tooling that foster collaboration between ML engineers and operations teams. It is easy to use and offers advanced security measures, and GPU support makes it suitable for various use cases, including deep learning.
An end-to-end enterprise-grade platform for teams of data scientists, data engineers, DevOps, and managers to manage experiments, data and orchestrate workloads of a machine learning project. The platform integrates with a wide variety of machine learning tools, making it easy to adapt for various teams. It also supports on-prem, private cloud, multi-cloud tenants, and custom configurations.
Our data scientists could review the automated model training results before deployment and build in a data drift system to automatically look for deviations that might negatively impact our solution,
says Sze-Wan Ng, Director of Analytics & Development at TransLink, speaking of the adoption of MLOps in the company.
Translink is a Vancouver, Canada, based transit agency that services over 2.5million residents across 200 routes. The company faced challenges with meeting with their customer request; a more accurate bus arrival and departure times.
To solve this challenge, TransLink partnered with Microsoft and T4G to build 16,000 adaptive AI models that collectively react to delays and changes in traffic patterns and provide customers with the most accurate time estimate possible. Because of their new AI solution, the teams then transitioned their traditional software development workflow with MLOps practices through the Azure Machine Learning platform. This allowed the team to set up an automated build and release pipeline that automatically trains and deploy models, an approval system that reviews and approves changes to the model, a data drift system that compares the model's performance to a benchmark to automatically identify a decline in the system and also integrate data drift pipeline that automatically retrains, builds, and deploys the model pipeline.
The result of these MLOps-enabled abilities is a 74% improvement in the actual bus scheduling times, as TransLink noted.
Companies and organizations providing commercial ML services or looking to adopt machine learning models in the future need to learn about MLOps and how to apply it in their workflow because, like DevOps, MLOps has the potential of becoming the de facto practice in model development as the future unfolds.
MLOps helps organizations create an automated workflow that is repeatable and error-resilient to facilitate cross-team collaboration, compliance with regulatory ethics, overcome prevalent challenges sabotaging the maturity of machine learning projects to production, and leverage the capabilities of the ML model to drive business growth.
As a discipline that is still in a relatively early phase, there are bound to be some challenges in the implementation of MLOps. Even though it has a community with established principles, its toolings may not portray the true proposed values of MLOps.
Also, MLOps might set unrealistic figures regarding the benchmark for a successful machine learning implementation. And as mentioned earlier, many MLOps tools and platforms available do not provide enough automation in validating data, running experiments, deploying, monitoring, and retraining models yet. But as the discipline matures and more organizations adopt MLOps, there'll be a need for higher speed in production to outperform competitors. This will force the toolings to level up to meet up the demands of the organization.
MLOps Engineer is one of the rising roles in the tech roles in the software engineering field. According to Glassdoor, the average base pay for an MLOps Engineer role in the US is $87,485, with a more senior role attracting up to $114k per year on average.
Considering that the field is still developing, the future of a career in MLOps looks promising with more settlements in cash and kind because of the business-critical tasks an MLOps Engineer handles.
Becoming an MLOps Engineer, like a DevOps Engineer, involves a steep learning curve.
To become an MLOps Engineer you need to learn at least one version control tool, Git preferably. You also need to know about linear and logistic regressions, artificial and convolutional neural networks, decision trees, K-means bayes theorem, and other core machine learning algorithms to familiarize you with machine learning projects. You also need to be well versed in using CI/CD tools such as GitLab, Jenkins, Prometheus, and Grafana and some cloud and DevOps automation tools such as Docker, Kubernetes and Terraform. Learning how to use at least one of the most common MLOps platforms will also come in handy and cover the need to learn many other tools.
Not just about data, MLOps cares about the operationalization of ML models as well as the code used in developing its software implementation. It is simply DevOps for machine learning development with little tweakings to the principles to perfectly suit the model development workflow. Implementing MLOps in your model development workflow will promote collaboration and automation that will see your machine learning business solution successfully deployed into production.
Get similar stories in your inbox weekly, for free
Share this story:
Today, companies make the most use of cloud technology regardless of their size and sector. …
In this post, you will learn how to optimize your cybersecurity and performance monitoring tools …