Building blocks of MLOps: What, Why and How?

Raviraja Ganta
Founding Engineer
April 28, 2022

MLOps has become the centre of focus for any ML team to scale since 2020. There are approx 284 tools created in MLOps. So what is MLOps? Why do we need MLOps?

What is MLOps?

The goal of MLOps is to reduce the technical friction to get the model from an idea to production in the shortest possible time to market with as little risk as possible.

MLOps is the set of practices at the intersection of Machine Learning, DevOps, and Data Engineering.

Why MLOps?

Typical machine learning project flow looks like this: There is a data scientist who needs to solve some business problem that can provide some impact. The data scientist investigates the problem, gets the data, does data analysis, cleans the data, experiments with different kinds of models, and then saves the best model and the data used for the creation of that model. Now the model needs to be served so that it can be consumed by the business applications.

What can go wrong?

  1. The number of requests can go high that a single server might not be able to handle.
  2. The underlying data can change which can affect the model predictions.
  3. The model might not perform best in all cases, so re-training might be required.
  4. The business problems can change which leads to either modification of the model or the addition of new models..

and many more...

How MLOps help?

MLOps helps in enforcing certain practices which help in reducing risks. It can help in Shorter development cycles. Better collaboration between teams. Increased reliability, performance, scalability, and security of ML systems and more...

Industrial Surveys

  • 72% of a cohort of organizations that began AI pilots before 2019 have not been able to deploy even a single application in production.
  • Algorithmia’s survey of the state of enterprise machine learning found that 55% of companies surveyed have not deployed an ML model. i.e 1 in 2 companies.

Successful deployments, adapting to changes, and effective error handlings are the bottlenecks for getting value from AI.

Model Development

Data Engineering

Data Collection: Data is needed for developing a model. Data can come from different places like data storage, fetching the data lively from apps, etc. There could be cases where the data is not sufficient. So data needs to be generated either manually or programmatically in these cases.

Data Annotation: Data might not have been labeled all the time. At times, the labels assigned to the data might not be accurate. In either of the cases, data needs to be annotated (or) validated.

Data Processing: Depending on the problem, data needs to be processed so that it would be useful for model creation. Feature engineering needs to be done to identify useful features. Data visualizations might be needed for a better understanding of the characteristics of the data.

Model Engineering

Modeling: This is the part that everyone is aware of. There are many kinds of models available. It could be statistical models, machine learning algorithms, or deep learning models. There are different kinds of modeling techniques that are also available like supervised learning, unsupervised learning, and semi-supervised learning techniques.

Hyper-parameter tuning: Each model has its own parameters and some modeling parameters. Fine-tuning them will lead to a better model. Some examples include hidden size, number of layers, batch size, learning rate, loss functions, etc.

Model Validation: Choosing the right metric is a crucial part of model engineering. Once the metric is finalized, trained models need to be validated based on this metric. Data used for validation should not be present in training. Unsupervised approaches might not have the training, so all data can be used there.

Model Packaging: Developed model might not be used in the same environment. The deployed environment might not have the required dependencies or run in a different hardware setup. Model packaging will help in resolving these kinds of issues. Some examples like docker container, ONNX, pkl will be used.

Model Deployment

Once the model is developed, it needs to be deployed so that it can be used in real-time.

Deployment Methods

Online vs Batch Inferencing

Some problems might require immediate predictions from the model (ex: conversational agent, self-driving car). Some others might not need immediate predictions and it’s okay having some delay (Feedback analysis).

Auto-Scaling

Scaling is a core part of the deployment, where systems are scaled according to the number of requests. There are two kinds of scale:

Vertical scaling means replacing the same compute instances with larger more powerful machines.

Horizontal scaling means replacing with more or less a number of the same machines.

Auto-scaling can be achieved in multiple ways: Kubernetes-based scaling, Serverless (Lambda) based scaling.

Deployment Stages

Once the model is developed, we should not directly update it in production. It should be done in stages, so that we can have the flexibility to catch any errors, to compare the performance w.r.t model that is in production, the update should be smooth so that end-user does not feel any downtime, etc.

To tackle this, deployment is usually done in stages.

Staging

First, the model is deployed to staging and all the sanity checks are performed here like latency, outputs are coming in the expected format, validation metrics are coming same, scaling accordingly to the requests, etc.

Production

Model can be deployed in production in different ways.

In the progressive delivery approach, a new model candidate does not immediately replace the previous version. Instead, after the new model is deployed to production, it runs in parallel to the previous version. A subset of data is redirected to the new model in stages (canary deployment, Blue/green deployment, etc). Based on the outcome and performance of the subset, it can be decided whether the model can be fully released and can replace the previous version.

Blue/Green Deployment

Blue-green deployment is a deployment strategy that utilizes two identical environments, a blue and a green environment with different versions of an application or service. User traffic is shifted from the blue environment to the green environment once new changes have been tested and accepted within the blue environment.

Canary Deployment

Canary deployment is a deployment strategy that releases an application or service incrementally to a subset of users. All infrastructure in a target environment is updated in small phases (e.g: 2%, 25%, 75%, 100%).

The model can be directly replaced in production without all these steps also depending on the use-case.

Monitoring

Model deployment is not the end of development. It’s only halfway. How do we know that the deployed model is performing as expected?

Monitoring systems can help give us confidence that our systems are running smoothly and, in the event of a system failure, can quickly provide appropriate context when diagnosing the root cause.

Things we want to monitor during training and inference are different. During training, we are concerned about whether the loss is decreasing or not, whether the model is overfitting, etc. But, during inference, We like to have confidence that our model is making correct predictions.

What to monitor?

Performance Metrics:

Response time: This metric is useful in understanding the time taken for the request to get completed.

Monitoring this metric will help in understanding when requests are taking higher / lesser times. Alerts can be created on top of this, to get notified when there is an abnormality in response times.

Resource Utilisation: This metric is useful in monitoring CPU/GPU/Memory usage.

Monitoring this metric will help in understanding how the resources are being used. Are the resources being used to their capacity?  Alerts can be configured to get notified when the resources are under-utilized or over-utilized upon a specific value.

Invocations: This metric is useful for monitoring how many times a service a called.

There could be different services (or) different methods in a single service. Monitoring how many times either a service/method is invoked will help in understanding are the services are being called as expected. In the case of auto-scaling, this metric will help in understanding, how many lambdas, containers, pods, etc are invoked.

Success and Error rates: This metric is useful in monitoring how many requests are executed.

Not every request will be successful. There could be errors, timeouts, etc could be happening due to various issues. This metric will help in understanding how the system is performing overall like how many requests are successful and how many failed.

Data Metrics:

The major variable in ML systems is the data. Monitoring data will help in understanding if there are any data leaks, data distributions, etc. Let’s see some metrics which are used for these.

Label distribution: This metric is useful in monitoring the number of predictions made for a label.

There could be no predictions made for some labels. Inversely, the model could be predicting only some labels all the time. Having this metric will help in understanding how the label predictions are distributed in run time. And alerts can be configured if there is a surge/drop-in label counts → Anomalies.

Data drift is one of the top reasons model performance degrades over time. For machine learning models, data drift is the change in model input data that leads to model performance degradation. Monitoring data drift helps detect these model performance issues.

There are different cases where a data drift can occur depending on the use case. Some of the popular ones are:

  • Covariate Shift: This happens when the distribution of input data shifts between the training environment and inference environment. Although the input distribution may change, the output distribution or labels remain the same. Statisticians call this covariate shift because the problem arises due to a shift in the distribution of the covariates (features).

    Example:
    For a cat-dog classification problem, let’s say training is happened using real images. During inference time, if the input is animated images, still the output does not change. So though the input distribution changes (real → animated), the output distribution (cat/dog) is not changed.

    Monitoring this will help in improving the model performance.
  • Label Shift: This happens when the output distribution changes but for a given output, the input distribution stays the same.

    Example:
    Say we build a classifier using data gathered in June to predict the probability that a patient has COVID19 based on the severity of their symptoms. At the time we train our classifier, covid positivity was at a comparatively lower rate in the community. Now it’s November, and the rate of covid positivity has increased considerably. Is it still a good idea to use our classifier from June to predict who has covid?

    Monitoring this will help us in understanding the changes in output distribution which lead to the re-training model.

    Covariate shift is when the input distribution changes. When the input distribution changes, the output distribution also changes, resulting in both covariate shift and label shift happening at the same time.
  • Concept Drift: This is commonly referred to as the same input → different output i.e when the input distribution remains the same but the conditional distribution of the output given an input changes.

    Example:
    Before COVID-19, a 3 bedroom apartment in San Francisco could cost $2,000,000. However, at the beginning of COVID-19, many people left San Francisco, so the same house would cost only $1,500,000. So even though the distribution of house features remains the same, the conditional distribution of the price of a house given its features has changed.

    Concept drift is a major challenge for machine learning deployment and development, as in some cases the model is at risk of becoming completely obsolete over time. Monitoring this metric will help in updating the model with respect to data changes.

Monitoring systems can help give us confidence that our systems are running smoothly and, in the event of a system failure, can quickly provide appropriate context when diagnosing the root cause.

Artifact Management

One of the major differences different between ML and others is data and models. The size of data, and models can vary from a few MB to TB. Having a well-defined Artefact management will help in auditability, traceability, and compliance, as well as for shareability, reusability, and discoverability of ML assets.

Feature Management: This allows to share, re-use & discover features and create more effective machine learning pipelines. Maintaining features in a single place, will help in speeding up experiments and also inference times → no need to compute features from scratch, read the already computed ones and run the pipelines.

Dataset Management: There could be different versions of data, different splits of data, data from different sources, etc. Having dataset management in place ensures that the schema of the data will be constant. Being able to access any of the combinations at a later point in time will help in reproducibility and lineage tracking.

Model Management: There could be a number of models in production at scale, and it becomes difficult to keep track of all of them manually. Having model management helps in tracking. We should be able to lineage tracking on a model → what data is used, what configurations are used, and what is the performance on validation/test datasets? It should also help in the health checks of the models.

Summary

Machine Learning will enable the business to solve use cases that were previously impossible to solve. Delivering business value through ML is not only about building the best ML model. It is also about:

  • Able to experiment, train, and evaluate models at ease
  • Building systems that are replicable
  • Serving the model for predictions that can scale according to the usage
  • Monitoring the model performance
  • Tracking and Versioning datasets, model metadata, and artifacts

Though it is not required to have all the components, as it can vary from organization to organization, having a reliable robust MLOps system will lead to stable productionising of ML models and reduce sudden surprises.

References

  1. https://towardsdatascience.com/battling-label-distribution-shift-in-a-dynamic-world-bc1f4c4d2f92
  2. https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html
  3. https://cloud.google.com/resources/mlops-whitepaper
Get a demo with your data
Enterpret allowed us to listen to specific issues and come closer to our Members - prioritizing feedback which needed immediate attention, when it came to monitoring reception of new releases: Enterpret picked up insights for new updates and became the eyes of whether new systems and functionality were working well or not.
Louise Sellars
Analyst, Customer Insights
Enterpret is one of the most powerful tools in our toolkit. It's very Member-friendly. We've been able to share how other teams can modify and self-serve in Enterpret. It's bridged a gap to getting access to Member feedback, and I see all our teams finding ways to use Enterpret to answer Member-related questions.
Dina Mohammad-Laity
VP of Data
The big win-win is our VoC program enabled us to leverage our engineering resources to ship significantly awesome and valuable features while minimizing bug fixes and" keep the lights on" work. Magnifying and focusing on the 20% that causes the impact is like finding the needle in a haystack, especially when you have issues coming from all over the place
Abishek Viswanathan
CPO, Apollo.io
Since launching our Voice of Customer program six months ago, our team has dropped our human inquiry rate by over 40%, improved customer satisfaction, and enabled our team to allocate resources to building features that increase LTV and revenue.
Abishek Viswanathan
CPO, Apollo.io
Enterpret's Gong Integration is a game changer on so many levels. The automated labeling of feedback saves dozens of hours per week. This is essential in creating a customer feedback database for analytics.
Michael Bartimer
Revenue Operations Lead
Enterpret has made it so much easier to understand our customer feedback. Every month I put together a Voice of Customer report on feedback trends. Before Enterpret it would take me two weeks - with Enterpret I can get it done in 3 days.
Maya Bakir
Product Operations, Notion
The Enterpret platform is like the hero team of data analysts you always wanted - the ability to consolidate customer feedback from diverse touch points and identify both ongoing and emerging trends to ensure we focus on and build the right things has been amazing. We love the tools and support to help us train the results to our unique business and users and the Enterpret team is outstanding in every way.
Larisa Sheckler
COO, Samsung Food
Enterpret makes it easy to understand and prioritize the most important feedback themes. Having data organized in one place, make it easy to dig into the associated feedback to deeply understand the voice of customer so we can delight users, solve issues, and deliver on the most important requests.
Lauren Cunningham
Head of Support and Ops
With Enterpret powering Voice of Customer we're democratizing feedback and making it accessible for everyone across product, customer success, marketing, and leadership to provide evidence and add credibility to their strategies and roadmaps.
Michael Nguyen
Head of Research Ops and Insights, Figma
Boll & Branch takes pride in being a data driven company and Enterpret is helping us unlock an entirely new source of data. Enterpret quantifies our qualitative data while still keeping customer voice just a click away, adding valuable context and helping us get a more complete view of our customers.
Matheson Kuo
Senior Product Analyst, Boll & Branch
Enterpret has transformed our ability to use feedback to prioritize customers and drive product innovation. By using Enterpret to centralize our data, it saves us time, eliminates manual tagging, and boosts accuracy. We now gain near real-time insights, measure product success, and easily merge feedback categories. Enterpret's generative AI technology has streamlined our processes, improved decision-making, and elevated customer satisfaction
Nathan Yoon
Business Operations, Apollo.io
Enterpret helps us have a holistic view from our social media coverage, to our support tickets, to every single interaction that we're plugging into it. Beyond just keywords, we can actually understand: what are the broader sentiments? What are our users saying?
Emma Auscher
Global VP of Customer Experience, Notion
The advantage of Enterpret is that we’re not relying entirely on human categorization. Enterpret is like a second brain that is looking out for themes and trends that I might not be thinking about.”
Misty Smith
Head of Product Operations, Notion
As a PM, I want to prioritize work that benefits as many of our customers as possible. It can be too easy to prioritize based on the loudest customer or the flavor of the moment. Because Enterpret is able to compress information across all of our qualitative feedback sources, I can make decisions that are more likely to result in positive outcomes for the customer and our business.
Duncan Stewart
Product Manager
We use Enterpret for our VoC & Root Cause Elimination Program - Solving the issues of aggregating disparate sources of feedback (often tens of thousands per month) and distilling it into specific reasons, with trends, so we can see if our product fixes are reducing reasons.
Nathan Yoon
Business Operations, Apollo.io

Other articles of interest

No items found.