March 24, 2023
At Meta, we have developed measurement processes for specific metrics about AI systems that can be used to make managing models more effective and efficient, and we’ve tested these processes across a diverse ecosystem of tools and systems. We believe these techniques can be applied broadly in other organizations managing AI ecosystems, so we are sharing them here.
AI development ecosystems are increasingly complex and challenging to maintain, and technology companies need to develop highly efficient systems to build, serve, and improve their AI models for production applications.
AI model management is particularly important for large-scale technology companies like Meta, where billions of people trust us to respect their privacy, keep their data secure, and deliver consistent, high-quality product experiences. Meta has thousands of AI models as well as a diverse set of development tools and model serving platforms that need to comply with regulations in multiple areas of the world.
To manage these models, Meta has developed a machine learning operations (ML-Ops) ecosystem that consolidates key elements such as measurement while also offering flexible, decentralized tools for different product teams with different needs. At Meta, we have extensive experience with ML-Ops, and have pioneered new ways to navigate the process for a wide range of product applications. In this blog, we will walk through approaches we’ve developed for measuring the maturity of our AI models and tracking management outcomes. These approaches are foundationally important for any AI infrastructure, because a first step in managing an ecosystem of models is developing a unified way of understanding particular ML assets via metrics.
We’ll start by exploring the way we think about AI model management (one of the core components of Meta’s broader AI governance strategy), then discuss how to move toward consistently defining concepts across authoring, training, and deploying AI models. We’ll then walk through how to apply those concepts to platformize measurement and reporting, and describe how to build a cohesive metric framework on that platform. These measurement processes are being leveraged to more efficiently, reliably, and responsibly operate many AI platforms, products, and infrastructure components at Meta.
The following items represent goals and principles that AI practitioners should consider as they aim to effectively manage models across their systems:
Data governance: Effective AI systems management factors in data governance considerations to help ensure models only consume data for a limited, clearly stated purpose, that the minimum necessary data is used, and that data is retained only as long as it is needed. At Meta, these efforts support certain regulatory requirements and internal policies related to data governance.
Security: Leverage automation to better prevent misuse or accidental loss of models and their data, and to provide access to the same.
Ownership accountability: Ensure that models are actively maintained and documented with a clear owner, taking into account other accountability goals for your organization.
Fairness: Minimize risk of amplifying societal biases in the model building process, and ensure consistent outcomes across different demographic groups, taking into account other fairness goals for your organization.
Robustness: Promote AI best practices to help models run optimally. At Meta, we work to drive best practices for things like frequent retraining to combat concept drift.
Efficiency: Maximize usage of machine resources and developer time by lessening friction in the process and reducing unnecessary compute and storage used for modeling.
At Meta, we provide product teams with opportunities to match AI tooling to their needs. Some product teams use tightly coupled custom solutions to meet exceptional feature and scale requirements, while others benefit from the ease of use of simpler, more general tools. This variance allows for effective AI development and fast iterations, but it also adds complexity to model management. Various tools, linked by product area, mean more touchpoints to manage. In our measurement of AI models, features, and datasets, we look to enable this decentralization by efficiently bridging across these varied tools to create visibility to outcomes without requiring tooling consolidation.
Measurement begins from a clear definition of an AI model. This seemingly simple question can get quite ambiguous in the context of complex modeling ecosystems. For example, if you ask different practitioners to look at the same set of model artifacts and tell you how many models there are, you will often get several different answers. Some people may say there are five trained binaries, thus five models; others may say there are three deployments serving traffic, so three models; and still others may say these artifacts all came from one source code file, so there is one model.
All this becomes a challenge when you have to ask such questions as, “Are all my models being retrained regularly?” and need a uniform way to understand how many models to assess. In order to consistently manage models, we need consistent definitions for key concepts, so we’ll walk through how we established standardized definitions for ML model artifacts in technical systems.
The first step in assessing the artifacts of AI modeling is selecting the right component for each use case. To do so, it can be helpful to picture a general structure of model development, pictured below as it looks at Meta. Significant complexity is added in considering nonmodel assets (such as features and data sources), so this section will focus on models themselves.
Model source code is written by engineers to solve a given purpose (such as ranking posts), and then one or more model binaries are trained from this source, often being created periodically for either iterative model improvements or retraining. Finally, some of these versions are published to inference systems as deployments to serve traffic. Separate deployments occur for various reasons, including for infrastructure considerations (such as one streaming deployment and one batch), for experiments, or for reusing generic or pretrained models in different areas.
After selecting the model artifact to measure, additional complexity arises when artifacts with the same function have different labels in different tools. This causes confusion when elements with certain labels are selected and relevant components from some systems are missed, or when selecting the same label across systems unintentionally groups together non comparable concepts in AI.
Take the example below, where system 1 calls the model source code the “Model,” with another term denoting the trained binaries, and system 2 considers binaries the “Model,” calling the model source code something else. If we were to select all “Models” from these systems, we’d get a mix of source code and binaries, which are not comparable. This confusion can be minimized with a clear metadata architecture for the organization that bridges across specific system implementations via common labels.
Meta uses decentralized AI tooling to create customized product experiences in various domains, from video feed ranking in Reels to helping people discover new VR experiences in Meta Horizon Worlds. This wide array of use cases has shown us that measurements can sometimes be inconsistent because they cover different subsets of the model population.
To avoid inconsistency while maintaining decentralization of our AI systems, we’ve worked toward consolidating certain logging information into a single platform that can then serve various needs when queried. This platformizing of measurement involves creating modular data collection endpoints that can be reused in different areas and ingesting data from these endpoints into a nuanced metadata graph. Together, decentralized data ingestion and a central data store create a scalable structure that gives us the flexibility to work with a diverse range of AI systems.
Two key elements in this process are the metadata graph and data interoperability standards.
The central graph stitches together data between systems and helps ensure coverage in diverse contexts. This is done by bridging across systems and creating common definitions for various artifacts.
A feature source record stores the origin of data for a particular feature.
These sources are linked to any number of logical feature records, which store the logic to transform the raw data into a training dataset.
Model code and its training dataset are connected to a workflow run, which is a pipeline that executes to produce one or more model binaries.
Finally, the model binaries are linked to one or more model deployments, which track each place where binaries are serving inference requests.
This graph is implemented via Ent framework and is referenced in various systems for model management where we need to call and edit information about specific models.
The metadata graph has various flexible units and data flags to cover a vast number of infrequent edge cases, such as a single model deployment location running thousands of model binaries at the same time. Because of this, it can be parsed in many ways that produce different views from the same data. For company-wide data applications, where consistency is important, we establish standard ways to interpret the graph. To do this, practical trade-offs are necessary to determine what edge cases are appropriate to include for common uses and to set global definitions, such as what we mean by an “active model.” This allows us to transform the graph into flat tables for broad reporting and to interact with traditional (non-AI) programs that want to leverage AI data.
These tables are the first stop for most analytics or exploratory use cases. When more nuanced use cases with specific needs arise, these can return to the graph to look to meet their needs via a custom data pull. The process of transforming a large graph into tables involves significant ETL operations, so we use Presto to make sure pipelines are efficient and reliable, and supplement with Spark to handle areas that are particularly memory intensive.
After consistently defining artifacts and aggregating data from diverse sources, the final step in getting a broad picture of model management is to define metrics and interactions between them to describe model outcomes.
One way to do this is via a machine learning maturity framework.
The first problem we need to solve is how to compare metric outcomes from across areas of model management. The framework does this by setting out design principles for each component metric. These describe the outcomes expected in models with a standardized, bucketed format. Translation to consistent labels allows for apples-to-apples comparisons across diverse domains, and the bucketing allows for comparison of both continuous variables (e.g., time since a model has been retrained) and logical states (e.g., investigation in progress or investigation completed).
This design is flexible enough to work in various contexts, where expectations could be set out to mean anything from compliance with access control policy to proper annotation tags on models, depending on specific needs.
For instance, at Meta we use it to shape a company-wide view of exposure to concept drift among deployed models. Concept drift is when the statistical representation that a model learns from its data changes, making the model’s predictions less valid. By measuring the time since a model was last trained, we can proxy this risk of drift and understand the likelihood that the model learned a representation that no longer applies.
Even with consistently defined metrics, it can be challenging to make sense of outcomes across each metric individually. Implementing a hierarchical framework structure to group metrics can help alleviate this problem. These groupings constitute a graph of metrics and their aggregations for each model, and various nodes on that graph can be reported for different purposes. Aggregations from component metrics to groups and to overall maturity can use different methods based on the nature of the metrics they aggregate.
At Meta, we use a maturity framework to coordinate key top-line metrics for ML model management and give a holistic view of how they interact. Each metric has a config file that defines its place in the graph hierarchy (lines 5 and 6, below), and also defines how it is aggregated from finer model concepts, such as deployments, to coarser model concepts, such as source code (line 8, below).
If you are looking to measure management of an AI development ecosystem, our experience shows some concrete ways to approach the situation:
Think about how different tools contribute to a generic AI development process, and map outputs from various tools to consistent artifacts in that process.
If your AI systems are complex and can’t be consolidated, think of measurement as a platform with a metadata graph and common interoperability standards.
For a broad view of outcomes, consider a metric framework that can compare and aggregate across model management domains.
Thanks to Michael Ashikhmin, Thomas Dudziak, Cindy Niu, and Reza Sharifi Sedeh for their significant contributions to this work, and Dany Daher, Prashant Dhamdhere, Maurizio Ferconi, Xin Fu, Giri Kumaran, Noah Lee, Joon Lim, and Jiang Wu for their guidance and support.
Research Data Scientist
Foundational models
Latest news
Foundational models