Monitoring and Metrics

Scaling AI is far from a purely technical endeavor. It’s intuitive, in fact, to assume that long-term execution of strategy requires Monitoring and Metrics of the strategy’s efficacy in producing business results. Well-rounded monitoring regimes should account for five key considerations:

• Maturity and risk;

• Adoption;

• Content moderation;

• Technical performance;

• Return on investment.

We’re giving a bit away here in discussing maturity and risk at this stage of the paper. Skip ahead to the section presenting the AI Maturity Model to understand how to evaluate the organization’s readiness or maturity for AI, as well as how to identify dimensions that present risk to be mitigated.

Adoption concerns the rate at which users take up a particular AI capability, how consistently they engage with the capability, and whether they continue engaging with the capability over the long term. We favor “weekly active users” (WAU), i.e., the number of users that actively use a given workload (let’s say Microsoft 365 Copilot, for example) each week. Monthly active users (MAU) has been a favorite metric in previous waves of technology adoption, but we find MAU to be misleading in the case of AI because AI requires such a cultural shift; a user who only engages once or twice a month is not likely to adopt “born-in-AI” ways of working and is thus unlikely to make significant productivity gains thanks to AI.

We also recommend employing a “rings of release” method when releasing new capabilities, then monitoring the uptake amongst colleagues. This is standard fare in software deployments, but for the uninitiated, this approach groups users for whom the capability will be available into concentric rings, say a closed circle of technical users, then business early adopters, and an ever-widening pool of users until the capability has been released to all target users. Monitor for WAU (or daily active users, if it makes sense) in each ring, identify and remedy obstacles to adoption in the smaller rings, and avoid releasing more widely until you’re satisfied with adoption in the predecessor ring.

Content moderation is crucial when implementing an AI product to ensure a safe, respectful, and legally compliant environment for users. Effective moderation addresses potentially harmful content, such as hate speech, explicit material, and misinformation, thereby preserving user trust and upholding community standards. For instance, input moderation may involve filtering user-uploaded images to detect obscene content, while output moderation could include analyzing text generated by AI to prevent the dissemination of inappropriate language. Content moderation is situational depending on your business, customers and goals. For example, requiring high amounts of blocking of violent imagery will be vital in business SaaS applications, but more nuanced in gaming.

Azure AI offers robust content moderation features, such as image moderation, text moderation, and video moderation, capable of detecting offensive content across multiple formats. Benefits include real-time monitoring, scalability to handle large volumes of content, and compliance with various international standards. These capabilities enable organizations to protect their brand reputation, enhance user experience, and foster a safe community.

Monitoring the technical performance of AI products is a multifaceted task that encompasses various metrics and benchmarks to ensure the systems are functioning optimally. Key performance indicators such as model accuracy, precision, recall, and F1 score are critical in evaluating the effectiveness of machine learning models.

Additionally, assessing workloads involves examining throughput and resource utilization to ensure the system can handle the expected volume of data and tasks. Responsiveness and latency are also vital metrics; low latency and high responsiveness indicate a well-optimized system capable of real-time or near-real-time processing.

Tools like performance dashboards, log analysis, and automated monitoring systems provide continuous insights into these parameters. Regular performance testing and anomaly detection are essential practices to preemptively identify and address potential issues, thereby maintaining the robustness and efficiency of AI products.

Azure AI Studio allows you to evaluate single-turn or complex, multi-turn conversations where you ground the generative AI model in your specific data (RAG). You can also evaluate general single-turn question answering scenarios, where no context is used to ground your generative AI model (non-RAG).

We learned through organizations’ experience adopting Power Platform (an earlier Microsoft platform technology that entered the mainstream in the 2018-2019 timeframe) that many organizations crave Return on Investment (ROI) data for every minor workload that’s deployed. This made sense in previous eras of big, monolithic software applications like ERP or CRM, but requires a more balanced, nuanced approach for AI (and for Power Platform, though this is a story for another time). Organizations that truly transform themselves for the age of AI will infuse AI throughout many, many aspects of its work.

We therefor recommend that ROI be measured explicitly for major “anchor” workloads, and in the aggregate for more micro-workloads, in other words, an aggregate assessment of worker hours saved, or costs reduced across the workforce or department.

Trustrorthy AIKnowledge Base

Monitoring and Metrics

Trustrorthy AI
Knowledge Base