AI Risk

AI Model Monitoring Template: Drift Detection, Thresholds, KRIs, and Evidence

Table of Contents

TL;DR:

  • Zillow lost over $500 million because nobody noticed their pricing model went off the rails — that’s a model monitoring failure, not a modeling failure
  • SR 11-7 and OCC Bulletin 2011-12 are explicit: ongoing monitoring is required, not optional. Examiners will ask how you do it.
  • You don’t need a $200K monitoring platform to start — you need defined thresholds, an escalation path, and proof you’re actually executing the plan
  • This post gives you the monitoring framework, tier-based cadence, and 30/60/90 rollout that examiners expect to see

Who this is for: Model risk managers, compliance officers, and AI governance leads who need to defend a model monitoring program in their next exam — or build one before the exam comes.

Zillow Lost $500 Million Because No One Was Watching the Model

In 2021, Zillow’s home-pricing algorithm couldn’t keep up with a rapidly shifting housing market. The model was making purchase offers based on price forecasts months ahead, but the training data didn’t reflect 2021’s volatility. Zillow wrote down over $500 million, shut down iBuying entirely, and laid off 25% of its workforce.

The model didn’t fail because it was poorly built. It failed because nobody caught the drift in time. The monitoring framework either didn’t exist or wasn’t tied to decisions. The organization kept feeding model outputs into multimillion-dollar purchase calls until the loss was unrecoverable.

This is the gap regulators care about. SR 11-7 says it plainly: “Validation activities should continue on an ongoing basis after a model goes into use.” OCC Bulletin 2011-12 makes monitoring a first-class control. The EU AI Act Article 72 makes post-market monitoring a legal requirement. And every model risk examiner’s first follow-up after seeing your inventory is the same: “Show me how you’re monitoring these.”

If you’re running AI models in production and you don’t have a monitoring framework with defined thresholds, escalation paths, and evidence of execution — you’re not managing risk. You’re waiting for the loss event.

Need the framework to defend this in your next exam? The AI Risk Assessment Template gives you the model inventory, tiering structure, and monitoring documentation examiners expect — built for compliance and risk teams, not data scientists.

The Four Types of Drift Examiners Will Ask About

“Model drift” is a catch-all term, but the control depends on what’s actually drifting. Your monitoring framework needs to cover all four:

Drift TypeWhat’s ChangingPlain-English Example
Data driftInput feature distributions shiftCustomer income distributions change after a recession
Concept driftThe relationship between inputs and outputs changesWhat makes a “good” borrower changes post-pandemic
Feature driftOne input shifts independently of the othersA vendor changes how they encode zip codes
Prediction driftThe model’s output distribution shiftsApproval rate climbs from 62% to 78% with no policy change

Most monitoring failures happen on concept drift. Data drift is loud — distributions change visibly. Concept drift is silent — the inputs look fine, but what they mean for the prediction has shifted. The Bank of England flagged exactly this during COVID: payment deferrals meant traditional delinquency signals stopped predicting default. The data hadn’t changed shape. The world had.

Your monitoring framework needs detection methods for each type, with thresholds tied to your model tiering. (If you don’t have model tiering yet, start here — without tiering you can’t justify your monitoring cadence.)

What to Actually Monitor (and How Often)

A monitoring framework isn’t useful if it drowns people in metrics. Here’s what matters, organized by cadence:

Daily / Real-Time

MetricWhat It Tells YouAlert Threshold
Prediction distributionModel output is shiftingPSI > 0.2 vs. 30-day baseline
Error rate / accuracyModel is getting worse> 2 standard deviations from baseline
Data completenessMissing inputs> 5% null rate on critical features
Volume anomaliesUnusual traffic patterns> 3 standard deviations from expected

Weekly

MetricWhat It Tells YouAction Trigger
Feature-level PSIIndividual inputs driftingAny feature PSI > 0.2
Segment performanceModel underperforming for subgroups> 10% degradation vs. overall
Fairness metricsBias emerging in productionDisparate impact ratio < 0.8
Ground truth reconciliationPredictions vs. actuals divergingWhen outcome labels become available

Monthly / Quarterly

MetricWhat It Tells YouAction Trigger
BacktestingHistorical accuracy trendingDeclining Gini, AUC, or KS
Stability across time windowsLong-term reliabilityConsistent degradation 3+ months
Champion-challengerBetter alternatives existChallenger outperforms on key metrics

The PSI threshold cheat sheet (the only statistical test most exam reports actually reference): PSI < 0.1 = stable; 0.1–0.2 = investigate; > 0.2 = significant drift, escalate. Run it on each input feature individually, not just on the prediction output. Feature-level PSI is what isolates which input is drifting and whether it’s a data quality issue or a real population shift.

Monitoring Cadence by Model Tier

Your monitoring frequency should match the model’s risk tier. This is the table examiners want to see in your monitoring plan:

Model TierExamplesMonitoring CadenceValidation Cadence
Tier 1 (Critical)Credit decisioning, fraud, AMLDaily automated + weekly reviewAnnual + triggered
Tier 2 (Significant)Pricing, customer segmentation, collectionsWeekly automated + monthly reviewAnnual
Tier 3 (Standard)Marketing propensity, operational efficiencyMonthly automated + quarterly reviewEvery 18–24 months
Tier 4 (Low)Internal reporting, non-decision supportQuarterly automatedEvery 2–3 years

Who owns what (and what an examiner will ask each one):

  • Model owner / first line: Runs daily/weekly monitoring, escalates breaches, initiates recalibration. Examiner question: “Show me your last drift alert and what you did about it.”
  • Model Risk Management / second line: Sets monitoring standards, reviews results, approves threshold changes. Examiner question: “Show me the documented thresholds and how they map to your tiering.”
  • Internal Audit / third line: Audits whether the monitoring framework itself is working. Examiner question: “When was monitoring effectiveness last independently reviewed?”

When Drift Triggers Recalibration vs. Re-Validation vs. Rebuild

Not every drift event requires rebuilding from scratch. Use this decision framework — it’s also the one to document in your monitoring plan, because examiners will ask “what’s your trigger for re-validation?”

Recalibration (lighter touch):

  • Prediction drift with stable feature distributions
  • PSI 0.1–0.25 on outputs only
  • Performance degradation < 5% from baseline
  • Action: Adjust thresholds or score cutoffs. Document and get sign-off from model risk.

Re-Validation (full cycle):

  • Feature-level PSI > 0.25 on multiple features
  • Concept drift confirmed (ground truth diverges from predictions)
  • Performance degradation > 10%
  • Action: Full re-validation per SR 11-7 — conceptual soundness review, outcomes analysis, updated documentation.

Rebuild / Retrain:

  • Multiple drift types compounding
  • Structural break in the data (post-merger, post-pandemic, new regulation)
  • Action: New model development cycle. Treat as a new model for governance purposes.

If the model drifts so badly you need an emergency shutdown — not just recalibration — make sure you have a documented kill switch before that day comes. Trying to design one mid-incident is how you end up with a Zillow.

What Examiners Actually Want to See

When examiners assess your model risk management program, they’re not looking for the most sophisticated monitoring. They’re looking for defensible execution. Five things:

  1. Documented monitoring plan tied to your model tiering. Not a generic policy — a plan that names the model, the metrics, the thresholds, and the cadence.
  2. Defined thresholds for performance degradation and drift, mapped to the tier.
  3. Escalation procedures when thresholds are breached — who gets notified, who decides, who acts.
  4. Evidence of ongoing execution. Sample monitoring runs from the last 90 days. Not a plan that sits on a shelf.
  5. Outcomes analysis comparing predictions to actuals once ground truth is available.

The order matters. A team that has #4 and #5 but skipped #1 will fail an exam. A team that has #1 and #2 but no evidence of #4 will fail harder. The most common MRA we see in this space is “monitoring framework documented but not consistently executed.”

30/60/90-Day Rollout

If you’re starting from scratch — or rebuilding after an MRA — this is the rollout sequence that’s defensible to model risk and feasible for a small team:

Days 1–30: Foundation

  • Inventory all production models and their current monitoring state
  • Define drift thresholds per model tier (PSI, KS, performance degradation)
  • Implement automated PSI/KS checks on top 5 critical models
  • Create escalation matrix: who gets notified at each threshold
  • Document the monitoring plan for regulatory readiness

Days 31–60: Scale

  • Extend automated monitoring to all Tier 1 and Tier 2 models
  • Build a monitoring dashboard with automated alerts
  • Implement fairness metric tracking (disparate impact ratios)
  • Conduct the first monthly monitoring review meeting
  • Set up champion-challenger framework for top models

Days 61–90: Operationalize

  • Complete monitoring coverage for Tier 3 models
  • Run the first triggered re-validation based on a real drift event (or simulated one)
  • Document and test automated retraining triggers
  • Conduct a tabletop exercise: “model gone wrong” scenario
  • Submit the monitoring framework for internal audit review

So What?

Model monitoring is the control that sits between “working model” and “$500 million loss.” Regulators are explicit about it. SR 11-7 demands ongoing validation. The OCC Comptroller’s Handbook makes monitoring a first-class control. The EU AI Act codifies it as a legal requirement.

The good news: you don’t need to build everything from scratch, and you don’t need an enterprise platform on day one. Start with PSI on your critical models, define thresholds that match your tiering, build the escalation path, and prove execution. Do that, and your next exam goes from defense to demonstration.

Ready to put this into a documented program your model risk team can defend? The AI Risk Assessment Template gives you the model inventory, tier definitions, and monitoring documentation structured the way examiners expect — built for risk and compliance teams, not data scientists.

FAQ

How often should I check for model drift?

It depends on the model’s risk tier. Tier 1 models (credit decisioning, fraud, AML) need daily automated monitoring with weekly review. Tier 3 and 4 models can be checked monthly or quarterly. The principle: monitoring cadence should match the magnitude of the decision the model is supporting.

What’s the difference between model drift and model degradation?

Drift is the change in inputs or in the input-to-output relationship. Degradation is the consequence — your model’s performance gets worse because of drift. You monitor for drift to prevent degradation. Drift is the disease; degradation is the symptom.

Do I need a dedicated platform like Fiddler, Arize, or Evidently to do this?

No — you can start with statistical tests (PSI, KS) implemented in Python or R. Purpose-built platforms make sense at scale or when you need real-time alerting and explainability. For most mid-size institutions, custom scripts on critical models plus a platform for the long tail is the practical approach. Don’t let “we don’t have the tooling” be the reason you skip it.

What’s the single biggest model monitoring failure examiners flag?

“Documented but not executed.” Teams write a beautiful monitoring plan and then can’t produce evidence of monthly runs, threshold breaches, or escalations. The plan becomes the finding. Build the plan around what you can actually run, not what looks impressive on paper.

Does this apply to generative AI and LLMs too?

Yes — and the monitoring is harder because outputs are open-ended. The NIST AI Risk Management Framework addresses this through its MEASURE function, and NIST AI 600-1 extends it to GenAI specifically — including hallucination monitoring and output quality tracking. The principle is the same: define what “drift” looks like for your use case, set thresholds, build escalation, prove execution.

Need the working template?

Start with the source guide.

These answer-first guides summarize the required fields, evidence, and implementation steps behind the templates practitioners search for.

Rebecca Leung

Rebecca Leung

Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.

Related Framework

AI Risk Assessment Template & Guide

Comprehensive AI model governance and risk assessment templates for financial services teams.

Immaterial Findings ✉️

Weekly newsletter

Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.

Join practitioners from banks, fintechs, and asset managers. Delivered weekly.