AI Model Monitoring Template: Drift Detection, Thresholds, KRIs, and Evidence
Table of Contents
TL;DR:
- Zillow lost over $500 million because nobody noticed their pricing model went off the rails — that’s a model monitoring failure, not a modeling failure
- SR 11-7 and OCC Bulletin 2011-12 are explicit: ongoing monitoring is required, not optional. Examiners will ask how you do it.
- You don’t need a $200K monitoring platform to start — you need defined thresholds, an escalation path, and proof you’re actually executing the plan
- This post gives you the monitoring framework, tier-based cadence, and 30/60/90 rollout that examiners expect to see
Who this is for: Model risk managers, compliance officers, and AI governance leads who need to defend a model monitoring program in their next exam — or build one before the exam comes.
Zillow Lost $500 Million Because No One Was Watching the Model
In 2021, Zillow’s home-pricing algorithm couldn’t keep up with a rapidly shifting housing market. The model was making purchase offers based on price forecasts months ahead, but the training data didn’t reflect 2021’s volatility. Zillow wrote down over $500 million, shut down iBuying entirely, and laid off 25% of its workforce.
The model didn’t fail because it was poorly built. It failed because nobody caught the drift in time. The monitoring framework either didn’t exist or wasn’t tied to decisions. The organization kept feeding model outputs into multimillion-dollar purchase calls until the loss was unrecoverable.
This is the gap regulators care about. SR 11-7 says it plainly: “Validation activities should continue on an ongoing basis after a model goes into use.” OCC Bulletin 2011-12 makes monitoring a first-class control. The EU AI Act Article 72 makes post-market monitoring a legal requirement. And every model risk examiner’s first follow-up after seeing your inventory is the same: “Show me how you’re monitoring these.”
If you’re running AI models in production and you don’t have a monitoring framework with defined thresholds, escalation paths, and evidence of execution — you’re not managing risk. You’re waiting for the loss event.
Need the framework to defend this in your next exam? The AI Risk Assessment Template gives you the model inventory, tiering structure, and monitoring documentation examiners expect — built for compliance and risk teams, not data scientists.
The Four Types of Drift Examiners Will Ask About
“Model drift” is a catch-all term, but the control depends on what’s actually drifting. Your monitoring framework needs to cover all four:
| Drift Type | What’s Changing | Plain-English Example |
|---|---|---|
| Data drift | Input feature distributions shift | Customer income distributions change after a recession |
| Concept drift | The relationship between inputs and outputs changes | What makes a “good” borrower changes post-pandemic |
| Feature drift | One input shifts independently of the others | A vendor changes how they encode zip codes |
| Prediction drift | The model’s output distribution shifts | Approval rate climbs from 62% to 78% with no policy change |
Most monitoring failures happen on concept drift. Data drift is loud — distributions change visibly. Concept drift is silent — the inputs look fine, but what they mean for the prediction has shifted. The Bank of England flagged exactly this during COVID: payment deferrals meant traditional delinquency signals stopped predicting default. The data hadn’t changed shape. The world had.
Your monitoring framework needs detection methods for each type, with thresholds tied to your model tiering. (If you don’t have model tiering yet, start here — without tiering you can’t justify your monitoring cadence.)
What to Actually Monitor (and How Often)
A monitoring framework isn’t useful if it drowns people in metrics. Here’s what matters, organized by cadence:
Daily / Real-Time
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| Prediction distribution | Model output is shifting | PSI > 0.2 vs. 30-day baseline |
| Error rate / accuracy | Model is getting worse | > 2 standard deviations from baseline |
| Data completeness | Missing inputs | > 5% null rate on critical features |
| Volume anomalies | Unusual traffic patterns | > 3 standard deviations from expected |
Weekly
| Metric | What It Tells You | Action Trigger |
|---|---|---|
| Feature-level PSI | Individual inputs drifting | Any feature PSI > 0.2 |
| Segment performance | Model underperforming for subgroups | > 10% degradation vs. overall |
| Fairness metrics | Bias emerging in production | Disparate impact ratio < 0.8 |
| Ground truth reconciliation | Predictions vs. actuals diverging | When outcome labels become available |
Monthly / Quarterly
| Metric | What It Tells You | Action Trigger |
|---|---|---|
| Backtesting | Historical accuracy trending | Declining Gini, AUC, or KS |
| Stability across time windows | Long-term reliability | Consistent degradation 3+ months |
| Champion-challenger | Better alternatives exist | Challenger outperforms on key metrics |
The PSI threshold cheat sheet (the only statistical test most exam reports actually reference): PSI < 0.1 = stable; 0.1–0.2 = investigate; > 0.2 = significant drift, escalate. Run it on each input feature individually, not just on the prediction output. Feature-level PSI is what isolates which input is drifting and whether it’s a data quality issue or a real population shift.
Monitoring Cadence by Model Tier
Your monitoring frequency should match the model’s risk tier. This is the table examiners want to see in your monitoring plan:
| Model Tier | Examples | Monitoring Cadence | Validation Cadence |
|---|---|---|---|
| Tier 1 (Critical) | Credit decisioning, fraud, AML | Daily automated + weekly review | Annual + triggered |
| Tier 2 (Significant) | Pricing, customer segmentation, collections | Weekly automated + monthly review | Annual |
| Tier 3 (Standard) | Marketing propensity, operational efficiency | Monthly automated + quarterly review | Every 18–24 months |
| Tier 4 (Low) | Internal reporting, non-decision support | Quarterly automated | Every 2–3 years |
Who owns what (and what an examiner will ask each one):
- Model owner / first line: Runs daily/weekly monitoring, escalates breaches, initiates recalibration. Examiner question: “Show me your last drift alert and what you did about it.”
- Model Risk Management / second line: Sets monitoring standards, reviews results, approves threshold changes. Examiner question: “Show me the documented thresholds and how they map to your tiering.”
- Internal Audit / third line: Audits whether the monitoring framework itself is working. Examiner question: “When was monitoring effectiveness last independently reviewed?”
When Drift Triggers Recalibration vs. Re-Validation vs. Rebuild
Not every drift event requires rebuilding from scratch. Use this decision framework — it’s also the one to document in your monitoring plan, because examiners will ask “what’s your trigger for re-validation?”
Recalibration (lighter touch):
- Prediction drift with stable feature distributions
- PSI 0.1–0.25 on outputs only
- Performance degradation < 5% from baseline
- Action: Adjust thresholds or score cutoffs. Document and get sign-off from model risk.
Re-Validation (full cycle):
- Feature-level PSI > 0.25 on multiple features
- Concept drift confirmed (ground truth diverges from predictions)
- Performance degradation > 10%
- Action: Full re-validation per SR 11-7 — conceptual soundness review, outcomes analysis, updated documentation.
Rebuild / Retrain:
- Multiple drift types compounding
- Structural break in the data (post-merger, post-pandemic, new regulation)
- Action: New model development cycle. Treat as a new model for governance purposes.
If the model drifts so badly you need an emergency shutdown — not just recalibration — make sure you have a documented kill switch before that day comes. Trying to design one mid-incident is how you end up with a Zillow.
What Examiners Actually Want to See
When examiners assess your model risk management program, they’re not looking for the most sophisticated monitoring. They’re looking for defensible execution. Five things:
- Documented monitoring plan tied to your model tiering. Not a generic policy — a plan that names the model, the metrics, the thresholds, and the cadence.
- Defined thresholds for performance degradation and drift, mapped to the tier.
- Escalation procedures when thresholds are breached — who gets notified, who decides, who acts.
- Evidence of ongoing execution. Sample monitoring runs from the last 90 days. Not a plan that sits on a shelf.
- Outcomes analysis comparing predictions to actuals once ground truth is available.
The order matters. A team that has #4 and #5 but skipped #1 will fail an exam. A team that has #1 and #2 but no evidence of #4 will fail harder. The most common MRA we see in this space is “monitoring framework documented but not consistently executed.”
30/60/90-Day Rollout
If you’re starting from scratch — or rebuilding after an MRA — this is the rollout sequence that’s defensible to model risk and feasible for a small team:
Days 1–30: Foundation
- Inventory all production models and their current monitoring state
- Define drift thresholds per model tier (PSI, KS, performance degradation)
- Implement automated PSI/KS checks on top 5 critical models
- Create escalation matrix: who gets notified at each threshold
- Document the monitoring plan for regulatory readiness
Days 31–60: Scale
- Extend automated monitoring to all Tier 1 and Tier 2 models
- Build a monitoring dashboard with automated alerts
- Implement fairness metric tracking (disparate impact ratios)
- Conduct the first monthly monitoring review meeting
- Set up champion-challenger framework for top models
Days 61–90: Operationalize
- Complete monitoring coverage for Tier 3 models
- Run the first triggered re-validation based on a real drift event (or simulated one)
- Document and test automated retraining triggers
- Conduct a tabletop exercise: “model gone wrong” scenario
- Submit the monitoring framework for internal audit review
So What?
Model monitoring is the control that sits between “working model” and “$500 million loss.” Regulators are explicit about it. SR 11-7 demands ongoing validation. The OCC Comptroller’s Handbook makes monitoring a first-class control. The EU AI Act codifies it as a legal requirement.
The good news: you don’t need to build everything from scratch, and you don’t need an enterprise platform on day one. Start with PSI on your critical models, define thresholds that match your tiering, build the escalation path, and prove execution. Do that, and your next exam goes from defense to demonstration.
Ready to put this into a documented program your model risk team can defend? The AI Risk Assessment Template gives you the model inventory, tier definitions, and monitoring documentation structured the way examiners expect — built for risk and compliance teams, not data scientists.
FAQ
How often should I check for model drift?
It depends on the model’s risk tier. Tier 1 models (credit decisioning, fraud, AML) need daily automated monitoring with weekly review. Tier 3 and 4 models can be checked monthly or quarterly. The principle: monitoring cadence should match the magnitude of the decision the model is supporting.
What’s the difference between model drift and model degradation?
Drift is the change in inputs or in the input-to-output relationship. Degradation is the consequence — your model’s performance gets worse because of drift. You monitor for drift to prevent degradation. Drift is the disease; degradation is the symptom.
Do I need a dedicated platform like Fiddler, Arize, or Evidently to do this?
No — you can start with statistical tests (PSI, KS) implemented in Python or R. Purpose-built platforms make sense at scale or when you need real-time alerting and explainability. For most mid-size institutions, custom scripts on critical models plus a platform for the long tail is the practical approach. Don’t let “we don’t have the tooling” be the reason you skip it.
What’s the single biggest model monitoring failure examiners flag?
“Documented but not executed.” Teams write a beautiful monitoring plan and then can’t produce evidence of monthly runs, threshold breaches, or escalations. The plan becomes the finding. Build the plan around what you can actually run, not what looks impressive on paper.
Does this apply to generative AI and LLMs too?
Yes — and the monitoring is harder because outputs are open-ended. The NIST AI Risk Management Framework addresses this through its MEASURE function, and NIST AI 600-1 extends it to GenAI specifically — including hallucination monitoring and output quality tracking. The principle is the same: define what “drift” looks like for your use case, set thresholds, build escalation, prove execution.
Need the working template?
Start with the source guide.
These answer-first guides summarize the required fields, evidence, and implementation steps behind the templates practitioners search for.
Related Template
AI Risk Assessment Template & Guide
Comprehensive AI model governance and risk assessment templates for financial services teams.
Rebecca Leung
Rebecca Leung has 8+ years of risk and compliance experience across first and second line roles at commercial banks, asset managers, and fintechs. Former management consultant advising financial institutions on risk strategy. Founder of RiskTemplates.
Related Framework
AI Risk Assessment Template & Guide
Comprehensive AI model governance and risk assessment templates for financial services teams.
Keep Reading
EU AI Act Digital Omnibus: What the December 2027 Deadline Deferral Means for Financial Services AI Teams
The EU AI Act's Digital Omnibus deal, reached May 7, 2026, defers Annex III high-risk AI obligations from August 2, 2026 to December 2, 2027. Here's what changed, what didn't, and how financial services AI teams should use the extra 16 months.
May 14, 2026
AI RiskEU AI Act Article 5 Prohibited AI Systems: The Compliance Checklist Financial Institutions Can't Ignore
Article 5 prohibitions have been in force since February 2025 and the enforcement regime launched August 2025. Here's what financial institutions must audit, stop doing, and document — with the credit scoring carve-out explained.
May 12, 2026
AI RiskEU AI Act High-Risk AI in Financial Services: What Banks and Fintechs Must Document by August 2, 2026
Annex III of the EU AI Act covers credit scoring, insurance pricing, and financial standing assessment. Here's what the seven compliance obligations actually require — and who they apply to.
May 10, 2026
Immaterial Findings ✉️
Weekly newsletter
Sharp risk & compliance insights practitioners actually read. Enforcement actions, regulatory shifts, and practical frameworks — no fluff, no filler.
Join practitioners from banks, fintechs, and asset managers. Delivered weekly.