AI Model Monitoring Template: Drift Detection, Thresholds, KRIs, and Evidence

TL;DR:

Zillow lost over $500 million because nobody noticed their pricing model went off the rails — that’s a model monitoring failure, not a modeling failure

SR 11-7 and OCC Bulletin 2011-12 are explicit: ongoing monitoring is required, not optional. Examiners will ask how you do it.

You don’t need a $200K monitoring platform to start — you need defined thresholds, an escalation path, and proof you’re actually executing the plan

This post gives you the monitoring framework, tier-based cadence, and 30/60/90 rollout that examiners expect to see

Who this is for: Model risk managers, compliance officers, and AI governance leads who need to defend a model monitoring program in their next exam — or build one before the exam comes.

Zillow Lost $500 Million Because No One Was Watching the Model

In 2021, Zillow’s home-pricing algorithm couldn’t keep up with a rapidly shifting housing market. The model was making purchase offers based on price forecasts months ahead, but the training data didn’t reflect 2021’s volatility. Zillow wrote down over $500 million, shut down iBuying entirely, and laid off 25% of its workforce.

The model didn’t fail because it was poorly built. It failed because nobody caught the drift in time. The monitoring framework either didn’t exist or wasn’t tied to decisions. The organization kept feeding model outputs into multimillion-dollar purchase calls until the loss was unrecoverable.

This is the gap regulators care about. SR 11-7 says it plainly: “Validation activities should continue on an ongoing basis after a model goes into use.” OCC Bulletin 2011-12 makes monitoring a first-class control. The EU AI Act Article 72 makes post-market monitoring a legal requirement. And every model risk examiner’s first follow-up after seeing your inventory is the same: “Show me how you’re monitoring these.”

If you’re running AI models in production and you don’t have a monitoring framework with defined thresholds, escalation paths, and evidence of execution — you’re not managing risk. You’re waiting for the loss event.

Need the framework to defend this in your next exam? The AI Risk Assessment Template gives you the model inventory, tiering structure, and monitoring documentation examiners expect — built for compliance and risk teams, not data scientists.

The Four Types of Drift Examiners Will Ask About

“Model drift” is a catch-all term, but the control depends on what’s actually drifting. Your monitoring framework needs to cover all four:

Drift Type	What’s Changing	Plain-English Example
Data drift	Input feature distributions shift	Customer income distributions change after a recession
Concept drift	The relationship between inputs and outputs changes	What makes a “good” borrower changes post-pandemic
Feature drift	One input shifts independently of the others	A vendor changes how they encode zip codes
Prediction drift	The model’s output distribution shifts	Approval rate climbs from 62% to 78% with no policy change

Most monitoring failures happen on concept drift. Data drift is loud — distributions change visibly. Concept drift is silent — the inputs look fine, but what they mean for the prediction has shifted. The Bank of England flagged exactly this during COVID: payment deferrals meant traditional delinquency signals stopped predicting default. The data hadn’t changed shape. The world had.

Your monitoring framework needs detection methods for each type, with thresholds tied to your model tiering. (If you don’t have model tiering yet, start here — without tiering you can’t justify your monitoring cadence.)

What to Actually Monitor (and How Often)

A monitoring framework isn’t useful if it drowns people in metrics. Here’s what matters, organized by cadence:

Daily / Real-Time

Metric	What It Tells You	Alert Threshold
Prediction distribution	Model output is shifting	PSI > 0.2 vs. 30-day baseline
Error rate / accuracy	Model is getting worse	> 2 standard deviations from baseline
Data completeness	Missing inputs	> 5% null rate on critical features
Volume anomalies	Unusual traffic patterns	> 3 standard deviations from expected

Weekly

Metric	What It Tells You	Action Trigger
Feature-level PSI	Individual inputs drifting	Any feature PSI > 0.2
Segment performance	Model underperforming for subgroups	> 10% degradation vs. overall
Fairness metrics	Bias emerging in production	Disparate impact ratio < 0.8
Ground truth reconciliation	Predictions vs. actuals diverging	When outcome labels become available

Monthly / Quarterly

Metric	What It Tells You	Action Trigger
Backtesting	Historical accuracy trending	Declining Gini, AUC, or KS
Stability across time windows	Long-term reliability	Consistent degradation 3+ months
Champion-challenger	Better alternatives exist	Challenger outperforms on key metrics

The PSI threshold cheat sheet (the only statistical test most exam reports actually reference): PSI < 0.1 = stable; 0.1–0.2 = investigate; > 0.2 = significant drift, escalate. Run it on each input feature individually, not just on the prediction output. Feature-level PSI is what isolates which input is drifting and whether it’s a data quality issue or a real population shift.

Monitoring Cadence by Model Tier

Your monitoring frequency should match the model’s risk tier. This is the table examiners want to see in your monitoring plan:

Model Tier	Examples	Monitoring Cadence	Validation Cadence
Tier 1 (Critical)	Credit decisioning, fraud, AML	Daily automated + weekly review	Annual + triggered
Tier 2 (Significant)	Pricing, customer segmentation, collections	Weekly automated + monthly review	Annual
Tier 3 (Standard)	Marketing propensity, operational efficiency	Monthly automated + quarterly review	Every 18–24 months
Tier 4 (Low)	Internal reporting, non-decision support	Quarterly automated	Every 2–3 years

Who owns what (and what an examiner will ask each one):

Model owner / first line: Runs daily/weekly monitoring, escalates breaches, initiates recalibration. Examiner question: “Show me your last drift alert and what you did about it.”
Model Risk Management / second line: Sets monitoring standards, reviews results, approves threshold changes. Examiner question: “Show me the documented thresholds and how they map to your tiering.”
Internal Audit / third line: Audits whether the monitoring framework itself is working. Examiner question: “When was monitoring effectiveness last independently reviewed?”

When Drift Triggers Recalibration vs. Re-Validation vs. Rebuild

Not every drift event requires rebuilding from scratch. Use this decision framework — it’s also the one to document in your monitoring plan, because examiners will ask “what’s your trigger for re-validation?”

Recalibration (lighter touch):

Prediction drift with stable feature distributions
PSI 0.1–0.25 on outputs only
Performance degradation < 5% from baseline
Action: Adjust thresholds or score cutoffs. Document and get sign-off from model risk.

Re-Validation (full cycle):

Feature-level PSI > 0.25 on multiple features
Concept drift confirmed (ground truth diverges from predictions)
Performance degradation > 10%
Action: Full re-validation per SR 11-7 — conceptual soundness review, outcomes analysis, updated documentation.

Rebuild / Retrain:

Multiple drift types compounding
Structural break in the data (post-merger, post-pandemic, new regulation)
Action: New model development cycle. Treat as a new model for governance purposes.

If the model drifts so badly you need an emergency shutdown — not just recalibration — make sure you have a documented kill switch before that day comes. Trying to design one mid-incident is how you end up with a Zillow.

What Examiners Actually Want to See

When examiners assess your model risk management program, they’re not looking for the most sophisticated monitoring. They’re looking for defensible execution. Five things:

Documented monitoring plan tied to your model tiering. Not a generic policy — a plan that names the model, the metrics, the thresholds, and the cadence.
Defined thresholds for performance degradation and drift, mapped to the tier.
Escalation procedures when thresholds are breached — who gets notified, who decides, who acts.
Evidence of ongoing execution. Sample monitoring runs from the last 90 days. Not a plan that sits on a shelf.
Outcomes analysis comparing predictions to actuals once ground truth is available.

The order matters. A team that has #4 and #5 but skipped #1 will fail an exam. A team that has #1 and #2 but no evidence of #4 will fail harder. The most common MRA we see in this space is “monitoring framework documented but not consistently executed.”

30/60/90-Day Rollout

If you’re starting from scratch — or rebuilding after an MRA — this is the rollout sequence that’s defensible to model risk and feasible for a small team:

Days 1–30: Foundation

Inventory all production models and their current monitoring state
Define drift thresholds per model tier (PSI, KS, performance degradation)
Implement automated PSI/KS checks on top 5 critical models
Create escalation matrix: who gets notified at each threshold
Document the monitoring plan for regulatory readiness

Days 31–60: Scale

Extend automated monitoring to all Tier 1 and Tier 2 models
Build a monitoring dashboard with automated alerts
Implement fairness metric tracking (disparate impact ratios)
Conduct the first monthly monitoring review meeting
Set up champion-challenger framework for top models

Days 61–90: Operationalize

Complete monitoring coverage for Tier 3 models
Run the first triggered re-validation based on a real drift event (or simulated one)
Document and test automated retraining triggers
Conduct a tabletop exercise: “model gone wrong” scenario
Submit the monitoring framework for internal audit review

So What?

Model monitoring is the control that sits between “working model” and “$500 million loss.” Regulators are explicit about it. SR 11-7 demands ongoing validation. The OCC Comptroller’s Handbook makes monitoring a first-class control. The EU AI Act codifies it as a legal requirement.

The good news: you don’t need to build everything from scratch, and you don’t need an enterprise platform on day one. Start with PSI on your critical models, define thresholds that match your tiering, build the escalation path, and prove execution. Do that, and your next exam goes from defense to demonstration.

Ready to put this into a documented program your model risk team can defend? The AI Risk Assessment Template gives you the model inventory, tier definitions, and monitoring documentation structured the way examiners expect — built for risk and compliance teams, not data scientists.

FAQ

How often should I check for model drift?

It depends on the model’s risk tier. Tier 1 models (credit decisioning, fraud, AML) need daily automated monitoring with weekly review. Tier 3 and 4 models can be checked monthly or quarterly. The principle: monitoring cadence should match the magnitude of the decision the model is supporting.

What’s the difference between model drift and model degradation?

Drift is the change in inputs or in the input-to-output relationship. Degradation is the consequence — your model’s performance gets worse because of drift. You monitor for drift to prevent degradation. Drift is the disease; degradation is the symptom.

Do I need a dedicated platform like Fiddler, Arize, or Evidently to do this?

No — you can start with statistical tests (PSI, KS) implemented in Python or R. Purpose-built platforms make sense at scale or when you need real-time alerting and explainability. For most mid-size institutions, custom scripts on critical models plus a platform for the long tail is the practical approach. Don’t let “we don’t have the tooling” be the reason you skip it.

What’s the single biggest model monitoring failure examiners flag?

“Documented but not executed.” Teams write a beautiful monitoring plan and then can’t produce evidence of monthly runs, threshold breaches, or escalations. The plan becomes the finding. Build the plan around what you can actually run, not what looks impressive on paper.

Does this apply to generative AI and LLMs too?

Yes — and the monitoring is harder because outputs are open-ended. The NIST AI Risk Management Framework addresses this through its MEASURE function, and NIST AI 600-1 extends it to GenAI specifically — including hallucination monitoring and output quality tracking. The principle is the same: define what “drift” looks like for your use case, set thresholds, build escalation, prove execution.