Proven

EAAPL-OBS005 · Model Drift Detection

📊 Observability & Monitoring🏭 Field-tested in AU

EAAPL-OBS005 · Model Drift Detection

Pattern ID: EAAPL-OBS005 Status: Proven Complexity: High Tags: observability model-risk alerting slo high-complexity Version: 1.0.0 Last Reviewed: 2026-06-12

1. Executive Summary

AI models degrade silently. Unlike a server that crashes with a clear error, a drifting model continues returning HTTP 200 responses while its outputs become progressively less accurate, more biased, or less relevant. Data drift — changes in the statistical distribution of inputs — and concept drift — changes in the relationship between inputs and desired outputs — are the two primary mechanisms. Without continuous drift monitoring, organisations discover model degradation through business metric decline, customer complaints, or regulatory findings — all lagging indicators that allow harm to accumulate.

This pattern defines a continuous monitoring system for statistical drift in model inputs and outputs across production AI deployments. It covers: data drift detection using Kolmogorov-Smirnov tests for continuous features, chi-squared tests for categorical features, and Population Stability Index (PSI) for combined assessment; concept drift detection through output distribution monitoring, accuracy on labeled holdout sets, and Jensen-Shannon divergence; reference dataset management with versioning and seasonal adjustment; drift severity classification (warning/alert/critical); automated retraining triggers on critical drift; visualisation dashboards for per-feature drift over time; integration with model registries; and the critical distinction between benign drift (seasonal patterns, legitimate distribution shift) and harmful drift (data quality degradation, adversarial shift, world-state change invalidating model assumptions).

Target Audience: CIO, CTO, Chief Risk Officer, Head of AI/ML Engineering, Model Risk Manager Time to Implement: 8–14 weeks

2. Problem Statement

Business Problem

Organisations deploy AI models and assume they will continue performing as they did in testing. They don't. The world changes, user behaviour changes, data pipelines evolve, and model performance degrades. Most organisations have no systematic mechanism for detecting this until a material business event forces attention — a regulatory finding, a wave of customer complaints, an unexpected decline in conversion or retention. At that point, the degradation may have been occurring for months.

Technical Problem

Drift detection requires statistical comparison of production data distributions against reference baselines, at scale, in near-real-time. The challenge is multi-dimensional: many ML models have hundreds or thousands of features; each requires its own statistical test; feature drift does not always imply output degradation; and the relationship between measured drift and model performance impact is non-linear. Additionally, distinguishing harmful drift from legitimate distribution changes (new product launches, seasonal patterns, geographic expansion) requires both statistical and domain knowledge.

Symptoms

Model deployed in January performs well; by June, customer satisfaction with AI features has quietly declined
Business metrics (task completion rate, recommendation click-through) declining with no engineering change attributed
RAG retrieval quality degraded because the vector index was built on stale embeddings but no monitoring detected this
Model retrained annually on schedule, not triggered by evidence of performance degradation
Regulatory review reveals model was trained on data no longer representative of current customer base

Cost of Inaction

Average enterprise AI model degrades measurably within 6 months of deployment without monitoring (Gartner 2024)
Regulatory findings for material models lacking performance monitoring (APRA CPG 234, EU AI Act Article 9)
Silent accuracy regression in credit scoring, fraud detection, or clinical triage has direct financial and safety consequences
Unnecessary scheduled retraining (without drift evidence) wastes compute and introduces regression risk from needless model changes

3. Context

When to Apply

Any production ML model with a defined performance baseline and ongoing inference traffic
AI systems where input distributions are expected to be stable (any significant change is an anomaly)
Models used for regulated decisions (credit, fraud, clinical, underwriting) requiring ongoing performance evidence
RAG systems where retrieval quality depends on embedding models that may become stale
Prerequisite: EAAPL-OBS001 provides the input/output data stream required for drift computation

When NOT to Apply

One-off batch models with no ongoing deployment
Purely generative tasks (creative writing) where output distribution monitoring is not meaningful
Models retrained continuously (online learning) where the model itself is always adapting — drift is by design

Prerequisites

Prerequisite	Required	Notes
EAAPL-OBS001 AI Telemetry Infrastructure	Required	Input feature logging and output logging required
Reference baseline dataset (labeled)	Required	Drift comparison requires a reference distribution
Model registry with versioned metadata	Required	Drift events must link to model versions
Statistical compute runtime	Required	Python scipy; PySpark for high-volume feature sets
Model performance ground truth mechanism	Strongly Recommended	Without labels, concept drift detection is indirect

Industry Applicability

Industry	Applicability	Primary Driver
Financial Services	Critical	APRA CPG 234, ASIC model risk, credit/fraud model degradation
Healthcare	Critical	Clinical model safety obligation; performance monitoring mandatory
Insurance	Critical	Underwriting model accuracy directly impacts financial outcomes
Retail / E-Commerce	High	Recommendation and personalisation models degrade with catalogue changes
Government	High	Decision-support models require ongoing performance evidence
Technology / SaaS	High	RAG freshness; NLP model drift with language evolution

4. Architecture Overview

The Model Drift Detection Architecture is a statistical monitoring system that operates asynchronously on the production inference data stream. It is composed of five functional layers: data capture, reference management, statistical analysis, severity classification, and action triggering.

Data Capture Layer

Every AI inference request logs the input features and the model output. For structured ML models, input features are captured in the telemetry log. For LLM and RAG systems, proxies are used: prompt token count, query type classification (derived from prompt metadata), retrieved document distribution, and output characteristics (length, entropy, sentiment score). The data capture layer feeds both real-time streaming analysis (for rapid drift detection on output distributions) and batch statistical analysis (for feature-level drift computation, which requires sufficient data volume for statistical power).

Reference Dataset Management

The reference dataset defines the expected distribution. It is not a static artifact — it must be managed actively. The reference is versioned: each model version has its own reference distribution. The reference is updated when a new model version is deployed (new baseline from evaluation data) or when a known distribution shift occurs (e.g., product launch changing customer demographic) and the shift is deemed legitimate. Reference datasets are stored in the model registry alongside model artifacts. A reference management API allows data scientists to approve reference updates; unapproved reference changes are blocked and alerted.

Statistical Analysis: Data Drift

Data drift is detected at the feature level. For each numerical feature, the Kolmogorov-Smirnov (KS) test compares the current production distribution against the reference distribution. KS test statistic and p-value are computed. A p-value < 0.05 with KS statistic > 0.10 indicates statistically significant drift. For categorical features, the chi-squared test compares observed vs. expected frequency distributions. For composite drift assessment, the Population Stability Index (PSI) is computed per feature: PSI = sum over bins of (actual% - expected%) × ln(actual% / expected%). PSI < 0.10 is stable (no concern), 0.10–0.25 is moderate drift (warning), > 0.25 is significant drift (alert). The overall drift index aggregates per-feature PSI scores weighted by feature importance.

Statistical Analysis: Concept Drift

Concept drift — the relationship between inputs and the desired output changing — is harder to detect without labels. Three complementary approaches are used. Output distribution monitoring: track the distribution of model outputs (predicted class distribution for classifiers; output length and vocabulary distribution for LLMs). Significant shifts in output distribution without corresponding input drift suggest concept drift. Jensen-Shannon divergence between current and reference output distributions is computed. Accuracy on labeled holdout: a static holdout set with human-labeled ground truth is evaluated periodically against the current model. Declining accuracy on a fixed holdout, while input distribution is stable, indicates concept drift. Error rate trend monitoring: for models with feedback mechanisms, track error rate (user corrections, thumbs down, escalations) as a proxy for accuracy.

Drift Severity Classification

Warning: PSI 0.10–0.25 on one or more features, or JS divergence increase of 0.05–0.10 on output. No immediate action; increased monitoring frequency; notify ML engineer. Alert: PSI > 0.25 on important features, or JS divergence increase > 0.10, or holdout accuracy drop > 5%. Schedule retraining review within 2 weeks. Critical: PSI > 0.50, or holdout accuracy drop > 10%, or error rate 2x baseline. Trigger automated retraining pipeline immediately; page ML engineer on-call; notify model risk manager.

Automated Retraining Trigger

Critical drift triggers the automated retraining pipeline. The trigger event is published to the model registry, which kicks off the organisation's standard model retraining workflow. The retraining pipeline uses the current production data (within the retention window) as training data, trains a new model version, evaluates against the holdout set, and if quality improves or is maintained, submits for deployment review. Human sign-off is required before the new model version is promoted to production.

Benign vs. Harmful Drift Classification

Not all drift is harmful. Seasonal patterns (retail models drifting at Christmas), known distribution shifts from product changes (new customer segment acquired), or deliberate training data diversity expansion are benign. The benign drift classifier consults a calendar of known events (product launches, campaigns, data pipeline changes) and applies a rule: if drift onset correlates with a known event within a 3-day window, classify as potentially benign and route to ML engineer review rather than auto-triggering retraining.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Capture["Data Capture"] A[Production Inference] B[(Feature Log Store)] C[Reference Dataset] end subgraph Analysis["Drift Analysis"] D[Data Drift Tests] E[Concept Drift Monitor] F[Benign Drift Classifier] end subgraph Action["Severity and Action"] G{Severity Classifier} H[Retraining Pipeline] end A --> B B --> D C --> D A --> E D --> F E --> F F --> G G -->|warning| I[Dashboard Alert] G -->|critical| H H --> J[Human Sign-Off] J -->|approved| C style A fill:#dbeafe,stroke:#3b82f6 style B fill:#fef9c3,stroke:#eab308 style C fill:#fef9c3,stroke:#eab308 style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#f3e8ff,stroke:#a855f7 style H fill:#fee2e2,stroke:#ef4444 style I fill:#d1fae5,stroke:#10b981 style J fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Feature Logger	SDK / Sidecar	Capture input features and output at inference time; forward to feature log store	Custom wrapper; Arize AI; WhyLabs; Evidently AI agent	Critical
Feature Log Store	Storage	Time-series storage of production inference features	ClickHouse, BigQuery, Apache Hudi/Delta Lake on S3	Critical
KS Test Processor	Batch Job	Kolmogorov-Smirnov test on numerical feature distributions	Python scipy.stats; PySpark on Databricks/Glue	High
Chi-Squared Processor	Batch Job	Chi-squared test on categorical feature distributions	Python scipy.stats; PySpark	High
PSI Calculator	Batch Job	Population Stability Index per feature and aggregate	Python/Spark; Evidently AI; WhyLabs	High
Jensen-Shannon Divergence Engine	Streaming + Batch	JS divergence on output distributions vs. reference	Python scipy; Flink streaming	High
Accuracy Monitor	Batch Job	Evaluate current model on fixed holdout set periodically	MLflow evaluation; custom Python script	High
Reference Dataset Store	Storage	Versioned reference distributions per model version	S3/GCS/Azure Blob + DVC; MLflow artifacts	Critical
Benign Drift Classifier	Service	Correlate drift onset with known events calendar	Custom rule engine + event calendar API	Medium
Retraining Pipeline Trigger	Integration	Publish critical drift event to retraining workflow	Airflow/Prefect sensor; MLflow webhook; Kubeflow trigger	High
Drift Dashboard	UI	Per-feature drift over time; severity summary; trend	Grafana; Evidently AI UI; WhyLabs; custom React app	Medium
Model Registry Integration	Integration	Link drift events to model versions; trigger review workflow	MLflow, SageMaker Model Registry, Vertex AI Model Registry	High

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Feature Logger	Captures input features and model output at every inference call	Feature log record with modelId, modelVersion, features{}, output, timestamp
2	Feature Log Store	Ingests and indexes feature records; enables time-window queries	Queryable time-series feature data
3	KS / Chi-Squared / PSI Processors	Run hourly batch analysis on 1-hour window vs. reference distribution	Per-feature drift scores; PSI values; test statistics and p-values
4	JSD Engine	Computes JS divergence on output distribution in rolling window	Output distribution divergence score
5	Accuracy Monitor	Evaluates model on holdout set (daily or per-deployment)	Accuracy, F1, or task-specific quality metrics
6	Severity Classifier	Applies severity rules to aggregate drift signals	Severity label: stable / warning / alert / critical
7	Benign Drift Classifier	Checks drift onset against known events calendar	Benign / unknown classification with rationale
8	Action Router	Routes by severity and benign classification to appropriate action	Alert, scheduled review, or retraining trigger
9	Retraining Pipeline (if critical)	Initiates model retraining on recent production data	New model version submitted for review
10	Human Reviewer	Reviews new model version quality; approves or rejects promotion	Approval decision; model registry updated

Error Flow

Error Scenario	Detection	Action	Recovery
Feature log store query times out	Batch job failure alert; lag metric	Alert ML platform; skip batch; run catch-up	Investigate store performance; run catch-up analysis
Reference distribution missing for new model version	Drift job raises missing reference error	Alert to ML engineer; skip drift computation until reference set	ML engineer creates reference on model deployment
Benign drift classifier incorrectly clears harmful drift	Accuracy holdout detects concurrent quality decline	Accuracy alert overrides benign classification; escalate	Investigate; tune benign classifier; enforce dual confirmation
Retraining pipeline fails	Pipeline failure alert; Airflow/Prefect failure	Alert ML engineer; manual retraining trigger	Fix pipeline; retry; monitor new model version
Holdout set becomes stale (labels no longer representative)	Holdout accuracy diverges from production feedback	Alert to ML team; schedule holdout refresh	Refresh holdout with new labels

8. Security Considerations

Authentication: Feature log store access requires service authentication. Drift analysis jobs authenticate via service accounts. Reference dataset store access is write-restricted to approved ML engineers; reads are available to drift analysis services.

Authorisation: Feature log data may contain sensitive model inputs (e.g., credit application features, health data proxies). Access to feature logs is restricted to ML engineers and data scientists with specific model ownership. Audit log of all accesses.

Secrets Management: Cloud storage credentials for feature log store and reference dataset store in secrets manager. Retraining pipeline trigger credentials rotated quarterly.

Data Classification: Feature logs are classified at the level of the most sensitive input feature (often Confidential for financial or health models). Reference datasets are classified as Internal. Drift event records are classified as Internal.

Encryption: Feature log data encrypted at rest (AES-256) and in transit (TLS 1.3). Reference dataset store encrypted. Long-term retention of feature logs for regulatory audit requires customer-managed encryption keys.

Auditability: Every reference dataset update is audited with requester, approver, timestamp, and rationale. Every retraining trigger event is logged immutably. Benign drift classifications are logged with the event calendar entry they matched.

OWASP LLM Top 10 Coverage

OWASP LLM Risk	Drift Detection Control	Implementation
LLM01 Prompt Injection	Prompt length and structure distribution drift detects systematic injection patterns	Distribution shift in prompt structure is drift signal
LLM02 Insecure Output Handling	Output distribution monitoring detects systematic output changes from injection	JSD alert if output distribution shifts toward unsafe patterns
LLM03 Training Data Poisoning	Feature distribution drift may indicate poisoned training affecting production distribution	Input drift concurrent with accuracy decline = poisoning signal
LLM04 Model Denial of Service	Token usage distribution drift detects abusive usage patterns	Token count distribution shift = anomaly signal
LLM05 Supply Chain Vulnerabilities	Unexpected model version in registry triggers investigation	Model version audit in drift monitoring
LLM06 Sensitive Information Disclosure	Input feature drift monitoring may detect feature set changes that introduce PII	Alert on new feature categories appearing in feature distribution
LLM07 Insecure Plugin Design	Tool call distribution monitoring detects shifts in tool usage patterns	Tool call frequency is a monitored distribution
LLM08 Excessive Agency	Agent action distribution drift detects scope expansion	Output action type distribution monitored
LLM09 Overreliance	Accuracy monitoring surfaces model quality degradation before users over-rely on degraded outputs	Accuracy SLO directly measures overreliance risk
LLM10 Model Theft	Unusual output volume distribution shift may indicate bulk extraction	Output volume distribution is a monitored signal

9. Governance Considerations

Responsible AI: Drift monitoring is the technical implementation of the principle that AI systems must perform as intended over their operational lifecycle, not only at deployment. Governance frameworks must mandate drift monitoring as a condition of continued production deployment for material models.

Model Risk Management: The drift event history is a key model risk management artefact. Material models must have a documented drift monitoring configuration reviewed by model risk. Critical drift events are Key Risk Indicator (KRI) breaches reported to the model risk committee.

Human Approval: All retraining decisions triggered by drift require human approval before deployment. The retraining pipeline produces a candidate model; an ML engineer and model risk reviewer approve promotion. Automated promotion without human review is not permitted for material models.

Policy: The model drift monitoring policy must define: which models require drift monitoring (materiality threshold), required monitoring frequency, reference dataset update criteria and approval process, drift severity thresholds, retraining trigger criteria, and escalation requirements for critical drift.

Traceability: Every drift event is linked to the model version, the reference dataset version, the statistical test result, and the action taken. This chain supports model risk management audit trails and regulatory evidence production.

Governance Artefacts

Artefact	Owner	Frequency	Format
Model Drift Monitoring Register	Model Risk Manager	Per model, updated on drift events	Registry with model, monitoring config, last assessment
Drift Event Log	ML Platform	Continuous	Immutable event store
Reference Dataset Approval Record	ML Engineering + Model Risk	Per update	Signed approval with rationale
Critical Drift KRI Report	Model Risk Manager	Monthly	Dashboard export + risk committee briefing
Retraining Decision Record	ML Engineering + Model Risk	Per retraining trigger	Signed decision with drift evidence and new model evaluation
Benign Drift Classification Log	ML Engineering	Per benign classification	Event log with matched calendar entry and rationale

10. Operational Considerations

Monitoring: The drift detection system is itself monitored. Batch job completion, processing lag, reference dataset freshness, and holdout evaluation frequency are all tracked. A drift detection system that hasn't run in 48 hours is as dangerous as a smoke alarm with a dead battery.

Logging: Drift analysis job logs stored separately from AI application logs. Drift event records are immutable.

Incident Response: Critical drift triggers the AI incident management process (EAAPL-OBS004) with a P1 quality incident. ML engineer on-call is paged. The retraining trigger is a parallel action, not a substitute for incident response — the current model may need to be limited or disabled while retraining proceeds.

Disaster Recovery: Drift detection is not in the critical inference path. A 4-hour outage of the drift detection system is acceptable. The risk is undetected drift during the outage window. Batch jobs can run catch-up analysis when the system recovers.

Capacity Planning: Feature log storage grows with inference volume. For a model with 100 features and 1M daily requests, each log record is approximately 1–5KB; total daily storage is 1–5GB. Plan for 90-day retention in hot storage and 2-year retention in warm storage for regulatory audit.

SLO Table

SLO	Target	Measurement	Alert Threshold
Drift detection freshness	Analysis runs within 2 hours of schedule	Job completion timestamp	> 4 hours behind schedule
Critical drift alert time	< 30 minutes from breach to alert	Alert delivery timestamp vs. PSI breach	> 60 minutes
Holdout accuracy evaluation	Daily for material models	Evaluation job completion log	> 48 hours since last evaluation
Reference dataset freshness	Updated within 5 days of model version deployment	Reference update timestamp vs. model deployment	> 7 days stale

Disaster Recovery Table

Component	RTO	RPO	Recovery Approach
Feature Log Store	30 minutes	1 hour	Replicated storage; catch-up analysis on recovery
Drift Analysis Jobs	4 hours	4 hours (catch-up)	Re-run batch jobs for missed windows
Reference Dataset Store	30 minutes	Near-zero	Replicated object storage
Drift Dashboard	60 minutes	N/A (read-only)	Redeploy dashboard from version control

11. Cost Considerations

Cost Drivers

Driver	Description	Relative Cost
Feature log storage	1–5KB per inference record; 90-day hot retention	High at scale
Statistical analysis compute (Spark/Glue)	Hourly batch jobs on feature data; scales with feature count and volume	Medium
Holdout accuracy evaluation	Model inference cost on holdout set (daily)	Low to Medium
JSD streaming computation	Real-time output distribution monitoring; minimal compute	Low
Reference dataset storage	Relatively small; versioned reference distributions	Low

Scaling Risks: Feature log storage is the primary scaling cost. At 10M+ daily inferences with 100+ features, storage costs can exceed $10K/month without optimisation. Use columnar compression (Parquet) and aggressive downsampling for older data.

Optimisations:

Store feature summaries (histogram buckets) rather than raw feature values for high-volume models
Run statistical tests on stratified samples (10K records sufficient for KS test statistical power) rather than full population
Use serverless compute (AWS Glue, BigQuery) to eliminate idle compute costs between batch windows

Indicative Cost Range

Scale	Daily Inferences	Estimated Drift Detection Cost/Month
Small	10,000	$200–$600
Medium	500,000	$1,500–$4,000
Large	5,000,000	$5,000–$15,000
Enterprise	50,000,000+	$20,000–$60,000 (with summarisation optimisation)

12. Trade-Off Analysis

Approach Comparison

Approach	Pros	Cons	Best For
Full feature-level drift monitoring (KS + Chi-Sq + PSI)	Precise; identifies which feature is drifting; enables targeted remediation	High compute and storage; requires feature access (not always available for LLMs)	Structured ML models with well-defined feature sets; regulated decisions
Output-only distribution monitoring (JSD on outputs)	Minimal infrastructure; no feature logging required; applicable to LLMs	Detects that something changed but not what; concept drift only, not data drift	LLM and generative systems; quick-start implementation
Human-labeled holdout evaluation only	Highest accuracy; directly measures real performance	Slow (labels take time); samples a small fraction of production	High-risk decisions where detection accuracy is paramount; complement to automated methods

Architectural Tensions

Tension	Description	Resolution
Sensitivity vs. False Alarms	Low thresholds detect early drift but generate false alarms that erode trust	PSI 0.10 warning (no page), 0.25 alert (notify), 0.50 critical (page) — graduated response
Feature granularity vs. Cost	Per-feature monitoring is precise but expensive at scale	Monitor all features for regulated models; monitor key features only for lower-risk models
Detection speed vs. Statistical power	Very fast detection requires small windows with low statistical power	Accept 1-hour minimum window for KS/Chi-sq; use streaming output monitoring for faster preliminary signal
Automation vs. Human oversight	Automated retraining is fast but may introduce new problems	Automated trigger only; human must approve new model promotion

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Reference dataset not updated after model deployment	High	High (drift false alarms or misses)	Reference freshness SLO alert	Enforce reference update as deployment gate
Benign classifier clears actual harmful drift	Medium	High (harmful drift not actioned)	Accuracy holdout detects concurrent quality decline	Require dual signal for benign classification; accuracy must not decline
Feature logger adds significant latency	Low	High (performance degradation)	Feature logger latency SLO breach	Async logging; decouple from inference path
Statistical test fails with insufficient data	High (for low-volume models)	Medium (no drift detection)	Test failure log; minimum data check	Enforce minimum sample size before running tests; alert if sample size insufficient
Retraining pipeline introduces regression	Medium	High (new model worse)	New model evaluation before promotion	Holdout gate in retraining pipeline; manual sign-off required

Cascading Scenarios

Scenario 1: Reference dataset never updated after major product launch → all features show critical drift → retraining triggered repeatedly → new models trained on shifted distribution → performance remains poor → wasted retraining compute and model risk review cycles. Mitigation: approved reference update is required within 5 days of known distribution shifts.
Scenario 2: Feature logger fails silently → feature store contains stale data → drift detection runs on old data → no drift detected → actual drift goes undetected for weeks. Mitigation: feature store freshness SLO; alert if no new records ingested in 30 minutes.

14. Regulatory Considerations

Regulation	Clause	Requirement	Drift Detection Implementation
APRA CPG 234	Section 6 (Model Risk)	Material models require ongoing performance monitoring and validation	Drift detection implements continuous monitoring; critical drift triggers validation workflow
APRA CPG 234	Section 8 (Model Review)	Annual (minimum) or event-triggered model review	Critical drift event triggers model review; documentation provided by this pattern
EU AI Act	Article 9.5 (Testing)	Ongoing testing to identify appropriate risk management measures for high-risk AI	Holdout evaluation and drift detection implement ongoing testing requirement
EU AI Act	Article 9.7 (Automatically Generated Logs)	High-risk AI must keep logs enabling verification of compliance	Drift event log with timestamps, metrics, and actions is the compliance verification record
ISO/IEC 42001	Clause 9.1 (Monitoring, Measurement, Analysis)	AI system performance must be continuously monitored against objectives	Drift detection implements ISO 42001 clause 9.1 at technical layer
NIST AI RMF	MANAGE 2.2	AI risk management includes monitoring for changes in performance over time	Drift monitoring directly implements NIST AI RMF MANAGE 2.2
Privacy Act 1988 (AU)	APP 11 (Security)	Model using PII must continue to protect it; model drift may expose new PII risks	Drift monitoring detects when input distribution shifts to include new PII categories

15. Reference Implementations

AWS

Feature Logger: Custom wrapper; AWS SageMaker Model Monitor (built-in feature capture)
Feature Log Store: Amazon S3 (Parquet) + AWS Glue Data Catalog; Amazon Redshift for queries
Drift Analysis: SageMaker Model Monitor scheduled monitoring jobs (built-in KS, chi-sq, PSI); AWS Glue ETL jobs
Reference Store: SageMaker baseline dataset (S3-backed)
Retraining Trigger: SageMaker Model Monitor alert → EventBridge → SageMaker Pipeline
Dashboard: SageMaker Model Monitor built-in dashboard; Amazon QuickSight custom dashboards
Registry: SageMaker Model Registry

Azure

Feature Logger: Azure Machine Learning Data Collector; custom wrapper
Feature Log Store: Azure Data Lake Storage Gen2 (Parquet) + Azure Synapse Analytics
Drift Analysis: Azure ML Data Drift monitoring (built-in); Azure Databricks jobs
Reference Store: Azure ML Dataset with versioning
Retraining Trigger: Azure ML Monitoring alert → Azure Event Grid → Azure ML Pipeline
Dashboard: Azure ML Studio monitoring dashboard; Power BI
Registry: Azure Machine Learning Model Registry

GCP

Feature Logger: Vertex AI Feature Store; custom wrapper
Feature Log Store: BigQuery (streaming insert); Cloud Storage (Parquet)
Drift Analysis: Vertex AI Model Monitoring (built-in skew/drift detection); Dataflow batch jobs
Reference Store: Vertex AI Dataset with versioning
Retraining Trigger: Vertex AI Model Monitoring alert → Cloud Pub/Sub → Vertex AI Pipeline
Dashboard: Vertex AI Model Monitoring dashboard; Looker
Registry: Vertex AI Model Registry

On-Premises

Feature Logger: Evidently AI (open source); custom wrapper with Kafka sink
Feature Log Store: Apache Hudi on HDFS/MinIO; ClickHouse for queries
Drift Analysis: Evidently AI reports scheduled via Airflow; custom PySpark jobs
Reference Store: MLflow artifacts; DVC
Retraining Trigger: Airflow sensor on drift metric; MLflow webhook
Dashboard: Evidently AI HTML reports; Grafana with drift metrics
Registry: MLflow Model Registry

Pattern ID	Pattern Name	Relationship	Notes
EAAPL-OBS001	AI Telemetry Architecture	Foundation	Input/output logging infrastructure required
EAAPL-OBS003	Hallucination Detection	Sibling	Both monitor output quality; drift in hallucination rate is a concept drift signal
EAAPL-OBS004	AI Incident Management	Depends On	Critical drift triggers P1 quality incident in OBS004
EAAPL-OBS008	AI Performance Benchmarking	Sibling	Offline benchmarking complements online drift monitoring

17. Maturity Assessment

Overall Maturity: Proven

Dimension	Score (1–5)	Rationale
Adoption Breadth	3	Adopted by mature ML organisations; cloud-native tools improving accessibility
Tooling Ecosystem	4	SageMaker Monitor, Vertex AI Monitoring, Evidently AI are mature; LLM drift tooling maturing
Operational Runbook Coverage	3	Statistical drift runbooks exist; benign drift classification is still manual/custom
Regulatory Evidence	4	APRA CPG 234 explicitly references model performance monitoring; EU AI Act adds momentum
Cost Predictability	3	Feature storage cost at scale can surprise teams without upfront capacity planning
Team Skill Availability	3	Statistical ML skills required; data scientist involvement needed for test interpretation

18. Revision History

Version	Date	Author	Changes
1.0.0	2026-06-12	EAAPL Working Group	Initial publication

← Back to Library More Observability & Monitoring →

EAAPL-OBS005 · Model Drift Detection

EAAPL-OBS005 · Model Drift Detection

1. Executive Summary

2. Problem Statement

Business Problem

Technical Problem

Symptoms

Cost of Inaction

3. Context

When to Apply

When NOT to Apply

Prerequisites

Industry Applicability

4. Architecture Overview

5. Architecture Diagram

6. Components

7. Data Flow

Primary Flow

Error Flow

8. Security Considerations

OWASP LLM Top 10 Coverage

9. Governance Considerations

Governance Artefacts

10. Operational Considerations

SLO Table

Disaster Recovery Table

11. Cost Considerations

Indicative Cost Range

12. Trade-Off Analysis

Approach Comparison

Architectural Tensions

13. Failure Modes

Cascading Scenarios

14. Regulatory Considerations

15. Reference Implementations

AWS

Azure

GCP

On-Premises

16. Related Patterns

17. Maturity Assessment

18. Revision History