EAAPL-OBS005 · Model Drift Detection
Pattern ID: EAAPL-OBS005
Status: Proven
Complexity: High
Tags: observability model-risk alerting slo high-complexity
Version: 1.0.0
Last Reviewed: 2026-06-12
1. Executive Summary
AI models degrade silently. Unlike a server that crashes with a clear error, a drifting model continues returning HTTP 200 responses while its outputs become progressively less accurate, more biased, or less relevant. Data drift — changes in the statistical distribution of inputs — and concept drift — changes in the relationship between inputs and desired outputs — are the two primary mechanisms. Without continuous drift monitoring, organisations discover model degradation through business metric decline, customer complaints, or regulatory findings — all lagging indicators that allow harm to accumulate.
This pattern defines a continuous monitoring system for statistical drift in model inputs and outputs across production AI deployments. It covers: data drift detection using Kolmogorov-Smirnov tests for continuous features, chi-squared tests for categorical features, and Population Stability Index (PSI) for combined assessment; concept drift detection through output distribution monitoring, accuracy on labeled holdout sets, and Jensen-Shannon divergence; reference dataset management with versioning and seasonal adjustment; drift severity classification (warning/alert/critical); automated retraining triggers on critical drift; visualisation dashboards for per-feature drift over time; integration with model registries; and the critical distinction between benign drift (seasonal patterns, legitimate distribution shift) and harmful drift (data quality degradation, adversarial shift, world-state change invalidating model assumptions).
Target Audience: CIO, CTO, Chief Risk Officer, Head of AI/ML Engineering, Model Risk Manager Time to Implement: 8–14 weeks
2. Problem Statement
Business Problem
Organisations deploy AI models and assume they will continue performing as they did in testing. They don't. The world changes, user behaviour changes, data pipelines evolve, and model performance degrades. Most organisations have no systematic mechanism for detecting this until a material business event forces attention — a regulatory finding, a wave of customer complaints, an unexpected decline in conversion or retention. At that point, the degradation may have been occurring for months.
Technical Problem
Drift detection requires statistical comparison of production data distributions against reference baselines, at scale, in near-real-time. The challenge is multi-dimensional: many ML models have hundreds or thousands of features; each requires its own statistical test; feature drift does not always imply output degradation; and the relationship between measured drift and model performance impact is non-linear. Additionally, distinguishing harmful drift from legitimate distribution changes (new product launches, seasonal patterns, geographic expansion) requires both statistical and domain knowledge.
Symptoms
- Model deployed in January performs well; by June, customer satisfaction with AI features has quietly declined
- Business metrics (task completion rate, recommendation click-through) declining with no engineering change attributed
- RAG retrieval quality degraded because the vector index was built on stale embeddings but no monitoring detected this
- Model retrained annually on schedule, not triggered by evidence of performance degradation
- Regulatory review reveals model was trained on data no longer representative of current customer base
Cost of Inaction
- Average enterprise AI model degrades measurably within 6 months of deployment without monitoring (Gartner 2024)
- Regulatory findings for material models lacking performance monitoring (APRA CPG 234, EU AI Act Article 9)
- Silent accuracy regression in credit scoring, fraud detection, or clinical triage has direct financial and safety consequences
- Unnecessary scheduled retraining (without drift evidence) wastes compute and introduces regression risk from needless model changes
3. Context
When to Apply
- Any production ML model with a defined performance baseline and ongoing inference traffic
- AI systems where input distributions are expected to be stable (any significant change is an anomaly)
- Models used for regulated decisions (credit, fraud, clinical, underwriting) requiring ongoing performance evidence
- RAG systems where retrieval quality depends on embedding models that may become stale
- Prerequisite: EAAPL-OBS001 provides the input/output data stream required for drift computation
When NOT to Apply
- One-off batch models with no ongoing deployment
- Purely generative tasks (creative writing) where output distribution monitoring is not meaningful
- Models retrained continuously (online learning) where the model itself is always adapting — drift is by design
Prerequisites
| Prerequisite | Required | Notes |
|---|---|---|
| EAAPL-OBS001 AI Telemetry Infrastructure | Required | Input feature logging and output logging required |
| Reference baseline dataset (labeled) | Required | Drift comparison requires a reference distribution |
| Model registry with versioned metadata | Required | Drift events must link to model versions |
| Statistical compute runtime | Required | Python scipy; PySpark for high-volume feature sets |
| Model performance ground truth mechanism | Strongly Recommended | Without labels, concept drift detection is indirect |
Industry Applicability
| Industry | Applicability | Primary Driver |
|---|---|---|
| Financial Services | Critical | APRA CPG 234, ASIC model risk, credit/fraud model degradation |
| Healthcare | Critical | Clinical model safety obligation; performance monitoring mandatory |
| Insurance | Critical | Underwriting model accuracy directly impacts financial outcomes |
| Retail / E-Commerce | High | Recommendation and personalisation models degrade with catalogue changes |
| Government | High | Decision-support models require ongoing performance evidence |
| Technology / SaaS | High | RAG freshness; NLP model drift with language evolution |
4. Architecture Overview
The Model Drift Detection Architecture is a statistical monitoring system that operates asynchronously on the production inference data stream. It is composed of five functional layers: data capture, reference management, statistical analysis, severity classification, and action triggering.
Data Capture Layer
Every AI inference request logs the input features and the model output. For structured ML models, input features are captured in the telemetry log. For LLM and RAG systems, proxies are used: prompt token count, query type classification (derived from prompt metadata), retrieved document distribution, and output characteristics (length, entropy, sentiment score). The data capture layer feeds both real-time streaming analysis (for rapid drift detection on output distributions) and batch statistical analysis (for feature-level drift computation, which requires sufficient data volume for statistical power).
Reference Dataset Management
The reference dataset defines the expected distribution. It is not a static artifact — it must be managed actively. The reference is versioned: each model version has its own reference distribution. The reference is updated when a new model version is deployed (new baseline from evaluation data) or when a known distribution shift occurs (e.g., product launch changing customer demographic) and the shift is deemed legitimate. Reference datasets are stored in the model registry alongside model artifacts. A reference management API allows data scientists to approve reference updates; unapproved reference changes are blocked and alerted.
Statistical Analysis: Data Drift
Data drift is detected at the feature level. For each numerical feature, the Kolmogorov-Smirnov (KS) test compares the current production distribution against the reference distribution. KS test statistic and p-value are computed. A p-value < 0.05 with KS statistic > 0.10 indicates statistically significant drift. For categorical features, the chi-squared test compares observed vs. expected frequency distributions. For composite drift assessment, the Population Stability Index (PSI) is computed per feature: PSI = sum over bins of (actual% - expected%) × ln(actual% / expected%). PSI < 0.10 is stable (no concern), 0.10–0.25 is moderate drift (warning), > 0.25 is significant drift (alert). The overall drift index aggregates per-feature PSI scores weighted by feature importance.
Statistical Analysis: Concept Drift
Concept drift — the relationship between inputs and the desired output changing — is harder to detect without labels. Three complementary approaches are used. Output distribution monitoring: track the distribution of model outputs (predicted class distribution for classifiers; output length and vocabulary distribution for LLMs). Significant shifts in output distribution without corresponding input drift suggest concept drift. Jensen-Shannon divergence between current and reference output distributions is computed. Accuracy on labeled holdout: a static holdout set with human-labeled ground truth is evaluated periodically against the current model. Declining accuracy on a fixed holdout, while input distribution is stable, indicates concept drift. Error rate trend monitoring: for models with feedback mechanisms, track error rate (user corrections, thumbs down, escalations) as a proxy for accuracy.
Drift Severity Classification
Warning: PSI 0.10–0.25 on one or more features, or JS divergence increase of 0.05–0.10 on output. No immediate action; increased monitoring frequency; notify ML engineer. Alert: PSI > 0.25 on important features, or JS divergence increase > 0.10, or holdout accuracy drop > 5%. Schedule retraining review within 2 weeks. Critical: PSI > 0.50, or holdout accuracy drop > 10%, or error rate 2x baseline. Trigger automated retraining pipeline immediately; page ML engineer on-call; notify model risk manager.
Automated Retraining Trigger
Critical drift triggers the automated retraining pipeline. The trigger event is published to the model registry, which kicks off the organisation's standard model retraining workflow. The retraining pipeline uses the current production data (within the retention window) as training data, trains a new model version, evaluates against the holdout set, and if quality improves or is maintained, submits for deployment review. Human sign-off is required before the new model version is promoted to production.
Benign vs. Harmful Drift Classification
Not all drift is harmful. Seasonal patterns (retail models drifting at Christmas), known distribution shifts from product changes (new customer segment acquired), or deliberate training data diversity expansion are benign. The benign drift classifier consults a calendar of known events (product launches, campaigns, data pipeline changes) and applies a rule: if drift onset correlates with a known event within a 3-day window, classify as potentially benign and route to ML engineer review rather than auto-triggering retraining.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Feature Logger | SDK / Sidecar | Capture input features and output at inference time; forward to feature log store | Custom wrapper; Arize AI; WhyLabs; Evidently AI agent | Critical |
| Feature Log Store | Storage | Time-series storage of production inference features | ClickHouse, BigQuery, Apache Hudi/Delta Lake on S3 | Critical |
| KS Test Processor | Batch Job | Kolmogorov-Smirnov test on numerical feature distributions | Python scipy.stats; PySpark on Databricks/Glue | High |
| Chi-Squared Processor | Batch Job | Chi-squared test on categorical feature distributions | Python scipy.stats; PySpark | High |
| PSI Calculator | Batch Job | Population Stability Index per feature and aggregate | Python/Spark; Evidently AI; WhyLabs | High |
| Jensen-Shannon Divergence Engine | Streaming + Batch | JS divergence on output distributions vs. reference | Python scipy; Flink streaming | High |
| Accuracy Monitor | Batch Job | Evaluate current model on fixed holdout set periodically | MLflow evaluation; custom Python script | High |
| Reference Dataset Store | Storage | Versioned reference distributions per model version | S3/GCS/Azure Blob + DVC; MLflow artifacts | Critical |
| Benign Drift Classifier | Service | Correlate drift onset with known events calendar | Custom rule engine + event calendar API | Medium |
| Retraining Pipeline Trigger | Integration | Publish critical drift event to retraining workflow | Airflow/Prefect sensor; MLflow webhook; Kubeflow trigger | High |
| Drift Dashboard | UI | Per-feature drift over time; severity summary; trend | Grafana; Evidently AI UI; WhyLabs; custom React app | Medium |
| Model Registry Integration | Integration | Link drift events to model versions; trigger review workflow | MLflow, SageMaker Model Registry, Vertex AI Model Registry | High |
7. Data Flow
Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | Feature Logger | Captures input features and model output at every inference call | Feature log record with modelId, modelVersion, features{}, output, timestamp |
| 2 | Feature Log Store | Ingests and indexes feature records; enables time-window queries | Queryable time-series feature data |
| 3 | KS / Chi-Squared / PSI Processors | Run hourly batch analysis on 1-hour window vs. reference distribution | Per-feature drift scores; PSI values; test statistics and p-values |
| 4 | JSD Engine | Computes JS divergence on output distribution in rolling window | Output distribution divergence score |
| 5 | Accuracy Monitor | Evaluates model on holdout set (daily or per-deployment) | Accuracy, F1, or task-specific quality metrics |
| 6 | Severity Classifier | Applies severity rules to aggregate drift signals | Severity label: stable / warning / alert / critical |
| 7 | Benign Drift Classifier | Checks drift onset against known events calendar | Benign / unknown classification with rationale |
| 8 | Action Router | Routes by severity and benign classification to appropriate action | Alert, scheduled review, or retraining trigger |
| 9 | Retraining Pipeline (if critical) | Initiates model retraining on recent production data | New model version submitted for review |
| 10 | Human Reviewer | Reviews new model version quality; approves or rejects promotion | Approval decision; model registry updated |
Error Flow
| Error Scenario | Detection | Action | Recovery |
|---|---|---|---|
| Feature log store query times out | Batch job failure alert; lag metric | Alert ML platform; skip batch; run catch-up | Investigate store performance; run catch-up analysis |
| Reference distribution missing for new model version | Drift job raises missing reference error | Alert to ML engineer; skip drift computation until reference set | ML engineer creates reference on model deployment |
| Benign drift classifier incorrectly clears harmful drift | Accuracy holdout detects concurrent quality decline | Accuracy alert overrides benign classification; escalate | Investigate; tune benign classifier; enforce dual confirmation |
| Retraining pipeline fails | Pipeline failure alert; Airflow/Prefect failure | Alert ML engineer; manual retraining trigger | Fix pipeline; retry; monitor new model version |
| Holdout set becomes stale (labels no longer representative) | Holdout accuracy diverges from production feedback | Alert to ML team; schedule holdout refresh | Refresh holdout with new labels |
8. Security Considerations
Authentication: Feature log store access requires service authentication. Drift analysis jobs authenticate via service accounts. Reference dataset store access is write-restricted to approved ML engineers; reads are available to drift analysis services.
Authorisation: Feature log data may contain sensitive model inputs (e.g., credit application features, health data proxies). Access to feature logs is restricted to ML engineers and data scientists with specific model ownership. Audit log of all accesses.
Secrets Management: Cloud storage credentials for feature log store and reference dataset store in secrets manager. Retraining pipeline trigger credentials rotated quarterly.
Data Classification: Feature logs are classified at the level of the most sensitive input feature (often Confidential for financial or health models). Reference datasets are classified as Internal. Drift event records are classified as Internal.
Encryption: Feature log data encrypted at rest (AES-256) and in transit (TLS 1.3). Reference dataset store encrypted. Long-term retention of feature logs for regulatory audit requires customer-managed encryption keys.
Auditability: Every reference dataset update is audited with requester, approver, timestamp, and rationale. Every retraining trigger event is logged immutably. Benign drift classifications are logged with the event calendar entry they matched.
OWASP LLM Top 10 Coverage
| OWASP LLM Risk | Drift Detection Control | Implementation |
|---|---|---|
| LLM01 Prompt Injection | Prompt length and structure distribution drift detects systematic injection patterns | Distribution shift in prompt structure is drift signal |
| LLM02 Insecure Output Handling | Output distribution monitoring detects systematic output changes from injection | JSD alert if output distribution shifts toward unsafe patterns |
| LLM03 Training Data Poisoning | Feature distribution drift may indicate poisoned training affecting production distribution | Input drift concurrent with accuracy decline = poisoning signal |
| LLM04 Model Denial of Service | Token usage distribution drift detects abusive usage patterns | Token count distribution shift = anomaly signal |
| LLM05 Supply Chain Vulnerabilities | Unexpected model version in registry triggers investigation | Model version audit in drift monitoring |
| LLM06 Sensitive Information Disclosure | Input feature drift monitoring may detect feature set changes that introduce PII | Alert on new feature categories appearing in feature distribution |
| LLM07 Insecure Plugin Design | Tool call distribution monitoring detects shifts in tool usage patterns | Tool call frequency is a monitored distribution |
| LLM08 Excessive Agency | Agent action distribution drift detects scope expansion | Output action type distribution monitored |
| LLM09 Overreliance | Accuracy monitoring surfaces model quality degradation before users over-rely on degraded outputs | Accuracy SLO directly measures overreliance risk |
| LLM10 Model Theft | Unusual output volume distribution shift may indicate bulk extraction | Output volume distribution is a monitored signal |
9. Governance Considerations
Responsible AI: Drift monitoring is the technical implementation of the principle that AI systems must perform as intended over their operational lifecycle, not only at deployment. Governance frameworks must mandate drift monitoring as a condition of continued production deployment for material models.
Model Risk Management: The drift event history is a key model risk management artefact. Material models must have a documented drift monitoring configuration reviewed by model risk. Critical drift events are Key Risk Indicator (KRI) breaches reported to the model risk committee.
Human Approval: All retraining decisions triggered by drift require human approval before deployment. The retraining pipeline produces a candidate model; an ML engineer and model risk reviewer approve promotion. Automated promotion without human review is not permitted for material models.
Policy: The model drift monitoring policy must define: which models require drift monitoring (materiality threshold), required monitoring frequency, reference dataset update criteria and approval process, drift severity thresholds, retraining trigger criteria, and escalation requirements for critical drift.
Traceability: Every drift event is linked to the model version, the reference dataset version, the statistical test result, and the action taken. This chain supports model risk management audit trails and regulatory evidence production.
Governance Artefacts
| Artefact | Owner | Frequency | Format |
|---|---|---|---|
| Model Drift Monitoring Register | Model Risk Manager | Per model, updated on drift events | Registry with model, monitoring config, last assessment |
| Drift Event Log | ML Platform | Continuous | Immutable event store |
| Reference Dataset Approval Record | ML Engineering + Model Risk | Per update | Signed approval with rationale |
| Critical Drift KRI Report | Model Risk Manager | Monthly | Dashboard export + risk committee briefing |
| Retraining Decision Record | ML Engineering + Model Risk | Per retraining trigger | Signed decision with drift evidence and new model evaluation |
| Benign Drift Classification Log | ML Engineering | Per benign classification | Event log with matched calendar entry and rationale |
10. Operational Considerations
Monitoring: The drift detection system is itself monitored. Batch job completion, processing lag, reference dataset freshness, and holdout evaluation frequency are all tracked. A drift detection system that hasn't run in 48 hours is as dangerous as a smoke alarm with a dead battery.
Logging: Drift analysis job logs stored separately from AI application logs. Drift event records are immutable.
Incident Response: Critical drift triggers the AI incident management process (EAAPL-OBS004) with a P1 quality incident. ML engineer on-call is paged. The retraining trigger is a parallel action, not a substitute for incident response — the current model may need to be limited or disabled while retraining proceeds.
Disaster Recovery: Drift detection is not in the critical inference path. A 4-hour outage of the drift detection system is acceptable. The risk is undetected drift during the outage window. Batch jobs can run catch-up analysis when the system recovers.
Capacity Planning: Feature log storage grows with inference volume. For a model with 100 features and 1M daily requests, each log record is approximately 1–5KB; total daily storage is 1–5GB. Plan for 90-day retention in hot storage and 2-year retention in warm storage for regulatory audit.
SLO Table
| SLO | Target | Measurement | Alert Threshold |
|---|---|---|---|
| Drift detection freshness | Analysis runs within 2 hours of schedule | Job completion timestamp | > 4 hours behind schedule |
| Critical drift alert time | < 30 minutes from breach to alert | Alert delivery timestamp vs. PSI breach | > 60 minutes |
| Holdout accuracy evaluation | Daily for material models | Evaluation job completion log | > 48 hours since last evaluation |
| Reference dataset freshness | Updated within 5 days of model version deployment | Reference update timestamp vs. model deployment | > 7 days stale |
Disaster Recovery Table
| Component | RTO | RPO | Recovery Approach |
|---|---|---|---|
| Feature Log Store | 30 minutes | 1 hour | Replicated storage; catch-up analysis on recovery |
| Drift Analysis Jobs | 4 hours | 4 hours (catch-up) | Re-run batch jobs for missed windows |
| Reference Dataset Store | 30 minutes | Near-zero | Replicated object storage |
| Drift Dashboard | 60 minutes | N/A (read-only) | Redeploy dashboard from version control |
11. Cost Considerations
Cost Drivers
| Driver | Description | Relative Cost |
|---|---|---|
| Feature log storage | 1–5KB per inference record; 90-day hot retention | High at scale |
| Statistical analysis compute (Spark/Glue) | Hourly batch jobs on feature data; scales with feature count and volume | Medium |
| Holdout accuracy evaluation | Model inference cost on holdout set (daily) | Low to Medium |
| JSD streaming computation | Real-time output distribution monitoring; minimal compute | Low |
| Reference dataset storage | Relatively small; versioned reference distributions | Low |
Scaling Risks: Feature log storage is the primary scaling cost. At 10M+ daily inferences with 100+ features, storage costs can exceed $10K/month without optimisation. Use columnar compression (Parquet) and aggressive downsampling for older data.
Optimisations:
- Store feature summaries (histogram buckets) rather than raw feature values for high-volume models
- Run statistical tests on stratified samples (10K records sufficient for KS test statistical power) rather than full population
- Use serverless compute (AWS Glue, BigQuery) to eliminate idle compute costs between batch windows
Indicative Cost Range
| Scale | Daily Inferences | Estimated Drift Detection Cost/Month |
|---|---|---|
| Small | 10,000 | $200–$600 |
| Medium | 500,000 | $1,500–$4,000 |
| Large | 5,000,000 | $5,000–$15,000 |
| Enterprise | 50,000,000+ | $20,000–$60,000 (with summarisation optimisation) |
12. Trade-Off Analysis
Approach Comparison
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| Full feature-level drift monitoring (KS + Chi-Sq + PSI) | Precise; identifies which feature is drifting; enables targeted remediation | High compute and storage; requires feature access (not always available for LLMs) | Structured ML models with well-defined feature sets; regulated decisions |
| Output-only distribution monitoring (JSD on outputs) | Minimal infrastructure; no feature logging required; applicable to LLMs | Detects that something changed but not what; concept drift only, not data drift | LLM and generative systems; quick-start implementation |
| Human-labeled holdout evaluation only | Highest accuracy; directly measures real performance | Slow (labels take time); samples a small fraction of production | High-risk decisions where detection accuracy is paramount; complement to automated methods |
Architectural Tensions
| Tension | Description | Resolution |
|---|---|---|
| Sensitivity vs. False Alarms | Low thresholds detect early drift but generate false alarms that erode trust | PSI 0.10 warning (no page), 0.25 alert (notify), 0.50 critical (page) — graduated response |
| Feature granularity vs. Cost | Per-feature monitoring is precise but expensive at scale | Monitor all features for regulated models; monitor key features only for lower-risk models |
| Detection speed vs. Statistical power | Very fast detection requires small windows with low statistical power | Accept 1-hour minimum window for KS/Chi-sq; use streaming output monitoring for faster preliminary signal |
| Automation vs. Human oversight | Automated retraining is fast but may introduce new problems | Automated trigger only; human must approve new model promotion |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Reference dataset not updated after model deployment | High | High (drift false alarms or misses) | Reference freshness SLO alert | Enforce reference update as deployment gate |
| Benign classifier clears actual harmful drift | Medium | High (harmful drift not actioned) | Accuracy holdout detects concurrent quality decline | Require dual signal for benign classification; accuracy must not decline |
| Feature logger adds significant latency | Low | High (performance degradation) | Feature logger latency SLO breach | Async logging; decouple from inference path |
| Statistical test fails with insufficient data | High (for low-volume models) | Medium (no drift detection) | Test failure log; minimum data check | Enforce minimum sample size before running tests; alert if sample size insufficient |
| Retraining pipeline introduces regression | Medium | High (new model worse) | New model evaluation before promotion | Holdout gate in retraining pipeline; manual sign-off required |
Cascading Scenarios
- Scenario 1: Reference dataset never updated after major product launch → all features show critical drift → retraining triggered repeatedly → new models trained on shifted distribution → performance remains poor → wasted retraining compute and model risk review cycles. Mitigation: approved reference update is required within 5 days of known distribution shifts.
- Scenario 2: Feature logger fails silently → feature store contains stale data → drift detection runs on old data → no drift detected → actual drift goes undetected for weeks. Mitigation: feature store freshness SLO; alert if no new records ingested in 30 minutes.
14. Regulatory Considerations
| Regulation | Clause | Requirement | Drift Detection Implementation |
|---|---|---|---|
| APRA CPG 234 | Section 6 (Model Risk) | Material models require ongoing performance monitoring and validation | Drift detection implements continuous monitoring; critical drift triggers validation workflow |
| APRA CPG 234 | Section 8 (Model Review) | Annual (minimum) or event-triggered model review | Critical drift event triggers model review; documentation provided by this pattern |
| EU AI Act | Article 9.5 (Testing) | Ongoing testing to identify appropriate risk management measures for high-risk AI | Holdout evaluation and drift detection implement ongoing testing requirement |
| EU AI Act | Article 9.7 (Automatically Generated Logs) | High-risk AI must keep logs enabling verification of compliance | Drift event log with timestamps, metrics, and actions is the compliance verification record |
| ISO/IEC 42001 | Clause 9.1 (Monitoring, Measurement, Analysis) | AI system performance must be continuously monitored against objectives | Drift detection implements ISO 42001 clause 9.1 at technical layer |
| NIST AI RMF | MANAGE 2.2 | AI risk management includes monitoring for changes in performance over time | Drift monitoring directly implements NIST AI RMF MANAGE 2.2 |
| Privacy Act 1988 (AU) | APP 11 (Security) | Model using PII must continue to protect it; model drift may expose new PII risks | Drift monitoring detects when input distribution shifts to include new PII categories |
15. Reference Implementations
AWS
- Feature Logger: Custom wrapper; AWS SageMaker Model Monitor (built-in feature capture)
- Feature Log Store: Amazon S3 (Parquet) + AWS Glue Data Catalog; Amazon Redshift for queries
- Drift Analysis: SageMaker Model Monitor scheduled monitoring jobs (built-in KS, chi-sq, PSI); AWS Glue ETL jobs
- Reference Store: SageMaker baseline dataset (S3-backed)
- Retraining Trigger: SageMaker Model Monitor alert → EventBridge → SageMaker Pipeline
- Dashboard: SageMaker Model Monitor built-in dashboard; Amazon QuickSight custom dashboards
- Registry: SageMaker Model Registry
Azure
- Feature Logger: Azure Machine Learning Data Collector; custom wrapper
- Feature Log Store: Azure Data Lake Storage Gen2 (Parquet) + Azure Synapse Analytics
- Drift Analysis: Azure ML Data Drift monitoring (built-in); Azure Databricks jobs
- Reference Store: Azure ML Dataset with versioning
- Retraining Trigger: Azure ML Monitoring alert → Azure Event Grid → Azure ML Pipeline
- Dashboard: Azure ML Studio monitoring dashboard; Power BI
- Registry: Azure Machine Learning Model Registry
GCP
- Feature Logger: Vertex AI Feature Store; custom wrapper
- Feature Log Store: BigQuery (streaming insert); Cloud Storage (Parquet)
- Drift Analysis: Vertex AI Model Monitoring (built-in skew/drift detection); Dataflow batch jobs
- Reference Store: Vertex AI Dataset with versioning
- Retraining Trigger: Vertex AI Model Monitoring alert → Cloud Pub/Sub → Vertex AI Pipeline
- Dashboard: Vertex AI Model Monitoring dashboard; Looker
- Registry: Vertex AI Model Registry
On-Premises
- Feature Logger: Evidently AI (open source); custom wrapper with Kafka sink
- Feature Log Store: Apache Hudi on HDFS/MinIO; ClickHouse for queries
- Drift Analysis: Evidently AI reports scheduled via Airflow; custom PySpark jobs
- Reference Store: MLflow artifacts; DVC
- Retraining Trigger: Airflow sensor on drift metric; MLflow webhook
- Dashboard: Evidently AI HTML reports; Grafana with drift metrics
- Registry: MLflow Model Registry
16. Related Patterns
| Pattern ID | Pattern Name | Relationship | Notes |
|---|---|---|---|
| EAAPL-OBS001 | AI Telemetry Architecture | Foundation | Input/output logging infrastructure required |
| EAAPL-OBS003 | Hallucination Detection | Sibling | Both monitor output quality; drift in hallucination rate is a concept drift signal |
| EAAPL-OBS004 | AI Incident Management | Depends On | Critical drift triggers P1 quality incident in OBS004 |
| EAAPL-OBS008 | AI Performance Benchmarking | Sibling | Offline benchmarking complements online drift monitoring |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Adoption Breadth | 3 | Adopted by mature ML organisations; cloud-native tools improving accessibility |
| Tooling Ecosystem | 4 | SageMaker Monitor, Vertex AI Monitoring, Evidently AI are mature; LLM drift tooling maturing |
| Operational Runbook Coverage | 3 | Statistical drift runbooks exist; benign drift classification is still manual/custom |
| Regulatory Evidence | 4 | APRA CPG 234 explicitly references model performance monitoring; EU AI Act adds momentum |
| Cost Predictability | 3 | Feature storage cost at scale can surprise teams without upfront capacity planning |
| Team Skill Availability | 3 | Statistical ML skills required; data scientist involvement needed for test interpretation |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0.0 | 2026-06-12 | EAAPL Working Group | Initial publication |