[EAAPL-PLT008] AI Experiment Tracking
Category: Platform Engineering
Sub-category: MLOps / Evaluation
Version: 1.1
Maturity: Proven
Tags: experiment-tracking, mlops, model-evaluation, a-b-testing, evaluation-datasets, metric-tracking, model-registry, promotion-decision
Regulatory Relevance: EU AI Act Article 9 (Risk Management), Article 17 (Quality Management), ISO 42001 Clause 9
1. Executive Summary
AI systems that cannot demonstrate systematic evaluation before production deployment are a governance liability. When a new model version, prompt change, or RAG configuration is promoted without structured comparison data, the organisation has no evidence that the change improves outcomes and no baseline to detect if it degrades them. Regulators, auditors, and increasingly, procurement processes require this evidence.
The AI Experiment Tracking pattern establishes the infrastructure for systematic, reproducible evaluation of all AI configuration changes. It covers the metadata schema for experiments (what was changed, what was measured, over what dataset), evaluation dataset management (the golden datasets that make metrics reproducible), metric computation and comparison (including human evaluation workflows for nuanced quality assessment), multi-run comparison dashboards, and the promotion decision audit trail that links every production configuration to the experiment that justified its deployment. This pattern transforms AI quality management from anecdotal to evidence-based, satisfying both engineering and regulatory audiences.
2. Problem Statement
Business Problem
AI feature quality degrades without detection because there is no systematic measurement. A model upgrade that performs better on marketing copy but worse on technical documentation goes undetected until customers complain. A prompt change that reduces cost also reduces accuracy in a way that only manifests at edge cases. Without experiment tracking, these regressions are invisible until they cause business impact.
Technical Problem
AI experiments are informal: engineers test a new model in a local notebook, eyeball a few outputs, and deploy. There is no reproducible evaluation framework, no standard metric set, no golden dataset, no comparison to baseline, and no record of the decision. When the promotion is later questioned, there is no evidence to review.
Symptoms
- Model or prompt changes deployed without documented evaluation results
- Different teams using different metrics to evaluate the same AI capability (no standard metric set)
- Golden datasets living in individual engineers' laptops or ad hoc S3 buckets with no versioning
- Post-incident analysis unable to identify whether the AI change or a data change caused quality degradation
- Regulators or auditors requesting evaluation evidence for production AI systems; none available
Cost of Inaction
- AI quality regressions reaching production that structured evaluation would have caught
- Regulatory non-compliance due to absence of quality management documentation
- Duplicated evaluation effort across teams using incompatible methodologies
- Inability to demonstrate AI improvement over time to business stakeholders
3. Context
When to Apply
- Organisation has AI models or prompts in production that change over time
- Multiple teams evaluate AI changes with no standard methodology
- Regulatory obligations require quality management documentation for AI systems
- A/B testing of model configurations is needed for production traffic comparison
- AI programme stakeholders require evidence of continuous improvement
When NOT to Apply
- Static, never-changing AI system with no evaluation lifecycle
- One-time AI analysis project with no ongoing deployment: lightweight evaluation sufficient
- Research prototype: production-grade experiment tracking overhead not warranted
Prerequisites
- Evaluation datasets (golden datasets) for each AI use case; these are the prerequisite that most organisations lack; they must be built before this pattern fully delivers value
- AI API Gateway for A/B traffic routing integration (PLT002/PLT003)
- Model Registry for linking experiment results to model versions (PLT001 Layer 2)
- Observability infrastructure for metric ingestion
Industry Applicability
| Industry |
Applicability |
Evaluation Priority |
| Financial Services |
Very High |
Accuracy of AI-assisted decisions; fairness/bias; regulatory documentation |
| Healthcare |
Very High |
Clinical accuracy; safety; regulatory approval evidence |
| Technology / SaaS |
High |
Quality at scale; competitive differentiation through AI quality |
| Legal / Professional Services |
High |
Accuracy; consistency; professional responsibility evidence |
| Retail / E-commerce |
Medium |
User satisfaction; conversion metrics; content quality |
| Government |
High |
Fairness; accuracy; democratic accountability |
4. Architecture Overview
The AI Experiment Tracking system is the measurement and evidence layer for all AI quality decisions. It is structurally analogous to a scientific lab notebook for AI—each experiment has a defined setup, methodology, results, and conclusion—but operationalised at engineering scale with automation.
Experiment Metadata Schema defines what an experiment record contains. Every experiment must record: a unique experiment ID, the component under evaluation (model name + version, prompt name + version, or RAG configuration version), the baseline configuration it is being compared to, the evaluation dataset reference (name + version), the evaluation metrics computed, the evaluation execution timestamp, the person or automated system that executed the evaluation, and the promotion decision record (approved/rejected with reason and approver). This schema is the foundation of the audit trail; every production configuration must have a traceable experiment record.
Evaluation Dataset Management is the most operationally demanding part of this pattern. Golden datasets for AI evaluation must be representative, version-controlled, and regularly maintained. A golden dataset consists of: input examples (representative queries or prompts), expected outputs (or reference outputs for similarity scoring), and metadata (creation date, curator, domain coverage statistics, known limitations). Datasets are stored in versioned object storage (S3, GCS) with a content-addressed hash ensuring reproducibility. A dataset update triggers re-evaluation of the current production configuration as a new baseline, ensuring metrics are always comparable on the same dataset version.
Metric Computation Framework standardises the metrics computed across all experiments. Common AI quality metrics include: accuracy (for classification tasks), factuality score (for knowledge-retrieval tasks, computed via citation checking or reference comparison), format compliance rate (for structured output tasks), latency percentiles (P50, P95, P99 inference time), token efficiency (output tokens per quality unit), and for safety-critical applications: toxicity rate, bias score, and hallucination rate. Metric computation is automated in a standardised evaluation harness that can be invoked from CI/CD pipelines and from the experiment tracking service.
Human Evaluation Workflow extends automated evaluation with human judgment for nuanced quality attributes that automated metrics cannot capture reliably: tone and brand voice consistency, logical coherence of long-form outputs, appropriateness of creative content, and clinical appropriateness in healthcare applications. Human evaluation is routed to qualified evaluators via a task queue; results are aggregated with inter-rater reliability scoring (Cohen's kappa) to ensure evaluation quality. Human evaluation is rate-limited by cost and evaluator availability; the pattern defines when human evaluation is required (high-risk use cases, major version changes) versus when automated evaluation is sufficient.
Multi-Run Comparison Dashboard provides the analytical view across experiments. The dashboard allows comparison of metrics across all experiments for a given component, over time (to detect trend improvements or regressions), and against the current production baseline. Statistical significance testing (two-proportion z-test for classification metrics, t-test for continuous metrics) is applied automatically; the dashboard distinguishes between statistically significant improvements and noise.
Promotion Decision Audit Trail links the decision to deploy a new configuration to the experiment that justified it. When a platform team member promotes a model or prompt version to production via the Prompt Registry (PLT005) or Model Registry (PLT001), they must reference an experiment ID that documents the evaluation. This reference is recorded in the promotion record and in the production configuration's metadata. Auditors can trace any production AI configuration to the experiment evidence that justified its deployment.
5. Architecture Diagram
flowchart TD
subgraph Triggers["Experiment Triggers"]
A[CI Pipeline]
B[Manual Trigger]
end
subgraph Evaluation["Evaluation Service"]
C[Evaluation Runner]
D[Metric Computation]
E[Statistical Comparison]
end
subgraph Storage["Data Stores"]
F[(Golden Dataset Store)]
G[(Experiment Metadata DB)]
end
subgraph Outcome["Outcomes"]
H[Comparison Dashboard]
I[Promotion Audit Trail]
end
A --> C
B --> C
F --> C
C --> D
D --> E
E --> G
G --> H
G --> I
I --> J[Governance Report]
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#dbeafe,stroke:#3b82f6
style C fill:#f0fdf4,stroke:#22c55e
style D fill:#f0fdf4,stroke:#22c55e
style E fill:#f0fdf4,stroke:#22c55e
style F fill:#fef9c3,stroke:#eab308
style G fill:#fef9c3,stroke:#eab308
style H fill:#d1fae5,stroke:#10b981
style I fill:#d1fae5,stroke:#10b981
style J fill:#d1fae5,stroke:#10b981
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Experiment Scheduler |
Service |
Queue and prioritise evaluation jobs |
Custom Celery queue, Temporal workflow |
High |
| Evaluation Runner |
Service |
Execute prompts against dataset; collect raw outputs |
Custom Python harness, Ragas, DeepEval |
Critical |
| Metric Computation Engine |
Service |
Compute standard and custom metrics from raw outputs |
Ragas (RAG metrics), custom evaluators |
Critical |
| Evaluation Dataset Store |
Service |
Version-controlled storage for golden datasets |
S3 + DVC (Data Version Control), GCS |
Critical |
| Experiment Metadata DB |
Service |
Store experiment records with full schema |
PostgreSQL, MongoDB |
Critical |
| Metric Time Series DB |
Service |
Store metric values for trend analysis and comparison |
Prometheus, ClickHouse, TimescaleDB |
High |
| Human Evaluation Task Queue |
Service |
Route nuanced evaluations to human evaluators |
Custom task queue, Label Studio, Scale AI |
Medium |
| Inter-Rater Reliability Calculator |
Service |
Compute Cohen's kappa for human evaluation quality |
Custom Python module |
Medium |
| Statistical Comparison Engine |
Service |
Compute significance tests between candidate and baseline |
Custom + scipy.stats |
High |
| Comparison Dashboard |
Service |
Visualise experiment results and trends |
Grafana, Metabase, custom React dashboard |
High |
| Promotion Decision Recorder |
Service |
Link production promotion events to experiment IDs |
Integration with Prompt Registry and Model Registry APIs |
Critical |
| Governance Report Generator |
Service |
Produce quality management documentation for auditors |
Custom, Jupyter notebook pipeline |
Medium |
7. Data Flow
Primary Flow — Automated CI Evaluation on Prompt Change
| Step |
Actor |
Action |
Output |
| 1 |
CI Pipeline |
Detect prompt change in PR; trigger experiment for customer-faq-v1.2.0 vs baseline v1.1.3 |
Experiment job created with ID exp-20250612-001 |
| 2 |
Experiment Scheduler |
Dequeue experiment job; load dataset faq-golden-v3.2 from S3 |
Dataset loaded: 250 examples |
| 3 |
Evaluation Runner |
Execute all 250 examples against baseline v1.1.3 and candidate v1.2.0; collect outputs |
500 raw output records |
| 4 |
Metric Computation |
Compute: accuracy 94.1% (v1.2) vs 92.8% (v1.1.3); factuality 91.2% vs 90.1%; P95 latency 1.2s vs 1.4s |
Metric delta: accuracy +1.3%, factuality +1.1%, latency -200ms |
| 5 |
Statistical Comparison |
Two-proportion z-test on accuracy: p=0.031 (< 0.05 threshold) → statistically significant improvement |
p-value + confidence intervals |
| 6 |
Experiment Record |
Write to experiment DB: exp-20250612-001; component=customer-faq; candidate=v1.2.0; status=PASS; significant improvement |
Experiment record written |
| 7 |
CI Status |
Post experiment results as PR comment; mark CI check as PASS |
PR author sees metric comparison |
| 8 |
Promotion |
Prompt owner approves PR; references exp-20250612-001 in promotion request |
Experiment ID recorded in promotion audit trail |
Error Flow
| Error |
Detection |
Response |
| Evaluation dataset unavailable (S3 outage) |
Dataset load failure |
Experiment queued; alert dataset custodian; retry with backoff |
| Model API rate limit during evaluation |
Runner error rate |
Slow down evaluation; use batch API; emit warning |
| Statistical test inconclusive (p > 0.05) |
Comparison engine |
Mark experiment as INCONCLUSIVE; alert prompt owner; may require larger dataset |
| Human evaluation SLA breach |
Task queue age monitor |
Escalate to evaluation team lead; unblock by automated proxy metric |
8. Security Considerations
- Evaluation datasets may contain sensitive representative inputs; they must be classified and stored with the same access controls as production data
- Experiment results (including individual output comparisons) may reveal model behaviour on sensitive inputs; access to raw outputs restricted to model owner and platform team
- Human evaluators must be bound by confidentiality agreements if evaluation involves sensitive content
OWASP LLM Controls
| OWASP LLM Risk |
Experiment Tracking Control |
| LLM09 Overreliance |
Factuality and hallucination metrics in evaluation suite provide evidence of reliability |
| LLM03 Training Data Poisoning |
Evaluation dataset integrity checks (content hash verification) detect dataset tampering |
9. Governance Considerations
Quality Management System
- Experiment tracking constitutes the quality management system for AI as required by EU AI Act Article 17; every production configuration must have a traceable experiment record
- Promotion without experiment evidence is a governance policy violation; the promotion workflow enforces experiment ID as a required field
Governance Artefacts
| Artefact |
Owner |
Cadence |
Location |
| Evaluation dataset catalogue |
Data Team + Model Owner |
Per dataset version |
Dataset store + metadata DB |
| Experiment records |
Model Owner |
Per experiment |
Experiment metadata DB |
| Promotion audit trail |
Platform Team |
Per promotion |
Experiment DB + Registry |
| Quarterly quality report |
Model Owner |
Quarterly |
Governance dashboard |
| Human evaluation guidelines |
Model Owner |
Annual |
Internal wiki |
10. Operational Considerations
Monitoring
| Signal |
Source |
Alert Threshold |
Owner |
| Evaluation pipeline failure rate |
Job status |
>5% failed experiments |
Platform Team |
| Experiment queue depth |
Scheduler metrics |
>20 jobs pending >30 min |
Platform Team |
| Dataset hash mismatch |
Integrity check |
Any mismatch |
Security + Data Team |
| Production quality metric regression (vs. last deployment) |
Production sampling |
>2% drop on key metric |
Model Owner + Platform Team |
SLOs
| SLO |
Target |
Window |
| CI evaluation pipeline completion |
<15 min for standard 250-example dataset |
Per run |
| Experiment DB availability |
99.9% |
Rolling 30 days |
| Production sampling metric freshness |
<4 hours lag |
Rolling 7 days |
Disaster Recovery
| Component |
RPO |
RTO |
Strategy |
| Experiment metadata DB |
5 min |
30 min |
Database replication |
| Dataset store |
0 (content-addressed) |
15 min |
S3 cross-region replication |
| Metric time series DB |
1 hour |
30 min |
Cross-region replication; recomputable from experiment DB |
11. Cost Considerations
Cost Drivers
| Driver |
Description |
Relative Weight |
| Evaluation API calls |
Running golden dataset through model on every PR/experiment |
Medium — scale with dataset size and frequency |
| Human evaluation |
Evaluator time for nuanced tasks |
High — reserved for high-risk changes only |
| Dataset storage |
Versioned datasets; relatively small at scale |
Low |
| Experiment DB hosting |
Low-volume OLTP database |
Low |
Optimisations
- Use the cheapest model that produces comparable signals for evaluation runs where the evaluation target is a more expensive model
- Maintain a tiered evaluation strategy: fast automated eval (10 examples) for PR gate; comprehensive eval (250 examples) for promotion to staging; extended eval (1000+ examples) for production promotion of high-risk changes
Indicative Cost Range
| Scale |
Monthly Experiment Infra + Evaluation API Cost |
| Small (10 prompts, weekly changes) |
$300–$1,000 |
| Medium (50 prompts, daily changes, 3 teams) |
$2,000–$6,000 |
| Large (200+ prompts, multiple models, continuous evaluation) |
$10,000–$30,000 |
12. Trade-Off Analysis
Evaluation Strategy Options
| Option |
Description |
Pros |
Cons |
Best For |
| Automated Only |
All evaluation via metric computation against golden dataset |
Scalable; fast; cheap |
Misses nuanced quality issues; only as good as golden dataset |
High-volume, structured output tasks |
| Human Evaluation Only |
All evaluation via human judges |
Highest quality |
Slow; expensive; not scalable; inter-rater inconsistency |
Very high-risk, low-volume decisions |
| Automated + Human Gate |
Automated for all changes; human required for high-risk/major changes |
Balanced quality and scalability |
Requires defining "high-risk" criterion carefully |
Recommended default |
Metric Strategy Options
| Option |
Description |
Pros |
Cons |
| Fixed Standard Metric Set |
Same metrics for all use cases |
Comparability; simplicity |
May not capture use-case-specific quality |
| Per-Use-Case Custom Metrics |
Metrics defined per use case |
Highest relevance |
Comparison across use cases harder; more maintenance |
| Composite Score |
Weighted combination of multiple metrics |
Single promotion threshold |
Weight calibration difficult; masks individual metric issues |
Architectural Tensions
| Tension |
Option A |
Option B |
Resolution |
| Evaluation completeness vs. CI speed |
Full evaluation on every PR |
Fast smoke test on PR; full eval on merge |
Tiered: 10-example fast check on PR; 250-example full on merge |
| Golden dataset size vs. cost |
Large representative dataset |
Minimal viable dataset |
Start with 50–100 high-quality examples; expand to 250–500 as infrastructure matures |
| Automated statistical significance vs. business judgment |
Block promotion on p>0.05 only |
Block on any quality drop |
Statistical gate for automated CI; human judgment gate for production promotion |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Golden dataset staleness (distribution drift from production) |
High over time |
High — metrics look good but production quality is low |
Production quality monitoring diverges from evaluation metrics |
Regular dataset refresh cadence; production sampling for dataset maintenance |
| Evaluation runner hitting model rate limits |
Medium |
Medium — experiments fail or slow |
Runner error rate |
Use batch API; reduce parallelism; queue retry |
| Human evaluator bias |
Medium |
Medium — biased promotion decisions |
Inter-rater reliability monitoring (low kappa) |
Evaluator calibration sessions; blind evaluation protocols |
| Promotion without experiment reference |
Low (if enforced) |
High — governance gap |
Promotion workflow audit |
Enforce experiment ID as mandatory in promotion workflow; alert on bypass |
14. Regulatory Considerations
EU AI Act Articles 9 and 17
- Article 9 requires risk management measures for high-risk AI; experiment evaluation records constitute the technical documentation of risk management
- Article 17 quality management system requirement is satisfied by the experiment tracking infrastructure and promotion audit trail
- Technical documentation required by Article 11 must reference the evaluation methodology, datasets, and metrics used
ISO 42001 Clause 9 (Performance Evaluation)
- Experiment tracking directly implements Clause 9.1 (monitoring and measurement of AI system performance)
- Regular production quality sampling satisfies Clause 9.3 (management review of AI performance)
NIST AI RMF MEASURE 2.3
- Metrics for tracking AI performance over time must be defined and measured; the experiment tracking framework and metric time series DB satisfy this requirement
15. Reference Implementations
AWS
| Component |
AWS Service |
| Evaluation runner |
SageMaker Processing Jobs or Lambda (batch) |
| Dataset store |
S3 + DVC |
| Experiment metadata DB |
Amazon RDS PostgreSQL |
| Metric time series |
Amazon Timestream or CloudWatch |
| Dashboard |
Amazon Managed Grafana |
Azure
| Component |
Azure Service |
| Evaluation runner + experiment tracking |
Azure ML Experiments |
| Dataset store |
Azure ML Datasets + Azure Blob Storage |
| Dashboard |
Azure ML Studio |
Open Source / SaaS
| Component |
Technology |
| Evaluation framework |
Ragas (RAG), DeepEval, HELM |
| Experiment tracking |
MLflow, Weights & Biases, Comet ML |
| Human evaluation |
Label Studio (open source), Scale AI, Labelbox |
On-Premises
| Component |
Technology |
| Evaluation runner |
Custom Python harness + Celery |
| Experiment tracking |
MLflow self-hosted |
| Dataset store |
MinIO (S3-compatible) + DVC |
| Pattern ID |
Name |
Relationship |
| EAAPL-PLT001 |
Enterprise AI Platform |
Parent — experiment tracking is Layer 4 |
| EAAPL-PLT005 |
Prompt Version Control |
Dependency — prompt promotions require experiment references |
| EAAPL-PLT003 |
Model Routing |
Integration — A/B routing data feeds experiment tracking |
| EAAPL-GOV001 |
AI Governance Framework |
Dependency — experiment records are primary governance evidence |
17. Maturity Assessment
Overall Maturity: Proven
Experiment tracking with MLflow, W&B, and Azure ML is production-proven. AI-specific evaluation frameworks (Ragas, DeepEval) are maturing rapidly. The promotion-decision audit trail link is the least standardised component.
Scoring Matrix
| Dimension |
Score (1–5) |
Rationale |
| Pattern Completeness |
5 |
All sections documented |
| Implementation Evidence |
4 |
MLOps experiment tracking proven; LLM-specific evaluation less so |
| Tooling Maturity |
4 |
Ragas/DeepEval maturing; human eval tooling stable |
| Regulatory Alignment |
5 |
Strong EU AI Act / ISO 42001 alignment |
| Dataset Management Maturity |
3 |
Golden dataset management is the common weak link in practice |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2024-08-01 |
EAAPL Working Group |
Initial publication |
| 1.1 |
2025-06-12 |
EAAPL Working Group |
Human evaluation workflow expanded; ISO 42001 Clause 9 alignment; production sampling guidance added |