EAAPL-PLT008Proven

AI Experiment Tracking

⚙️ Platform EngineeringEU AI ActISO/IEC 42001

[EAAPL-PLT008] AI Experiment Tracking

Category: Platform Engineering Sub-category: MLOps / Evaluation Version: 1.1 Maturity: Proven Tags: experiment-tracking, mlops, model-evaluation, a-b-testing, evaluation-datasets, metric-tracking, model-registry, promotion-decision Regulatory Relevance: EU AI Act Article 9 (Risk Management), Article 17 (Quality Management), ISO 42001 Clause 9

1. Executive Summary

AI systems that cannot demonstrate systematic evaluation before production deployment are a governance liability. When a new model version, prompt change, or RAG configuration is promoted without structured comparison data, the organisation has no evidence that the change improves outcomes and no baseline to detect if it degrades them. Regulators, auditors, and increasingly, procurement processes require this evidence.

The AI Experiment Tracking pattern establishes the infrastructure for systematic, reproducible evaluation of all AI configuration changes. It covers the metadata schema for experiments (what was changed, what was measured, over what dataset), evaluation dataset management (the golden datasets that make metrics reproducible), metric computation and comparison (including human evaluation workflows for nuanced quality assessment), multi-run comparison dashboards, and the promotion decision audit trail that links every production configuration to the experiment that justified its deployment. This pattern transforms AI quality management from anecdotal to evidence-based, satisfying both engineering and regulatory audiences.

2. Problem Statement

Business Problem

AI feature quality degrades without detection because there is no systematic measurement. A model upgrade that performs better on marketing copy but worse on technical documentation goes undetected until customers complain. A prompt change that reduces cost also reduces accuracy in a way that only manifests at edge cases. Without experiment tracking, these regressions are invisible until they cause business impact.

Technical Problem

AI experiments are informal: engineers test a new model in a local notebook, eyeball a few outputs, and deploy. There is no reproducible evaluation framework, no standard metric set, no golden dataset, no comparison to baseline, and no record of the decision. When the promotion is later questioned, there is no evidence to review.

Symptoms

Model or prompt changes deployed without documented evaluation results
Different teams using different metrics to evaluate the same AI capability (no standard metric set)
Golden datasets living in individual engineers' laptops or ad hoc S3 buckets with no versioning
Post-incident analysis unable to identify whether the AI change or a data change caused quality degradation
Regulators or auditors requesting evaluation evidence for production AI systems; none available

Cost of Inaction

AI quality regressions reaching production that structured evaluation would have caught
Regulatory non-compliance due to absence of quality management documentation
Duplicated evaluation effort across teams using incompatible methodologies
Inability to demonstrate AI improvement over time to business stakeholders

3. Context

When to Apply

Organisation has AI models or prompts in production that change over time
Multiple teams evaluate AI changes with no standard methodology
Regulatory obligations require quality management documentation for AI systems
A/B testing of model configurations is needed for production traffic comparison
AI programme stakeholders require evidence of continuous improvement

When NOT to Apply

Static, never-changing AI system with no evaluation lifecycle
One-time AI analysis project with no ongoing deployment: lightweight evaluation sufficient
Research prototype: production-grade experiment tracking overhead not warranted

Prerequisites

Evaluation datasets (golden datasets) for each AI use case; these are the prerequisite that most organisations lack; they must be built before this pattern fully delivers value
AI API Gateway for A/B traffic routing integration (PLT002/PLT003)
Model Registry for linking experiment results to model versions (PLT001 Layer 2)
Observability infrastructure for metric ingestion

Industry Applicability

Industry	Applicability	Evaluation Priority
Financial Services	Very High	Accuracy of AI-assisted decisions; fairness/bias; regulatory documentation
Healthcare	Very High	Clinical accuracy; safety; regulatory approval evidence
Technology / SaaS	High	Quality at scale; competitive differentiation through AI quality
Legal / Professional Services	High	Accuracy; consistency; professional responsibility evidence
Retail / E-commerce	Medium	User satisfaction; conversion metrics; content quality
Government	High	Fairness; accuracy; democratic accountability

4. Architecture Overview

The AI Experiment Tracking system is the measurement and evidence layer for all AI quality decisions. It is structurally analogous to a scientific lab notebook for AI—each experiment has a defined setup, methodology, results, and conclusion—but operationalised at engineering scale with automation.

Experiment Metadata Schema defines what an experiment record contains. Every experiment must record: a unique experiment ID, the component under evaluation (model name + version, prompt name + version, or RAG configuration version), the baseline configuration it is being compared to, the evaluation dataset reference (name + version), the evaluation metrics computed, the evaluation execution timestamp, the person or automated system that executed the evaluation, and the promotion decision record (approved/rejected with reason and approver). This schema is the foundation of the audit trail; every production configuration must have a traceable experiment record.

Evaluation Dataset Management is the most operationally demanding part of this pattern. Golden datasets for AI evaluation must be representative, version-controlled, and regularly maintained. A golden dataset consists of: input examples (representative queries or prompts), expected outputs (or reference outputs for similarity scoring), and metadata (creation date, curator, domain coverage statistics, known limitations). Datasets are stored in versioned object storage (S3, GCS) with a content-addressed hash ensuring reproducibility. A dataset update triggers re-evaluation of the current production configuration as a new baseline, ensuring metrics are always comparable on the same dataset version.

Metric Computation Framework standardises the metrics computed across all experiments. Common AI quality metrics include: accuracy (for classification tasks), factuality score (for knowledge-retrieval tasks, computed via citation checking or reference comparison), format compliance rate (for structured output tasks), latency percentiles (P50, P95, P99 inference time), token efficiency (output tokens per quality unit), and for safety-critical applications: toxicity rate, bias score, and hallucination rate. Metric computation is automated in a standardised evaluation harness that can be invoked from CI/CD pipelines and from the experiment tracking service.

Human Evaluation Workflow extends automated evaluation with human judgment for nuanced quality attributes that automated metrics cannot capture reliably: tone and brand voice consistency, logical coherence of long-form outputs, appropriateness of creative content, and clinical appropriateness in healthcare applications. Human evaluation is routed to qualified evaluators via a task queue; results are aggregated with inter-rater reliability scoring (Cohen's kappa) to ensure evaluation quality. Human evaluation is rate-limited by cost and evaluator availability; the pattern defines when human evaluation is required (high-risk use cases, major version changes) versus when automated evaluation is sufficient.

Multi-Run Comparison Dashboard provides the analytical view across experiments. The dashboard allows comparison of metrics across all experiments for a given component, over time (to detect trend improvements or regressions), and against the current production baseline. Statistical significance testing (two-proportion z-test for classification metrics, t-test for continuous metrics) is applied automatically; the dashboard distinguishes between statistically significant improvements and noise.

Promotion Decision Audit Trail links the decision to deploy a new configuration to the experiment that justified it. When a platform team member promotes a model or prompt version to production via the Prompt Registry (PLT005) or Model Registry (PLT001), they must reference an experiment ID that documents the evaluation. This reference is recorded in the promotion record and in the production configuration's metadata. Auditors can trace any production AI configuration to the experiment evidence that justified its deployment.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Triggers["Experiment Triggers"] A[CI Pipeline] B[Manual Trigger] end subgraph Evaluation["Evaluation Service"] C[Evaluation Runner] D[Metric Computation] E[Statistical Comparison] end subgraph Storage["Data Stores"] F[(Golden Dataset Store)] G[(Experiment Metadata DB)] end subgraph Outcome["Outcomes"] H[Comparison Dashboard] I[Promotion Audit Trail] end A --> C B --> C F --> C C --> D D --> E E --> G G --> H G --> I I --> J[Governance Report] style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#fef9c3,stroke:#eab308 style G fill:#fef9c3,stroke:#eab308 style H fill:#d1fae5,stroke:#10b981 style I fill:#d1fae5,stroke:#10b981 style J fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Experiment Scheduler	Service	Queue and prioritise evaluation jobs	Custom Celery queue, Temporal workflow	High
Evaluation Runner	Service	Execute prompts against dataset; collect raw outputs	Custom Python harness, Ragas, DeepEval	Critical
Metric Computation Engine	Service	Compute standard and custom metrics from raw outputs	Ragas (RAG metrics), custom evaluators	Critical
Evaluation Dataset Store	Service	Version-controlled storage for golden datasets	S3 + DVC (Data Version Control), GCS	Critical
Experiment Metadata DB	Service	Store experiment records with full schema	PostgreSQL, MongoDB	Critical
Metric Time Series DB	Service	Store metric values for trend analysis and comparison	Prometheus, ClickHouse, TimescaleDB	High
Human Evaluation Task Queue	Service	Route nuanced evaluations to human evaluators	Custom task queue, Label Studio, Scale AI	Medium
Inter-Rater Reliability Calculator	Service	Compute Cohen's kappa for human evaluation quality	Custom Python module	Medium
Statistical Comparison Engine	Service	Compute significance tests between candidate and baseline	Custom + scipy.stats	High
Comparison Dashboard	Service	Visualise experiment results and trends	Grafana, Metabase, custom React dashboard	High
Promotion Decision Recorder	Service	Link production promotion events to experiment IDs	Integration with Prompt Registry and Model Registry APIs	Critical
Governance Report Generator	Service	Produce quality management documentation for auditors	Custom, Jupyter notebook pipeline	Medium

7. Data Flow

Primary Flow — Automated CI Evaluation on Prompt Change

Step	Actor	Action	Output
1	CI Pipeline	Detect prompt change in PR; trigger experiment for `customer-faq-v1.2.0` vs baseline `v1.1.3`	Experiment job created with ID `exp-20250612-001`
2	Experiment Scheduler	Dequeue experiment job; load dataset `faq-golden-v3.2` from S3	Dataset loaded: 250 examples
3	Evaluation Runner	Execute all 250 examples against baseline `v1.1.3` and candidate `v1.2.0`; collect outputs	500 raw output records
4	Metric Computation	Compute: accuracy 94.1% (v1.2) vs 92.8% (v1.1.3); factuality 91.2% vs 90.1%; P95 latency 1.2s vs 1.4s	Metric delta: accuracy +1.3%, factuality +1.1%, latency -200ms
5	Statistical Comparison	Two-proportion z-test on accuracy: p=0.031 (< 0.05 threshold) → statistically significant improvement	p-value + confidence intervals
6	Experiment Record	Write to experiment DB: exp-20250612-001; component=customer-faq; candidate=v1.2.0; status=PASS; significant improvement	Experiment record written
7	CI Status	Post experiment results as PR comment; mark CI check as PASS	PR author sees metric comparison
8	Promotion	Prompt owner approves PR; references exp-20250612-001 in promotion request	Experiment ID recorded in promotion audit trail

Error Flow

Error	Detection	Response
Evaluation dataset unavailable (S3 outage)	Dataset load failure	Experiment queued; alert dataset custodian; retry with backoff
Model API rate limit during evaluation	Runner error rate	Slow down evaluation; use batch API; emit warning
Statistical test inconclusive (p > 0.05)	Comparison engine	Mark experiment as INCONCLUSIVE; alert prompt owner; may require larger dataset
Human evaluation SLA breach	Task queue age monitor	Escalate to evaluation team lead; unblock by automated proxy metric

8. Security Considerations

Evaluation datasets may contain sensitive representative inputs; they must be classified and stored with the same access controls as production data
Experiment results (including individual output comparisons) may reveal model behaviour on sensitive inputs; access to raw outputs restricted to model owner and platform team
Human evaluators must be bound by confidentiality agreements if evaluation involves sensitive content

OWASP LLM Controls

OWASP LLM Risk	Experiment Tracking Control
LLM09 Overreliance	Factuality and hallucination metrics in evaluation suite provide evidence of reliability
LLM03 Training Data Poisoning	Evaluation dataset integrity checks (content hash verification) detect dataset tampering

9. Governance Considerations

Quality Management System

Experiment tracking constitutes the quality management system for AI as required by EU AI Act Article 17; every production configuration must have a traceable experiment record
Promotion without experiment evidence is a governance policy violation; the promotion workflow enforces experiment ID as a required field

Governance Artefacts

Artefact	Owner	Cadence	Location
Evaluation dataset catalogue	Data Team + Model Owner	Per dataset version	Dataset store + metadata DB
Experiment records	Model Owner	Per experiment	Experiment metadata DB
Promotion audit trail	Platform Team	Per promotion	Experiment DB + Registry
Quarterly quality report	Model Owner	Quarterly	Governance dashboard
Human evaluation guidelines	Model Owner	Annual	Internal wiki

10. Operational Considerations

Monitoring

Signal	Source	Alert Threshold	Owner
Evaluation pipeline failure rate	Job status	>5% failed experiments	Platform Team
Experiment queue depth	Scheduler metrics	>20 jobs pending >30 min	Platform Team
Dataset hash mismatch	Integrity check	Any mismatch	Security + Data Team
Production quality metric regression (vs. last deployment)	Production sampling	>2% drop on key metric	Model Owner + Platform Team

SLOs

SLO	Target	Window
CI evaluation pipeline completion	<15 min for standard 250-example dataset	Per run
Experiment DB availability	99.9%	Rolling 30 days
Production sampling metric freshness	<4 hours lag	Rolling 7 days

Disaster Recovery

Component	RPO	RTO	Strategy
Experiment metadata DB	5 min	30 min	Database replication
Dataset store	0 (content-addressed)	15 min	S3 cross-region replication
Metric time series DB	1 hour	30 min	Cross-region replication; recomputable from experiment DB

11. Cost Considerations

Cost Drivers

Driver	Description	Relative Weight
Evaluation API calls	Running golden dataset through model on every PR/experiment	Medium — scale with dataset size and frequency
Human evaluation	Evaluator time for nuanced tasks	High — reserved for high-risk changes only
Dataset storage	Versioned datasets; relatively small at scale	Low
Experiment DB hosting	Low-volume OLTP database	Low

Optimisations

Use the cheapest model that produces comparable signals for evaluation runs where the evaluation target is a more expensive model
Maintain a tiered evaluation strategy: fast automated eval (10 examples) for PR gate; comprehensive eval (250 examples) for promotion to staging; extended eval (1000+ examples) for production promotion of high-risk changes

Indicative Cost Range

Scale	Monthly Experiment Infra + Evaluation API Cost
Small (10 prompts, weekly changes)	$300–$1,000
Medium (50 prompts, daily changes, 3 teams)	$2,000–$6,000
Large (200+ prompts, multiple models, continuous evaluation)	$10,000–$30,000

12. Trade-Off Analysis

Evaluation Strategy Options

Option	Description	Pros	Cons	Best For
Automated Only	All evaluation via metric computation against golden dataset	Scalable; fast; cheap	Misses nuanced quality issues; only as good as golden dataset	High-volume, structured output tasks
Human Evaluation Only	All evaluation via human judges	Highest quality	Slow; expensive; not scalable; inter-rater inconsistency	Very high-risk, low-volume decisions
Automated + Human Gate	Automated for all changes; human required for high-risk/major changes	Balanced quality and scalability	Requires defining "high-risk" criterion carefully	Recommended default

Metric Strategy Options

Option	Description	Pros	Cons
Fixed Standard Metric Set	Same metrics for all use cases	Comparability; simplicity	May not capture use-case-specific quality
Per-Use-Case Custom Metrics	Metrics defined per use case	Highest relevance	Comparison across use cases harder; more maintenance
Composite Score	Weighted combination of multiple metrics	Single promotion threshold	Weight calibration difficult; masks individual metric issues

Architectural Tensions

Tension	Option A	Option B	Resolution
Evaluation completeness vs. CI speed	Full evaluation on every PR	Fast smoke test on PR; full eval on merge	Tiered: 10-example fast check on PR; 250-example full on merge
Golden dataset size vs. cost	Large representative dataset	Minimal viable dataset	Start with 50–100 high-quality examples; expand to 250–500 as infrastructure matures
Automated statistical significance vs. business judgment	Block promotion on p>0.05 only	Block on any quality drop	Statistical gate for automated CI; human judgment gate for production promotion

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Golden dataset staleness (distribution drift from production)	High over time	High — metrics look good but production quality is low	Production quality monitoring diverges from evaluation metrics	Regular dataset refresh cadence; production sampling for dataset maintenance
Evaluation runner hitting model rate limits	Medium	Medium — experiments fail or slow	Runner error rate	Use batch API; reduce parallelism; queue retry
Human evaluator bias	Medium	Medium — biased promotion decisions	Inter-rater reliability monitoring (low kappa)	Evaluator calibration sessions; blind evaluation protocols
Promotion without experiment reference	Low (if enforced)	High — governance gap	Promotion workflow audit	Enforce experiment ID as mandatory in promotion workflow; alert on bypass

14. Regulatory Considerations

EU AI Act Articles 9 and 17

Article 9 requires risk management measures for high-risk AI; experiment evaluation records constitute the technical documentation of risk management
Article 17 quality management system requirement is satisfied by the experiment tracking infrastructure and promotion audit trail
Technical documentation required by Article 11 must reference the evaluation methodology, datasets, and metrics used

ISO 42001 Clause 9 (Performance Evaluation)

Experiment tracking directly implements Clause 9.1 (monitoring and measurement of AI system performance)
Regular production quality sampling satisfies Clause 9.3 (management review of AI performance)

NIST AI RMF MEASURE 2.3

Metrics for tracking AI performance over time must be defined and measured; the experiment tracking framework and metric time series DB satisfy this requirement

15. Reference Implementations

AWS

Component	AWS Service
Evaluation runner	SageMaker Processing Jobs or Lambda (batch)
Dataset store	S3 + DVC
Experiment metadata DB	Amazon RDS PostgreSQL
Metric time series	Amazon Timestream or CloudWatch
Dashboard	Amazon Managed Grafana

Azure

Component	Azure Service
Evaluation runner + experiment tracking	Azure ML Experiments
Dataset store	Azure ML Datasets + Azure Blob Storage
Dashboard	Azure ML Studio

Open Source / SaaS

Component	Technology
Evaluation framework	Ragas (RAG), DeepEval, HELM
Experiment tracking	MLflow, Weights & Biases, Comet ML
Human evaluation	Label Studio (open source), Scale AI, Labelbox

On-Premises

Component	Technology
Evaluation runner	Custom Python harness + Celery
Experiment tracking	MLflow self-hosted
Dataset store	MinIO (S3-compatible) + DVC

Pattern ID	Name	Relationship
EAAPL-PLT001	Enterprise AI Platform	Parent — experiment tracking is Layer 4
EAAPL-PLT005	Prompt Version Control	Dependency — prompt promotions require experiment references
EAAPL-PLT003	Model Routing	Integration — A/B routing data feeds experiment tracking
EAAPL-GOV001	AI Governance Framework	Dependency — experiment records are primary governance evidence

17. Maturity Assessment

Overall Maturity: Proven Experiment tracking with MLflow, W&B, and Azure ML is production-proven. AI-specific evaluation frameworks (Ragas, DeepEval) are maturing rapidly. The promotion-decision audit trail link is the least standardised component.

Scoring Matrix

Dimension	Score (1–5)	Rationale
Pattern Completeness	5	All sections documented
Implementation Evidence	4	MLOps experiment tracking proven; LLM-specific evaluation less so
Tooling Maturity	4	Ragas/DeepEval maturing; human eval tooling stable
Regulatory Alignment	5	Strong EU AI Act / ISO 42001 alignment
Dataset Management Maturity	3	Golden dataset management is the common weak link in practice

18. Revision History

Version	Date	Author	Changes
1.0	2024-08-01	EAAPL Working Group	Initial publication
1.1	2025-06-12	EAAPL Working Group	Human evaluation workflow expanded; ISO 42001 Clause 9 alignment; production sampling guidance added

← Back to Library More Platform Engineering →