EAAPL-OBS008 · AI Performance Benchmarking
Pattern ID: EAAPL-OBS008
Status: Proven
Complexity: Medium
Tags: observability model-risk slo llm medium-complexity
Version: 1.0.0
Last Reviewed: 2026-06-12
1. Executive Summary
AI system quality degrades silently between benchmarking events. A model that scored 87% accuracy in its pre-deployment evaluation may be delivering 72% accuracy in production six months later — and no alert has fired because there is no continuous quality measurement. Organisations run evaluations at deployment time but rarely maintain a living benchmark that tracks quality continuously, gates new deployments, and validates that offline evaluation scores actually predict production outcomes.
This pattern defines a continuous AI performance benchmarking system that operates across the full AI system lifecycle. It covers: golden dataset management with versioned, human-labeled ground truth; automated regression testing on every model or prompt deployment with quality gates; quality metric tracking over time (accuracy, F1, ROUGE, BERTScore, and LLM-as-judge scores depending on task type); A/B comparison of model versions with statistical significance testing; performance budget alerts that block promotion when latency regresses; benchmark-to-production correlation validation to ensure offline metrics predict production quality; and an executive quality scorecard delivering weekly insights on quality trend, incident count, and SLO attainment. Together, these capabilities give AI teams the equivalent of a CI/CD test suite and production monitoring combined — but for AI quality.
Target Audience: CIO, CTO, Head of AI/ML Engineering, Model Risk Manager Time to Implement: 6–10 weeks
2. Problem Statement
Business Problem
Organisations invest heavily in AI model selection and evaluation at deployment time, then assume the decision holds indefinitely. It doesn't. Model providers update models, prompt templates drift, retrieval quality degrades, and world-state changes make yesterday's training data less relevant. Without continuous benchmarking, quality degradation is discovered through business outcomes — lower conversion, more escalations, customer complaints — that lag the technical root cause by weeks. Critically, organisations cannot demonstrate to regulators or auditors that their AI systems continue to perform as specified.
Technical Problem
AI quality measurement requires ground truth labels — expensive, time-consuming, and rapidly stalening. Most organisations lack the infrastructure to: maintain a versioned, representative golden dataset; run evaluations automatically on every deployment; compare results statistically to distinguish real regression from measurement noise; and correlate offline benchmark scores with production outcomes. Without these capabilities, evaluation is a one-time event rather than a continuous control.
Symptoms
- AI evaluation happens once at deployment; next evaluation is when something breaks
- Different teams use different metrics for the same AI task; results are not comparable
- A prompt template change was deployed that reduced quality by 15%; this was discovered 3 weeks later through a customer satisfaction survey
- A cheaper model was adopted to reduce costs; whether quality was maintained is unknown
- Regulators request evidence of ongoing AI performance monitoring; the organisation has only deployment-time evaluation reports
Cost of Inaction
- Silent quality regressions accumulate over months; compounding user trust erosion
- APRA CPG 234 model risk management requires ongoing performance monitoring for material models
- EU AI Act Article 9.5 requires testing as part of risk management for high-risk AI
- Failed model substitution (cheaper model deployed, quality unmeasured) results in both cost and quality failures
- Without benchmark-to-production correlation data, engineering teams cannot trust that improving benchmark scores will improve production outcomes
3. Context
When to Apply
- Any production AI system where quality can be measured against ground truth or proxy metrics
- Systems where model or prompt changes are deployed more than once a month
- AI systems subject to regulatory performance monitoring obligations
- Before adopting a new model version or major prompt redesign (A/B evaluation required)
- Prerequisite: EAAPL-OBS001 provides production telemetry for calibration; EAAPL-OBS005 provides drift signals
When NOT to Apply
- Pure creative generation tasks with no objective quality measure (subjective evaluation only)
- Proof-of-concept systems with < 30-day lifespan
- Systems where ground truth is unavailable and proxy metrics are insufficient for quality assessment
Prerequisites
| Prerequisite | Required | Notes |
|---|---|---|
| Golden dataset with human-labeled ground truth | Required | Without ground truth, evaluation is proxy-only |
| Evaluation metric definition for the task type | Required | Must be agreed before benchmarking infrastructure is built |
| Model/prompt deployment pipeline with gates | Required | Benchmarking gates must be able to block promotion |
| EAAPL-OBS001 telemetry | Recommended | Production metric correlation requires production telemetry |
Industry Applicability
| Industry | Applicability | Primary Driver |
|---|---|---|
| Financial Services | Critical | APRA CPG 234 model validation, ASIC AI guidance |
| Healthcare | Critical | Clinical AI accuracy, safety, and ongoing validation obligations |
| Legal Services | High | Professional accuracy standards; liability management |
| Government | High | Public service quality; FOI and accountability |
| Technology / SaaS | High | SLA quality obligations; competitive differentiation |
| Retail / E-Commerce | Medium | Recommendation and search quality |
4. Architecture Overview
The AI Performance Benchmarking Architecture is a three-phase system: evaluation infrastructure (golden dataset + evaluation pipeline), deployment gating (regression testing on every change), and production correlation (validating that offline benchmarks predict production quality).
Golden Dataset Management
The golden dataset is the foundation. It consists of input samples paired with expected outputs (for generation tasks) or correctness labels (for classification tasks). The dataset is curated by subject matter experts to be representative of the production input distribution, covering common cases, edge cases, and adversarial inputs. It is version-controlled alongside model artifacts in the model registry. Key management principles: the golden dataset must be protected from training data contamination (model must not train on evaluation data); it must be regularly refreshed (quarterly for most tasks; monthly for rapidly evolving domains); each version must be approved by the model owner and a quality assurance lead; and the version used for each evaluation is recorded alongside the result. For LLM tasks, the golden dataset contains: prompt inputs, reference outputs, and evaluation criteria for LLM-as-judge assessment.
Evaluation Metric Framework
Metrics are selected by task type. Classification: accuracy, precision, recall, F1. Retrieval/RAG: precision@k, recall@k, mean reciprocal rank (MRR), NDCG. Text generation (summarisation): ROUGE-1, ROUGE-2, ROUGE-L, BERTScore. Text generation (open-ended): LLM-as-judge score on criteria (relevance, accuracy, completeness, safety). Latency: p50, p95, p99 (offline benchmark; compared against production p99 SLO). All metrics have a defined baseline (from the current production model) and a deployment threshold (minimum acceptable score for promotion). The threshold is set at 97% of baseline for critical tasks, 95% for standard tasks.
Automated Regression Testing Pipeline
The benchmarking pipeline runs automatically on every model version change, prompt template change, or RAG retrieval configuration change. The pipeline steps: load the current golden dataset version; run the AI system under test (the candidate version) against all golden dataset inputs in parallel; compute all evaluation metrics; compare to baseline metrics stored in the model registry; evaluate against deployment thresholds; produce a benchmark report with pass/fail decision and metric breakdown; and block or allow promotion based on the gate decision. The pipeline runs in the CI/CD system and its output is a required check before production deployment.
A/B Statistical Comparison
When two model versions are compared, statistical tests determine whether observed quality differences are real or within measurement noise. For proportion-type metrics (accuracy, success rate), a two-proportion z-test is used. For continuous metrics (latency, BERTScore), a Mann-Whitney U test (non-parametric) is used. The significance threshold is p < 0.05. A result is reported as "statistically significant improvement," "statistically significant regression," or "no significant difference." The effect size (Cohen's h or d) is also reported to distinguish statistically significant but practically negligible differences from meaningful improvements.
Performance Budget Enforcement
Latency is a quality dimension. The performance budget defines maximum acceptable latency for each AI endpoint. If the candidate version's p99 latency increases by more than 20% versus the production baseline, the promotion is blocked. This prevents quality optimisations that inadvertently degrade latency. Performance budgets are defined per endpoint and stored in the model registry alongside quality thresholds.
Benchmark-to-Production Calibration
The critical validation question: does a 5% improvement in offline benchmark score actually translate to a 5% improvement in production quality? The calibration pipeline tracks, for each model version promotion: offline benchmark score before deployment; production quality metric (estimated from production feedback, hallucination detection, user satisfaction) after deployment. Calibration is measured as the correlation between offline score change and production metric change. Well-calibrated evaluation (correlation > 0.7) means the benchmark can be trusted. Poor calibration (correlation < 0.4) means the golden dataset needs revision — it is not representative of production cases.
Executive Quality Scorecard
A weekly automated report delivers to CTO, AI engineering leads, and model risk manager: quality SLO attainment by AI system (% of week within quality thresholds), quality trend (improving/stable/degrading) by AI system, incident count by type (quality incidents from EAAPL-OBS004), deployment count and gate pass/fail rate, and top three quality risks. This replaces ad-hoc quality reporting with a systematic governance artefact.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Golden Dataset Store | Storage | Versioned dataset with inputs, expected outputs, metadata | S3/GCS/Azure Blob + DVC; MLflow datasets; Hugging Face datasets | Critical |
| Contamination Checker | Service | Verify golden dataset inputs are not in training data | MinHash deduplication; exact match check against training data | Critical |
| Evaluation Pipeline | CI/CD Job | Run AI system against golden dataset; compute all metrics | Custom Python; Promptfoo; Ragas; LangChain Evaluator; Evals framework | Critical |
| Metric Compute Library | Library | Compute accuracy, F1, ROUGE, BERTScore, LLM-as-judge | HuggingFace evaluate; custom scorers; Python scikit-learn | Critical |
| LLM-as-Judge Service | Service | Open-ended quality evaluation using a judge LLM | GPT-4o, Claude 3.5 as judge; Prometheus judge model (open source) | High |
| Baseline Store | Storage | Metric baselines per model version per task; deployment thresholds | Model registry metadata; PostgreSQL | Critical |
| Statistical Test Engine | Library | A/B significance testing on metric distributions | Python scipy.stats; custom two-proportion z-test | High |
| Performance Budget Enforcer | CI/CD Gate | Block promotions violating latency budget | Custom gate in CI pipeline; integrated with benchmark pipeline | High |
| Calibration Pipeline | Batch Job | Correlate offline metrics with production signals | Custom Python Pearson correlation; quarterly batch job | Medium |
| Quality Metric History Store | Storage | Time-series quality metrics per model version; long-term trend data | InfluxDB; TimescaleDB; BigQuery | High |
| Executive Scorecard Generator | Batch Job | Weekly report generation; send to distribution list | Python report generator + SendGrid/SES; Looker/Power BI scheduled report | Medium |
7. Data Flow
Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | CI/CD Trigger | Model or prompt change merged; evaluation pipeline triggered | Pipeline job started |
| 2 | Dataset Loader | Loads current approved golden dataset version; verifies dataset integrity | Golden dataset loaded |
| 3 | AI System Under Test | Runs candidate model/prompt against all golden dataset inputs in parallel | Generated outputs for all inputs |
| 4 | Metric Compute Library | Computes all configured metrics: accuracy, F1, ROUGE, BERTScore, LLM-as-judge, latency | Metric scores per input + aggregate scores |
| 5 | Baseline Comparator | Loads production baseline from baseline store; computes delta for each metric | Metric deltas: improvement/regression percentage |
| 6 | Threshold Evaluator | Checks each metric against deployment threshold (97% of baseline for critical tasks) | Pass/fail per metric; overall pass/fail |
| 7 | Statistical Test Engine | Runs Mann-Whitney / z-test for each metric vs baseline distribution | p-value and effect size per metric; statistical significance verdict |
| 8 | Performance Budget Enforcer | Checks latency p99 delta against 20% budget | Latency gate pass/fail |
| 9 | Benchmark Report | Generates report with all metrics, deltas, significance results, gate decision | Benchmark report artefact linked to deployment |
| 10 | Quality History Store | Records metric scores with model version, dataset version, timestamp | Queryable quality history |
| 11 | Calibration Pipeline (quarterly) | Correlates offline scores with production feedback for recent model versions | Calibration report; alert if calibration poor |
Error Flow
| Error Scenario | Detection | Action | Recovery |
|---|---|---|---|
| Golden dataset contaminated | Contamination checker finds overlap with training data | Block evaluation; alert quality assurance team | Remove contaminated samples; re-approve clean dataset |
| LLM-as-judge service unavailable | Evaluation pipeline reports LLM-judge timeout | Fail pipeline; alert; retry with backoff | Restore judge service; re-run evaluation |
| Statistical test insufficient samples | Test returns p-value warning for small sample size | Log warning; report result with insufficient power caveat | Increase golden dataset size for affected task |
| Baseline store missing for new task | Pipeline fails with missing baseline error | Use initial deployment score as baseline; set baseline in store | Manual baseline creation for new tasks |
| Calibration correlation poor (< 0.4) | Quarterly calibration pipeline alert | Notify dataset quality team; initiate golden dataset revision | Refresh golden dataset with current production-representative samples |
8. Security Considerations
Authentication: Evaluation pipeline accesses golden dataset via service account. LLM-as-judge API key in secrets manager. Baseline store access restricted to evaluation service and model registry.
Authorisation: Golden dataset contains proprietary evaluation data; access restricted to AI engineering and model risk management. Benchmark reports are Internal; available to product and engineering leads.
Secrets Management: LLM-as-judge API keys rotated quarterly. Dataset encryption keys in KMS. Evaluation pipeline credentials managed by CI/CD platform secret management.
Data Classification: Golden dataset with human-labeled examples is classified as Confidential (contains curated IP). Benchmark reports are classified as Internal. Executive scorecard is classified as Internal.
Encryption: Golden dataset encrypted at rest (AES-256) and in transit (TLS 1.3). Benchmark reports stored with encryption. Golden dataset versions backed up with encryption.
Auditability: Every evaluation run is logged with: dataset version, model version, pipeline version, run timestamp, pass/fail, and the full metric report artefact. Dataset version changes are audited with approver and rationale.
OWASP LLM Top 10 Coverage
| OWASP LLM Risk | Benchmarking Control | Implementation |
|---|---|---|
| LLM01 Prompt Injection | Adversarial injection examples in golden dataset | Include injection test cases; injection resistance is a scored evaluation dimension |
| LLM02 Insecure Output Handling | Output safety metrics in evaluation | Safety gate in evaluation pipeline; unsafe outputs fail benchmark |
| LLM03 Training Data Poisoning | Benchmark accuracy regression detects poisoned model | Sudden accuracy drop in regression testing signals poisoning |
| LLM04 Model Denial of Service | Latency budget enforcement catches DoS-vulnerable models | Latency regression gate blocks slow models |
| LLM05 Supply Chain Vulnerabilities | Model version recorded with every benchmark; unexpected version fails gate | Model identity verification in evaluation pipeline |
| LLM06 Sensitive Information Disclosure | Output PII scan included in evaluation metrics | PII in generated outputs = benchmark failure |
| LLM07 Insecure Plugin Design | Tool use accuracy included in evaluation for agent models | Tool invocation correctness is a scored evaluation dimension |
| LLM08 Excessive Agency | Agent scope boundary tests in golden dataset | Scope violation = benchmark failure |
| LLM09 Overreliance | Accuracy thresholds enforce minimum quality before deployment | Below-threshold models cannot be promoted; prevents overreliance on degraded models |
| LLM10 Model Theft | Benchmark metrics are internal artefacts; access controlled | Benchmark results not published externally |
9. Governance Considerations
Responsible AI: Continuous benchmarking is the technical implementation of the principle that AI systems must meet quality obligations throughout their operational lifetime. The benchmark report for each deployment is the evidence base for responsible AI governance review.
Model Risk Management: For material AI models, every benchmark result is a model risk management artefact. The baseline store maintains the performance history required for model risk review. Quality regressions trigger model risk events.
Human Approval: New golden dataset versions require approval from model owner and quality assurance lead before use in evaluation. Threshold changes require model risk manager approval. Any deployment that passes the quality gate on a waiver (manual override of a failed gate) requires VP-level documentation.
Policy: The benchmarking policy must define: golden dataset refresh cadence, threshold setting methodology and approval, deployment gate criteria, waiver process for exceptional circumstances, and retention period for benchmark artefacts (minimum 7 years for regulated AI decisions).
Traceability: Every production AI decision is linked (via model version) to the benchmark report that demonstrated quality before deployment. This chain enables: "this decision was made by model version X, which had accuracy Y on the approved evaluation set at the time of deployment."
Governance Artefacts
| Artefact | Owner | Frequency | Format |
|---|---|---|---|
| Golden Dataset Version Register | AI Engineering + QA | Per version change | Version-controlled manifest with approval record |
| Benchmark Report per Deployment | AI Engineering | Per deployment | Automated report stored in model registry |
| Deployment Gate Waiver Log | AI Engineering + Model Risk | Per waiver | Signed waiver with rationale and risk acceptance |
| Calibration Correlation Report | ML Platform | Quarterly | Statistical analysis document |
| Executive Quality Scorecard | AI Platform | Weekly | Automated email report + dashboard |
| Annual Benchmarking Review | Model Risk + AI Engineering | Annual | Review of dataset representativeness, threshold calibration, metric relevance |
10. Operational Considerations
Monitoring: Evaluation pipeline availability and run time are monitored. Failed evaluations block deployment and are alerting events. The baseline store and golden dataset store are high-durability assets; their backup status is monitored.
Logging: Every evaluation run produces a structured log record: run ID, model version, dataset version, all metric scores, gate decisions, run duration. These records are immutable.
Incident Response: If the evaluation pipeline fails and a deployment is blocked, the on-call engineering team investigates the pipeline failure. If a production quality regression is detected (post-deployment), the incident management process (EAAPL-OBS004) is triggered.
Disaster Recovery: The golden dataset and baseline store are the most critical assets. They require RPO < 1 hour and RTO < 30 minutes. Evaluation pipeline can be rerun after recovery without data loss.
Capacity Planning: The evaluation pipeline must complete within the CI/CD timeout (typically 30–60 minutes). For large golden datasets (> 10K examples), parallelisation is required. At 100 parallel evaluation workers processing 10K examples each taking 1 second, total time is 100 seconds — well within typical CI/CD budgets.
SLO Table
| SLO | Target | Measurement | Alert Threshold |
|---|---|---|---|
| Evaluation pipeline completion time | < 30 minutes for datasets up to 10K examples | Pipeline run duration | > 60 minutes (blocks deployment) |
| Evaluation pipeline availability | > 99% | Pipeline health check | Any failure blocking a scheduled deployment |
| Benchmark result delivery | < 5 minutes after pipeline completion | Report generation timestamp | > 15 minutes (deployment blocked) |
| Weekly scorecard delivery | By 08:00 Monday AEST | Report send timestamp | Missed delivery triggers manual escalation |
Disaster Recovery Table
| Component | RTO | RPO | Recovery Approach |
|---|---|---|---|
| Golden Dataset Store | 30 minutes | 1 hour | Replicated object storage; version history |
| Baseline Store | 30 minutes | 1 hour | Database replication; daily backup |
| Evaluation Pipeline | 15 minutes | N/A (stateless) | Redeploy from CI/CD; re-run evaluation |
| Quality History Store | 30 minutes | 1 hour | Time-series DB replication |
11. Cost Considerations
Cost Drivers
| Driver | Description | Relative Cost |
|---|---|---|
| LLM-as-judge evaluation | One judge LLM call per evaluated example; scales with dataset size and evaluation frequency | High for large datasets |
| Golden dataset inference | Running candidate model on full golden dataset; cost scales with model and dataset size | Medium to High |
| Evaluation compute (CI/CD) | Parallelised evaluation job; ephemeral compute | Medium |
| Storage for benchmark artefacts | Benchmark reports, metric history, dataset versions | Low |
Scaling Risks: LLM-as-judge cost is proportional to dataset size × deployment frequency. A 10K-sample dataset with LLM-as-judge evaluation run on every PR can generate significant cost. Mitigation: use LLM-as-judge only on a stratified sample for PR-level checks; full dataset for release-level checks.
Optimisations:
- Stratified sampling for CI/CD checks (500 representative examples, not full dataset)
- Use open-source judge models (Prometheus, Llama 3.1 70B with judge prompt) for internal evaluation
- Cache evaluation results for unchanged inputs — if the prompt template didn't change for a component, skip re-evaluation of affected examples
Indicative Cost Range
| Scale | Deployments/Month | Golden Dataset Size | Estimated Benchmarking Cost/Month |
|---|---|---|---|
| Small | 5 | 1,000 examples | $200–$500 |
| Medium | 20 | 5,000 examples | $1,000–$3,000 |
| Large | 50 | 20,000 examples | $5,000–$15,000 |
| Enterprise | 200+ | 100,000 examples | $20,000–$60,000 |
12. Trade-Off Analysis
Approach Comparison
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| Full golden dataset evaluation on every deployment | Comprehensive; catches all regression types; high confidence | Expensive; slow pipeline if large dataset; LLM-as-judge cost | Release-level deployments; regulated AI; material models |
| Stratified sample evaluation on PRs, full on releases | Balances cost and coverage; fast CI feedback; full confidence at release | PR-level checks may miss rare regression patterns | Most production AI systems; standard approach |
| Production shadow testing only (no offline evaluation) | No evaluation infrastructure; uses real production quality signals | Hallucinations delivered to real users; no pre-deployment gate; only detects regressions post-harm | Not recommended; acceptable only for extremely low-risk AI |
Architectural Tensions
| Tension | Description | Resolution |
|---|---|---|
| Sensitivity vs. False Gates | Strict thresholds catch real regressions but may block valid improvements due to dataset variance | Use statistical significance testing; natural variance in measurement does not block promotion |
| Dataset freshness vs. Stability | Frequent dataset updates keep evaluation representative but change the baseline, making trend analysis harder | Version datasets carefully; maintain consistent baseline versions for trend analysis |
| LLM-as-judge accuracy vs. Cost | Human evaluation is most accurate; LLM judge is 90% correlated at 1% of cost | Use LLM judge for CI/CD gates; human evaluation for periodic calibration and audit |
| Offline vs. Production quality | Offline benchmark may not reflect production distribution | Calibration pipeline validates correlation; poor calibration triggers dataset revision |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Golden dataset staleness (not refreshed) | High | High (benchmark passes but production degrades) | Calibration correlation drops below 0.4 | Mandatory quarterly refresh; calibration gate |
| Evaluation contamination (model trained on eval data) | Low | Critical (benchmark is meaningless; model overfits to eval) | Random holdout check; unexpected high scores | Remove contaminated data; retrain model; regenerate baseline |
| Metric gaming (optimise for benchmark, not production) | Medium | High (benchmark up, production neutral/down) | Calibration correlation decline | Use diverse metrics; include adversarial examples; monitor production |
| LLM-as-judge inconsistency | Medium | Medium (evaluation variance too high; statistical noise masks real changes) | Judge consistency test (same input twice → same score) | Fine-tune judge; use structured scoring rubric; ensemble judges |
| Baseline drift (baseline updated too frequently) | Medium | Medium (threshold always met; real regression masked) | Baseline update frequency audit | Policy: baseline update only on major model releases; not on incremental changes |
Cascading Scenarios
- Scenario 1: Golden dataset not refreshed after product domain expansion → benchmark scores remain high on old examples → new domain inputs hallucinate regularly → production quality invisible to benchmark → regulatory audit finds undocumented AI accuracy degradation. Mitigation: dataset refresh mandated when product domain changes; calibration pipeline catches divergence.
- Scenario 2: LLM-as-judge upgraded to newer version → judge scoring calibration changed → all models appear to regress → multiple deployments blocked → engineers override gates manually → override waivers accumulate without review → actual regressions also pass via waiver. Mitigation: judge version change = new baseline recalibration required; waiver rate is a KPI alerting when > 10%.
14. Regulatory Considerations
| Regulation | Clause | Requirement | Benchmarking Implementation |
|---|---|---|---|
| APRA CPG 234 | Section 6.3 (Model Validation) | Material models require pre-deployment validation and ongoing performance assessment | Deployment gate benchmarking + quality history implements this requirement |
| APRA CPG 234 | Section 6.4 (Performance Monitoring) | Ongoing monitoring against defined performance standards | Quality SLO attainment in weekly scorecard |
| EU AI Act | Article 9.5 (Testing) | High-risk AI must be tested to ensure they meet requirements before deployment | Automated benchmark gate is the testing implementation |
| EU AI Act | Article 9.6 (Training, Validation, Testing) | Appropriate data governance for training, validation, testing datasets | Golden dataset governance: versioning, contamination check, approval workflow |
| ISO/IEC 42001 | Clause 9.1.3 (Evaluation of AI System Performance) | AI system performance must be evaluated at planned intervals | Continuous benchmarking with quarterly full evaluation fulfils this clause |
| NIST AI RMF | MEASURE 2.5 | AI system performance tracked over time and evaluated against baselines | Quality history store and trend analysis implement MEASURE 2.5 |
| Privacy Act 1988 (AU) | APP 3 / APP 11 | Golden datasets containing personal information require appropriate safeguards | Anonymise or synthesise golden dataset inputs containing PII; access controls |
15. Reference Implementations
AWS
- Golden Dataset Store: Amazon S3 with versioning and access logging; metadata in DynamoDB
- Evaluation Pipeline: AWS Step Functions orchestrating Lambda evaluation workers; or AWS Batch for heavy compute
- LLM-as-Judge: Amazon Bedrock (Claude 3.5 Haiku for cost efficiency)
- Metric Compute: AWS Lambda with Python evaluation libraries (HuggingFace evaluate, sacrebleu)
- Baseline Store: Amazon DynamoDB (per model version, per task, per metric)
- Quality History: Amazon Timestream (time-series); Amazon QuickSight dashboards
- CI/CD Integration: AWS CodePipeline gate; GitHub Actions workflow calling evaluation Lambda
- Scorecard Delivery: Amazon SES for email; CloudWatch Dashboards for real-time view
Azure
- Golden Dataset Store: Azure Blob Storage with versioning; metadata in Azure Cosmos DB
- Evaluation Pipeline: Azure Machine Learning Pipelines; Azure Container Instances
- LLM-as-Judge: Azure OpenAI Service (GPT-4o mini for cost efficiency)
- Metric Compute: Azure ML compute with Python evaluation libraries
- Baseline Store: Azure Cosmos DB
- Quality History: Azure Monitor Custom Metrics; Azure Data Lake + Synapse for history
- CI/CD Integration: Azure DevOps pipeline gate; GitHub Actions calling Azure ML
- Scorecard Delivery: Power BI Scheduled Report; SendGrid
GCP
- Golden Dataset Store: Google Cloud Storage; Vertex AI Datasets for governance
- Evaluation Pipeline: Vertex AI Pipelines; Cloud Run jobs
- LLM-as-Judge: Vertex AI Gemini 1.5 Flash (cost-optimised)
- Metric Compute: BigQuery ML; Cloud Run with Python evaluation libraries
- Baseline Store: BigQuery with versioned metric table
- Quality History: BigQuery (excellent for time-series aggregations)
- CI/CD Integration: Cloud Build step; GitHub Actions calling Cloud Build
- Scorecard Delivery: Looker Scheduled Delivery; Cloud Pub/Sub + Cloud Functions
On-Premises
- Golden Dataset Store: MinIO (S3-compatible); DVC for version control
- Evaluation Pipeline: Airflow DAG with parallel task operators; custom Python runner
- LLM-as-Judge: Self-hosted Llama 3.1 70B with judge system prompt (Prometheus format); or Ollama
- Metric Compute: Custom Python library; HuggingFace evaluate; scikit-learn
- Baseline Store: PostgreSQL with versioned metric schema
- Quality History: InfluxDB or TimescaleDB
- CI/CD Integration: Jenkins post-build step; GitHub Actions self-hosted runner
- Scorecard Delivery: Automated Markdown/PDF report via email (Postfix/Mailgun)
16. Related Patterns
| Pattern ID | Pattern Name | Relationship | Notes |
|---|---|---|---|
| EAAPL-OBS001 | AI Telemetry Architecture | Foundation | Production metric data used in benchmark-to-production calibration |
| EAAPL-OBS003 | Hallucination Detection | Sibling | Production hallucination rate is a key calibration signal for benchmark accuracy |
| EAAPL-OBS004 | AI Incident Management | Depends On | Benchmark gate failures and quality regressions trigger incident management |
| EAAPL-OBS005 | Model Drift Detection | Sibling | Drift detection triggers retraining; benchmarking gates new model before promotion |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Adoption Breadth | 3 | Deployment-time evaluation is widespread; continuous benchmarking still minority practice |
| Tooling Ecosystem | 4 | HuggingFace evaluate, Promptfoo, Ragas, Evals frameworks are mature; LLM judge scoring maturing |
| Operational Runbook Coverage | 4 | CI/CD quality gates well-understood in ML engineering; AI-specific runbooks established |
| Regulatory Evidence | 4 | APRA CPG 234 and EU AI Act both explicitly require ongoing performance assessment |
| Cost Predictability | 3 | LLM-as-judge cost at scale can be high; requires careful dataset size management |
| Team Skill Availability | 4 | ML evaluation skills broadly available; statistical significance testing requires data science background |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0.0 | 2026-06-12 | EAAPL Working Group | Initial publication |