Proven

EAAPL-MDL005 — Multi-Model Ensemble

Attribute	Value
Pattern ID	EAAPL-MDL005
Name	Multi-Model Ensemble
Maturity	Proven
Complexity	High
Tags	`llm` `model-risk` `fairness` `high-complexity`
Last Reviewed	2026-06-12
Owner	Enterprise AI Architecture Practice

1. Executive Summary

Multi-model ensemble combines the outputs of two or more AI models to produce decisions or responses that are more accurate, more robust, or more defensible than any individual model could provide alone. The pattern spans strategies from simple majority voting for classification tasks through weighted scoring for risk decisions to sophisticated mixture-of-experts routing where a meta-model directs each query to the most capable specialist model. Ensembles are justified when the cost of error is high (credit decisions, medical triage, fraud detection) and when no single model can achieve the required accuracy across the full input distribution. For CIOs, ensemble is an investment decision: 2–4× inference cost must be weighed against the business value of higher accuracy and the risk reduction of lower error rates. For CTOs, ensemble introduces architectural complexity — model agreement protocols, versioning of ensemble configurations, and handling of constituent model failures — that must be designed explicitly. For risk officers, ensemble raises a specific governance question: which model is "the decision-maker" for regulatory and accountability purposes. This pattern answers that question: the ensemble output is the decision; all constituent models are registered and their contributions to any specific decision are logged.

2. Problem Statement

2.1 Business Problem

High-stakes decisions — credit approvals, fraud flags, medical triage suggestions, content moderation verdicts — cannot tolerate the individual model error rates that are acceptable in lower-stakes contexts. A single LLM or classifier, however well-trained, has systematic blind spots: distribution gaps in training data, sensitivity to input phrasing variations, and degradation under adversarial conditions. Business leaders need a mechanism to achieve decision quality that is higher than any individual model, with confidence intervals they can defend to a board or regulator.

2.2 Technical Problem

Individual models exhibit correlated failures: they tend to fail on the same types of inputs because they were trained on similar data distributions. A naive ensemble of correlated models provides limited improvement. The technical challenge is constructing an ensemble from models that fail on complementary subsets of the input space, implementing the combination strategy correctly for the task type, and managing the increased infrastructure complexity of running multiple models per inference.

2.3 Symptoms

High-stakes models have error rates that are unacceptable for the business function but cannot be reduced by further training on available data.
Model outputs vary significantly with minor input rephrasing, indicating low robustness.
Adversarial testing reveals systematic failure modes not addressed by standard retraining.
The organisation has heterogeneous user populations where no single model performs well across all subgroups.

2.4 Cost of Inaction

Category	Indicative Impact
Decision Quality	Individual model error rates persist; high-stakes errors accumulate with business and customer impact
Robustness	Adversarial or distribution-shifted inputs cause systematic failures not caught by individual model monitoring
Fairness	Individual model subgroup performance gaps persist; ensemble can selectively address these
Regulatory	Inability to demonstrate reliability assurance for high-risk AI (EU AI Act Article 9/15)

3. Context

3.1 When to Apply

High-stakes classification or scoring tasks where the cost of individual errors is material.
Tasks where robustness against adversarial inputs is required.
Situations where the input distribution is heterogeneous and no single model excels across all subgroups.
Regulated decisions requiring demonstrable accuracy assurance.
Inference budget allows 2–4× per-query cost increase.

3.2 When NOT to Apply

Real-time latency-sensitive applications where ensemble adds unacceptable latency (< 100ms p99 requirements).
Tasks with limited training data where multiple diverse models cannot be built from the available corpus.
Cost-sensitive high-volume applications where the 2–4× cost multiplier is not justified by quality gains.
Tasks where the ensemble combination logic itself becomes a new failure mode more dangerous than a single model's errors.

3.3 Prerequisites

Prerequisite	Detail
Multiple independent models	At least 2 models with different architectures, training data, or inductive biases
Model versioning (EAAPL-MDL001)	All constituent models must be individually versioned and registered
Ensemble configuration versioning	The combination strategy and weights must themselves be versioned as an artefact
Agreement measurement infrastructure	Tooling to measure and log model agreement/disagreement on every inference
Compute budget approval	2–4× inference cost requires budget justification and approval

3.4 Industry Applicability

Industry	Applicability	Primary Driver
Financial Services	High	Credit scoring, fraud detection — decision quality and explainability
Healthcare	Critical	Diagnostic support — robustness against clinical edge cases
Insurance	High	Underwriting decisions — accuracy across heterogeneous risk profiles
Legal / Compliance	High	Contract analysis, compliance checks — consistency and accuracy
Government	High	Benefit eligibility, regulatory decisions — fairness and defensibility
Media / Content	Medium	Content moderation — handling ambiguous content with higher confidence

4. Architecture Overview

4.1 Ensemble Strategies

Majority Voting (Classification Tasks): Each constituent model produces a class prediction. The ensemble output is the class predicted by the majority of models. Requires an odd number of models (typically 3 or 5) to avoid ties. Ties are resolved by a designated tiebreaker model (usually the highest-accuracy model) or escalated to human review. Majority voting is model-agnostic and requires no training of an aggregation layer.

Weighted Average (Scoring Tasks): Each constituent model produces a numeric score. The ensemble output is a weighted average of scores, where weights reflect each model's historical accuracy on the task. Weights are learned from a validation dataset and stored in the ensemble configuration artefact. Weights are updated when constituent model versions change. This strategy is appropriate for fraud scores, credit scores, and risk ratings.

Stacking / Meta-Learning: A separate meta-model is trained to optimally combine the outputs of base models. The meta-model takes as input the predictions of all base models (and optionally the input features) and produces the final output. Stacking can learn non-linear combination strategies that outperform fixed weights. It requires a held-out training dataset and introduces an additional model (the meta-model) to the governance and versioning process.

Mixture of Experts (Router-Based): A router model classifies each incoming query into a type and directs it to the specialist model with highest expected performance on that type. Different specialist models may be trained on different subsets of the input space. The router itself is a model that must be versioned, evaluated, and monitored. Mixture of Experts reduces per-query cost compared to running all models on all queries.

4.2 Model Agreement Measurement

For every inference, the ensemble layer records the agreement level among constituent models. Agreement is defined per strategy: for classification, it is the fraction of models agreeing on the winner class; for scoring, it is the coefficient of variation of scores. Low agreement (< 60% for voting, high CoV for scoring) triggers: (1) for automated decisions below a confidence threshold — route to human review; (2) for high-stakes automated decisions — the ensemble output is flagged in the audit log for retrospective review.

Agreement measurement is a governance tool as well as a quality tool: if models consistently disagree on a particular input subtype, this signals a data distribution gap that should drive retraining.

4.3 Cost-Quality Tradeoff

Running 3 models per inference costs 3× the base inference cost plus the aggregation overhead. This is justified when: (a) the quality improvement measurably reduces costly downstream errors (e.g., each avoided false positive in fraud detection saves $X); (b) the regulatory requirement for accuracy assurance cannot be met by any single model; (c) the robustness requirement against adversarial inputs requires diversity. For mixture-of-experts, the cost premium is lower (typically 1.2–1.5×) because only one specialist model processes the full inference per query.

4.4 Governance: Which Model Is the Decision-Maker

For regulatory and accountability purposes, the ensemble is the decision-maker. The ensemble output is the recorded decision. The ensemble configuration (model identifiers, versions, combination weights or router) is versioned per EAAPL-MDL001 and registered as an artefact in the Model Register. Every inference logs: the ensemble version, each constituent model version, each constituent model's output, the combination result, and the agreement score. This log is the regulatory evidence that a specific decision was made by a specific, versioned, approved ensemble — not by any individual constituent model in isolation.

4.5 Ensemble Configuration Versioning

The ensemble is itself a versioned artefact: MAJOR change when the combination strategy changes (e.g., voting → stacking); MINOR when constituent model versions are updated; PATCH when combination weights are recalibrated. Each ensemble version must be approved through the same approval workflow as individual models. A model card for the ensemble is required, noting the constituent models, combination strategy, evaluation results for the ensemble (not just individual models), and fairness analysis.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Input["Request Routing"] A[Inference Request] B{Routing Strategy} end subgraph Models["Constituent Models"] C[Model A] D[Model B] E[Model C] end subgraph Aggregation["Ensemble Aggregation"] F[Aggregation Layer] G{Agreement Check} end A --> B B -->|all models| C B -->|all models| D B -->|all models| E B -->|expert route| C C --> F D --> F E --> F F --> G G -->|high agreement| H[Ensemble Output] G -->|low agreement| I[Human Review Queue] F --> J[(Inference Audit Log)] style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f3e8ff,stroke:#a855f7 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#f3e8ff,stroke:#a855f7 style H fill:#d1fae5,stroke:#10b981 style I fill:#fee2e2,stroke:#ef4444 style J fill:#fef9c3,stroke:#eab308

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Constituent Models	Inference	Individual models producing outputs for aggregation; must be independently versioned	Any framework: PyTorch, TensorFlow, JAX, proprietary API	Critical
Expert Router	Inference	Classifies query type; routes to specialist model (MoE strategy only)	Lightweight classifier; LLM with routing prompt	High
Aggregation Layer	Service	Combines constituent outputs per strategy (voting, weighted, meta-model)	Custom Python service; FastAPI microservice	Critical
Agreement Monitor	Observability	Computes and logs agreement score; triggers human review on low agreement	Custom metric in aggregation layer; Prometheus counter	High
Human Review Queue	Integration	Receives low-agreement inferences for human expert review	SQS + Lambda, custom task queue, Jira Service Management	High
Ensemble Audit Log	Data Store	Records per-inference ensemble version, constituent outputs, agreement	DynamoDB, BigQuery, Elasticsearch	Critical
Ensemble Config Store	Platform Service	Versioned storage for ensemble configuration (weights, strategy, constituent versions)	Model Register (EAAPL-MDL001), Git, S3 versioned object	Critical

7. Data Flow

7.1 Primary Flow

Step	Actor	Action	Output
1	Calling System	Sends inference request to ensemble endpoint	Request received; ensemble version loaded from config store
2	Routing Layer	Applies routing strategy (all-models or expert-router)	Request dispatched to appropriate constituent model(s)
3	Constituent Models	Each model processes request independently	N independent outputs (class/score/text)
4	Aggregation Layer	Combines outputs per strategy; computes agreement score	Ensemble output + agreement score
5	Agreement Monitor	Evaluates agreement score against threshold	High: proceed to output; Low: route to human review
6	Audit Logger	Records ensemble version, constituent versions, all outputs, agreement score	Immutable audit log entry
7	Calling System	Receives ensemble output (and agreement score if consumer needs it)	Final decision/response delivered

7.2 Error Flow

Error Scenario	Detection	Recovery Action
One constituent model fails	Health check + error counter	Reduce ensemble to remaining models; alert; flag reduced ensemble in log
All constituent models fail	Total error rate spike	Return error to caller; trigger model rollback (EAAPL-MDL004)
Expert router misclassifies query type	Agreement analysis reveals systematic error	Route misclassified query type to all models as fallback; retrain router
Meta-model (stacking) inference failure	Error counter on meta-model endpoint	Fall back to simple majority vote; alert; flag in audit log
Human review queue backlog	Queue depth monitor	Increase human reviewer capacity; alert if queue exceeds SLA threshold

8. Security Considerations

8.1 Controls Summary

Domain	Control
Authentication	Each constituent model endpoint requires authenticated calls from aggregation layer service account
Authorisation	Aggregation layer service account scoped to inference-only; cannot modify model versions or config
Secrets	API keys for each constituent model (including third-party APIs) in secrets manager; scoped per model
Classification	Inference audit log classified at same level as the decision data (often CONFIDENTIAL)
Encryption	Inter-component communication via mTLS; audit log encrypted at rest
Auditability	Per-inference audit log is the primary accountability mechanism; must be tamper-evident

8.2 OWASP LLM Top 10 Relevance

OWASP LLM Risk	Relevance	Mitigation
LLM01 Prompt Injection	High	Adversarial inputs that cause one model to fail may not cause all models to fail — ensemble diversity is a defence
LLM02 Insecure Output Handling	Medium	Aggregation layer must sanitise combined output; constituent model outputs must not be directly concatenated without sanitisation
LLM03 Training Data Poisoning	High	Ensemble resilience depends on constituent models being trained on different data; correlated poisoning attacks can affect all models
LLM04 Model Denial of Service	High	Running N models per inference amplifies DoS impact; circuit breaker per constituent model required
LLM05 Supply Chain Vulnerabilities	High	Each constituent model has its own supply chain; all must be independently verified per EAAPL-MDL001
LLM06 Sensitive Information Disclosure	Medium	Audit log contains all constituent outputs; access controls on log must match data classification
LLM07 Insecure Plugin Design	Medium	If constituent models use tools, all tool integrations must be individually secured
LLM08 Excessive Agency	Medium	Low-agreement threshold routing to human review is the primary excessive agency control
LLM09 Overreliance	High	Ensemble confidence score must be communicated to consumers; high agreement ≠ correctness
LLM10 Model Theft	Medium	Audit log at scale enables model inversion attacks; access controls essential

9. Governance Considerations

9.1 Responsible AI

Ensemble fairness analysis must be performed on the ensemble output, not just on individual constituent models. A constituent model with a demographic bias may be overridden by the ensemble — or may dominate the ensemble for that subgroup. Fairness evaluation uses disaggregated analysis on the ensemble output across relevant demographic subgroups.

9.2 Model Risk Management

The ensemble configuration is a model for MRM purposes. It requires its own validation, model card, and approval. A change to any constituent model version that triggers a MINOR version change in that constituent also triggers a MINOR version change in the ensemble — and requires re-validation of the ensemble on the full evaluation suite.

9.3 Human Approval Gates

Low-agreement threshold routing is the primary human-in-the-loop control. The threshold must be set based on the business cost of automated decisions vs the cost of human review. The threshold is part of the ensemble configuration and requires governance approval to change.

9.4 Governance Artefacts

Artefact	Owner	Frequency	Location
Ensemble Model Card	Model Owner	Per ensemble version	Model Register
Per-Inference Audit Log	Aggregation Layer	Continuous	Immutable data store
Agreement Distribution Report	Model Owner	Monthly	Model governance dashboard
Constituent Model Approval Record	AI Governance	Per constituent version	Model Register
Fairness Analysis Report	AI Governance	Quarterly	Governance artefact repository

10. Operational Considerations

10.1 SLOs

SLO	Target	Measurement Method
Ensemble inference latency p99	Defined per use case (typically 2–5× single model p99)	End-to-end timing from request to aggregated output
Agreement measurement completeness	100%	Audit log completeness check
Human review queue SLA	Defined per use case (e.g., 4 hours for high-stakes)	Queue age monitor
Constituent model availability	Each ≥ 99.5%	Per-model health check

10.2 Monitoring and Logging

Monitor: per-constituent model error rate and latency (independent health view); ensemble agreement distribution (alert if mean agreement drops — indicates constituent model drift); human review queue depth; audit log write success rate (100% required). Dashboard shows constituent model performance side-by-side plus ensemble output quality.

10.3 Incident Response

A constituent model failure reduces ensemble quality but does not necessarily cause a full service failure. The aggregation layer must detect constituent failures and operate in degraded mode (fewer constituent models) rather than failing completely. If the ensemble drops below a minimum constituent model count (usually N-1 out of N), it should fail closed rather than produce an unreliable output.

10.4 Disaster Recovery

Scenario	RPO	RTO	Recovery Procedure
One constituent model unavailable	N/A	< 5 min	Aggregation layer switches to N-1 mode; alert; restore constituent
Aggregation layer unavailable	N/A	< 10 min	Fall back to highest-accuracy single model; degrade gracefully
Ensemble config store unavailable	N/A	< 15 min	Aggregation layer uses cached config; alert; restore config store

10.5 Capacity Planning

Total compute = sum of constituent model compute × parallelism factor. For 3 models: plan 3× inference compute. For MoE: plan 1.3× (router + average 1 specialist). Horizontal scaling applies to each constituent model independently — burst capacity must be provisioned per model.

11. Cost Considerations

11.1 Cost Drivers

Driver	Description	Relative Impact
Constituent model inference	N models per inference = N× base inference cost	Very High
Aggregation layer compute	Low-latency aggregation service; relatively cheap	Low
Audit log storage	Per-inference log with N constituent outputs; scales with request volume	Medium
Human review cost	Fully-loaded cost of human reviewers handling low-agreement queue	Medium-High
Meta-model training (stacking)	One-time training cost for stacking meta-model; periodic retraining	Medium

11.2 Scaling Risks

Cost scales linearly with traffic and with N (number of constituent models). For high-volume services, an ensemble of 3 large LLMs may be prohibitively expensive. The mixture-of-experts strategy reduces this scaling risk by routing each query to only one specialist model.

11.3 Optimisations

Use MoE routing to avoid running all constituent models on all queries.
Use smaller, specialised constituent models rather than N copies of a large general model.
Cache ensemble outputs for repeated queries (with appropriate TTL).
Run lower-cost constituent models in parallel; gate expensive models on initial disagreement.

11.4 Indicative Cost Range

Ensemble Type	Cost Multiple vs Single Model	Monthly Range (medium volume: 1M req/day)
3-model majority voting	3×	$30,000–$150,000
3-model weighted average	3×	$30,000–$150,000
Mixture of Experts (3 specialists)	1.3–1.5×	$13,000–$75,000
2-model agreement with human fallback	2.1×	$21,000–$105,000

12. Trade-Off Analysis

12.1 Ensemble Strategy Comparison

Strategy	Quality Gain	Latency Impact	Cost Multiple	Complexity	Regulatory Clarity	Best For
Majority Voting	Moderate	Parallel = same	3×	Low	High	Classification tasks; balanced inputs
Weighted Average	High	Parallel = same	3×	Medium	High	Scoring tasks; known model strengths
Stacking	Very High	+meta-model	3×+	High	Medium	High-accuracy requirement; complex inputs
Mixture of Experts	High	+router	1.3–1.5×	Very High	Medium	Cost-sensitive; well-segmented input types

12.2 Architectural Tensions

Tension	Description	Resolution
Accuracy vs Cost	More constituent models = higher accuracy but higher cost	Define minimum quality threshold; use minimum N models that achieves it
Independence vs Practicality	Ensemble benefit is maximised by independent constituent models; finding truly independent models is hard	Use different architectures, different training data, or different providers
Human Review vs Throughput	Low-agreement routing to human review protects quality but creates throughput bottleneck	Tier by business impact: only route high-stakes low-agreement cases to human; automate rest

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Correlated constituent model failures	Medium	High	Agreement score drops for all models simultaneously	Investigate root cause; this signals a data distribution shift
Expert router systematic misclassification	Medium	High	Agreement analysis on router-directed queries	Retrain router; use all-models fallback during retraining
Meta-model overfits to constituent patterns	Medium	Medium	Ensemble quality regression on new input types	Retrain meta-model with more diverse validation data
Human review queue SLA breach	Medium	Medium	Queue depth monitor	Add reviewer capacity; escalate to automated fallback for excess
Constituent model version drift (one updated without ensemble re-evaluation)	Low	High	Quality regression in ensemble output	Mandatory ensemble re-evaluation on any constituent version change

13.1 Cascading Failure Scenarios

If all constituent models are served by the same inference infrastructure provider and that provider has a partial outage, all constituent models may fail simultaneously — despite architectural diversity. Mitigation: use constituent models from at least two independent infrastructure providers for high-criticality ensembles.

14. Regulatory Considerations

Regulation / Framework	Relevant Clause	How This Pattern Addresses It
EU AI Act (2024/1689)	Article 9 (Risk Management) — accuracy and robustness measures for high-risk AI	Ensemble is an accuracy/robustness measure; all constituent models must be individually validated
EU AI Act (2024/1689)	Article 13 (Transparency) — traceability of high-risk AI decisions	Per-inference audit log with all constituent outputs and ensemble version satisfies Article 13
ISO 42001:2023	Clause 8.4 (AI system lifecycle) — verification and validation	Ensemble must be evaluated as a complete system, not just constituent models
NIST AI RMF (2023)	MEASURE 2.5 (AI system performance measurement)	Agreement score and ensemble quality metrics are the primary performance measurements
APRA CPS 234 (2019)	Paragraph 15 (Information security policy)	Each constituent model is an additional attack surface; independence must be governed
Privacy Act 1988 (Cth)	APP 11 (Security) / APP 5 (Notification of collection)	Inference audit log containing user data must be secured; data collection purpose must cover ensemble processing

15. Reference Implementations

15.1 AWS

Constituent Models: SageMaker Endpoints (separate endpoint per model); or Bedrock model invocations (multi-model).
Aggregation Layer: AWS Lambda (for lightweight aggregation); ECS/EKS container (for stateful meta-model).
Expert Router: Lightweight SageMaker Endpoint or Lambda with classification model.
Audit Log: DynamoDB (per-inference record) + Kinesis Firehose to S3 for bulk analysis.
Human Review Queue: SQS + AWS A2I (Augmented AI) for human review workflow.

15.2 Azure

Constituent Models: Azure ML Managed Endpoints; or Azure OpenAI Service (multi-deployment).
Aggregation Layer: Azure Functions (lightweight); Azure Container Apps (stateful meta-model).
Expert Router: Azure ML Endpoint with routing classifier.
Audit Log: Azure Cosmos DB (per-inference); Event Hubs to Azure Data Lake for analysis.
Human Review Queue: Azure Service Bus + Azure Logic Apps for human review routing.

15.3 GCP

Constituent Models: Vertex AI Endpoints (separate deployment per model); or Vertex AI model garden.
Aggregation Layer: Cloud Functions (lightweight); Cloud Run (stateful).
Expert Router: Vertex AI Endpoint with routing model.
Audit Log: BigQuery (per-inference, columnar for analysis efficiency).
Human Review Queue: Cloud Tasks + Cloud Functions for human review workflow.

15.4 On-Premises / Hybrid

Constituent Models: Triton Inference Server (multi-model on GPU); vLLM for LLM serving.
Aggregation Layer: FastAPI microservice on Kubernetes; Seldon Core ensemble orchestration.
Expert Router: Lightweight FastText/XGBoost classifier as sidecar.
Audit Log: Elasticsearch or PostgreSQL (per-inference); Kafka for streaming audit events.
Human Review Queue: Celery task queue; custom review UI.

Pattern ID	Pattern Name	Relationship Type	Description
EAAPL-MDL001	Model Versioning	Prerequisite	Each constituent model and the ensemble configuration must be individually versioned
EAAPL-MDL002	Shadow Model Deployment	Related	Shadow testing should validate ensemble configuration vs single-model baseline
EAAPL-MDL006	Fine-Tuning Pipeline	Related	Constituent models may be fine-tuned specialisations; fine-tuning pipeline produces them
EAAPL-MDL008	Model Access Governance	Dependency	Each constituent model requires its own access governance; ensemble access is additive

17. Maturity Assessment

Overall Maturity: Proven

Dimension	Score (1–5)	Rationale
Industry Adoption	4	Ensemble methods are standard in ML; LLM ensembles are emerging
Tooling Availability	3	Constituent model serving is mature; ensemble orchestration requires custom build
Standards Alignment	3	EU AI Act and ISO 42001 support ensemble approaches; specific guidance limited
Implementation Complexity	4 (high)	Managing N models, agreement logic, human review, and versioning is complex
Regulatory Acceptance	3	Accepted as an accuracy enhancement; regulatory clarity on accountability still developing

18. Revision History

Version	Date	Author	Summary of Changes
1.0	2026-06-12	Enterprise AI Architecture Practice	Initial publication

← Back to Library More Model Management →

EAAPL-MDL005 — Multi-Model Ensemble

EAAPL-MDL005 — Multi-Model Ensemble

1. Executive Summary

2. Problem Statement

2.1 Business Problem

2.2 Technical Problem

2.3 Symptoms

2.4 Cost of Inaction

3. Context

3.1 When to Apply

3.2 When NOT to Apply

3.3 Prerequisites

3.4 Industry Applicability

4. Architecture Overview

4.1 Ensemble Strategies

4.2 Model Agreement Measurement

4.3 Cost-Quality Tradeoff

4.4 Governance: Which Model Is the Decision-Maker

4.5 Ensemble Configuration Versioning

5. Architecture Diagram

6. Components

7. Data Flow

7.1 Primary Flow

7.2 Error Flow

8. Security Considerations

8.1 Controls Summary

8.2 OWASP LLM Top 10 Relevance

9. Governance Considerations

9.1 Responsible AI

9.2 Model Risk Management

9.3 Human Approval Gates

9.4 Governance Artefacts

10. Operational Considerations

10.1 SLOs

10.2 Monitoring and Logging

10.3 Incident Response

10.4 Disaster Recovery

10.5 Capacity Planning

11. Cost Considerations

11.1 Cost Drivers

11.2 Scaling Risks

11.3 Optimisations

11.4 Indicative Cost Range

12. Trade-Off Analysis

12.1 Ensemble Strategy Comparison

12.2 Architectural Tensions

13. Failure Modes

13.1 Cascading Failure Scenarios

14. Regulatory Considerations

15. Reference Implementations

15.1 AWS

15.2 Azure

15.3 GCP

15.4 On-Premises / Hybrid

16. Related Patterns

17. Maturity Assessment

18. Revision History