EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryModel Management
Proven
⇄ Compare

EAAPL-MDL006 — Fine-Tuning Pipeline

EAAPL-MDL006 — Fine-Tuning Pipeline

Attribute Value
Pattern ID EAAPL-MDL006
Name Fine-Tuning Pipeline
Maturity Proven
Complexity High
Tags llm privacy-act model-risk high-complexity
Last Reviewed 2026-06-12
Owner Enterprise AI Architecture Practice

1. Executive Summary

Fine-tuning is the process of adapting a pre-trained foundation model to an organisation's specific domain, language, and task requirements by continuing its training on curated enterprise data. The result is a model that combines the general capabilities of the foundation model with specialised performance on the organisation's particular use cases. This pattern defines the full production pipeline: from training data preparation (deduplication, PII removal, quality filtering, format conversion) through compute infrastructure provisioning, fine-tuning execution, monitoring, and evaluation before promotion. For CIOs, fine-tuning is a strategic investment — it produces a proprietary model asset that can outperform generic APIs on enterprise tasks at lower per-query cost. For CTOs, the pipeline architecture must be production-grade: reproducible, cost-controlled, privacy-compliant, and integrated with the model versioning and governance infrastructure (EAAPL-MDL001). For risk officers, fine-tuning on enterprise data creates a specific privacy obligation: consent, purpose limitation, and PII removal must be demonstrably satisfied before any data enters the training pipeline. Safety testing post-fine-tune is mandatory — fine-tuning can amplify biases or degrade alignment properties of the base model.


2. Problem Statement

2.1 Business Problem

Generic foundation models perform adequately on general tasks but underperform on domain-specific enterprise tasks: legal contract analysis, financial report interpretation, medical documentation, customer service in domain-specific language. The performance gap is often 15–40% on task-specific benchmarks. Organisations pay API costs for suboptimal performance. Fine-tuning closes this gap, producing a model asset that is both more accurate on enterprise tasks and more cost-efficient at scale.

2.2 Technical Problem

Fine-tuning at enterprise scale requires solving a set of non-trivial engineering problems simultaneously: sourcing and cleaning training data at sufficient quality and volume; provisioning and managing GPU compute infrastructure; executing distributed training efficiently; preventing overfitting on a narrow enterprise dataset; maintaining safety and alignment properties of the base model; and integrating the resulting model into the organisation's versioning, evaluation, and deployment infrastructure.

2.3 Symptoms

  • Generic API responses consistently miss domain-specific terminology, concepts, or procedures.
  • API costs are high and will not decrease as usage scales.
  • The organisation has a proprietary knowledge corpus that is not reflected in any available model.
  • Post-deployment user feedback consistently identifies the same domain-specific failure patterns.

2.4 Cost of Inaction

Category Indicative Impact
Quality Persistent domain performance gap; users supplement AI with manual correction
Cost API costs scale linearly with usage; fine-tuned models reduce per-query cost 50–80%
Strategic Proprietary knowledge is not monetised through a model asset; competitive differentiation lost
Privacy Risk Using raw enterprise data with third-party APIs without a proper privacy review creates regulatory exposure

3. Context

3.1 When to Apply

  • The organisation has a domain-specific task where generic models underperform.
  • A proprietary knowledge corpus (legal documents, clinical notes, financial reports, engineering documentation) is available for training.
  • Usage volume justifies the fine-tuning investment (typically > 1M queries/month before ROI is clear).
  • The organisation has or can provision GPU compute infrastructure.

3.2 When NOT to Apply

  • The domain gap can be addressed by prompt engineering or retrieval-augmented generation at lower cost and complexity.
  • The available training corpus is too small (< 1,000 high-quality examples for task-specific fine-tuning; < 100,000 for meaningful domain adaptation).
  • The training data cannot be cleared of PII or confidential information.
  • The organisation lacks the ML engineering expertise to execute and maintain a fine-tuning pipeline.

3.3 Prerequisites

Prerequisite Detail
Training data corpus Minimum viable corpus size per task type; rights to use data for model training
Privacy review Formal privacy impact assessment completed; PII removal process defined
GPU compute infrastructure Cloud GPU cluster or on-premises GPU servers with distributed training capability
Model versioning (EAAPL-MDL001) Fine-tuned model will be registered; versioning infrastructure must exist
Base model licence Licence for the base model permits fine-tuning and derivative model deployment

3.4 Industry Applicability

Industry Applicability Primary Driver
Financial Services High Domain-specific financial language; regulatory document understanding
Healthcare High Clinical documentation; medical coding; drug information extraction
Legal High Contract analysis; case law interpretation; compliance checking
Government High Policy document analysis; public service queries in specialised domain
Manufacturing Medium Technical documentation; maintenance manual interpretation
Retail / E-commerce Medium Product catalogue understanding; customer service domain adaptation

4. Architecture Overview

4.1 Training Data Preparation Pipeline

The data preparation pipeline is the most privacy-sensitive stage of fine-tuning. It executes sequentially:

Data Collection: Data is ingested from authorised enterprise sources — document management systems, ticketing systems, knowledge bases, transaction logs. A data collection manifest records: source system, collection date, volume, and legal basis for use (consent record reference, legitimate interest assessment reference, or data processing agreement reference).

Deduplication: Near-duplicate documents are identified using MinHash LSH or SimHash and removed. Exact duplicates are removed using content hash. Deduplication reduces memorisation risk (the model memorising and reproducing training examples verbatim) and improves training efficiency.

Quality Filtering: Documents are scored for quality using heuristics (length, language detection, vocabulary richness) and a lightweight quality classifier. Below-threshold documents are excluded. The quality threshold is tuned to balance data volume against quality for the specific task.

PII Removal: All training documents pass through a PII detection and removal stage using a named entity recognition model plus rule-based patterns. PII categories detected and masked: names, email addresses, phone numbers, government identifiers (TFN, ABN, Medicare), financial account numbers, dates of birth, physical addresses. Masking replaces PII with category tokens (e.g., [PERSON], [EMAIL]). PII removal is imperfect — a residual PII scan is run after removal and any remaining high-confidence PII detections are reported. The residual rate is logged and reviewed by the privacy function. An acceptable residual rate threshold (e.g., < 0.1% of documents containing residual PII) is defined in the privacy impact assessment.

Format Conversion: Documents are converted to the fine-tuning format for the target model (instruction-response pairs, completion format, or chat format). Prompt templates are applied. The conversion code is versioned and included in the artefact bundle.

Dataset Split: The prepared dataset is split into train (80%), validation (10%), and test (10%) sets. The test set is held out completely — it is not used for any training decision, only for final evaluation. The split is stratified if the dataset has class labels.

4.2 Compute Infrastructure

Fine-tuning large models requires GPU clusters. Infrastructure choices depend on model size: models < 7B parameters can be fine-tuned on a single A100 GPU; 7B–70B parameter models require multi-GPU nodes (8× A100 or equivalent); 70B+ parameter models require multi-node distributed training.

Spot instance strategy: Fine-tuning jobs are restartable (checkpoints written every N steps). Using spot/preemptible instances reduces compute cost 60–80%. Checkpoint-based restart means a spot interruption costs at most N steps of recomputation. The pipeline must implement automatic job restart on spot termination.

Distributed training coordination: For multi-GPU or multi-node training, use PyTorch FSDP or DeepSpeed ZeRO for memory-efficient distribution. The distributed training configuration is versioned as part of the training run configuration.

4.3 Fine-Tuning Techniques

Full fine-tuning: All model weights are updated. Produces highest quality but requires the most GPU memory and compute. Appropriate when the domain gap is large and compute budget allows.

LoRA (Low-Rank Adaptation): A small number of low-rank weight matrices are trained while base model weights are frozen. Memory reduction of 4–8×; quality within 2–5% of full fine-tuning for most tasks. The LoRA adapters are the primary artefact (base model + LoRA adapter = fine-tuned model). Recommended for most enterprise fine-tuning scenarios.

QLoRA: LoRA on a quantised (4-bit) base model. Enables fine-tuning of very large models on consumer-grade GPU hardware. Quality penalty of approximately 1–3% vs LoRA on full-precision base. Use when compute budget is highly constrained.

Selection criteria: Start with QLoRA for experimentation; graduate to LoRA for production; use full fine-tuning only when LoRA quality ceiling is insufficient for the business requirement.

4.4 Training Monitoring

During training, monitor every N steps: training loss and validation loss (divergence indicates overfitting); gradient norms (exploding gradients indicate learning rate issue); learning rate schedule adherence; GPU utilisation; cost per step (extrapolated to total run cost). Alert if: validation loss increases for 3 consecutive checkpoints (overfitting signal), gradient norms exceed 10 (training instability), or extrapolated total cost exceeds budget by 20%.

4.5 Safety Testing Post-Fine-Tune

Fine-tuning can degrade the alignment properties of the base model. Post-fine-tune safety testing is mandatory and must pass before the model can be registered. Tests include: harmlessness evaluation (does the model generate harmful content in response to adversarial prompts — test against a standard adversarial prompt set); bias amplification check (compare demographic bias metrics of fine-tuned model vs base model on standard fairness benchmarks); alignment regression test (compare responses to alignment-relevant queries between base and fine-tuned model). Any regression on safety or alignment metrics relative to the base model is a blocking condition — the fine-tuned model cannot be promoted until the regression is resolved.

4.6 Privacy Review for Training Data

The privacy review is not a one-time gate — it is a continuous obligation. For each fine-tuning run, the privacy function must confirm: (1) all training data sources have valid consent records or legitimate interest assessments; (2) PII removal has been executed and the residual PII rate is within the approved threshold; (3) the purpose of the fine-tuning is within the scope of the consent or LIA under which data was collected (purpose limitation); (4) data processing agreements with any third-party data processors used in the pipeline are current. The privacy review record is included in the artefact bundle.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph DataPrep["Data Preparation"] A[Enterprise Data Sources] B[PII Removal] C{Privacy Review} end subgraph Training["Training Pipeline"] D[GPU Cluster] E[Fine-Tuning Run] F[Training Monitor] end subgraph Evaluation["Evaluation and Registration"] G[Benchmark + Safety Tests] H[Model Register] end A -->|dedup + quality filter| B B --> C C -->|approved| D C -->|residual PII| A D --> E E --> F F -->|healthy| G F -->|instability| E G -->|pass| H G -->|fail| I[Reject + Root Cause] style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#fef9c3,stroke:#eab308 style G fill:#f0fdf4,stroke:#22c55e style H fill:#d1fae5,stroke:#10b981 style I fill:#fee2e2,stroke:#ef4444

6. Components

Component Type Responsibility Technology Options Criticality
Data Ingestion Pipeline Stage Collects data from enterprise sources; records legal basis Apache Airflow, Prefect, custom ETL Critical
PII Detector/Remover Pipeline Stage Detects and masks PII using NER + rules Microsoft Presidio, spaCy NER, AWS Comprehend, custom Critical
Quality Filter Pipeline Stage Scores and filters low-quality training examples fastText language classifier; custom quality scorer High
Training Orchestrator Infrastructure Provisions compute; launches and monitors training jobs; handles restarts AWS SageMaker Training Jobs, Azure ML Compute, Vertex AI Training Critical
Fine-Tuning Framework Framework Executes the actual weight update computation Hugging Face PEFT (LoRA/QLoRA), DeepSpeed, Axolotl Critical
Evaluation Harness Platform Service Runs benchmark + safety tests post-training Eleuther LM Eval Harness, custom evaluation suite Critical
Artefact Bundler Pipeline Stage Packages weights, config, preprocessing code, eval results per EAAPL-MDL001 Custom CI/CD step High

7. Data Flow

7.1 Primary Flow

Step Actor Action Output
1 Data Engineer Initiates data collection from approved enterprise sources Raw dataset + data collection manifest
2 Deduplication Stage Removes near-duplicates and exact duplicates Deduplicated dataset
3 Quality Filter Scores documents; removes below-threshold examples Quality-filtered dataset
4 PII Remover Detects and masks PII; produces residual scan report PII-masked dataset + residual rate report
5 Privacy Function Reviews residual rate; confirms purpose limitation and consent coverage Privacy review sign-off (or rejection with required remediation)
6 Format Converter Transforms to fine-tuning format; applies prompt templates Training-ready dataset in model-specific format
7 Training Orchestrator Provisions GPU compute; launches training job; monitors; handles restarts Training run in progress
8 Fine-Tuning Framework Updates weights (LoRA/full) using training data; checkpoints periodically Model weights + training metrics + training run record
9 Evaluation Harness Runs benchmark + safety + bias evaluation on fine-tuned model Evaluation results JSON
10 Artefact Bundler Assembles bundle per EAAPL-MDL001; registers as new MINOR version Versioned model artefact in Model Register

7.2 Error Flow

Error Scenario Detection Recovery Action
PII residual rate exceeds threshold Residual scan report check Halt pipeline; remediate identified documents; re-run PII removal
Training instability (loss divergence) Training monitor gradient norm alert Stop training run; investigate learning rate/batch size; restart
Spot instance interruption Orchestrator interruption handler Automatic job restart from last checkpoint; log interruption event
Evaluation fails safety threshold Safety test result evaluation Reject version; mandatory root cause analysis; adjust fine-tuning approach
Compute cost exceeds budget Cost tracking in training monitor Alert; human decision to continue or stop; log budget overrun

8. Security Considerations

8.1 Controls Summary

Domain Control
Authentication Training pipeline service account with narrow scope; no access to production serving infrastructure
Authorisation Training data source access gated by data classification and privacy review sign-off
Secrets Any API keys used in data collection or training stored in secrets manager; never in training data
Classification Fine-tuned model classified based on sensitivity of training data — if trained on CONFIDENTIAL data, model is CONFIDENTIAL
Encryption Training data encrypted at rest throughout pipeline; GPU cluster communication encrypted via TLS; artefact encrypted at rest
Auditability Full data lineage recorded: source → dedup → quality filter → PII removal → training run → model version

8.2 OWASP LLM Top 10 Relevance

OWASP LLM Risk Relevance Mitigation
LLM01 Prompt Injection Medium Fine-tuned model may be more susceptible to domain-specific injection — post-fine-tune adversarial testing required
LLM02 Insecure Output Handling Medium Fine-tuned model may learn output patterns from training data; output validation at inference layer
LLM03 Training Data Poisoning Critical Enterprise data sources are attack surface; data collection manifest enables detection; quality filter catches anomalous data
LLM04 Model Denial of Service Low Training pipeline concern, not serving concern
LLM05 Supply Chain Vulnerabilities High Base model licence and provenance verified per EAAPL-MDL001; fine-tuning framework dependencies pinned and scanned
LLM06 Sensitive Information Disclosure Critical Training data PII removal is the primary control; residual rate monitoring; memorisation testing post-fine-tune
LLM07 Insecure Plugin Design Low Pipeline does not use plugins in training context
LLM08 Excessive Agency Low Training pipeline concern is data, not agency
LLM09 Overreliance Medium Domain-specific fine-tuned models may create overreliance; model card must document limitations clearly
LLM10 Model Theft High Fine-tuned model is a proprietary IP asset; artefact store access controls and export restrictions apply

9. Governance Considerations

9.1 Responsible AI

Fine-tuning on enterprise data can amplify biases present in that data. Fairness analysis is mandatory post-fine-tune: compare demographic performance metrics of fine-tuned model vs base model. If fine-tuning increases performance disparity for any protected subgroup, the training data must be audited and balanced before the version can be promoted.

9.2 Model Risk Management

A fine-tuned model is a MINOR version change from the base model (per EAAPL-MDL001). It requires full model validation including benchmark evaluation, safety testing, and fairness evaluation. The training data provenance is part of the model risk record — any future question about the model's behaviour can be investigated by examining the training data.

9.3 Human Approval Gates

Privacy review sign-off (privacy function) is required before training begins. Evaluation pass gates model registration. Final production promotion requires the standard approval workflow (EAAPL-MDL001).

9.4 Governance Artefacts

Artefact Owner Frequency Location
Data Collection Manifest Data Engineer Per training run Artefact bundle + data catalogue
Privacy Review Record Privacy Function Per training run Privacy register + artefact bundle
PII Residual Rate Report Data Engineer Per data preparation Artefact bundle
Training Run Record ML Engineer Per training run Model Register + MLflow
Safety Testing Report AI Governance Per fine-tuned version Model governance record

10. Operational Considerations

10.1 SLOs

SLO Target Measurement Method
Data preparation pipeline duration < 4 hours (for datasets < 10GB) Pipeline end-to-end timing
Training job restart on spot interruption < 5 minutes Orchestrator restart timing from interruption event
Evaluation pipeline duration < 2 hours Evaluation harness end-to-end timing
PII residual rate < 0.1% Residual scan report metric

10.2 Monitoring and Logging

Key metrics monitored during training: training loss per step, validation loss per checkpoint, gradient norm per step, GPU utilisation per node, cost per step (accumulated and extrapolated). Alerts configured for: validation loss divergence (3 consecutive checkpoints increasing), gradient norm > 10, spot interruption (informational — triggers automatic restart), and cost overrun.

10.3 Incident Response

Training pipeline incidents include: data pipeline failure (data loss or corruption), PII residual rate breach, training instability, and cost overrun. Each triggers a halt-and-investigate workflow. The privacy function is notified immediately for any PII-related incident. Training runs that fail mid-job do not produce registered model versions — partial artefacts are discarded.

10.4 Disaster Recovery

Scenario RPO RTO Recovery Procedure
Training data store unavailable Last snapshot 4 hours Restore from snapshot; rerun from last stage with data intact
GPU cluster unavailable Last checkpoint 2 hours Provision alternate cluster; restart from checkpoint
PII-contaminated data discovered post-training Immediate alert Manual Quarantine model version; investigate training data; re-run with clean data

10.5 Capacity Planning

GPU planning: estimate training time as (dataset tokens × 6 × model parameters) / (GPU FLOPS × GPU efficiency). For a 7B model fine-tuned on 1B tokens on 8× A100 80GB: approximately 8 hours. Add 20% for evaluation and overhead. For cost estimation: A100 spot price is approximately $1.5–2.5/hour; 8 GPUs × 10 hours = $120–$200 per training run. Budget multiple runs for hyperparameter tuning.


11. Cost Considerations

11.1 Cost Drivers

Driver Description Relative Impact
GPU compute for training Primary cost; scales with model size, dataset size, and number of training runs Very High
Data preparation compute CPU-heavy pipeline: dedup, quality filter, PII removal Medium
Training data storage Processed training dataset storage; artefact bundle storage Low
Engineering labour ML engineering time to build, maintain, and tune the pipeline High
Evaluation compute GPU time for post-training evaluation Medium

11.2 Scaling Risks

Training cost is highly sensitive to model size and dataset size. A 2× model size increase typically causes 4× training cost increase (quadratic relationship for attention). Budget for multiple training runs — first runs are often hyperparameter explorations that do not produce production models.

11.3 Optimisations

  • Start with QLoRA to minimise experimentation cost; graduate to LoRA once the training approach is validated.
  • Use spot/preemptible instances for all training runs (checkpoint restarts make this safe).
  • Implement curriculum learning: start with easier examples to accelerate early training convergence.
  • Cache the base model in the training cluster warm storage — loading from cold storage adds 30–60 minutes per run for large models.

11.4 Indicative Cost Range

Model Size Technique Dataset Size Approximate Training Cost Runs to Production
7B parameters QLoRA 100M tokens $50–$200 3–5 runs
13B parameters LoRA 1B tokens $500–$2,000 3–5 runs
70B parameters LoRA 5B tokens $5,000–$20,000 2–3 runs
70B+ parameters Full FT 10B tokens $50,000–$200,000 1–2 runs

12. Trade-Off Analysis

12.1 Fine-Tuning Technique Comparison

Technique GPU Memory Quality vs Full FT Cost Multiple Training Stability Production Complexity Best For
Full fine-tuning Very High Baseline (1.0×) 1.0× Medium Low (standard weights) Largest performance gain; ample GPU
LoRA Low ~0.97× 0.25× High Low (adapter merge) Most production scenarios
QLoRA Very Low ~0.95× 0.1× High Low (adapter merge) Budget-constrained; experimentation
Prefix Tuning Low ~0.90× 0.1× High Medium (prefix at inference) Minimal-change constraints

12.2 Architectural Tensions

Tension Description Resolution
Data Volume vs Privacy Risk More training data improves quality; more data increases PII exposure surface Invest in PII removal pipeline quality; set conservative residual rate threshold
Cost vs Quality More training compute and larger datasets improve quality; cost is bounded Define quality target first; find minimum compute that achieves it; use QLoRA for experiments
Domain Specialisation vs Generality Heavy domain fine-tuning improves domain performance but may degrade general capability Maintain base model for general tasks; use fine-tuned model only for target domain

13. Failure Modes

Failure Likelihood Impact Detection Recovery
PII in training data memorised by model Medium Critical Post-training memorisation probe test Reject model version; improve PII removal; retrain
Catastrophic forgetting of base capabilities Medium High Benchmark regression vs base model Reduce fine-tuning learning rate; use regularisation (EWC)
Bias amplification in fine-tuned model Medium High Fairness benchmark comparison Audit and balance training data; add debiasing training examples
Training cost overrun (> 3× budget) Medium Medium Cost monitor alert Stop run; analyse checkpoint quality; decision to continue or abort
Alignment regression (safety degradation) Low Critical Safety test failure Reject version; add safety examples to training set; retrain

13.1 Cascading Failure Scenarios

If the PII removal pipeline has a systematic failure (e.g., a NER model update causes a regression in detection recall), an entire training run may proceed with elevated PII in the training data. The fine-tuned model may then memorise and reproduce that PII in inference responses. Mitigation: PII removal pipeline is separately version-controlled and tested; a canary test on a synthetic PII-seeded document is run at the start of each data preparation job to validate detection recall before processing production data.


14. Regulatory Considerations

Regulation / Framework Relevant Clause How This Pattern Addresses It
Privacy Act 1988 (Cth) APP 3 (Collection) / APP 6 (Use and Disclosure) / APP 11 (Security) Data collection manifest records consent/legal basis; purpose limitation check; PII removal addresses APP 11
EU AI Act (2024/1689) Article 10 (Data Governance) — training data quality and governance PII removal, quality filtering, deduplication, and provenance manifest directly address Article 10
EU AI Act (2024/1689) Article 9 (Risk Management) — testing and validation of high-risk AI Safety testing, bias evaluation, and benchmark evaluation satisfy Article 9 pre-deployment testing
ISO 42001:2023 Clause 8.3 (AI system design and development) — data quality Data preparation pipeline documents and enforces data quality per Clause 8.3
NIST AI RMF (2023) MAP 1.5 (Organisational risk tolerance for training data) / MEASURE 2.6 (Bias evaluation) Privacy review and bias evaluation directly address MAP 1.5 and MEASURE 2.6
APRA CPS 234 (2019) Paragraph 15 (Information security policy) — data used in model training Training data encryption, access controls, and audit trail satisfy Paragraph 15

15. Reference Implementations

15.1 AWS

  • Data Preparation: AWS Glue for ETL; Amazon Comprehend for PII detection; S3 for versioned dataset storage.
  • Compute: SageMaker Training Jobs; EC2 P4d/P5 spot instances for direct training; EFA networking for multi-node.
  • Fine-Tuning Framework: Hugging Face PEFT (LoRA/QLoRA) on SageMaker; SageMaker distributed training library for large models.
  • Training Monitor: SageMaker Experiments; CloudWatch custom metrics for loss/gradient tracking.
  • Evaluation: SageMaker Processing Jobs running Eleuther LM Eval Harness.

15.2 Azure

  • Data Preparation: Azure Data Factory for ETL; Azure AI Language (PII detection); Azure Data Lake Storage.
  • Compute: Azure Machine Learning Compute Clusters (NC series); spot (low-priority) VM policy.
  • Fine-Tuning Framework: Hugging Face PEFT on Azure ML; DeepSpeed for distributed training.
  • Training Monitor: Azure ML Experiment Tracking; Azure Monitor custom metrics.
  • Evaluation: Azure ML Pipelines running evaluation scripts; Azure OpenAI for evaluation assistance.

15.3 GCP

  • Data Preparation: Cloud Dataflow (Apache Beam) for ETL; Cloud Natural Language API for PII detection; Cloud Storage.
  • Compute: Vertex AI Custom Training; A2/A3 VM (spot) for GPU training; TPU pods for very large models.
  • Fine-Tuning Framework: Hugging Face PEFT on Vertex AI; Vertex AI model garden fine-tuning for supported models.
  • Training Monitor: Vertex AI Experiments; Cloud Monitoring custom metrics.
  • Evaluation: Vertex AI Pipelines; BigQuery for evaluation results storage.

15.4 On-Premises / Hybrid

  • Data Preparation: Apache Spark (on-prem cluster) for ETL; self-hosted Presidio for PII detection; MinIO for dataset storage.
  • Compute: On-premises GPU cluster (NVIDIA A100/H100); Kubernetes with GPU operator.
  • Fine-Tuning Framework: Hugging Face PEFT + Axolotl; DeepSpeed ZeRO for multi-GPU/multi-node.
  • Training Monitor: MLflow Tracking (self-hosted); Prometheus + Grafana for GPU metrics.
  • Evaluation: Custom evaluation harness; self-hosted LM Eval.

Pattern ID Pattern Name Relationship Type Description
EAAPL-MDL001 Model Versioning Produces Fine-tuning pipeline produces new MINOR version artefacts registered per MDL001
EAAPL-MDL002 Shadow Model Deployment Next Step Fine-tuned model candidates enter shadow testing before production promotion
EAAPL-MDL007 Model Compression and Optimisation Related Fine-tuned models are often subsequently quantised/compressed for production cost
EAAPL-MDL005 Multi-Model Ensemble Related Fine-tuned specialist models are natural candidates for mixture-of-experts ensembles

17. Maturity Assessment

Overall Maturity: Proven

Dimension Score (1–5) Rationale
Industry Adoption 4 LoRA/QLoRA fine-tuning is widely adopted in enterprise; pipelines are maturing
Tooling Availability 4 Hugging Face PEFT, Axolotl, cloud training services are production-ready
Standards Alignment 4 EU AI Act Article 10, Privacy Act, ISO 42001 all addressed explicitly
Implementation Complexity 4 (high) Full pipeline including privacy, safety testing, and governance is complex
Regulatory Acceptance 3 Privacy and AI governance regulators accept the approach; specific audit evidence requirements still being established

18. Revision History

Version Date Author Summary of Changes
1.0 2026-06-12 Enterprise AI Architecture Practice Initial publication
← Back to LibraryMore Model Management