Proven

EAAPL-MDL006 — Fine-Tuning Pipeline

Attribute	Value
Pattern ID	EAAPL-MDL006
Name	Fine-Tuning Pipeline
Maturity	Proven
Complexity	High
Tags	`llm` `privacy-act` `model-risk` `high-complexity`
Last Reviewed	2026-06-12
Owner	Enterprise AI Architecture Practice

1. Executive Summary

Fine-tuning is the process of adapting a pre-trained foundation model to an organisation's specific domain, language, and task requirements by continuing its training on curated enterprise data. The result is a model that combines the general capabilities of the foundation model with specialised performance on the organisation's particular use cases. This pattern defines the full production pipeline: from training data preparation (deduplication, PII removal, quality filtering, format conversion) through compute infrastructure provisioning, fine-tuning execution, monitoring, and evaluation before promotion. For CIOs, fine-tuning is a strategic investment — it produces a proprietary model asset that can outperform generic APIs on enterprise tasks at lower per-query cost. For CTOs, the pipeline architecture must be production-grade: reproducible, cost-controlled, privacy-compliant, and integrated with the model versioning and governance infrastructure (EAAPL-MDL001). For risk officers, fine-tuning on enterprise data creates a specific privacy obligation: consent, purpose limitation, and PII removal must be demonstrably satisfied before any data enters the training pipeline. Safety testing post-fine-tune is mandatory — fine-tuning can amplify biases or degrade alignment properties of the base model.

2. Problem Statement

2.1 Business Problem

Generic foundation models perform adequately on general tasks but underperform on domain-specific enterprise tasks: legal contract analysis, financial report interpretation, medical documentation, customer service in domain-specific language. The performance gap is often 15–40% on task-specific benchmarks. Organisations pay API costs for suboptimal performance. Fine-tuning closes this gap, producing a model asset that is both more accurate on enterprise tasks and more cost-efficient at scale.

2.2 Technical Problem

Fine-tuning at enterprise scale requires solving a set of non-trivial engineering problems simultaneously: sourcing and cleaning training data at sufficient quality and volume; provisioning and managing GPU compute infrastructure; executing distributed training efficiently; preventing overfitting on a narrow enterprise dataset; maintaining safety and alignment properties of the base model; and integrating the resulting model into the organisation's versioning, evaluation, and deployment infrastructure.

2.3 Symptoms

Generic API responses consistently miss domain-specific terminology, concepts, or procedures.
API costs are high and will not decrease as usage scales.
The organisation has a proprietary knowledge corpus that is not reflected in any available model.
Post-deployment user feedback consistently identifies the same domain-specific failure patterns.

2.4 Cost of Inaction

Category	Indicative Impact
Quality	Persistent domain performance gap; users supplement AI with manual correction
Cost	API costs scale linearly with usage; fine-tuned models reduce per-query cost 50–80%
Strategic	Proprietary knowledge is not monetised through a model asset; competitive differentiation lost
Privacy Risk	Using raw enterprise data with third-party APIs without a proper privacy review creates regulatory exposure

3. Context

3.1 When to Apply

The organisation has a domain-specific task where generic models underperform.
A proprietary knowledge corpus (legal documents, clinical notes, financial reports, engineering documentation) is available for training.
Usage volume justifies the fine-tuning investment (typically > 1M queries/month before ROI is clear).
The organisation has or can provision GPU compute infrastructure.

3.2 When NOT to Apply

The domain gap can be addressed by prompt engineering or retrieval-augmented generation at lower cost and complexity.
The available training corpus is too small (< 1,000 high-quality examples for task-specific fine-tuning; < 100,000 for meaningful domain adaptation).
The training data cannot be cleared of PII or confidential information.
The organisation lacks the ML engineering expertise to execute and maintain a fine-tuning pipeline.

3.3 Prerequisites

Prerequisite	Detail
Training data corpus	Minimum viable corpus size per task type; rights to use data for model training
Privacy review	Formal privacy impact assessment completed; PII removal process defined
GPU compute infrastructure	Cloud GPU cluster or on-premises GPU servers with distributed training capability
Model versioning (EAAPL-MDL001)	Fine-tuned model will be registered; versioning infrastructure must exist
Base model licence	Licence for the base model permits fine-tuning and derivative model deployment

3.4 Industry Applicability

Industry	Applicability	Primary Driver
Financial Services	High	Domain-specific financial language; regulatory document understanding
Healthcare	High	Clinical documentation; medical coding; drug information extraction
Legal	High	Contract analysis; case law interpretation; compliance checking
Government	High	Policy document analysis; public service queries in specialised domain
Manufacturing	Medium	Technical documentation; maintenance manual interpretation
Retail / E-commerce	Medium	Product catalogue understanding; customer service domain adaptation

4. Architecture Overview

4.1 Training Data Preparation Pipeline

The data preparation pipeline is the most privacy-sensitive stage of fine-tuning. It executes sequentially:

Data Collection: Data is ingested from authorised enterprise sources — document management systems, ticketing systems, knowledge bases, transaction logs. A data collection manifest records: source system, collection date, volume, and legal basis for use (consent record reference, legitimate interest assessment reference, or data processing agreement reference).

Deduplication: Near-duplicate documents are identified using MinHash LSH or SimHash and removed. Exact duplicates are removed using content hash. Deduplication reduces memorisation risk (the model memorising and reproducing training examples verbatim) and improves training efficiency.

Quality Filtering: Documents are scored for quality using heuristics (length, language detection, vocabulary richness) and a lightweight quality classifier. Below-threshold documents are excluded. The quality threshold is tuned to balance data volume against quality for the specific task.

PII Removal: All training documents pass through a PII detection and removal stage using a named entity recognition model plus rule-based patterns. PII categories detected and masked: names, email addresses, phone numbers, government identifiers (TFN, ABN, Medicare), financial account numbers, dates of birth, physical addresses. Masking replaces PII with category tokens (e.g., [PERSON], [EMAIL]). PII removal is imperfect — a residual PII scan is run after removal and any remaining high-confidence PII detections are reported. The residual rate is logged and reviewed by the privacy function. An acceptable residual rate threshold (e.g., < 0.1% of documents containing residual PII) is defined in the privacy impact assessment.

Format Conversion: Documents are converted to the fine-tuning format for the target model (instruction-response pairs, completion format, or chat format). Prompt templates are applied. The conversion code is versioned and included in the artefact bundle.

Dataset Split: The prepared dataset is split into train (80%), validation (10%), and test (10%) sets. The test set is held out completely — it is not used for any training decision, only for final evaluation. The split is stratified if the dataset has class labels.

4.2 Compute Infrastructure

Fine-tuning large models requires GPU clusters. Infrastructure choices depend on model size: models < 7B parameters can be fine-tuned on a single A100 GPU; 7B–70B parameter models require multi-GPU nodes (8× A100 or equivalent); 70B+ parameter models require multi-node distributed training.

Spot instance strategy: Fine-tuning jobs are restartable (checkpoints written every N steps). Using spot/preemptible instances reduces compute cost 60–80%. Checkpoint-based restart means a spot interruption costs at most N steps of recomputation. The pipeline must implement automatic job restart on spot termination.

Distributed training coordination: For multi-GPU or multi-node training, use PyTorch FSDP or DeepSpeed ZeRO for memory-efficient distribution. The distributed training configuration is versioned as part of the training run configuration.

4.3 Fine-Tuning Techniques

Full fine-tuning: All model weights are updated. Produces highest quality but requires the most GPU memory and compute. Appropriate when the domain gap is large and compute budget allows.

LoRA (Low-Rank Adaptation): A small number of low-rank weight matrices are trained while base model weights are frozen. Memory reduction of 4–8×; quality within 2–5% of full fine-tuning for most tasks. The LoRA adapters are the primary artefact (base model + LoRA adapter = fine-tuned model). Recommended for most enterprise fine-tuning scenarios.

QLoRA: LoRA on a quantised (4-bit) base model. Enables fine-tuning of very large models on consumer-grade GPU hardware. Quality penalty of approximately 1–3% vs LoRA on full-precision base. Use when compute budget is highly constrained.

Selection criteria: Start with QLoRA for experimentation; graduate to LoRA for production; use full fine-tuning only when LoRA quality ceiling is insufficient for the business requirement.

4.4 Training Monitoring

During training, monitor every N steps: training loss and validation loss (divergence indicates overfitting); gradient norms (exploding gradients indicate learning rate issue); learning rate schedule adherence; GPU utilisation; cost per step (extrapolated to total run cost). Alert if: validation loss increases for 3 consecutive checkpoints (overfitting signal), gradient norms exceed 10 (training instability), or extrapolated total cost exceeds budget by 20%.

4.5 Safety Testing Post-Fine-Tune

Fine-tuning can degrade the alignment properties of the base model. Post-fine-tune safety testing is mandatory and must pass before the model can be registered. Tests include: harmlessness evaluation (does the model generate harmful content in response to adversarial prompts — test against a standard adversarial prompt set); bias amplification check (compare demographic bias metrics of fine-tuned model vs base model on standard fairness benchmarks); alignment regression test (compare responses to alignment-relevant queries between base and fine-tuned model). Any regression on safety or alignment metrics relative to the base model is a blocking condition — the fine-tuned model cannot be promoted until the regression is resolved.

4.6 Privacy Review for Training Data

The privacy review is not a one-time gate — it is a continuous obligation. For each fine-tuning run, the privacy function must confirm: (1) all training data sources have valid consent records or legitimate interest assessments; (2) PII removal has been executed and the residual PII rate is within the approved threshold; (3) the purpose of the fine-tuning is within the scope of the consent or LIA under which data was collected (purpose limitation); (4) data processing agreements with any third-party data processors used in the pipeline are current. The privacy review record is included in the artefact bundle.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph DataPrep["Data Preparation"] A[Enterprise Data Sources] B[PII Removal] C{Privacy Review} end subgraph Training["Training Pipeline"] D[GPU Cluster] E[Fine-Tuning Run] F[Training Monitor] end subgraph Evaluation["Evaluation and Registration"] G[Benchmark + Safety Tests] H[Model Register] end A -->|dedup + quality filter| B B --> C C -->|approved| D C -->|residual PII| A D --> E E --> F F -->|healthy| G F -->|instability| E G -->|pass| H G -->|fail| I[Reject + Root Cause] style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#fef9c3,stroke:#eab308 style G fill:#f0fdf4,stroke:#22c55e style H fill:#d1fae5,stroke:#10b981 style I fill:#fee2e2,stroke:#ef4444

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Data Ingestion	Pipeline Stage	Collects data from enterprise sources; records legal basis	Apache Airflow, Prefect, custom ETL	Critical
PII Detector/Remover	Pipeline Stage	Detects and masks PII using NER + rules	Microsoft Presidio, spaCy NER, AWS Comprehend, custom	Critical
Quality Filter	Pipeline Stage	Scores and filters low-quality training examples	fastText language classifier; custom quality scorer	High
Training Orchestrator	Infrastructure	Provisions compute; launches and monitors training jobs; handles restarts	AWS SageMaker Training Jobs, Azure ML Compute, Vertex AI Training	Critical
Fine-Tuning Framework	Framework	Executes the actual weight update computation	Hugging Face PEFT (LoRA/QLoRA), DeepSpeed, Axolotl	Critical
Evaluation Harness	Platform Service	Runs benchmark + safety tests post-training	Eleuther LM Eval Harness, custom evaluation suite	Critical
Artefact Bundler	Pipeline Stage	Packages weights, config, preprocessing code, eval results per EAAPL-MDL001	Custom CI/CD step	High

7. Data Flow

7.1 Primary Flow

Step	Actor	Action	Output
1	Data Engineer	Initiates data collection from approved enterprise sources	Raw dataset + data collection manifest
2	Deduplication Stage	Removes near-duplicates and exact duplicates	Deduplicated dataset
3	Quality Filter	Scores documents; removes below-threshold examples	Quality-filtered dataset
4	PII Remover	Detects and masks PII; produces residual scan report	PII-masked dataset + residual rate report
5	Privacy Function	Reviews residual rate; confirms purpose limitation and consent coverage	Privacy review sign-off (or rejection with required remediation)
6	Format Converter	Transforms to fine-tuning format; applies prompt templates	Training-ready dataset in model-specific format
7	Training Orchestrator	Provisions GPU compute; launches training job; monitors; handles restarts	Training run in progress
8	Fine-Tuning Framework	Updates weights (LoRA/full) using training data; checkpoints periodically	Model weights + training metrics + training run record
9	Evaluation Harness	Runs benchmark + safety + bias evaluation on fine-tuned model	Evaluation results JSON
10	Artefact Bundler	Assembles bundle per EAAPL-MDL001; registers as new MINOR version	Versioned model artefact in Model Register

7.2 Error Flow

Error Scenario	Detection	Recovery Action
PII residual rate exceeds threshold	Residual scan report check	Halt pipeline; remediate identified documents; re-run PII removal
Training instability (loss divergence)	Training monitor gradient norm alert	Stop training run; investigate learning rate/batch size; restart
Spot instance interruption	Orchestrator interruption handler	Automatic job restart from last checkpoint; log interruption event
Evaluation fails safety threshold	Safety test result evaluation	Reject version; mandatory root cause analysis; adjust fine-tuning approach
Compute cost exceeds budget	Cost tracking in training monitor	Alert; human decision to continue or stop; log budget overrun

8. Security Considerations

8.1 Controls Summary

Domain	Control
Authentication	Training pipeline service account with narrow scope; no access to production serving infrastructure
Authorisation	Training data source access gated by data classification and privacy review sign-off
Secrets	Any API keys used in data collection or training stored in secrets manager; never in training data
Classification	Fine-tuned model classified based on sensitivity of training data — if trained on CONFIDENTIAL data, model is CONFIDENTIAL
Encryption	Training data encrypted at rest throughout pipeline; GPU cluster communication encrypted via TLS; artefact encrypted at rest
Auditability	Full data lineage recorded: source → dedup → quality filter → PII removal → training run → model version

8.2 OWASP LLM Top 10 Relevance

OWASP LLM Risk	Relevance	Mitigation
LLM01 Prompt Injection	Medium	Fine-tuned model may be more susceptible to domain-specific injection — post-fine-tune adversarial testing required
LLM02 Insecure Output Handling	Medium	Fine-tuned model may learn output patterns from training data; output validation at inference layer
LLM03 Training Data Poisoning	Critical	Enterprise data sources are attack surface; data collection manifest enables detection; quality filter catches anomalous data
LLM04 Model Denial of Service	Low	Training pipeline concern, not serving concern
LLM05 Supply Chain Vulnerabilities	High	Base model licence and provenance verified per EAAPL-MDL001; fine-tuning framework dependencies pinned and scanned
LLM06 Sensitive Information Disclosure	Critical	Training data PII removal is the primary control; residual rate monitoring; memorisation testing post-fine-tune
LLM07 Insecure Plugin Design	Low	Pipeline does not use plugins in training context
LLM08 Excessive Agency	Low	Training pipeline concern is data, not agency
LLM09 Overreliance	Medium	Domain-specific fine-tuned models may create overreliance; model card must document limitations clearly
LLM10 Model Theft	High	Fine-tuned model is a proprietary IP asset; artefact store access controls and export restrictions apply

9. Governance Considerations

9.1 Responsible AI

Fine-tuning on enterprise data can amplify biases present in that data. Fairness analysis is mandatory post-fine-tune: compare demographic performance metrics of fine-tuned model vs base model. If fine-tuning increases performance disparity for any protected subgroup, the training data must be audited and balanced before the version can be promoted.

9.2 Model Risk Management

A fine-tuned model is a MINOR version change from the base model (per EAAPL-MDL001). It requires full model validation including benchmark evaluation, safety testing, and fairness evaluation. The training data provenance is part of the model risk record — any future question about the model's behaviour can be investigated by examining the training data.

9.3 Human Approval Gates

Privacy review sign-off (privacy function) is required before training begins. Evaluation pass gates model registration. Final production promotion requires the standard approval workflow (EAAPL-MDL001).

9.4 Governance Artefacts

Artefact	Owner	Frequency	Location
Data Collection Manifest	Data Engineer	Per training run	Artefact bundle + data catalogue
Privacy Review Record	Privacy Function	Per training run	Privacy register + artefact bundle
PII Residual Rate Report	Data Engineer	Per data preparation	Artefact bundle
Training Run Record	ML Engineer	Per training run	Model Register + MLflow
Safety Testing Report	AI Governance	Per fine-tuned version	Model governance record

10. Operational Considerations

10.1 SLOs

SLO	Target	Measurement Method
Data preparation pipeline duration	< 4 hours (for datasets < 10GB)	Pipeline end-to-end timing
Training job restart on spot interruption	< 5 minutes	Orchestrator restart timing from interruption event
Evaluation pipeline duration	< 2 hours	Evaluation harness end-to-end timing
PII residual rate	< 0.1%	Residual scan report metric

10.2 Monitoring and Logging

Key metrics monitored during training: training loss per step, validation loss per checkpoint, gradient norm per step, GPU utilisation per node, cost per step (accumulated and extrapolated). Alerts configured for: validation loss divergence (3 consecutive checkpoints increasing), gradient norm > 10, spot interruption (informational — triggers automatic restart), and cost overrun.

10.3 Incident Response

Training pipeline incidents include: data pipeline failure (data loss or corruption), PII residual rate breach, training instability, and cost overrun. Each triggers a halt-and-investigate workflow. The privacy function is notified immediately for any PII-related incident. Training runs that fail mid-job do not produce registered model versions — partial artefacts are discarded.

10.4 Disaster Recovery

Scenario	RPO	RTO	Recovery Procedure
Training data store unavailable	Last snapshot	4 hours	Restore from snapshot; rerun from last stage with data intact
GPU cluster unavailable	Last checkpoint	2 hours	Provision alternate cluster; restart from checkpoint
PII-contaminated data discovered post-training	Immediate alert	Manual	Quarantine model version; investigate training data; re-run with clean data

10.5 Capacity Planning

GPU planning: estimate training time as (dataset tokens × 6 × model parameters) / (GPU FLOPS × GPU efficiency). For a 7B model fine-tuned on 1B tokens on 8× A100 80GB: approximately 8 hours. Add 20% for evaluation and overhead. For cost estimation: A100 spot price is approximately $1.5–2.5/hour; 8 GPUs × 10 hours = $120–$200 per training run. Budget multiple runs for hyperparameter tuning.

11. Cost Considerations

11.1 Cost Drivers

Driver	Description	Relative Impact
GPU compute for training	Primary cost; scales with model size, dataset size, and number of training runs	Very High
Data preparation compute	CPU-heavy pipeline: dedup, quality filter, PII removal	Medium
Training data storage	Processed training dataset storage; artefact bundle storage	Low
Engineering labour	ML engineering time to build, maintain, and tune the pipeline	High
Evaluation compute	GPU time for post-training evaluation	Medium

11.2 Scaling Risks

Training cost is highly sensitive to model size and dataset size. A 2× model size increase typically causes 4× training cost increase (quadratic relationship for attention). Budget for multiple training runs — first runs are often hyperparameter explorations that do not produce production models.

11.3 Optimisations

Start with QLoRA to minimise experimentation cost; graduate to LoRA once the training approach is validated.
Use spot/preemptible instances for all training runs (checkpoint restarts make this safe).
Implement curriculum learning: start with easier examples to accelerate early training convergence.
Cache the base model in the training cluster warm storage — loading from cold storage adds 30–60 minutes per run for large models.

11.4 Indicative Cost Range

Model Size	Technique	Dataset Size	Approximate Training Cost	Runs to Production
7B parameters	QLoRA	100M tokens	$50–$200	3–5 runs
13B parameters	LoRA	1B tokens	$500–$2,000	3–5 runs
70B parameters	LoRA	5B tokens	$5,000–$20,000	2–3 runs
70B+ parameters	Full FT	10B tokens	$50,000–$200,000	1–2 runs

12. Trade-Off Analysis

12.1 Fine-Tuning Technique Comparison

Technique	GPU Memory	Quality vs Full FT	Cost Multiple	Training Stability	Production Complexity	Best For
Full fine-tuning	Very High	Baseline (1.0×)	1.0×	Medium	Low (standard weights)	Largest performance gain; ample GPU
LoRA	Low	~0.97×	0.25×	High	Low (adapter merge)	Most production scenarios
QLoRA	Very Low	~0.95×	0.1×	High	Low (adapter merge)	Budget-constrained; experimentation
Prefix Tuning	Low	~0.90×	0.1×	High	Medium (prefix at inference)	Minimal-change constraints

12.2 Architectural Tensions

Tension	Description	Resolution
Data Volume vs Privacy Risk	More training data improves quality; more data increases PII exposure surface	Invest in PII removal pipeline quality; set conservative residual rate threshold
Cost vs Quality	More training compute and larger datasets improve quality; cost is bounded	Define quality target first; find minimum compute that achieves it; use QLoRA for experiments
Domain Specialisation vs Generality	Heavy domain fine-tuning improves domain performance but may degrade general capability	Maintain base model for general tasks; use fine-tuned model only for target domain

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
PII in training data memorised by model	Medium	Critical	Post-training memorisation probe test	Reject model version; improve PII removal; retrain
Catastrophic forgetting of base capabilities	Medium	High	Benchmark regression vs base model	Reduce fine-tuning learning rate; use regularisation (EWC)
Bias amplification in fine-tuned model	Medium	High	Fairness benchmark comparison	Audit and balance training data; add debiasing training examples
Training cost overrun (> 3× budget)	Medium	Medium	Cost monitor alert	Stop run; analyse checkpoint quality; decision to continue or abort
Alignment regression (safety degradation)	Low	Critical	Safety test failure	Reject version; add safety examples to training set; retrain

13.1 Cascading Failure Scenarios

If the PII removal pipeline has a systematic failure (e.g., a NER model update causes a regression in detection recall), an entire training run may proceed with elevated PII in the training data. The fine-tuned model may then memorise and reproduce that PII in inference responses. Mitigation: PII removal pipeline is separately version-controlled and tested; a canary test on a synthetic PII-seeded document is run at the start of each data preparation job to validate detection recall before processing production data.

14. Regulatory Considerations

Regulation / Framework	Relevant Clause	How This Pattern Addresses It
Privacy Act 1988 (Cth)	APP 3 (Collection) / APP 6 (Use and Disclosure) / APP 11 (Security)	Data collection manifest records consent/legal basis; purpose limitation check; PII removal addresses APP 11
EU AI Act (2024/1689)	Article 10 (Data Governance) — training data quality and governance	PII removal, quality filtering, deduplication, and provenance manifest directly address Article 10
EU AI Act (2024/1689)	Article 9 (Risk Management) — testing and validation of high-risk AI	Safety testing, bias evaluation, and benchmark evaluation satisfy Article 9 pre-deployment testing
ISO 42001:2023	Clause 8.3 (AI system design and development) — data quality	Data preparation pipeline documents and enforces data quality per Clause 8.3
NIST AI RMF (2023)	MAP 1.5 (Organisational risk tolerance for training data) / MEASURE 2.6 (Bias evaluation)	Privacy review and bias evaluation directly address MAP 1.5 and MEASURE 2.6
APRA CPS 234 (2019)	Paragraph 15 (Information security policy) — data used in model training	Training data encryption, access controls, and audit trail satisfy Paragraph 15

15. Reference Implementations

15.1 AWS

Data Preparation: AWS Glue for ETL; Amazon Comprehend for PII detection; S3 for versioned dataset storage.
Compute: SageMaker Training Jobs; EC2 P4d/P5 spot instances for direct training; EFA networking for multi-node.
Fine-Tuning Framework: Hugging Face PEFT (LoRA/QLoRA) on SageMaker; SageMaker distributed training library for large models.
Training Monitor: SageMaker Experiments; CloudWatch custom metrics for loss/gradient tracking.
Evaluation: SageMaker Processing Jobs running Eleuther LM Eval Harness.

15.2 Azure

Data Preparation: Azure Data Factory for ETL; Azure AI Language (PII detection); Azure Data Lake Storage.
Compute: Azure Machine Learning Compute Clusters (NC series); spot (low-priority) VM policy.
Fine-Tuning Framework: Hugging Face PEFT on Azure ML; DeepSpeed for distributed training.
Training Monitor: Azure ML Experiment Tracking; Azure Monitor custom metrics.
Evaluation: Azure ML Pipelines running evaluation scripts; Azure OpenAI for evaluation assistance.

15.3 GCP

Data Preparation: Cloud Dataflow (Apache Beam) for ETL; Cloud Natural Language API for PII detection; Cloud Storage.
Compute: Vertex AI Custom Training; A2/A3 VM (spot) for GPU training; TPU pods for very large models.
Fine-Tuning Framework: Hugging Face PEFT on Vertex AI; Vertex AI model garden fine-tuning for supported models.
Training Monitor: Vertex AI Experiments; Cloud Monitoring custom metrics.
Evaluation: Vertex AI Pipelines; BigQuery for evaluation results storage.

15.4 On-Premises / Hybrid

Data Preparation: Apache Spark (on-prem cluster) for ETL; self-hosted Presidio for PII detection; MinIO for dataset storage.
Compute: On-premises GPU cluster (NVIDIA A100/H100); Kubernetes with GPU operator.
Fine-Tuning Framework: Hugging Face PEFT + Axolotl; DeepSpeed ZeRO for multi-GPU/multi-node.
Training Monitor: MLflow Tracking (self-hosted); Prometheus + Grafana for GPU metrics.
Evaluation: Custom evaluation harness; self-hosted LM Eval.

Pattern ID	Pattern Name	Relationship Type	Description
EAAPL-MDL001	Model Versioning	Produces	Fine-tuning pipeline produces new MINOR version artefacts registered per MDL001
EAAPL-MDL002	Shadow Model Deployment	Next Step	Fine-tuned model candidates enter shadow testing before production promotion
EAAPL-MDL007	Model Compression and Optimisation	Related	Fine-tuned models are often subsequently quantised/compressed for production cost
EAAPL-MDL005	Multi-Model Ensemble	Related	Fine-tuned specialist models are natural candidates for mixture-of-experts ensembles

17. Maturity Assessment

Overall Maturity: Proven

Dimension	Score (1–5)	Rationale
Industry Adoption	4	LoRA/QLoRA fine-tuning is widely adopted in enterprise; pipelines are maturing
Tooling Availability	4	Hugging Face PEFT, Axolotl, cloud training services are production-ready
Standards Alignment	4	EU AI Act Article 10, Privacy Act, ISO 42001 all addressed explicitly
Implementation Complexity	4 (high)	Full pipeline including privacy, safety testing, and governance is complex
Regulatory Acceptance	3	Privacy and AI governance regulators accept the approach; specific audit evidence requirements still being established

18. Revision History

Version	Date	Author	Summary of Changes
1.0	2026-06-12	Enterprise AI Architecture Practice	Initial publication

← Back to Library More Model Management →

EAAPL-MDL006 — Fine-Tuning Pipeline

EAAPL-MDL006 — Fine-Tuning Pipeline

1. Executive Summary

2. Problem Statement

2.1 Business Problem

2.2 Technical Problem

2.3 Symptoms

2.4 Cost of Inaction

3. Context

3.1 When to Apply

3.2 When NOT to Apply

3.3 Prerequisites

3.4 Industry Applicability

4. Architecture Overview

4.1 Training Data Preparation Pipeline

4.2 Compute Infrastructure

4.3 Fine-Tuning Techniques

4.4 Training Monitoring

4.5 Safety Testing Post-Fine-Tune

4.6 Privacy Review for Training Data

5. Architecture Diagram

6. Components

7. Data Flow

7.1 Primary Flow

7.2 Error Flow

8. Security Considerations

8.1 Controls Summary

8.2 OWASP LLM Top 10 Relevance

9. Governance Considerations

9.1 Responsible AI

9.2 Model Risk Management

9.3 Human Approval Gates

9.4 Governance Artefacts

10. Operational Considerations

10.1 SLOs

10.2 Monitoring and Logging

10.3 Incident Response

10.4 Disaster Recovery

10.5 Capacity Planning

11. Cost Considerations

11.1 Cost Drivers

11.2 Scaling Risks

11.3 Optimisations

11.4 Indicative Cost Range

12. Trade-Off Analysis

12.1 Fine-Tuning Technique Comparison

12.2 Architectural Tensions

13. Failure Modes

13.1 Cascading Failure Scenarios

14. Regulatory Considerations

15. Reference Implementations

15.1 AWS

15.2 Azure

15.3 GCP

15.4 On-Premises / Hybrid

16. Related Patterns

17. Maturity Assessment

18. Revision History