EAAPL-MDL006 — Fine-Tuning Pipeline
| Attribute | Value |
|---|---|
| Pattern ID | EAAPL-MDL006 |
| Name | Fine-Tuning Pipeline |
| Maturity | Proven |
| Complexity | High |
| Tags | llm privacy-act model-risk high-complexity |
| Last Reviewed | 2026-06-12 |
| Owner | Enterprise AI Architecture Practice |
1. Executive Summary
Fine-tuning is the process of adapting a pre-trained foundation model to an organisation's specific domain, language, and task requirements by continuing its training on curated enterprise data. The result is a model that combines the general capabilities of the foundation model with specialised performance on the organisation's particular use cases. This pattern defines the full production pipeline: from training data preparation (deduplication, PII removal, quality filtering, format conversion) through compute infrastructure provisioning, fine-tuning execution, monitoring, and evaluation before promotion. For CIOs, fine-tuning is a strategic investment — it produces a proprietary model asset that can outperform generic APIs on enterprise tasks at lower per-query cost. For CTOs, the pipeline architecture must be production-grade: reproducible, cost-controlled, privacy-compliant, and integrated with the model versioning and governance infrastructure (EAAPL-MDL001). For risk officers, fine-tuning on enterprise data creates a specific privacy obligation: consent, purpose limitation, and PII removal must be demonstrably satisfied before any data enters the training pipeline. Safety testing post-fine-tune is mandatory — fine-tuning can amplify biases or degrade alignment properties of the base model.
2. Problem Statement
2.1 Business Problem
Generic foundation models perform adequately on general tasks but underperform on domain-specific enterprise tasks: legal contract analysis, financial report interpretation, medical documentation, customer service in domain-specific language. The performance gap is often 15–40% on task-specific benchmarks. Organisations pay API costs for suboptimal performance. Fine-tuning closes this gap, producing a model asset that is both more accurate on enterprise tasks and more cost-efficient at scale.
2.2 Technical Problem
Fine-tuning at enterprise scale requires solving a set of non-trivial engineering problems simultaneously: sourcing and cleaning training data at sufficient quality and volume; provisioning and managing GPU compute infrastructure; executing distributed training efficiently; preventing overfitting on a narrow enterprise dataset; maintaining safety and alignment properties of the base model; and integrating the resulting model into the organisation's versioning, evaluation, and deployment infrastructure.
2.3 Symptoms
- Generic API responses consistently miss domain-specific terminology, concepts, or procedures.
- API costs are high and will not decrease as usage scales.
- The organisation has a proprietary knowledge corpus that is not reflected in any available model.
- Post-deployment user feedback consistently identifies the same domain-specific failure patterns.
2.4 Cost of Inaction
| Category | Indicative Impact |
|---|---|
| Quality | Persistent domain performance gap; users supplement AI with manual correction |
| Cost | API costs scale linearly with usage; fine-tuned models reduce per-query cost 50–80% |
| Strategic | Proprietary knowledge is not monetised through a model asset; competitive differentiation lost |
| Privacy Risk | Using raw enterprise data with third-party APIs without a proper privacy review creates regulatory exposure |
3. Context
3.1 When to Apply
- The organisation has a domain-specific task where generic models underperform.
- A proprietary knowledge corpus (legal documents, clinical notes, financial reports, engineering documentation) is available for training.
- Usage volume justifies the fine-tuning investment (typically > 1M queries/month before ROI is clear).
- The organisation has or can provision GPU compute infrastructure.
3.2 When NOT to Apply
- The domain gap can be addressed by prompt engineering or retrieval-augmented generation at lower cost and complexity.
- The available training corpus is too small (< 1,000 high-quality examples for task-specific fine-tuning; < 100,000 for meaningful domain adaptation).
- The training data cannot be cleared of PII or confidential information.
- The organisation lacks the ML engineering expertise to execute and maintain a fine-tuning pipeline.
3.3 Prerequisites
| Prerequisite | Detail |
|---|---|
| Training data corpus | Minimum viable corpus size per task type; rights to use data for model training |
| Privacy review | Formal privacy impact assessment completed; PII removal process defined |
| GPU compute infrastructure | Cloud GPU cluster or on-premises GPU servers with distributed training capability |
| Model versioning (EAAPL-MDL001) | Fine-tuned model will be registered; versioning infrastructure must exist |
| Base model licence | Licence for the base model permits fine-tuning and derivative model deployment |
3.4 Industry Applicability
| Industry | Applicability | Primary Driver |
|---|---|---|
| Financial Services | High | Domain-specific financial language; regulatory document understanding |
| Healthcare | High | Clinical documentation; medical coding; drug information extraction |
| Legal | High | Contract analysis; case law interpretation; compliance checking |
| Government | High | Policy document analysis; public service queries in specialised domain |
| Manufacturing | Medium | Technical documentation; maintenance manual interpretation |
| Retail / E-commerce | Medium | Product catalogue understanding; customer service domain adaptation |
4. Architecture Overview
4.1 Training Data Preparation Pipeline
The data preparation pipeline is the most privacy-sensitive stage of fine-tuning. It executes sequentially:
Data Collection: Data is ingested from authorised enterprise sources — document management systems, ticketing systems, knowledge bases, transaction logs. A data collection manifest records: source system, collection date, volume, and legal basis for use (consent record reference, legitimate interest assessment reference, or data processing agreement reference).
Deduplication: Near-duplicate documents are identified using MinHash LSH or SimHash and removed. Exact duplicates are removed using content hash. Deduplication reduces memorisation risk (the model memorising and reproducing training examples verbatim) and improves training efficiency.
Quality Filtering: Documents are scored for quality using heuristics (length, language detection, vocabulary richness) and a lightweight quality classifier. Below-threshold documents are excluded. The quality threshold is tuned to balance data volume against quality for the specific task.
PII Removal: All training documents pass through a PII detection and removal stage using a named entity recognition model plus rule-based patterns. PII categories detected and masked: names, email addresses, phone numbers, government identifiers (TFN, ABN, Medicare), financial account numbers, dates of birth, physical addresses. Masking replaces PII with category tokens (e.g., [PERSON], [EMAIL]). PII removal is imperfect — a residual PII scan is run after removal and any remaining high-confidence PII detections are reported. The residual rate is logged and reviewed by the privacy function. An acceptable residual rate threshold (e.g., < 0.1% of documents containing residual PII) is defined in the privacy impact assessment.
Format Conversion: Documents are converted to the fine-tuning format for the target model (instruction-response pairs, completion format, or chat format). Prompt templates are applied. The conversion code is versioned and included in the artefact bundle.
Dataset Split: The prepared dataset is split into train (80%), validation (10%), and test (10%) sets. The test set is held out completely — it is not used for any training decision, only for final evaluation. The split is stratified if the dataset has class labels.
4.2 Compute Infrastructure
Fine-tuning large models requires GPU clusters. Infrastructure choices depend on model size: models < 7B parameters can be fine-tuned on a single A100 GPU; 7B–70B parameter models require multi-GPU nodes (8× A100 or equivalent); 70B+ parameter models require multi-node distributed training.
Spot instance strategy: Fine-tuning jobs are restartable (checkpoints written every N steps). Using spot/preemptible instances reduces compute cost 60–80%. Checkpoint-based restart means a spot interruption costs at most N steps of recomputation. The pipeline must implement automatic job restart on spot termination.
Distributed training coordination: For multi-GPU or multi-node training, use PyTorch FSDP or DeepSpeed ZeRO for memory-efficient distribution. The distributed training configuration is versioned as part of the training run configuration.
4.3 Fine-Tuning Techniques
Full fine-tuning: All model weights are updated. Produces highest quality but requires the most GPU memory and compute. Appropriate when the domain gap is large and compute budget allows.
LoRA (Low-Rank Adaptation): A small number of low-rank weight matrices are trained while base model weights are frozen. Memory reduction of 4–8×; quality within 2–5% of full fine-tuning for most tasks. The LoRA adapters are the primary artefact (base model + LoRA adapter = fine-tuned model). Recommended for most enterprise fine-tuning scenarios.
QLoRA: LoRA on a quantised (4-bit) base model. Enables fine-tuning of very large models on consumer-grade GPU hardware. Quality penalty of approximately 1–3% vs LoRA on full-precision base. Use when compute budget is highly constrained.
Selection criteria: Start with QLoRA for experimentation; graduate to LoRA for production; use full fine-tuning only when LoRA quality ceiling is insufficient for the business requirement.
4.4 Training Monitoring
During training, monitor every N steps: training loss and validation loss (divergence indicates overfitting); gradient norms (exploding gradients indicate learning rate issue); learning rate schedule adherence; GPU utilisation; cost per step (extrapolated to total run cost). Alert if: validation loss increases for 3 consecutive checkpoints (overfitting signal), gradient norms exceed 10 (training instability), or extrapolated total cost exceeds budget by 20%.
4.5 Safety Testing Post-Fine-Tune
Fine-tuning can degrade the alignment properties of the base model. Post-fine-tune safety testing is mandatory and must pass before the model can be registered. Tests include: harmlessness evaluation (does the model generate harmful content in response to adversarial prompts — test against a standard adversarial prompt set); bias amplification check (compare demographic bias metrics of fine-tuned model vs base model on standard fairness benchmarks); alignment regression test (compare responses to alignment-relevant queries between base and fine-tuned model). Any regression on safety or alignment metrics relative to the base model is a blocking condition — the fine-tuned model cannot be promoted until the regression is resolved.
4.6 Privacy Review for Training Data
The privacy review is not a one-time gate — it is a continuous obligation. For each fine-tuning run, the privacy function must confirm: (1) all training data sources have valid consent records or legitimate interest assessments; (2) PII removal has been executed and the residual PII rate is within the approved threshold; (3) the purpose of the fine-tuning is within the scope of the consent or LIA under which data was collected (purpose limitation); (4) data processing agreements with any third-party data processors used in the pipeline are current. The privacy review record is included in the artefact bundle.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Data Ingestion | Pipeline Stage | Collects data from enterprise sources; records legal basis | Apache Airflow, Prefect, custom ETL | Critical |
| PII Detector/Remover | Pipeline Stage | Detects and masks PII using NER + rules | Microsoft Presidio, spaCy NER, AWS Comprehend, custom | Critical |
| Quality Filter | Pipeline Stage | Scores and filters low-quality training examples | fastText language classifier; custom quality scorer | High |
| Training Orchestrator | Infrastructure | Provisions compute; launches and monitors training jobs; handles restarts | AWS SageMaker Training Jobs, Azure ML Compute, Vertex AI Training | Critical |
| Fine-Tuning Framework | Framework | Executes the actual weight update computation | Hugging Face PEFT (LoRA/QLoRA), DeepSpeed, Axolotl | Critical |
| Evaluation Harness | Platform Service | Runs benchmark + safety tests post-training | Eleuther LM Eval Harness, custom evaluation suite | Critical |
| Artefact Bundler | Pipeline Stage | Packages weights, config, preprocessing code, eval results per EAAPL-MDL001 | Custom CI/CD step | High |
7. Data Flow
7.1 Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | Data Engineer | Initiates data collection from approved enterprise sources | Raw dataset + data collection manifest |
| 2 | Deduplication Stage | Removes near-duplicates and exact duplicates | Deduplicated dataset |
| 3 | Quality Filter | Scores documents; removes below-threshold examples | Quality-filtered dataset |
| 4 | PII Remover | Detects and masks PII; produces residual scan report | PII-masked dataset + residual rate report |
| 5 | Privacy Function | Reviews residual rate; confirms purpose limitation and consent coverage | Privacy review sign-off (or rejection with required remediation) |
| 6 | Format Converter | Transforms to fine-tuning format; applies prompt templates | Training-ready dataset in model-specific format |
| 7 | Training Orchestrator | Provisions GPU compute; launches training job; monitors; handles restarts | Training run in progress |
| 8 | Fine-Tuning Framework | Updates weights (LoRA/full) using training data; checkpoints periodically | Model weights + training metrics + training run record |
| 9 | Evaluation Harness | Runs benchmark + safety + bias evaluation on fine-tuned model | Evaluation results JSON |
| 10 | Artefact Bundler | Assembles bundle per EAAPL-MDL001; registers as new MINOR version | Versioned model artefact in Model Register |
7.2 Error Flow
| Error Scenario | Detection | Recovery Action |
|---|---|---|
| PII residual rate exceeds threshold | Residual scan report check | Halt pipeline; remediate identified documents; re-run PII removal |
| Training instability (loss divergence) | Training monitor gradient norm alert | Stop training run; investigate learning rate/batch size; restart |
| Spot instance interruption | Orchestrator interruption handler | Automatic job restart from last checkpoint; log interruption event |
| Evaluation fails safety threshold | Safety test result evaluation | Reject version; mandatory root cause analysis; adjust fine-tuning approach |
| Compute cost exceeds budget | Cost tracking in training monitor | Alert; human decision to continue or stop; log budget overrun |
8. Security Considerations
8.1 Controls Summary
| Domain | Control |
|---|---|
| Authentication | Training pipeline service account with narrow scope; no access to production serving infrastructure |
| Authorisation | Training data source access gated by data classification and privacy review sign-off |
| Secrets | Any API keys used in data collection or training stored in secrets manager; never in training data |
| Classification | Fine-tuned model classified based on sensitivity of training data — if trained on CONFIDENTIAL data, model is CONFIDENTIAL |
| Encryption | Training data encrypted at rest throughout pipeline; GPU cluster communication encrypted via TLS; artefact encrypted at rest |
| Auditability | Full data lineage recorded: source → dedup → quality filter → PII removal → training run → model version |
8.2 OWASP LLM Top 10 Relevance
| OWASP LLM Risk | Relevance | Mitigation |
|---|---|---|
| LLM01 Prompt Injection | Medium | Fine-tuned model may be more susceptible to domain-specific injection — post-fine-tune adversarial testing required |
| LLM02 Insecure Output Handling | Medium | Fine-tuned model may learn output patterns from training data; output validation at inference layer |
| LLM03 Training Data Poisoning | Critical | Enterprise data sources are attack surface; data collection manifest enables detection; quality filter catches anomalous data |
| LLM04 Model Denial of Service | Low | Training pipeline concern, not serving concern |
| LLM05 Supply Chain Vulnerabilities | High | Base model licence and provenance verified per EAAPL-MDL001; fine-tuning framework dependencies pinned and scanned |
| LLM06 Sensitive Information Disclosure | Critical | Training data PII removal is the primary control; residual rate monitoring; memorisation testing post-fine-tune |
| LLM07 Insecure Plugin Design | Low | Pipeline does not use plugins in training context |
| LLM08 Excessive Agency | Low | Training pipeline concern is data, not agency |
| LLM09 Overreliance | Medium | Domain-specific fine-tuned models may create overreliance; model card must document limitations clearly |
| LLM10 Model Theft | High | Fine-tuned model is a proprietary IP asset; artefact store access controls and export restrictions apply |
9. Governance Considerations
9.1 Responsible AI
Fine-tuning on enterprise data can amplify biases present in that data. Fairness analysis is mandatory post-fine-tune: compare demographic performance metrics of fine-tuned model vs base model. If fine-tuning increases performance disparity for any protected subgroup, the training data must be audited and balanced before the version can be promoted.
9.2 Model Risk Management
A fine-tuned model is a MINOR version change from the base model (per EAAPL-MDL001). It requires full model validation including benchmark evaluation, safety testing, and fairness evaluation. The training data provenance is part of the model risk record — any future question about the model's behaviour can be investigated by examining the training data.
9.3 Human Approval Gates
Privacy review sign-off (privacy function) is required before training begins. Evaluation pass gates model registration. Final production promotion requires the standard approval workflow (EAAPL-MDL001).
9.4 Governance Artefacts
| Artefact | Owner | Frequency | Location |
|---|---|---|---|
| Data Collection Manifest | Data Engineer | Per training run | Artefact bundle + data catalogue |
| Privacy Review Record | Privacy Function | Per training run | Privacy register + artefact bundle |
| PII Residual Rate Report | Data Engineer | Per data preparation | Artefact bundle |
| Training Run Record | ML Engineer | Per training run | Model Register + MLflow |
| Safety Testing Report | AI Governance | Per fine-tuned version | Model governance record |
10. Operational Considerations
10.1 SLOs
| SLO | Target | Measurement Method |
|---|---|---|
| Data preparation pipeline duration | < 4 hours (for datasets < 10GB) | Pipeline end-to-end timing |
| Training job restart on spot interruption | < 5 minutes | Orchestrator restart timing from interruption event |
| Evaluation pipeline duration | < 2 hours | Evaluation harness end-to-end timing |
| PII residual rate | < 0.1% | Residual scan report metric |
10.2 Monitoring and Logging
Key metrics monitored during training: training loss per step, validation loss per checkpoint, gradient norm per step, GPU utilisation per node, cost per step (accumulated and extrapolated). Alerts configured for: validation loss divergence (3 consecutive checkpoints increasing), gradient norm > 10, spot interruption (informational — triggers automatic restart), and cost overrun.
10.3 Incident Response
Training pipeline incidents include: data pipeline failure (data loss or corruption), PII residual rate breach, training instability, and cost overrun. Each triggers a halt-and-investigate workflow. The privacy function is notified immediately for any PII-related incident. Training runs that fail mid-job do not produce registered model versions — partial artefacts are discarded.
10.4 Disaster Recovery
| Scenario | RPO | RTO | Recovery Procedure |
|---|---|---|---|
| Training data store unavailable | Last snapshot | 4 hours | Restore from snapshot; rerun from last stage with data intact |
| GPU cluster unavailable | Last checkpoint | 2 hours | Provision alternate cluster; restart from checkpoint |
| PII-contaminated data discovered post-training | Immediate alert | Manual | Quarantine model version; investigate training data; re-run with clean data |
10.5 Capacity Planning
GPU planning: estimate training time as (dataset tokens × 6 × model parameters) / (GPU FLOPS × GPU efficiency). For a 7B model fine-tuned on 1B tokens on 8× A100 80GB: approximately 8 hours. Add 20% for evaluation and overhead. For cost estimation: A100 spot price is approximately $1.5–2.5/hour; 8 GPUs × 10 hours = $120–$200 per training run. Budget multiple runs for hyperparameter tuning.
11. Cost Considerations
11.1 Cost Drivers
| Driver | Description | Relative Impact |
|---|---|---|
| GPU compute for training | Primary cost; scales with model size, dataset size, and number of training runs | Very High |
| Data preparation compute | CPU-heavy pipeline: dedup, quality filter, PII removal | Medium |
| Training data storage | Processed training dataset storage; artefact bundle storage | Low |
| Engineering labour | ML engineering time to build, maintain, and tune the pipeline | High |
| Evaluation compute | GPU time for post-training evaluation | Medium |
11.2 Scaling Risks
Training cost is highly sensitive to model size and dataset size. A 2× model size increase typically causes 4× training cost increase (quadratic relationship for attention). Budget for multiple training runs — first runs are often hyperparameter explorations that do not produce production models.
11.3 Optimisations
- Start with QLoRA to minimise experimentation cost; graduate to LoRA once the training approach is validated.
- Use spot/preemptible instances for all training runs (checkpoint restarts make this safe).
- Implement curriculum learning: start with easier examples to accelerate early training convergence.
- Cache the base model in the training cluster warm storage — loading from cold storage adds 30–60 minutes per run for large models.
11.4 Indicative Cost Range
| Model Size | Technique | Dataset Size | Approximate Training Cost | Runs to Production |
|---|---|---|---|---|
| 7B parameters | QLoRA | 100M tokens | $50–$200 | 3–5 runs |
| 13B parameters | LoRA | 1B tokens | $500–$2,000 | 3–5 runs |
| 70B parameters | LoRA | 5B tokens | $5,000–$20,000 | 2–3 runs |
| 70B+ parameters | Full FT | 10B tokens | $50,000–$200,000 | 1–2 runs |
12. Trade-Off Analysis
12.1 Fine-Tuning Technique Comparison
| Technique | GPU Memory | Quality vs Full FT | Cost Multiple | Training Stability | Production Complexity | Best For |
|---|---|---|---|---|---|---|
| Full fine-tuning | Very High | Baseline (1.0×) | 1.0× | Medium | Low (standard weights) | Largest performance gain; ample GPU |
| LoRA | Low | ~0.97× | 0.25× | High | Low (adapter merge) | Most production scenarios |
| QLoRA | Very Low | ~0.95× | 0.1× | High | Low (adapter merge) | Budget-constrained; experimentation |
| Prefix Tuning | Low | ~0.90× | 0.1× | High | Medium (prefix at inference) | Minimal-change constraints |
12.2 Architectural Tensions
| Tension | Description | Resolution |
|---|---|---|
| Data Volume vs Privacy Risk | More training data improves quality; more data increases PII exposure surface | Invest in PII removal pipeline quality; set conservative residual rate threshold |
| Cost vs Quality | More training compute and larger datasets improve quality; cost is bounded | Define quality target first; find minimum compute that achieves it; use QLoRA for experiments |
| Domain Specialisation vs Generality | Heavy domain fine-tuning improves domain performance but may degrade general capability | Maintain base model for general tasks; use fine-tuned model only for target domain |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| PII in training data memorised by model | Medium | Critical | Post-training memorisation probe test | Reject model version; improve PII removal; retrain |
| Catastrophic forgetting of base capabilities | Medium | High | Benchmark regression vs base model | Reduce fine-tuning learning rate; use regularisation (EWC) |
| Bias amplification in fine-tuned model | Medium | High | Fairness benchmark comparison | Audit and balance training data; add debiasing training examples |
| Training cost overrun (> 3× budget) | Medium | Medium | Cost monitor alert | Stop run; analyse checkpoint quality; decision to continue or abort |
| Alignment regression (safety degradation) | Low | Critical | Safety test failure | Reject version; add safety examples to training set; retrain |
13.1 Cascading Failure Scenarios
If the PII removal pipeline has a systematic failure (e.g., a NER model update causes a regression in detection recall), an entire training run may proceed with elevated PII in the training data. The fine-tuned model may then memorise and reproduce that PII in inference responses. Mitigation: PII removal pipeline is separately version-controlled and tested; a canary test on a synthetic PII-seeded document is run at the start of each data preparation job to validate detection recall before processing production data.
14. Regulatory Considerations
| Regulation / Framework | Relevant Clause | How This Pattern Addresses It |
|---|---|---|
| Privacy Act 1988 (Cth) | APP 3 (Collection) / APP 6 (Use and Disclosure) / APP 11 (Security) | Data collection manifest records consent/legal basis; purpose limitation check; PII removal addresses APP 11 |
| EU AI Act (2024/1689) | Article 10 (Data Governance) — training data quality and governance | PII removal, quality filtering, deduplication, and provenance manifest directly address Article 10 |
| EU AI Act (2024/1689) | Article 9 (Risk Management) — testing and validation of high-risk AI | Safety testing, bias evaluation, and benchmark evaluation satisfy Article 9 pre-deployment testing |
| ISO 42001:2023 | Clause 8.3 (AI system design and development) — data quality | Data preparation pipeline documents and enforces data quality per Clause 8.3 |
| NIST AI RMF (2023) | MAP 1.5 (Organisational risk tolerance for training data) / MEASURE 2.6 (Bias evaluation) | Privacy review and bias evaluation directly address MAP 1.5 and MEASURE 2.6 |
| APRA CPS 234 (2019) | Paragraph 15 (Information security policy) — data used in model training | Training data encryption, access controls, and audit trail satisfy Paragraph 15 |
15. Reference Implementations
15.1 AWS
- Data Preparation: AWS Glue for ETL; Amazon Comprehend for PII detection; S3 for versioned dataset storage.
- Compute: SageMaker Training Jobs; EC2 P4d/P5 spot instances for direct training; EFA networking for multi-node.
- Fine-Tuning Framework: Hugging Face PEFT (LoRA/QLoRA) on SageMaker; SageMaker distributed training library for large models.
- Training Monitor: SageMaker Experiments; CloudWatch custom metrics for loss/gradient tracking.
- Evaluation: SageMaker Processing Jobs running Eleuther LM Eval Harness.
15.2 Azure
- Data Preparation: Azure Data Factory for ETL; Azure AI Language (PII detection); Azure Data Lake Storage.
- Compute: Azure Machine Learning Compute Clusters (NC series); spot (low-priority) VM policy.
- Fine-Tuning Framework: Hugging Face PEFT on Azure ML; DeepSpeed for distributed training.
- Training Monitor: Azure ML Experiment Tracking; Azure Monitor custom metrics.
- Evaluation: Azure ML Pipelines running evaluation scripts; Azure OpenAI for evaluation assistance.
15.3 GCP
- Data Preparation: Cloud Dataflow (Apache Beam) for ETL; Cloud Natural Language API for PII detection; Cloud Storage.
- Compute: Vertex AI Custom Training; A2/A3 VM (spot) for GPU training; TPU pods for very large models.
- Fine-Tuning Framework: Hugging Face PEFT on Vertex AI; Vertex AI model garden fine-tuning for supported models.
- Training Monitor: Vertex AI Experiments; Cloud Monitoring custom metrics.
- Evaluation: Vertex AI Pipelines; BigQuery for evaluation results storage.
15.4 On-Premises / Hybrid
- Data Preparation: Apache Spark (on-prem cluster) for ETL; self-hosted Presidio for PII detection; MinIO for dataset storage.
- Compute: On-premises GPU cluster (NVIDIA A100/H100); Kubernetes with GPU operator.
- Fine-Tuning Framework: Hugging Face PEFT + Axolotl; DeepSpeed ZeRO for multi-GPU/multi-node.
- Training Monitor: MLflow Tracking (self-hosted); Prometheus + Grafana for GPU metrics.
- Evaluation: Custom evaluation harness; self-hosted LM Eval.
16. Related Patterns
| Pattern ID | Pattern Name | Relationship Type | Description |
|---|---|---|---|
| EAAPL-MDL001 | Model Versioning | Produces | Fine-tuning pipeline produces new MINOR version artefacts registered per MDL001 |
| EAAPL-MDL002 | Shadow Model Deployment | Next Step | Fine-tuned model candidates enter shadow testing before production promotion |
| EAAPL-MDL007 | Model Compression and Optimisation | Related | Fine-tuned models are often subsequently quantised/compressed for production cost |
| EAAPL-MDL005 | Multi-Model Ensemble | Related | Fine-tuned specialist models are natural candidates for mixture-of-experts ensembles |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Industry Adoption | 4 | LoRA/QLoRA fine-tuning is widely adopted in enterprise; pipelines are maturing |
| Tooling Availability | 4 | Hugging Face PEFT, Axolotl, cloud training services are production-ready |
| Standards Alignment | 4 | EU AI Act Article 10, Privacy Act, ISO 42001 all addressed explicitly |
| Implementation Complexity | 4 (high) | Full pipeline including privacy, safety testing, and governance is complex |
| Regulatory Acceptance | 3 | Privacy and AI governance regulators accept the approach; specific audit evidence requirements still being established |
18. Revision History
| Version | Date | Author | Summary of Changes |
|---|---|---|---|
| 1.0 | 2026-06-12 | Enterprise AI Architecture Practice | Initial publication |