EAAPL-MDL007 — Model Compression and Optimisation
| Attribute | Value |
|---|---|
| Pattern ID | EAAPL-MDL007 |
| Name | Model Compression and Optimisation |
| Maturity | Proven |
| Complexity | High |
| Tags | cost-optimisation llm inference high-complexity |
| Last Reviewed | 2026-06-12 |
| Owner | Enterprise AI Architecture Practice |
1. Executive Summary
Model compression reduces the computational footprint of an AI model — memory usage, inference latency, and cost per query — through techniques including quantisation, knowledge distillation, and pruning. For large language models, compression is not optional at enterprise scale: a 70B parameter model in full 16-bit precision requires 140GB of GPU memory; INT8 quantisation reduces this to 70GB with approximately 1% quality loss; INT4 reduces it to 35GB with 3–5% quality loss. The right compression strategy can reduce inference costs by 50–80% while maintaining quality within acceptable business thresholds. For CIOs, compression is a cost governance tool: without it, LLM serving costs scale prohibitively. For CTOs, compression is an inference engineering discipline — each technique involves trade-offs that must be evaluated against the specific quality requirements of the application. For risk officers, compression introduces a material model risk: a compressed model is a different model from its uncompressed predecessor. If the original model was validated for regulatory purposes (EU AI Act high-risk, APRA model risk), the compressed version requires re-validation. This pattern provides a benchmarking protocol and re-validation framework to address that requirement.
2. Problem Statement
2.1 Business Problem
LLM inference costs are the primary cost driver for AI applications at scale. A large language model generating 1,000 tokens costs $0.001–$0.03 per query depending on the model. At 10M queries/day, that is $10,000–$300,000 per day. Without compression, organisations either accept these costs, rate-limit usage below business requirements, or limit deployment to use cases where the cost-value ratio is clearly positive. Compression unlocks broader deployment by reducing per-query costs to a level where a wider range of use cases are economically viable.
2.2 Technical Problem
Foundation models are trained in high precision (BF16 or FP32) for numerical stability during gradient descent. Inference does not require the same precision. The excess precision represents wasted memory bandwidth and compute cycles. However, naively reducing precision degrades model quality — the compression must be calibrated against representative input data to minimise quality loss for the specific application's input distribution.
2.3 Symptoms
- Model serving infrastructure is GPU-constrained by model memory requirements, not compute throughput.
- Inference cost is the primary objection to broader AI deployment within the organisation.
- Model responses have acceptable latency at low load but degrade significantly under peak load due to memory bandwidth limitations.
- The organisation is paying for a larger GPU tier than is needed if the model were compressed.
2.4 Cost of Inaction
| Category | Indicative Impact |
|---|---|
| Cost | 50–80% overpay on inference compute vs compressed equivalents |
| Deployment | Memory constraints limit model deployment to high-cost GPU instances only |
| Scale | Cannot serve peak load without vertical GPU scaling at prohibitive cost |
| Competitiveness | Competitors using compressed models serve the same quality at lower cost or higher volume |
3. Context
3.1 When to Apply
- Models serving high-query-volume applications where inference cost is a primary concern.
- Models that require deployment on memory-constrained infrastructure (edge devices, smaller cloud instances).
- After fine-tuning (EAAPL-MDL006) — the fine-tuned model is the compression target.
- When latency reduction is required and the bottleneck is memory bandwidth rather than compute.
3.2 When NOT to Apply
- Safety-critical outputs: Models making high-stakes decisions (medical diagnosis, credit decisions, safety systems) where even 1% quality degradation is unacceptable — evaluate compression benefits against quality risk carefully.
- High-precision numerical tasks: Models producing numerical outputs (financial calculations, measurements) where precision matters — quantisation error accumulates in numerical computations.
- Models that are already at acceptable cost and latency targets — do not compress without clear business justification.
- Models subject to active regulatory examination where re-validation cost exceeds compression savings.
3.3 Prerequisites
| Prerequisite | Detail |
|---|---|
| Baseline model artefact | Registered, production model version to be compressed (EAAPL-MDL001) |
| Evaluation suite | Full benchmark suite representing production input distribution for quality measurement |
| Calibration dataset | Representative sample of production inputs (1,000–10,000 examples) for quantisation calibration |
| Quality threshold definition | Pre-agreed acceptable quality degradation threshold (e.g., ≤ 2% on primary metric) |
3.4 Industry Applicability
| Industry | Applicability | Primary Driver |
|---|---|---|
| Technology Platforms | Critical | API serving cost; latency; scale |
| Financial Services | High | Cost governance; on-premises deployment security requirement |
| Healthcare | High | Edge deployment for clinical tools; cost management |
| Retail / E-commerce | High | High-volume recommendation; cost-sensitive application economics |
| Government | High | On-premises / sovereign cloud deployment; cost per citizen interaction |
| Manufacturing | Medium | Edge device deployment for on-site inference |
4. Architecture Overview
4.1 Quantisation
Quantisation converts model weights from high-precision floating point (BF16/FP32) to lower-precision integer or reduced-precision float representations.
INT8 (Post-Training Quantisation): Reduces each weight from 16-bit to 8-bit integer. Memory reduction: 2×. Quality loss: approximately 0.5–1% on typical LLM benchmarks. Inference speedup: 1.5–2× on hardware with INT8 acceleration (NVIDIA A100, H100, consumer GPU). Suitable for most production LLM deployments where minor quality loss is acceptable. Calibration requires a representative sample of 512–2,048 inputs.
INT4 / GPTQ / AWQ: Reduces weights to 4-bit. Memory reduction: 4× vs FP16. Quality loss: 2–5% depending on model size (larger models tolerate quantisation better). Inference speedup: 2–4× on compatible hardware. The AutoGPTQ and AutoAWQ libraries implement calibrated 4-bit quantisation that minimises perplexity increase. Use for deployments where memory constraint is the primary driver and 2–5% quality loss is acceptable.
When NOT to quantise: As stated, avoid INT4 for safety-critical, high-precision numerical, or regulated decisions where re-validation cost or quality risk is unacceptable. Always measure quality on the full evaluation suite — do not rely on perplexity as a proxy for task-specific quality.
4.2 Knowledge Distillation
Knowledge distillation trains a smaller "student" model to mimic the behaviour of a larger "teacher" model. The student is trained on the teacher's output distribution (soft targets) rather than hard labels. This transfers the teacher's generalisation to a smaller architecture.
Architecture requirement: The student architecture is defined before distillation begins. The student is typically 10–30% the size of the teacher (e.g., a 7B student from a 70B teacher). The student must have sufficient capacity to capture the knowledge — a student that is too small cannot learn the teacher's full capability regardless of training budget.
Dataset requirement: Distillation requires a large, diverse dataset — typically the original training data or a similarly broad corpus. The teacher generates soft probability distributions over vocabulary for each training example; the student is trained to match these distributions. Dataset size requirements are similar to the original training run.
Quality ceiling: Knowledge distillation has a quality ceiling: the student cannot exceed the teacher's quality. Empirically, a well-distilled student achieves 90–95% of the teacher's task performance at 20–30% of the inference cost. For tasks where the teacher's quality exceeds the requirement, this is highly cost-efficient.
Use cases: Distillation is the right approach when the organisation needs a model that is dramatically smaller (not just 2–4× reduction from quantisation) and can invest in the training cost of the distillation run.
4.3 Pruning
Pruning removes weights or neurons that contribute least to model outputs, reducing model size.
Structured pruning removes entire neurons, attention heads, or layers — the resulting model has fewer parameters and can be directly accelerated by standard hardware without special kernels. Quality loss is higher than unstructured pruning for the same size reduction.
Unstructured pruning removes individual weights based on magnitude or gradient importance — the resulting sparse weight tensor requires sparse matrix multiplication support for inference speedup. On standard GPU hardware without sparse acceleration, unstructured pruning reduces model size but does not necessarily reduce inference latency.
Industry evidence for LLM pruning: Research and production evidence (Meta, Google, Microsoft) indicates that LLMs tolerate moderate pruning (up to 20% of parameters) with < 2% quality loss on most benchmarks. Beyond 30% pruning, quality loss becomes significant and recovery via fine-tuning is required. Pruning is less mature than quantisation as a production technique for LLMs as of 2026 — prefer quantisation for production deployments unless there is a specific architectural reason for pruning.
4.4 ONNX Export for Inference Optimisation
ONNX (Open Neural Network Exchange) is a cross-platform model representation that enables deployment with optimised runtimes (ONNX Runtime, TensorRT, OpenVINO) that provide hardware-specific inference acceleration independent of the training framework.
Cross-platform benefit: An ONNX-exported model can run on CPUs, GPUs, and specialised hardware accelerators without rewriting serving code. This is particularly valuable for hybrid infrastructure (some workloads on GPU, some on CPU).
TensorRT acceleration: NVIDIA TensorRT converts ONNX models to optimised inference engines for NVIDIA GPUs. For transformer models, TensorRT provides 2–4× latency reduction vs PyTorch inference on the same hardware, with equivalent quality.
Limitations: Not all model operations are ONNX-exportable without custom operators. Verify the full model graph exports correctly before relying on ONNX in production. ONNX export must be validated against the quality threshold — graph-level optimisations can occasionally affect numerical precision.
4.5 Benchmarking Protocol Before and After Compression
The benchmarking protocol must satisfy one requirement: statistical confidence that quality degradation is within the pre-agreed threshold. The protocol:
- Run the full evaluation suite on the baseline model, recording all metric values.
- Apply the compression technique to produce the compressed model artefact.
- Run the full evaluation suite on the compressed model.
- Compute the quality degradation for each metric: (baseline_score - compressed_score) / baseline_score.
- Apply a statistical significance test (bootstrap confidence interval on evaluation metrics).
- Compare degradation to the pre-agreed threshold.
- If all metrics are within threshold: record evaluation results and proceed to PATCH version registration.
- If any metric exceeds threshold: reject compression; document result; explore alternative technique or adjust compression parameters.
The evaluation must use the full evaluation suite, not a sample. Compression techniques can cause disproportionate degradation on rare input types or long-tail tasks that are underrepresented in small sample evaluations.
4.6 Compliance Consideration for Regulated Models
If the original model was validated under EU AI Act, APRA CPS 234 model risk management, or any other regulatory framework, the compressed model requires re-validation. Compression changes the model's weights and internal representations — regulators treat the compressed model as a distinct model for validation purposes. The re-validation must demonstrate that the compressed model's regulatory-relevant performance (accuracy, fairness, explainability) has not been materially degraded by compression. The re-validation evidence must be attached to the compressed model's PATCH version artefact bundle.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Calibration Dataset Store | Data Store | Stores representative production samples for quantisation calibration | S3, Azure Blob, GCS; curated sample per model | High |
| Quantisation Engine | Pipeline Stage | Applies INT8/INT4 post-training quantisation with calibration | AutoAWQ, AutoGPTQ, bitsandbytes, TensorRT PTQ | High |
| Distillation Training | Pipeline Stage | Trains student model using teacher soft targets | Hugging Face PEFT, custom PyTorch training loop | High |
| Pruning Engine | Pipeline Stage | Identifies and removes low-importance weights/structures | llm-pruner, custom structured pruning | Medium |
| ONNX Exporter | Pipeline Stage | Exports model to ONNX; validates graph completeness | torch.onnx.export, Optimum, TensorRT | Medium |
| Evaluation Harness | Platform Service | Runs full benchmark suite on both baseline and compressed models | Eleuther LM Eval Harness, custom evaluation suite | Critical |
| Compression Result Store | Data Store | Records benchmark results for all compression experiments | MLflow, Model Register (EAAPL-MDL001), S3 | High |
7. Data Flow
7.1 Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | ML Engineer | Selects compression technique based on target (memory/latency/cost) and model type | Compression configuration document |
| 2 | Calibration Stage | Samples production inference logs; prepares calibration dataset | Calibration dataset (1K–10K representative examples) |
| 3 | Compression Engine | Applies technique (quantisation/distillation/pruning/ONNX) | Compressed model artefact |
| 4 | Evaluation Harness | Runs full benchmark suite on baseline model | Baseline evaluation results |
| 5 | Evaluation Harness | Runs full benchmark suite on compressed model | Compressed evaluation results |
| 6 | Benchmark Comparator | Computes degradation per metric; statistical significance test | Degradation report: within threshold or exceeds threshold |
| 7 | Regulated Model Check | Determines if model requires regulatory re-validation | Re-validation required or not required |
| 8 | Regulatory Re-validation | If required: runs regulatory-specific evaluation suite | Re-validation pass or fail |
| 9 | Artefact Bundler | Packages compressed model + evaluation results; registers as PATCH version | PATCH version in Model Register |
7.2 Error Flow
| Error Scenario | Detection | Recovery Action |
|---|---|---|
| Quality metric exceeds degradation threshold | Benchmark comparator threshold check | Document result; try alternative compression parameters; try alternative technique |
| Calibration dataset not representative | High perplexity on production inputs post-compression | Recollect calibration dataset from broader sample; re-quantise |
| ONNX export graph validation fails | ONNX validator error | Debug unsupported operations; add custom ONNX operators or skip ONNX |
| Re-validation failure for regulated model | Regulatory evaluation suite threshold | Reject compressed version; increase calibration quality; revert to baseline |
| Distillation student underfits | Student eval significantly below threshold | Increase student capacity; extend training; revise architecture |
8. Security Considerations
8.1 Controls Summary
| Domain | Control |
|---|---|
| Authentication | Compression pipeline service account with same scope as training pipeline; no broader access |
| Authorisation | Compressed model PATCH version requires same approval workflow as other version changes |
| Secrets | Calibration data may contain real user inputs — same classification and access controls as production data |
| Classification | Compressed model artefact classified at same level as baseline (compression does not reduce sensitivity) |
| Encryption | Calibration dataset and compressed artefact encrypted at rest and in transit |
| Auditability | Compression technique, parameters, calibration dataset version, and benchmark results all logged to artefact bundle |
8.2 OWASP LLM Top 10 Relevance
| OWASP LLM Risk | Relevance | Mitigation |
|---|---|---|
| LLM01 Prompt Injection | Medium | Compressed models may have different robustness characteristics to adversarial inputs — include adversarial inputs in evaluation |
| LLM02 Insecure Output Handling | Low | Output handling is unchanged by compression |
| LLM03 Training Data Poisoning | Low | Compression does not introduce new training data; calibration dataset is not training data |
| LLM04 Model Denial of Service | Medium | Compression may change memory usage patterns — validate that compressed model DoS thresholds are understood |
| LLM05 Supply Chain Vulnerabilities | High | Compression libraries (AutoGPTQ, TensorRT) are dependencies with their own supply chain risks; pin versions |
| LLM06 Sensitive Information Disclosure | Medium | Compressed models may memorise training data differently — include memorisation probe in evaluation |
| LLM07 Insecure Plugin Design | Low | Not affected by compression |
| LLM08 Excessive Agency | Low | Not affected by compression |
| LLM09 Overreliance | High | If compression degrades quality at the tail of the distribution, users may over-rely on a model performing worse than they expect |
| LLM10 Model Theft | Medium | Compressed models are smaller and easier to exfiltrate; artefact access controls remain critical |
9. Governance Considerations
9.1 Responsible AI
Compression can disproportionately impact model quality for minority subgroups — a model that performs similarly overall may have larger quality degradation for underrepresented groups. Fairness evaluation after compression is mandatory: compare subgroup performance metrics (where applicable) between baseline and compressed model. Any subgroup quality degradation exceeding the threshold for any subgroup is a blocking condition.
9.2 Model Risk Management
A compressed model is a PATCH version change per EAAPL-MDL001. For models registered in the MRM framework, the PATCH version requires a model risk validation event demonstrating that the compression has not materially changed risk-relevant model behaviour. This is a lighter-weight validation than a MINOR version but is not zero.
9.3 Human Approval Gates
Compression results (benchmark degradation report) must be reviewed and approved by the model owner before the PATCH version is registered. For regulated models, the AI Governance function reviews the re-validation evidence. No compression technique may be applied silently — every compression is a versioned, approved change.
9.4 Governance Artefacts
| Artefact | Owner | Frequency | Location |
|---|---|---|---|
| Compression Configuration Record | ML Engineer | Per compression run | Artefact bundle |
| Benchmark Degradation Report | ML Engineer | Per compression run | Artefact bundle + Model Register |
| Calibration Dataset Reference | ML Engineer | Per compression run | Artefact bundle (dataset version hash) |
| Regulatory Re-validation Record | AI Governance | Per regulated model | Model governance record |
10. Operational Considerations
10.1 SLOs
| SLO | Target | Measurement Method |
|---|---|---|
| Compression pipeline duration | < 4 hours (INT8); < 24 hours (distillation) | Pipeline end-to-end timing |
| Benchmark completeness (vs baseline) | 100% of metrics | Benchmark result coverage check |
| Quality degradation threshold | Defined per application | Benchmark comparator report |
| Inference latency improvement post-compression | ≥ 20% (p99) | Load test comparison baseline vs compressed |
10.2 Monitoring and Logging
Post-deployment monitoring for compressed models mirrors production monitoring with one addition: a monthly quality drift check. If the quality metric for the compressed model drifts over time at a rate faster than the baseline model, this may indicate that the compression has created sensitivity to data distribution shift that was not present in the original model.
10.3 Incident Response
If a compressed model shows unexpected quality degradation in production (post-deployment), the standard rollback procedure (EAAPL-MDL004) applies: rollback to the last-known-good version (the uncompressed or previous PATCH). The incident triggers a review of the compression benchmark — was the quality degradation visible in the evaluation and accepted, or was it not detected?
10.4 Disaster Recovery
| Scenario | RPO | RTO | Recovery Procedure |
|---|---|---|---|
| Compressed model quality failure | N/A | < 5 min | Rollback to uncompressed version per EAAPL-MDL004 |
| Compression pipeline failure | N/A | 4 hours | Restart pipeline from last successful stage; no production impact |
10.5 Capacity Planning
Compression reduces memory requirements — this enables deployment on smaller (cheaper) GPU instances or increases the number of models per GPU instance. After compression, re-evaluate the serving infrastructure sizing: INT8 compression enabling deployment from A100-80GB to A100-40GB halves the GPU memory cost. Model this before procurement to capture the cost saving.
11. Cost Considerations
11.1 Cost Drivers
| Driver | Description | Relative Impact |
|---|---|---|
| Compression compute | GPU time for calibration-based quantisation; CPU for post-processing | Low-Medium |
| Distillation training compute | Full training run for student model; comparable to fine-tuning compute | High |
| Evaluation compute | Running full benchmark suite on both baseline and compressed models | Medium |
| Engineering labour | Selecting, configuring, and validating compression; pipeline development | Medium |
| Regulatory re-validation | Additional evaluation cost for regulated models | Medium |
11.2 Scaling Risks
Distillation training cost scales with student model size and training dataset size — cost is comparable to fine-tuning. For organisations doing distillation of very large teachers (70B+), budget accordingly. Quantisation compute is much lower (typically < 1 hour for INT8 calibration of a 7B model).
11.3 Optimisations
- Start with INT8 post-training quantisation: highest quality-to-effort ratio; no training required.
- Use AutoAWQ or AutoGPTQ for INT4 before considering distillation — they achieve close to distillation quality with dramatically less compute.
- Profile the actual memory and latency bottleneck before choosing a technique: if the bottleneck is compute (not memory bandwidth), quantisation may have limited benefit.
11.4 Indicative Cost Range and Savings
| Technique | Compression Cost (One-Time) | Inference Cost Reduction | Break-Even Volume (queries/month) |
|---|---|---|---|
| INT8 PTQ | $50–$500 | 30–50% | ~100K queries/month |
| INT4 GPTQ/AWQ | $100–$1,000 | 50–70% | ~200K queries/month |
| Distillation | $1,000–$50,000 | 60–80% (smaller model) | ~1M queries/month |
| ONNX + TensorRT | $200–$2,000 (export + validation) | 30–50% latency improvement | ~500K queries/month |
12. Trade-Off Analysis
12.1 Compression Technique Comparison
| Technique | Memory Reduction | Quality Loss | Compute Cost | Training Required | Regulatory Complexity | Best For |
|---|---|---|---|---|---|---|
| INT8 PTQ | 2× | ~1% | Negligible | No | Low (PATCH re-eval) | Most production LLMs |
| INT4 GPTQ/AWQ | 4× | 2–5% | < 1 hour | No | Medium | Memory-constrained; cost-sensitive |
| Knowledge Distillation | 5–20× | 5–10% | High (training) | Yes | High (new model) | Very large cost reduction needed |
| Structured Pruning | 1.2–1.5× | 2–5% | Medium | Often (fine-tune after) | Medium | LLM-specific architectural optimisation |
| ONNX + TensorRT | 0 (latency only) | < 0.5% | Medium | No | Low | Latency reduction on NVIDIA hardware |
12.2 Architectural Tensions
| Tension | Description | Resolution |
|---|---|---|
| Cost vs Quality | More aggressive compression saves more money but risks quality below threshold | Define quality threshold first; find the most aggressive compression within threshold |
| Speed to Value vs Thoroughness | Full benchmark on large model takes hours; teams want quick validation | Implement tiered evaluation: fast eval (key metrics only) for go/no-go; full eval before registration |
| Regulatory Certainty vs Cost | Re-validation adds cost and time; skipping it adds regulatory risk | Tier re-validation by model risk classification; PATCH re-validation is lighter than MINOR |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Calibration dataset not representative | Medium | High | Perplexity spike on production tail inputs | Re-collect calibration; re-quantise; extend evaluation coverage |
| Quality exceeds threshold for a specific subgroup | Medium | High | Disaggregated benchmark analysis | Accept subgroup degradation only with documented risk acceptance; otherwise re-compress |
| ONNX export produces incorrect numerical outputs | Low | High | Numerical validation test in evaluation | Debug graph export; skip ONNX for this model; use native runtime |
| Compression library version regression | Low | Medium | CI build test comparing quantised model quality | Pin compression library versions; test on version upgrade |
| Compressed model fails in production but passed evaluation | Low | Medium | Production monitoring quality drift | Rollback; expand evaluation to cover production distribution gap |
13.1 Cascading Failure Scenarios
If a compressed model is deployed without adequate quality regression testing and serves as the base for a subsequent fine-tuning run (EAAPL-MDL006), the fine-tuned model inherits the quality degradation of the compressed base. Mitigation: fine-tuning pipelines must use uncompressed (full precision) base models; compression is always a post-training PATCH applied to the final model, not a pre-training base model step.
14. Regulatory Considerations
| Regulation / Framework | Relevant Clause | How This Pattern Addresses It |
|---|---|---|
| EU AI Act (2024/1689) | Article 9 (Risk Management) — changes to high-risk AI require re-assessment | Compressed model is a PATCH version requiring re-validation; process is documented |
| EU AI Act (2024/1689) | Article 15 (Accuracy, Robustness) — compression cannot materially degrade accuracy | Benchmark degradation report quantifies accuracy impact; threshold enforces Article 15 |
| ISO 42001:2023 | Clause 8.4 (Verification and validation) — all lifecycle changes require validation | Benchmarking protocol before and after compression is the validation event |
| NIST AI RMF (2023) | MANAGE 1.3 (Responses to risks from AI system changes) | Compression is a documented change with risk evidence (benchmark report) |
| APRA CPS 234 (2019) | Paragraph 15 (Change management for information assets including AI models) | PATCH version with benchmark evidence satisfies APRA change management requirement |
| Privacy Act 1988 (Cth) | APP 11 (Security) — calibration dataset security | Calibration dataset contains real user inputs; must be secured at production data classification |
15. Reference Implementations
15.1 AWS
- Quantisation: AWS SageMaker Neo (hardware-aware optimisation); bitsandbytes on SageMaker Processing Job; AutoGPTQ on GPU instance.
- Distillation: SageMaker Training Jobs (same pipeline as fine-tuning, student architecture).
- ONNX Export: SageMaker Processing Job with PyTorch ONNX export; Triton Inference Server on SageMaker for TensorRT.
- Evaluation: SageMaker Processing Jobs for benchmark runs.
- Artefact Storage: S3 (PATCH version bundle).
15.2 Azure
- Quantisation: Azure ML with bitsandbytes/AutoAWQ compute job; Azure Neural Network Intelligence for hardware-aware compression.
- Distillation: Azure ML Training (same pipeline as fine-tuning).
- ONNX Export: Azure ML with Optimum (Hugging Face); ONNX Runtime inference on Azure.
- Evaluation: Azure ML Pipelines for benchmark runs.
- Artefact Storage: Azure Blob Storage (PATCH version bundle).
15.3 GCP
- Quantisation: Vertex AI Custom Training with bitsandbytes/AutoAWQ; GCP Model Garden compression tools.
- Distillation: Vertex AI Training (student model).
- ONNX Export: Vertex AI Custom Training with Optimum; TensorRT on T4/A100 VMs.
- Evaluation: Vertex AI Pipelines.
- Artefact Storage: GCS (PATCH version bundle).
15.4 On-Premises / Hybrid
- Quantisation: bitsandbytes, AutoGPTQ, AutoAWQ on on-premises GPU; llama.cpp for GGUF quantisation on CPU-capable inference.
- Distillation: Standard PyTorch training on GPU cluster (same as fine-tuning pipeline).
- ONNX Export: torch.onnx.export + Optimum; TensorRT for NVIDIA GPU optimisation.
- Evaluation: Eleuther LM Eval Harness on self-hosted compute.
- Artefact Storage: MinIO S3-compatible storage.
16. Related Patterns
| Pattern ID | Pattern Name | Relationship Type | Description |
|---|---|---|---|
| EAAPL-MDL001 | Model Versioning | Produces | Compression produces a new PATCH version; versioning infrastructure records it |
| EAAPL-MDL006 | Fine-Tuning Pipeline | Predecessor | Fine-tuned model is typically the input to compression; compression is post-training |
| EAAPL-MDL002 | Shadow Model Deployment | Next Step | Compressed model candidates enter shadow testing to validate production quality |
| EAAPL-MDL004 | Model Rollback | Safety Net | Rollback to uncompressed version is the recovery when compressed model fails in prod |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Industry Adoption | 4 | INT8/INT4 quantisation is production-standard; GPTQ/AWQ widely deployed |
| Tooling Availability | 4 | AutoGPTQ, AutoAWQ, bitsandbytes, TensorRT are production-ready libraries |
| Standards Alignment | 3 | EU AI Act re-validation requirement is clear; specific evaluation standards still developing |
| Implementation Complexity | 4 (high) | Calibration, evaluation protocol, regulatory re-validation add complexity |
| Regulatory Acceptance | 3 | Accepted as a model change; re-validation process requirements still being established by regulators |
18. Revision History
| Version | Date | Author | Summary of Changes |
|---|---|---|---|
| 1.0 | 2026-06-12 | Enterprise AI Architecture Practice | Initial publication |