Proven

EAAPL-MDL007 — Model Compression and Optimisation

Attribute	Value
Pattern ID	EAAPL-MDL007
Name	Model Compression and Optimisation
Maturity	Proven
Complexity	High
Tags	`cost-optimisation` `llm` `inference` `high-complexity`
Last Reviewed	2026-06-12
Owner	Enterprise AI Architecture Practice

1. Executive Summary

Model compression reduces the computational footprint of an AI model — memory usage, inference latency, and cost per query — through techniques including quantisation, knowledge distillation, and pruning. For large language models, compression is not optional at enterprise scale: a 70B parameter model in full 16-bit precision requires 140GB of GPU memory; INT8 quantisation reduces this to 70GB with approximately 1% quality loss; INT4 reduces it to 35GB with 3–5% quality loss. The right compression strategy can reduce inference costs by 50–80% while maintaining quality within acceptable business thresholds. For CIOs, compression is a cost governance tool: without it, LLM serving costs scale prohibitively. For CTOs, compression is an inference engineering discipline — each technique involves trade-offs that must be evaluated against the specific quality requirements of the application. For risk officers, compression introduces a material model risk: a compressed model is a different model from its uncompressed predecessor. If the original model was validated for regulatory purposes (EU AI Act high-risk, APRA model risk), the compressed version requires re-validation. This pattern provides a benchmarking protocol and re-validation framework to address that requirement.

2. Problem Statement

2.1 Business Problem

LLM inference costs are the primary cost driver for AI applications at scale. A large language model generating 1,000 tokens costs $0.001–$0.03 per query depending on the model. At 10M queries/day, that is $10,000–$300,000 per day. Without compression, organisations either accept these costs, rate-limit usage below business requirements, or limit deployment to use cases where the cost-value ratio is clearly positive. Compression unlocks broader deployment by reducing per-query costs to a level where a wider range of use cases are economically viable.

2.2 Technical Problem

Foundation models are trained in high precision (BF16 or FP32) for numerical stability during gradient descent. Inference does not require the same precision. The excess precision represents wasted memory bandwidth and compute cycles. However, naively reducing precision degrades model quality — the compression must be calibrated against representative input data to minimise quality loss for the specific application's input distribution.

2.3 Symptoms

Model serving infrastructure is GPU-constrained by model memory requirements, not compute throughput.
Inference cost is the primary objection to broader AI deployment within the organisation.
Model responses have acceptable latency at low load but degrade significantly under peak load due to memory bandwidth limitations.
The organisation is paying for a larger GPU tier than is needed if the model were compressed.

2.4 Cost of Inaction

Category	Indicative Impact
Cost	50–80% overpay on inference compute vs compressed equivalents
Deployment	Memory constraints limit model deployment to high-cost GPU instances only
Scale	Cannot serve peak load without vertical GPU scaling at prohibitive cost
Competitiveness	Competitors using compressed models serve the same quality at lower cost or higher volume

3. Context

3.1 When to Apply

Models serving high-query-volume applications where inference cost is a primary concern.
Models that require deployment on memory-constrained infrastructure (edge devices, smaller cloud instances).
After fine-tuning (EAAPL-MDL006) — the fine-tuned model is the compression target.
When latency reduction is required and the bottleneck is memory bandwidth rather than compute.

3.2 When NOT to Apply

Safety-critical outputs: Models making high-stakes decisions (medical diagnosis, credit decisions, safety systems) where even 1% quality degradation is unacceptable — evaluate compression benefits against quality risk carefully.
High-precision numerical tasks: Models producing numerical outputs (financial calculations, measurements) where precision matters — quantisation error accumulates in numerical computations.
Models that are already at acceptable cost and latency targets — do not compress without clear business justification.
Models subject to active regulatory examination where re-validation cost exceeds compression savings.

3.3 Prerequisites

Prerequisite	Detail
Baseline model artefact	Registered, production model version to be compressed (EAAPL-MDL001)
Evaluation suite	Full benchmark suite representing production input distribution for quality measurement
Calibration dataset	Representative sample of production inputs (1,000–10,000 examples) for quantisation calibration
Quality threshold definition	Pre-agreed acceptable quality degradation threshold (e.g., ≤ 2% on primary metric)

3.4 Industry Applicability

Industry	Applicability	Primary Driver
Technology Platforms	Critical	API serving cost; latency; scale
Financial Services	High	Cost governance; on-premises deployment security requirement
Healthcare	High	Edge deployment for clinical tools; cost management
Retail / E-commerce	High	High-volume recommendation; cost-sensitive application economics
Government	High	On-premises / sovereign cloud deployment; cost per citizen interaction
Manufacturing	Medium	Edge device deployment for on-site inference

4. Architecture Overview

4.1 Quantisation

Quantisation converts model weights from high-precision floating point (BF16/FP32) to lower-precision integer or reduced-precision float representations.

INT8 (Post-Training Quantisation): Reduces each weight from 16-bit to 8-bit integer. Memory reduction: 2×. Quality loss: approximately 0.5–1% on typical LLM benchmarks. Inference speedup: 1.5–2× on hardware with INT8 acceleration (NVIDIA A100, H100, consumer GPU). Suitable for most production LLM deployments where minor quality loss is acceptable. Calibration requires a representative sample of 512–2,048 inputs.

INT4 / GPTQ / AWQ: Reduces weights to 4-bit. Memory reduction: 4× vs FP16. Quality loss: 2–5% depending on model size (larger models tolerate quantisation better). Inference speedup: 2–4× on compatible hardware. The AutoGPTQ and AutoAWQ libraries implement calibrated 4-bit quantisation that minimises perplexity increase. Use for deployments where memory constraint is the primary driver and 2–5% quality loss is acceptable.

When NOT to quantise: As stated, avoid INT4 for safety-critical, high-precision numerical, or regulated decisions where re-validation cost or quality risk is unacceptable. Always measure quality on the full evaluation suite — do not rely on perplexity as a proxy for task-specific quality.

4.2 Knowledge Distillation

Knowledge distillation trains a smaller "student" model to mimic the behaviour of a larger "teacher" model. The student is trained on the teacher's output distribution (soft targets) rather than hard labels. This transfers the teacher's generalisation to a smaller architecture.

Architecture requirement: The student architecture is defined before distillation begins. The student is typically 10–30% the size of the teacher (e.g., a 7B student from a 70B teacher). The student must have sufficient capacity to capture the knowledge — a student that is too small cannot learn the teacher's full capability regardless of training budget.

Dataset requirement: Distillation requires a large, diverse dataset — typically the original training data or a similarly broad corpus. The teacher generates soft probability distributions over vocabulary for each training example; the student is trained to match these distributions. Dataset size requirements are similar to the original training run.

Quality ceiling: Knowledge distillation has a quality ceiling: the student cannot exceed the teacher's quality. Empirically, a well-distilled student achieves 90–95% of the teacher's task performance at 20–30% of the inference cost. For tasks where the teacher's quality exceeds the requirement, this is highly cost-efficient.

Use cases: Distillation is the right approach when the organisation needs a model that is dramatically smaller (not just 2–4× reduction from quantisation) and can invest in the training cost of the distillation run.

4.3 Pruning

Pruning removes weights or neurons that contribute least to model outputs, reducing model size.

Structured pruning removes entire neurons, attention heads, or layers — the resulting model has fewer parameters and can be directly accelerated by standard hardware without special kernels. Quality loss is higher than unstructured pruning for the same size reduction.

Unstructured pruning removes individual weights based on magnitude or gradient importance — the resulting sparse weight tensor requires sparse matrix multiplication support for inference speedup. On standard GPU hardware without sparse acceleration, unstructured pruning reduces model size but does not necessarily reduce inference latency.

Industry evidence for LLM pruning: Research and production evidence (Meta, Google, Microsoft) indicates that LLMs tolerate moderate pruning (up to 20% of parameters) with < 2% quality loss on most benchmarks. Beyond 30% pruning, quality loss becomes significant and recovery via fine-tuning is required. Pruning is less mature than quantisation as a production technique for LLMs as of 2026 — prefer quantisation for production deployments unless there is a specific architectural reason for pruning.

4.4 ONNX Export for Inference Optimisation

ONNX (Open Neural Network Exchange) is a cross-platform model representation that enables deployment with optimised runtimes (ONNX Runtime, TensorRT, OpenVINO) that provide hardware-specific inference acceleration independent of the training framework.

Cross-platform benefit: An ONNX-exported model can run on CPUs, GPUs, and specialised hardware accelerators without rewriting serving code. This is particularly valuable for hybrid infrastructure (some workloads on GPU, some on CPU).

TensorRT acceleration: NVIDIA TensorRT converts ONNX models to optimised inference engines for NVIDIA GPUs. For transformer models, TensorRT provides 2–4× latency reduction vs PyTorch inference on the same hardware, with equivalent quality.

Limitations: Not all model operations are ONNX-exportable without custom operators. Verify the full model graph exports correctly before relying on ONNX in production. ONNX export must be validated against the quality threshold — graph-level optimisations can occasionally affect numerical precision.

4.5 Benchmarking Protocol Before and After Compression

The benchmarking protocol must satisfy one requirement: statistical confidence that quality degradation is within the pre-agreed threshold. The protocol:

Run the full evaluation suite on the baseline model, recording all metric values.
Apply the compression technique to produce the compressed model artefact.
Run the full evaluation suite on the compressed model.
Compute the quality degradation for each metric: (baseline_score - compressed_score) / baseline_score.
Apply a statistical significance test (bootstrap confidence interval on evaluation metrics).
Compare degradation to the pre-agreed threshold.
If all metrics are within threshold: record evaluation results and proceed to PATCH version registration.
If any metric exceeds threshold: reject compression; document result; explore alternative technique or adjust compression parameters.

The evaluation must use the full evaluation suite, not a sample. Compression techniques can cause disproportionate degradation on rare input types or long-tail tasks that are underrepresented in small sample evaluations.

4.6 Compliance Consideration for Regulated Models

If the original model was validated under EU AI Act, APRA CPS 234 model risk management, or any other regulatory framework, the compressed model requires re-validation. Compression changes the model's weights and internal representations — regulators treat the compressed model as a distinct model for validation purposes. The re-validation must demonstrate that the compressed model's regulatory-relevant performance (accuracy, fairness, explainability) has not been materially degraded by compression. The re-validation evidence must be attached to the compressed model's PATCH version artefact bundle.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Input["Compression Input"] A[Baseline Model] B[Calibration Dataset] end subgraph Compression["Compression Techniques"] C{Strategy Selection} D[Quantisation INT8/INT4] E[Knowledge Distillation] F[ONNX Export] end subgraph Validation["Validation Gate"] G[Full Evaluation Suite] H{Quality Threshold Met?} I[Regulatory Re-validation] end A --> C B --> C C -->|quantise| D C -->|distil| E C -->|inference opt| F D --> G E --> G F --> G G --> H H -->|within threshold| I H -->|exceeds threshold| J[Reject Compression] I -->|pass| K[Register PATCH Version] I -->|fail| J style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#f0fdf4,stroke:#22c55e style H fill:#f3e8ff,stroke:#a855f7 style I fill:#fef9c3,stroke:#eab308 style J fill:#fee2e2,stroke:#ef4444 style K fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Calibration Dataset Store	Data Store	Stores representative production samples for quantisation calibration	S3, Azure Blob, GCS; curated sample per model	High
Quantisation Engine	Pipeline Stage	Applies INT8/INT4 post-training quantisation with calibration	AutoAWQ, AutoGPTQ, bitsandbytes, TensorRT PTQ	High
Distillation Training	Pipeline Stage	Trains student model using teacher soft targets	Hugging Face PEFT, custom PyTorch training loop	High
Pruning Engine	Pipeline Stage	Identifies and removes low-importance weights/structures	llm-pruner, custom structured pruning	Medium
ONNX Exporter	Pipeline Stage	Exports model to ONNX; validates graph completeness	torch.onnx.export, Optimum, TensorRT	Medium
Evaluation Harness	Platform Service	Runs full benchmark suite on both baseline and compressed models	Eleuther LM Eval Harness, custom evaluation suite	Critical
Compression Result Store	Data Store	Records benchmark results for all compression experiments	MLflow, Model Register (EAAPL-MDL001), S3	High

7. Data Flow

7.1 Primary Flow

Step	Actor	Action	Output
1	ML Engineer	Selects compression technique based on target (memory/latency/cost) and model type	Compression configuration document
2	Calibration Stage	Samples production inference logs; prepares calibration dataset	Calibration dataset (1K–10K representative examples)
3	Compression Engine	Applies technique (quantisation/distillation/pruning/ONNX)	Compressed model artefact
4	Evaluation Harness	Runs full benchmark suite on baseline model	Baseline evaluation results
5	Evaluation Harness	Runs full benchmark suite on compressed model	Compressed evaluation results
6	Benchmark Comparator	Computes degradation per metric; statistical significance test	Degradation report: within threshold or exceeds threshold
7	Regulated Model Check	Determines if model requires regulatory re-validation	Re-validation required or not required
8	Regulatory Re-validation	If required: runs regulatory-specific evaluation suite	Re-validation pass or fail
9	Artefact Bundler	Packages compressed model + evaluation results; registers as PATCH version	PATCH version in Model Register

7.2 Error Flow

Error Scenario	Detection	Recovery Action
Quality metric exceeds degradation threshold	Benchmark comparator threshold check	Document result; try alternative compression parameters; try alternative technique
Calibration dataset not representative	High perplexity on production inputs post-compression	Recollect calibration dataset from broader sample; re-quantise
ONNX export graph validation fails	ONNX validator error	Debug unsupported operations; add custom ONNX operators or skip ONNX
Re-validation failure for regulated model	Regulatory evaluation suite threshold	Reject compressed version; increase calibration quality; revert to baseline
Distillation student underfits	Student eval significantly below threshold	Increase student capacity; extend training; revise architecture

8. Security Considerations

8.1 Controls Summary

Domain	Control
Authentication	Compression pipeline service account with same scope as training pipeline; no broader access
Authorisation	Compressed model PATCH version requires same approval workflow as other version changes
Secrets	Calibration data may contain real user inputs — same classification and access controls as production data
Classification	Compressed model artefact classified at same level as baseline (compression does not reduce sensitivity)
Encryption	Calibration dataset and compressed artefact encrypted at rest and in transit
Auditability	Compression technique, parameters, calibration dataset version, and benchmark results all logged to artefact bundle

8.2 OWASP LLM Top 10 Relevance

OWASP LLM Risk	Relevance	Mitigation
LLM01 Prompt Injection	Medium	Compressed models may have different robustness characteristics to adversarial inputs — include adversarial inputs in evaluation
LLM02 Insecure Output Handling	Low	Output handling is unchanged by compression
LLM03 Training Data Poisoning	Low	Compression does not introduce new training data; calibration dataset is not training data
LLM04 Model Denial of Service	Medium	Compression may change memory usage patterns — validate that compressed model DoS thresholds are understood
LLM05 Supply Chain Vulnerabilities	High	Compression libraries (AutoGPTQ, TensorRT) are dependencies with their own supply chain risks; pin versions
LLM06 Sensitive Information Disclosure	Medium	Compressed models may memorise training data differently — include memorisation probe in evaluation
LLM07 Insecure Plugin Design	Low	Not affected by compression
LLM08 Excessive Agency	Low	Not affected by compression
LLM09 Overreliance	High	If compression degrades quality at the tail of the distribution, users may over-rely on a model performing worse than they expect
LLM10 Model Theft	Medium	Compressed models are smaller and easier to exfiltrate; artefact access controls remain critical

9. Governance Considerations

9.1 Responsible AI

Compression can disproportionately impact model quality for minority subgroups — a model that performs similarly overall may have larger quality degradation for underrepresented groups. Fairness evaluation after compression is mandatory: compare subgroup performance metrics (where applicable) between baseline and compressed model. Any subgroup quality degradation exceeding the threshold for any subgroup is a blocking condition.

9.2 Model Risk Management

A compressed model is a PATCH version change per EAAPL-MDL001. For models registered in the MRM framework, the PATCH version requires a model risk validation event demonstrating that the compression has not materially changed risk-relevant model behaviour. This is a lighter-weight validation than a MINOR version but is not zero.

9.3 Human Approval Gates

Compression results (benchmark degradation report) must be reviewed and approved by the model owner before the PATCH version is registered. For regulated models, the AI Governance function reviews the re-validation evidence. No compression technique may be applied silently — every compression is a versioned, approved change.

9.4 Governance Artefacts

Artefact	Owner	Frequency	Location
Compression Configuration Record	ML Engineer	Per compression run	Artefact bundle
Benchmark Degradation Report	ML Engineer	Per compression run	Artefact bundle + Model Register
Calibration Dataset Reference	ML Engineer	Per compression run	Artefact bundle (dataset version hash)
Regulatory Re-validation Record	AI Governance	Per regulated model	Model governance record

10. Operational Considerations

10.1 SLOs

SLO	Target	Measurement Method
Compression pipeline duration	< 4 hours (INT8); < 24 hours (distillation)	Pipeline end-to-end timing
Benchmark completeness (vs baseline)	100% of metrics	Benchmark result coverage check
Quality degradation threshold	Defined per application	Benchmark comparator report
Inference latency improvement post-compression	≥ 20% (p99)	Load test comparison baseline vs compressed

10.2 Monitoring and Logging

Post-deployment monitoring for compressed models mirrors production monitoring with one addition: a monthly quality drift check. If the quality metric for the compressed model drifts over time at a rate faster than the baseline model, this may indicate that the compression has created sensitivity to data distribution shift that was not present in the original model.

10.3 Incident Response

If a compressed model shows unexpected quality degradation in production (post-deployment), the standard rollback procedure (EAAPL-MDL004) applies: rollback to the last-known-good version (the uncompressed or previous PATCH). The incident triggers a review of the compression benchmark — was the quality degradation visible in the evaluation and accepted, or was it not detected?

10.4 Disaster Recovery

Scenario	RPO	RTO	Recovery Procedure
Compressed model quality failure	N/A	< 5 min	Rollback to uncompressed version per EAAPL-MDL004
Compression pipeline failure	N/A	4 hours	Restart pipeline from last successful stage; no production impact

10.5 Capacity Planning

Compression reduces memory requirements — this enables deployment on smaller (cheaper) GPU instances or increases the number of models per GPU instance. After compression, re-evaluate the serving infrastructure sizing: INT8 compression enabling deployment from A100-80GB to A100-40GB halves the GPU memory cost. Model this before procurement to capture the cost saving.

11. Cost Considerations

11.1 Cost Drivers

Driver	Description	Relative Impact
Compression compute	GPU time for calibration-based quantisation; CPU for post-processing	Low-Medium
Distillation training compute	Full training run for student model; comparable to fine-tuning compute	High
Evaluation compute	Running full benchmark suite on both baseline and compressed models	Medium
Engineering labour	Selecting, configuring, and validating compression; pipeline development	Medium
Regulatory re-validation	Additional evaluation cost for regulated models	Medium

11.2 Scaling Risks

Distillation training cost scales with student model size and training dataset size — cost is comparable to fine-tuning. For organisations doing distillation of very large teachers (70B+), budget accordingly. Quantisation compute is much lower (typically < 1 hour for INT8 calibration of a 7B model).

11.3 Optimisations

Start with INT8 post-training quantisation: highest quality-to-effort ratio; no training required.
Use AutoAWQ or AutoGPTQ for INT4 before considering distillation — they achieve close to distillation quality with dramatically less compute.
Profile the actual memory and latency bottleneck before choosing a technique: if the bottleneck is compute (not memory bandwidth), quantisation may have limited benefit.

11.4 Indicative Cost Range and Savings

Technique	Compression Cost (One-Time)	Inference Cost Reduction	Break-Even Volume (queries/month)
INT8 PTQ	$50–$500	30–50%	~100K queries/month
INT4 GPTQ/AWQ	$100–$1,000	50–70%	~200K queries/month
Distillation	$1,000–$50,000	60–80% (smaller model)	~1M queries/month
ONNX + TensorRT	$200–$2,000 (export + validation)	30–50% latency improvement	~500K queries/month

12. Trade-Off Analysis

12.1 Compression Technique Comparison

Technique	Memory Reduction	Quality Loss	Compute Cost	Training Required	Regulatory Complexity	Best For
INT8 PTQ	2×	~1%	Negligible	No	Low (PATCH re-eval)	Most production LLMs
INT4 GPTQ/AWQ	4×	2–5%	< 1 hour	No	Medium	Memory-constrained; cost-sensitive
Knowledge Distillation	5–20×	5–10%	High (training)	Yes	High (new model)	Very large cost reduction needed
Structured Pruning	1.2–1.5×	2–5%	Medium	Often (fine-tune after)	Medium	LLM-specific architectural optimisation
ONNX + TensorRT	0 (latency only)	< 0.5%	Medium	No	Low	Latency reduction on NVIDIA hardware

12.2 Architectural Tensions

Tension	Description	Resolution
Cost vs Quality	More aggressive compression saves more money but risks quality below threshold	Define quality threshold first; find the most aggressive compression within threshold
Speed to Value vs Thoroughness	Full benchmark on large model takes hours; teams want quick validation	Implement tiered evaluation: fast eval (key metrics only) for go/no-go; full eval before registration
Regulatory Certainty vs Cost	Re-validation adds cost and time; skipping it adds regulatory risk	Tier re-validation by model risk classification; PATCH re-validation is lighter than MINOR

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Calibration dataset not representative	Medium	High	Perplexity spike on production tail inputs	Re-collect calibration; re-quantise; extend evaluation coverage
Quality exceeds threshold for a specific subgroup	Medium	High	Disaggregated benchmark analysis	Accept subgroup degradation only with documented risk acceptance; otherwise re-compress
ONNX export produces incorrect numerical outputs	Low	High	Numerical validation test in evaluation	Debug graph export; skip ONNX for this model; use native runtime
Compression library version regression	Low	Medium	CI build test comparing quantised model quality	Pin compression library versions; test on version upgrade
Compressed model fails in production but passed evaluation	Low	Medium	Production monitoring quality drift	Rollback; expand evaluation to cover production distribution gap

13.1 Cascading Failure Scenarios

If a compressed model is deployed without adequate quality regression testing and serves as the base for a subsequent fine-tuning run (EAAPL-MDL006), the fine-tuned model inherits the quality degradation of the compressed base. Mitigation: fine-tuning pipelines must use uncompressed (full precision) base models; compression is always a post-training PATCH applied to the final model, not a pre-training base model step.

14. Regulatory Considerations

Regulation / Framework	Relevant Clause	How This Pattern Addresses It
EU AI Act (2024/1689)	Article 9 (Risk Management) — changes to high-risk AI require re-assessment	Compressed model is a PATCH version requiring re-validation; process is documented
EU AI Act (2024/1689)	Article 15 (Accuracy, Robustness) — compression cannot materially degrade accuracy	Benchmark degradation report quantifies accuracy impact; threshold enforces Article 15
ISO 42001:2023	Clause 8.4 (Verification and validation) — all lifecycle changes require validation	Benchmarking protocol before and after compression is the validation event
NIST AI RMF (2023)	MANAGE 1.3 (Responses to risks from AI system changes)	Compression is a documented change with risk evidence (benchmark report)
APRA CPS 234 (2019)	Paragraph 15 (Change management for information assets including AI models)	PATCH version with benchmark evidence satisfies APRA change management requirement
Privacy Act 1988 (Cth)	APP 11 (Security) — calibration dataset security	Calibration dataset contains real user inputs; must be secured at production data classification

15. Reference Implementations

15.1 AWS

Quantisation: AWS SageMaker Neo (hardware-aware optimisation); bitsandbytes on SageMaker Processing Job; AutoGPTQ on GPU instance.
Distillation: SageMaker Training Jobs (same pipeline as fine-tuning, student architecture).
ONNX Export: SageMaker Processing Job with PyTorch ONNX export; Triton Inference Server on SageMaker for TensorRT.
Evaluation: SageMaker Processing Jobs for benchmark runs.
Artefact Storage: S3 (PATCH version bundle).

15.2 Azure

Quantisation: Azure ML with bitsandbytes/AutoAWQ compute job; Azure Neural Network Intelligence for hardware-aware compression.
Distillation: Azure ML Training (same pipeline as fine-tuning).
ONNX Export: Azure ML with Optimum (Hugging Face); ONNX Runtime inference on Azure.
Evaluation: Azure ML Pipelines for benchmark runs.
Artefact Storage: Azure Blob Storage (PATCH version bundle).

15.3 GCP

Quantisation: Vertex AI Custom Training with bitsandbytes/AutoAWQ; GCP Model Garden compression tools.
Distillation: Vertex AI Training (student model).
ONNX Export: Vertex AI Custom Training with Optimum; TensorRT on T4/A100 VMs.
Evaluation: Vertex AI Pipelines.
Artefact Storage: GCS (PATCH version bundle).

15.4 On-Premises / Hybrid

Quantisation: bitsandbytes, AutoGPTQ, AutoAWQ on on-premises GPU; llama.cpp for GGUF quantisation on CPU-capable inference.
Distillation: Standard PyTorch training on GPU cluster (same as fine-tuning pipeline).
ONNX Export: torch.onnx.export + Optimum; TensorRT for NVIDIA GPU optimisation.
Evaluation: Eleuther LM Eval Harness on self-hosted compute.
Artefact Storage: MinIO S3-compatible storage.

Pattern ID	Pattern Name	Relationship Type	Description
EAAPL-MDL001	Model Versioning	Produces	Compression produces a new PATCH version; versioning infrastructure records it
EAAPL-MDL006	Fine-Tuning Pipeline	Predecessor	Fine-tuned model is typically the input to compression; compression is post-training
EAAPL-MDL002	Shadow Model Deployment	Next Step	Compressed model candidates enter shadow testing to validate production quality
EAAPL-MDL004	Model Rollback	Safety Net	Rollback to uncompressed version is the recovery when compressed model fails in prod

17. Maturity Assessment

Overall Maturity: Proven

Dimension	Score (1–5)	Rationale
Industry Adoption	4	INT8/INT4 quantisation is production-standard; GPTQ/AWQ widely deployed
Tooling Availability	4	AutoGPTQ, AutoAWQ, bitsandbytes, TensorRT are production-ready libraries
Standards Alignment	3	EU AI Act re-validation requirement is clear; specific evaluation standards still developing
Implementation Complexity	4 (high)	Calibration, evaluation protocol, regulatory re-validation add complexity
Regulatory Acceptance	3	Accepted as a model change; re-validation process requirements still being established by regulators

18. Revision History

Version	Date	Author	Summary of Changes
1.0	2026-06-12	Enterprise AI Architecture Practice	Initial publication

← Back to Library More Model Management →

EAAPL-MDL007 — Model Compression and Optimisation

EAAPL-MDL007 — Model Compression and Optimisation

1. Executive Summary

2. Problem Statement

2.1 Business Problem

2.2 Technical Problem

2.3 Symptoms

2.4 Cost of Inaction

3. Context

3.1 When to Apply

3.2 When NOT to Apply

3.3 Prerequisites

3.4 Industry Applicability

4. Architecture Overview

4.1 Quantisation

4.2 Knowledge Distillation

4.3 Pruning

4.4 ONNX Export for Inference Optimisation

4.5 Benchmarking Protocol Before and After Compression

4.6 Compliance Consideration for Regulated Models

5. Architecture Diagram

6. Components

7. Data Flow

7.1 Primary Flow

7.2 Error Flow

8. Security Considerations

8.1 Controls Summary

8.2 OWASP LLM Top 10 Relevance

9. Governance Considerations

9.1 Responsible AI

9.2 Model Risk Management

9.3 Human Approval Gates

9.4 Governance Artefacts

10. Operational Considerations

10.1 SLOs

10.2 Monitoring and Logging

10.3 Incident Response

10.4 Disaster Recovery

10.5 Capacity Planning

11. Cost Considerations

11.1 Cost Drivers

11.2 Scaling Risks

11.3 Optimisations

11.4 Indicative Cost Range and Savings

12. Trade-Off Analysis

12.1 Compression Technique Comparison

12.2 Architectural Tensions

13. Failure Modes

13.1 Cascading Failure Scenarios

14. Regulatory Considerations

15. Reference Implementations

15.1 AWS

15.2 Azure

15.3 GCP

15.4 On-Premises / Hybrid

16. Related Patterns

17. Maturity Assessment

18. Revision History