EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryModel Management
Proven
⇄ Compare

EAAPL-MDL007 — Model Compression and Optimisation

EAAPL-MDL007 — Model Compression and Optimisation

Attribute Value
Pattern ID EAAPL-MDL007
Name Model Compression and Optimisation
Maturity Proven
Complexity High
Tags cost-optimisation llm inference high-complexity
Last Reviewed 2026-06-12
Owner Enterprise AI Architecture Practice

1. Executive Summary

Model compression reduces the computational footprint of an AI model — memory usage, inference latency, and cost per query — through techniques including quantisation, knowledge distillation, and pruning. For large language models, compression is not optional at enterprise scale: a 70B parameter model in full 16-bit precision requires 140GB of GPU memory; INT8 quantisation reduces this to 70GB with approximately 1% quality loss; INT4 reduces it to 35GB with 3–5% quality loss. The right compression strategy can reduce inference costs by 50–80% while maintaining quality within acceptable business thresholds. For CIOs, compression is a cost governance tool: without it, LLM serving costs scale prohibitively. For CTOs, compression is an inference engineering discipline — each technique involves trade-offs that must be evaluated against the specific quality requirements of the application. For risk officers, compression introduces a material model risk: a compressed model is a different model from its uncompressed predecessor. If the original model was validated for regulatory purposes (EU AI Act high-risk, APRA model risk), the compressed version requires re-validation. This pattern provides a benchmarking protocol and re-validation framework to address that requirement.


2. Problem Statement

2.1 Business Problem

LLM inference costs are the primary cost driver for AI applications at scale. A large language model generating 1,000 tokens costs $0.001–$0.03 per query depending on the model. At 10M queries/day, that is $10,000–$300,000 per day. Without compression, organisations either accept these costs, rate-limit usage below business requirements, or limit deployment to use cases where the cost-value ratio is clearly positive. Compression unlocks broader deployment by reducing per-query costs to a level where a wider range of use cases are economically viable.

2.2 Technical Problem

Foundation models are trained in high precision (BF16 or FP32) for numerical stability during gradient descent. Inference does not require the same precision. The excess precision represents wasted memory bandwidth and compute cycles. However, naively reducing precision degrades model quality — the compression must be calibrated against representative input data to minimise quality loss for the specific application's input distribution.

2.3 Symptoms

  • Model serving infrastructure is GPU-constrained by model memory requirements, not compute throughput.
  • Inference cost is the primary objection to broader AI deployment within the organisation.
  • Model responses have acceptable latency at low load but degrade significantly under peak load due to memory bandwidth limitations.
  • The organisation is paying for a larger GPU tier than is needed if the model were compressed.

2.4 Cost of Inaction

Category Indicative Impact
Cost 50–80% overpay on inference compute vs compressed equivalents
Deployment Memory constraints limit model deployment to high-cost GPU instances only
Scale Cannot serve peak load without vertical GPU scaling at prohibitive cost
Competitiveness Competitors using compressed models serve the same quality at lower cost or higher volume

3. Context

3.1 When to Apply

  • Models serving high-query-volume applications where inference cost is a primary concern.
  • Models that require deployment on memory-constrained infrastructure (edge devices, smaller cloud instances).
  • After fine-tuning (EAAPL-MDL006) — the fine-tuned model is the compression target.
  • When latency reduction is required and the bottleneck is memory bandwidth rather than compute.

3.2 When NOT to Apply

  • Safety-critical outputs: Models making high-stakes decisions (medical diagnosis, credit decisions, safety systems) where even 1% quality degradation is unacceptable — evaluate compression benefits against quality risk carefully.
  • High-precision numerical tasks: Models producing numerical outputs (financial calculations, measurements) where precision matters — quantisation error accumulates in numerical computations.
  • Models that are already at acceptable cost and latency targets — do not compress without clear business justification.
  • Models subject to active regulatory examination where re-validation cost exceeds compression savings.

3.3 Prerequisites

Prerequisite Detail
Baseline model artefact Registered, production model version to be compressed (EAAPL-MDL001)
Evaluation suite Full benchmark suite representing production input distribution for quality measurement
Calibration dataset Representative sample of production inputs (1,000–10,000 examples) for quantisation calibration
Quality threshold definition Pre-agreed acceptable quality degradation threshold (e.g., ≤ 2% on primary metric)

3.4 Industry Applicability

Industry Applicability Primary Driver
Technology Platforms Critical API serving cost; latency; scale
Financial Services High Cost governance; on-premises deployment security requirement
Healthcare High Edge deployment for clinical tools; cost management
Retail / E-commerce High High-volume recommendation; cost-sensitive application economics
Government High On-premises / sovereign cloud deployment; cost per citizen interaction
Manufacturing Medium Edge device deployment for on-site inference

4. Architecture Overview

4.1 Quantisation

Quantisation converts model weights from high-precision floating point (BF16/FP32) to lower-precision integer or reduced-precision float representations.

INT8 (Post-Training Quantisation): Reduces each weight from 16-bit to 8-bit integer. Memory reduction: 2×. Quality loss: approximately 0.5–1% on typical LLM benchmarks. Inference speedup: 1.5–2× on hardware with INT8 acceleration (NVIDIA A100, H100, consumer GPU). Suitable for most production LLM deployments where minor quality loss is acceptable. Calibration requires a representative sample of 512–2,048 inputs.

INT4 / GPTQ / AWQ: Reduces weights to 4-bit. Memory reduction: 4× vs FP16. Quality loss: 2–5% depending on model size (larger models tolerate quantisation better). Inference speedup: 2–4× on compatible hardware. The AutoGPTQ and AutoAWQ libraries implement calibrated 4-bit quantisation that minimises perplexity increase. Use for deployments where memory constraint is the primary driver and 2–5% quality loss is acceptable.

When NOT to quantise: As stated, avoid INT4 for safety-critical, high-precision numerical, or regulated decisions where re-validation cost or quality risk is unacceptable. Always measure quality on the full evaluation suite — do not rely on perplexity as a proxy for task-specific quality.

4.2 Knowledge Distillation

Knowledge distillation trains a smaller "student" model to mimic the behaviour of a larger "teacher" model. The student is trained on the teacher's output distribution (soft targets) rather than hard labels. This transfers the teacher's generalisation to a smaller architecture.

Architecture requirement: The student architecture is defined before distillation begins. The student is typically 10–30% the size of the teacher (e.g., a 7B student from a 70B teacher). The student must have sufficient capacity to capture the knowledge — a student that is too small cannot learn the teacher's full capability regardless of training budget.

Dataset requirement: Distillation requires a large, diverse dataset — typically the original training data or a similarly broad corpus. The teacher generates soft probability distributions over vocabulary for each training example; the student is trained to match these distributions. Dataset size requirements are similar to the original training run.

Quality ceiling: Knowledge distillation has a quality ceiling: the student cannot exceed the teacher's quality. Empirically, a well-distilled student achieves 90–95% of the teacher's task performance at 20–30% of the inference cost. For tasks where the teacher's quality exceeds the requirement, this is highly cost-efficient.

Use cases: Distillation is the right approach when the organisation needs a model that is dramatically smaller (not just 2–4× reduction from quantisation) and can invest in the training cost of the distillation run.

4.3 Pruning

Pruning removes weights or neurons that contribute least to model outputs, reducing model size.

Structured pruning removes entire neurons, attention heads, or layers — the resulting model has fewer parameters and can be directly accelerated by standard hardware without special kernels. Quality loss is higher than unstructured pruning for the same size reduction.

Unstructured pruning removes individual weights based on magnitude or gradient importance — the resulting sparse weight tensor requires sparse matrix multiplication support for inference speedup. On standard GPU hardware without sparse acceleration, unstructured pruning reduces model size but does not necessarily reduce inference latency.

Industry evidence for LLM pruning: Research and production evidence (Meta, Google, Microsoft) indicates that LLMs tolerate moderate pruning (up to 20% of parameters) with < 2% quality loss on most benchmarks. Beyond 30% pruning, quality loss becomes significant and recovery via fine-tuning is required. Pruning is less mature than quantisation as a production technique for LLMs as of 2026 — prefer quantisation for production deployments unless there is a specific architectural reason for pruning.

4.4 ONNX Export for Inference Optimisation

ONNX (Open Neural Network Exchange) is a cross-platform model representation that enables deployment with optimised runtimes (ONNX Runtime, TensorRT, OpenVINO) that provide hardware-specific inference acceleration independent of the training framework.

Cross-platform benefit: An ONNX-exported model can run on CPUs, GPUs, and specialised hardware accelerators without rewriting serving code. This is particularly valuable for hybrid infrastructure (some workloads on GPU, some on CPU).

TensorRT acceleration: NVIDIA TensorRT converts ONNX models to optimised inference engines for NVIDIA GPUs. For transformer models, TensorRT provides 2–4× latency reduction vs PyTorch inference on the same hardware, with equivalent quality.

Limitations: Not all model operations are ONNX-exportable without custom operators. Verify the full model graph exports correctly before relying on ONNX in production. ONNX export must be validated against the quality threshold — graph-level optimisations can occasionally affect numerical precision.

4.5 Benchmarking Protocol Before and After Compression

The benchmarking protocol must satisfy one requirement: statistical confidence that quality degradation is within the pre-agreed threshold. The protocol:

  1. Run the full evaluation suite on the baseline model, recording all metric values.
  2. Apply the compression technique to produce the compressed model artefact.
  3. Run the full evaluation suite on the compressed model.
  4. Compute the quality degradation for each metric: (baseline_score - compressed_score) / baseline_score.
  5. Apply a statistical significance test (bootstrap confidence interval on evaluation metrics).
  6. Compare degradation to the pre-agreed threshold.
  7. If all metrics are within threshold: record evaluation results and proceed to PATCH version registration.
  8. If any metric exceeds threshold: reject compression; document result; explore alternative technique or adjust compression parameters.

The evaluation must use the full evaluation suite, not a sample. Compression techniques can cause disproportionate degradation on rare input types or long-tail tasks that are underrepresented in small sample evaluations.

4.6 Compliance Consideration for Regulated Models

If the original model was validated under EU AI Act, APRA CPS 234 model risk management, or any other regulatory framework, the compressed model requires re-validation. Compression changes the model's weights and internal representations — regulators treat the compressed model as a distinct model for validation purposes. The re-validation must demonstrate that the compressed model's regulatory-relevant performance (accuracy, fairness, explainability) has not been materially degraded by compression. The re-validation evidence must be attached to the compressed model's PATCH version artefact bundle.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Input["Compression Input"] A[Baseline Model] B[Calibration Dataset] end subgraph Compression["Compression Techniques"] C{Strategy Selection} D[Quantisation INT8/INT4] E[Knowledge Distillation] F[ONNX Export] end subgraph Validation["Validation Gate"] G[Full Evaluation Suite] H{Quality Threshold Met?} I[Regulatory Re-validation] end A --> C B --> C C -->|quantise| D C -->|distil| E C -->|inference opt| F D --> G E --> G F --> G G --> H H -->|within threshold| I H -->|exceeds threshold| J[Reject Compression] I -->|pass| K[Register PATCH Version] I -->|fail| J style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#f0fdf4,stroke:#22c55e style H fill:#f3e8ff,stroke:#a855f7 style I fill:#fef9c3,stroke:#eab308 style J fill:#fee2e2,stroke:#ef4444 style K fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Calibration Dataset Store Data Store Stores representative production samples for quantisation calibration S3, Azure Blob, GCS; curated sample per model High
Quantisation Engine Pipeline Stage Applies INT8/INT4 post-training quantisation with calibration AutoAWQ, AutoGPTQ, bitsandbytes, TensorRT PTQ High
Distillation Training Pipeline Stage Trains student model using teacher soft targets Hugging Face PEFT, custom PyTorch training loop High
Pruning Engine Pipeline Stage Identifies and removes low-importance weights/structures llm-pruner, custom structured pruning Medium
ONNX Exporter Pipeline Stage Exports model to ONNX; validates graph completeness torch.onnx.export, Optimum, TensorRT Medium
Evaluation Harness Platform Service Runs full benchmark suite on both baseline and compressed models Eleuther LM Eval Harness, custom evaluation suite Critical
Compression Result Store Data Store Records benchmark results for all compression experiments MLflow, Model Register (EAAPL-MDL001), S3 High

7. Data Flow

7.1 Primary Flow

Step Actor Action Output
1 ML Engineer Selects compression technique based on target (memory/latency/cost) and model type Compression configuration document
2 Calibration Stage Samples production inference logs; prepares calibration dataset Calibration dataset (1K–10K representative examples)
3 Compression Engine Applies technique (quantisation/distillation/pruning/ONNX) Compressed model artefact
4 Evaluation Harness Runs full benchmark suite on baseline model Baseline evaluation results
5 Evaluation Harness Runs full benchmark suite on compressed model Compressed evaluation results
6 Benchmark Comparator Computes degradation per metric; statistical significance test Degradation report: within threshold or exceeds threshold
7 Regulated Model Check Determines if model requires regulatory re-validation Re-validation required or not required
8 Regulatory Re-validation If required: runs regulatory-specific evaluation suite Re-validation pass or fail
9 Artefact Bundler Packages compressed model + evaluation results; registers as PATCH version PATCH version in Model Register

7.2 Error Flow

Error Scenario Detection Recovery Action
Quality metric exceeds degradation threshold Benchmark comparator threshold check Document result; try alternative compression parameters; try alternative technique
Calibration dataset not representative High perplexity on production inputs post-compression Recollect calibration dataset from broader sample; re-quantise
ONNX export graph validation fails ONNX validator error Debug unsupported operations; add custom ONNX operators or skip ONNX
Re-validation failure for regulated model Regulatory evaluation suite threshold Reject compressed version; increase calibration quality; revert to baseline
Distillation student underfits Student eval significantly below threshold Increase student capacity; extend training; revise architecture

8. Security Considerations

8.1 Controls Summary

Domain Control
Authentication Compression pipeline service account with same scope as training pipeline; no broader access
Authorisation Compressed model PATCH version requires same approval workflow as other version changes
Secrets Calibration data may contain real user inputs — same classification and access controls as production data
Classification Compressed model artefact classified at same level as baseline (compression does not reduce sensitivity)
Encryption Calibration dataset and compressed artefact encrypted at rest and in transit
Auditability Compression technique, parameters, calibration dataset version, and benchmark results all logged to artefact bundle

8.2 OWASP LLM Top 10 Relevance

OWASP LLM Risk Relevance Mitigation
LLM01 Prompt Injection Medium Compressed models may have different robustness characteristics to adversarial inputs — include adversarial inputs in evaluation
LLM02 Insecure Output Handling Low Output handling is unchanged by compression
LLM03 Training Data Poisoning Low Compression does not introduce new training data; calibration dataset is not training data
LLM04 Model Denial of Service Medium Compression may change memory usage patterns — validate that compressed model DoS thresholds are understood
LLM05 Supply Chain Vulnerabilities High Compression libraries (AutoGPTQ, TensorRT) are dependencies with their own supply chain risks; pin versions
LLM06 Sensitive Information Disclosure Medium Compressed models may memorise training data differently — include memorisation probe in evaluation
LLM07 Insecure Plugin Design Low Not affected by compression
LLM08 Excessive Agency Low Not affected by compression
LLM09 Overreliance High If compression degrades quality at the tail of the distribution, users may over-rely on a model performing worse than they expect
LLM10 Model Theft Medium Compressed models are smaller and easier to exfiltrate; artefact access controls remain critical

9. Governance Considerations

9.1 Responsible AI

Compression can disproportionately impact model quality for minority subgroups — a model that performs similarly overall may have larger quality degradation for underrepresented groups. Fairness evaluation after compression is mandatory: compare subgroup performance metrics (where applicable) between baseline and compressed model. Any subgroup quality degradation exceeding the threshold for any subgroup is a blocking condition.

9.2 Model Risk Management

A compressed model is a PATCH version change per EAAPL-MDL001. For models registered in the MRM framework, the PATCH version requires a model risk validation event demonstrating that the compression has not materially changed risk-relevant model behaviour. This is a lighter-weight validation than a MINOR version but is not zero.

9.3 Human Approval Gates

Compression results (benchmark degradation report) must be reviewed and approved by the model owner before the PATCH version is registered. For regulated models, the AI Governance function reviews the re-validation evidence. No compression technique may be applied silently — every compression is a versioned, approved change.

9.4 Governance Artefacts

Artefact Owner Frequency Location
Compression Configuration Record ML Engineer Per compression run Artefact bundle
Benchmark Degradation Report ML Engineer Per compression run Artefact bundle + Model Register
Calibration Dataset Reference ML Engineer Per compression run Artefact bundle (dataset version hash)
Regulatory Re-validation Record AI Governance Per regulated model Model governance record

10. Operational Considerations

10.1 SLOs

SLO Target Measurement Method
Compression pipeline duration < 4 hours (INT8); < 24 hours (distillation) Pipeline end-to-end timing
Benchmark completeness (vs baseline) 100% of metrics Benchmark result coverage check
Quality degradation threshold Defined per application Benchmark comparator report
Inference latency improvement post-compression ≥ 20% (p99) Load test comparison baseline vs compressed

10.2 Monitoring and Logging

Post-deployment monitoring for compressed models mirrors production monitoring with one addition: a monthly quality drift check. If the quality metric for the compressed model drifts over time at a rate faster than the baseline model, this may indicate that the compression has created sensitivity to data distribution shift that was not present in the original model.

10.3 Incident Response

If a compressed model shows unexpected quality degradation in production (post-deployment), the standard rollback procedure (EAAPL-MDL004) applies: rollback to the last-known-good version (the uncompressed or previous PATCH). The incident triggers a review of the compression benchmark — was the quality degradation visible in the evaluation and accepted, or was it not detected?

10.4 Disaster Recovery

Scenario RPO RTO Recovery Procedure
Compressed model quality failure N/A < 5 min Rollback to uncompressed version per EAAPL-MDL004
Compression pipeline failure N/A 4 hours Restart pipeline from last successful stage; no production impact

10.5 Capacity Planning

Compression reduces memory requirements — this enables deployment on smaller (cheaper) GPU instances or increases the number of models per GPU instance. After compression, re-evaluate the serving infrastructure sizing: INT8 compression enabling deployment from A100-80GB to A100-40GB halves the GPU memory cost. Model this before procurement to capture the cost saving.


11. Cost Considerations

11.1 Cost Drivers

Driver Description Relative Impact
Compression compute GPU time for calibration-based quantisation; CPU for post-processing Low-Medium
Distillation training compute Full training run for student model; comparable to fine-tuning compute High
Evaluation compute Running full benchmark suite on both baseline and compressed models Medium
Engineering labour Selecting, configuring, and validating compression; pipeline development Medium
Regulatory re-validation Additional evaluation cost for regulated models Medium

11.2 Scaling Risks

Distillation training cost scales with student model size and training dataset size — cost is comparable to fine-tuning. For organisations doing distillation of very large teachers (70B+), budget accordingly. Quantisation compute is much lower (typically < 1 hour for INT8 calibration of a 7B model).

11.3 Optimisations

  • Start with INT8 post-training quantisation: highest quality-to-effort ratio; no training required.
  • Use AutoAWQ or AutoGPTQ for INT4 before considering distillation — they achieve close to distillation quality with dramatically less compute.
  • Profile the actual memory and latency bottleneck before choosing a technique: if the bottleneck is compute (not memory bandwidth), quantisation may have limited benefit.

11.4 Indicative Cost Range and Savings

Technique Compression Cost (One-Time) Inference Cost Reduction Break-Even Volume (queries/month)
INT8 PTQ $50–$500 30–50% ~100K queries/month
INT4 GPTQ/AWQ $100–$1,000 50–70% ~200K queries/month
Distillation $1,000–$50,000 60–80% (smaller model) ~1M queries/month
ONNX + TensorRT $200–$2,000 (export + validation) 30–50% latency improvement ~500K queries/month

12. Trade-Off Analysis

12.1 Compression Technique Comparison

Technique Memory Reduction Quality Loss Compute Cost Training Required Regulatory Complexity Best For
INT8 PTQ ~1% Negligible No Low (PATCH re-eval) Most production LLMs
INT4 GPTQ/AWQ 2–5% < 1 hour No Medium Memory-constrained; cost-sensitive
Knowledge Distillation 5–20× 5–10% High (training) Yes High (new model) Very large cost reduction needed
Structured Pruning 1.2–1.5× 2–5% Medium Often (fine-tune after) Medium LLM-specific architectural optimisation
ONNX + TensorRT 0 (latency only) < 0.5% Medium No Low Latency reduction on NVIDIA hardware

12.2 Architectural Tensions

Tension Description Resolution
Cost vs Quality More aggressive compression saves more money but risks quality below threshold Define quality threshold first; find the most aggressive compression within threshold
Speed to Value vs Thoroughness Full benchmark on large model takes hours; teams want quick validation Implement tiered evaluation: fast eval (key metrics only) for go/no-go; full eval before registration
Regulatory Certainty vs Cost Re-validation adds cost and time; skipping it adds regulatory risk Tier re-validation by model risk classification; PATCH re-validation is lighter than MINOR

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Calibration dataset not representative Medium High Perplexity spike on production tail inputs Re-collect calibration; re-quantise; extend evaluation coverage
Quality exceeds threshold for a specific subgroup Medium High Disaggregated benchmark analysis Accept subgroup degradation only with documented risk acceptance; otherwise re-compress
ONNX export produces incorrect numerical outputs Low High Numerical validation test in evaluation Debug graph export; skip ONNX for this model; use native runtime
Compression library version regression Low Medium CI build test comparing quantised model quality Pin compression library versions; test on version upgrade
Compressed model fails in production but passed evaluation Low Medium Production monitoring quality drift Rollback; expand evaluation to cover production distribution gap

13.1 Cascading Failure Scenarios

If a compressed model is deployed without adequate quality regression testing and serves as the base for a subsequent fine-tuning run (EAAPL-MDL006), the fine-tuned model inherits the quality degradation of the compressed base. Mitigation: fine-tuning pipelines must use uncompressed (full precision) base models; compression is always a post-training PATCH applied to the final model, not a pre-training base model step.


14. Regulatory Considerations

Regulation / Framework Relevant Clause How This Pattern Addresses It
EU AI Act (2024/1689) Article 9 (Risk Management) — changes to high-risk AI require re-assessment Compressed model is a PATCH version requiring re-validation; process is documented
EU AI Act (2024/1689) Article 15 (Accuracy, Robustness) — compression cannot materially degrade accuracy Benchmark degradation report quantifies accuracy impact; threshold enforces Article 15
ISO 42001:2023 Clause 8.4 (Verification and validation) — all lifecycle changes require validation Benchmarking protocol before and after compression is the validation event
NIST AI RMF (2023) MANAGE 1.3 (Responses to risks from AI system changes) Compression is a documented change with risk evidence (benchmark report)
APRA CPS 234 (2019) Paragraph 15 (Change management for information assets including AI models) PATCH version with benchmark evidence satisfies APRA change management requirement
Privacy Act 1988 (Cth) APP 11 (Security) — calibration dataset security Calibration dataset contains real user inputs; must be secured at production data classification

15. Reference Implementations

15.1 AWS

  • Quantisation: AWS SageMaker Neo (hardware-aware optimisation); bitsandbytes on SageMaker Processing Job; AutoGPTQ on GPU instance.
  • Distillation: SageMaker Training Jobs (same pipeline as fine-tuning, student architecture).
  • ONNX Export: SageMaker Processing Job with PyTorch ONNX export; Triton Inference Server on SageMaker for TensorRT.
  • Evaluation: SageMaker Processing Jobs for benchmark runs.
  • Artefact Storage: S3 (PATCH version bundle).

15.2 Azure

  • Quantisation: Azure ML with bitsandbytes/AutoAWQ compute job; Azure Neural Network Intelligence for hardware-aware compression.
  • Distillation: Azure ML Training (same pipeline as fine-tuning).
  • ONNX Export: Azure ML with Optimum (Hugging Face); ONNX Runtime inference on Azure.
  • Evaluation: Azure ML Pipelines for benchmark runs.
  • Artefact Storage: Azure Blob Storage (PATCH version bundle).

15.3 GCP

  • Quantisation: Vertex AI Custom Training with bitsandbytes/AutoAWQ; GCP Model Garden compression tools.
  • Distillation: Vertex AI Training (student model).
  • ONNX Export: Vertex AI Custom Training with Optimum; TensorRT on T4/A100 VMs.
  • Evaluation: Vertex AI Pipelines.
  • Artefact Storage: GCS (PATCH version bundle).

15.4 On-Premises / Hybrid

  • Quantisation: bitsandbytes, AutoGPTQ, AutoAWQ on on-premises GPU; llama.cpp for GGUF quantisation on CPU-capable inference.
  • Distillation: Standard PyTorch training on GPU cluster (same as fine-tuning pipeline).
  • ONNX Export: torch.onnx.export + Optimum; TensorRT for NVIDIA GPU optimisation.
  • Evaluation: Eleuther LM Eval Harness on self-hosted compute.
  • Artefact Storage: MinIO S3-compatible storage.

Pattern ID Pattern Name Relationship Type Description
EAAPL-MDL001 Model Versioning Produces Compression produces a new PATCH version; versioning infrastructure records it
EAAPL-MDL006 Fine-Tuning Pipeline Predecessor Fine-tuned model is typically the input to compression; compression is post-training
EAAPL-MDL002 Shadow Model Deployment Next Step Compressed model candidates enter shadow testing to validate production quality
EAAPL-MDL004 Model Rollback Safety Net Rollback to uncompressed version is the recovery when compressed model fails in prod

17. Maturity Assessment

Overall Maturity: Proven

Dimension Score (1–5) Rationale
Industry Adoption 4 INT8/INT4 quantisation is production-standard; GPTQ/AWQ widely deployed
Tooling Availability 4 AutoGPTQ, AutoAWQ, bitsandbytes, TensorRT are production-ready libraries
Standards Alignment 3 EU AI Act re-validation requirement is clear; specific evaluation standards still developing
Implementation Complexity 4 (high) Calibration, evaluation protocol, regulatory re-validation add complexity
Regulatory Acceptance 3 Accepted as a model change; re-validation process requirements still being established by regulators

18. Revision History

Version Date Author Summary of Changes
1.0 2026-06-12 Enterprise AI Architecture Practice Initial publication
← Back to LibraryMore Model Management