EAAPL-SEC006Proven

AI Output Filtering

🔐 AI SecurityAPRA CPS234EU AI Act🏭 Field-tested in AU

[EAAPL-SEC006] AI Output Filtering

Category: Security / Response Validation Sub-category: Post-Generation Content Controls Version: 1.2 Maturity: Proven Tags: output-safety content-filtering pii-detection hallucination-flagging guardrails harmful-content schema-validation Regulatory Relevance: Australian Privacy Act, EU AI Act Art. 13 & 14, APRA CPS234, NIST AI RMF MANAGE 2.4, OWASP LLM02

1. Executive Summary

AI Output Filtering is the post-generation inspection and transformation pipeline that evaluates every LLM response before it is delivered to users or downstream systems. Where input controls (EAAPL-SEC002, SEC005) reduce the probability of harmful outputs, output filtering is the last line of defence — catching what gets through.

For business leaders, this pattern addresses a direct liability risk: organisations that deploy AI-facing users without output controls are accepting that every possible model output will reach end users unreviewed. LLMs produce harmful content, leak PII, give incorrect advice in domains requiring professional qualifications (financial, medical, legal), and hallucinate facts — not as rare edge cases but as a statistical property of the technology. Output filtering transforms this from an accepted risk to a managed one.

The pattern provides: PII leak detection (catching personal information present in model responses), harmful content filtering (blocking responses that violate content policy), hallucination flagging (scoring response confidence for downstream use), domain guardrails (blocking financial, medical, and legal advice in applications not licensed to provide it), and output schema validation (ensuring responses conform to expected structure for automated processing). It operates inline with minimal latency impact and supports both hard blocks and soft flags, allowing organisations to calibrate the balance between safety and user experience.

2. Problem Statement

Business Problem

LLMs deployed without output filtering expose organisations to three material risks simultaneously:

Data leakage: Models can regurgitate PII from their training data or from context window contents — leaking information about other users, producing personally identifiable outputs, or generating content that combined with inference becomes identifying.
Harmful content: Models can generate content that is harmful, offensive, or inappropriate, even with well-designed system prompts. Safety fine-tuning reduces but does not eliminate this risk.
Professional liability: Models can produce advice in regulated domains (financial, medical, legal) that the organisation is not licensed to provide. A chatbot that says "you should definitely sell those shares" creates liability regardless of the disclaimer in the system prompt.

Technical Problem

Model safety training provides a probabilistic reduction in harmful output probability — it is not an absolute guarantee. Adversarial inputs (prompt injection, jailbreaks), unusual prompt combinations, and model version changes can all cause safety training to produce inadequate results for specific inputs. Output filtering provides a deterministic second layer that operates independently of the model's internal safety training.

Symptoms

Model producing inappropriate content reaching users before detection (social media reports, support tickets).
PII from one user appearing in another user's conversation (context contamination).
Model giving specific financial/medical advice in a use case where that is inappropriate.
Model output format deviating from expected schema, causing downstream application errors.
No mechanism to detect hallucinated facts before they reach users.

Cost of Inaction

Dimension	Impact
Legal	Organisation liable for AI-generated harmful content, medical advice, or financial advice
Regulatory	Privacy breach from PII leakage; financial services regulatory action for unlicensed financial advice
Reputational	Public incident involving AI-generated harmful content causes significant brand damage
Operational	Downstream automation breaks when model output does not match expected schema
User Safety	Harmful content or dangerous advice reaches vulnerable users

3. Context

When to Apply

All user-facing AI applications generating freeform text responses.
AI systems that provide responses in regulated domains (financial services, healthcare, legal, education).
Applications where PII from multiple users may be present in the context window.
Automated pipelines where model outputs feed downstream systems and schema conformance is critical.
Agentic systems where model outputs become tool call arguments or actions.

When NOT to Apply

Fully offline, isolated systems with no external users.
Applications where all outputs are reviewed by a human before any action is taken (the human IS the output filter).
Developer tooling (code generation, code explanation) where content policy enforcement would create excessive friction for legitimate use.

Prerequisites

Prerequisite	Detail
AI Gateway (EAAPL-SEC001)	Output filter deployed as a response stage in the gateway
Content Policy Definitions	Organisation's content policy codified into detectable violation categories
Domain Guardrail Rules	Per-use-case rules defining which advice domains are prohibited
PII Detection Library	Microsoft Presidio or equivalent (shared with SEC005)

Industry Applicability

Industry	Applicability	Key Driver
Financial Services	Critical	AFSL obligations; unlicensed financial advice risk
Healthcare	Critical	Clinical advice liability; PHI in model outputs
Legal	Critical	Unlicensed legal advice; privilege leakage
E-commerce / Retail	High	Customer PII leakage; harmful content in customer service
Education	High	Content appropriate for minor users; factual accuracy
Government	High	Citizen data protection; official advice accuracy

4. Architecture Overview

The output filtering pipeline receives the raw model response immediately after generation — before it is returned to the AI Gateway's response path. The pipeline is designed for low-latency parallel operation: multiple filtering stages execute concurrently, with a final aggregation step that combines results and determines the composite disposition.

PII Leak Detection

The first stage scans the model output for PII using the same detection engine as the input sanitisation pipeline (EAAPL-SEC005). Model outputs can contain PII in several ways: the model may hallucinate plausible-sounding personal details, it may reproduce PII from its training data, or it may leak PII from the context window (e.g., information about user A appearing in a response to user B due to shared conversation history). Detected PII triggers one of: inline redaction (PII replaced with type label), partial block (affected sentence removed), or full response block depending on the severity configuration.

Harmful Content Classification

A multi-label content classifier evaluates the response against the organisation's content policy categories: hate speech, violence, self-harm, illegal activity facilitation, adult content, and harassment. The classifier returns probability scores per category. Responses exceeding thresholds for any enabled category are blocked. The classifier operates as a distilled transformer model (typically DeBERTa or DistilBERT fine-tuned on content safety datasets) running on CPU with 20–40ms latency. For applications with strict content requirements (children's platforms, consumer health applications), thresholds are set more conservatively.

Domain Guardrails

This stage enforces use-case-specific content restrictions. For each application registered with the output filter, a guardrail configuration specifies prohibited advice domains and the filter action when they are detected:

Financial services application without AFSL: block responses containing specific financial recommendations ("you should buy/sell X").
Healthcare information site (not a medical service): block specific clinical diagnoses or treatment recommendations.
Legal information platform: block specific legal advice (as opposed to legal information). Domain detection uses a combination of classifier models and pattern rules. Detected violations are replaced with appropriate alternative text (e.g., "For specific financial advice, please consult a licensed financial adviser.").

Hallucination Scoring

For RAG-based applications, the output is scored for grounding against the retrieved source documents. Sentences in the response are checked for: presence in the source corpus, factual consistency with sources, and citation accuracy. A hallucination risk score is attached to the response as metadata. Downstream systems can use this score to: flag responses for human review, present a confidence indicator to users, or trigger automatic regeneration with different parameters. The hallucination detection architecture is elaborated in EAAPL-OBS003.

Output Schema Validation

For applications that expect structured model outputs (JSON, XML, specific formats), the response is validated against the registered output schema. Schema validation failures indicate either model output quality issues or adversarial manipulation of output format. Invalid schemas trigger response regeneration (up to a configured retry limit) before falling back to an error response.

Response Aggregation and Decision

All parallel filter stages emit verdicts and confidence scores. The aggregation layer applies the configured response policy: BLOCK (return error to caller), SANITISE (return modified response with violations removed), FLAG (return response with metadata flags for downstream handling), or ALLOW. The aggregation policy is configurable per application and per content category, allowing fine-grained calibration of the tradeoff between safety and user experience.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Input["Model Output"] A[Raw Model Response] end subgraph Filters["Parallel Filter Stages"] B[PII Leak Detector] C[Harmful Content Classifier] D[Domain Guardrails Engine] E[Hallucination Scorer] end subgraph Decision["Aggregation + Delivery"] F[Aggregation Layer] G{Composite Disposition} H[Audit Log] end A --> B A --> C A --> D A --> E B --> F C --> F D --> F E --> F F --> G G -->|block| I[Error Response] G -->|sanitise / allow| J[Caller Response] G --> H style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#f3e8ff,stroke:#a855f7 style H fill:#fef9c3,stroke:#eab308 style I fill:#fee2e2,stroke:#ef4444 style J fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
PII Leak Detector	NLP	Scans model output for PII; redacts or blocks	Microsoft Presidio, AWS Comprehend PII, custom NER	Critical
Harmful Content Classifier	ML Classifier	Multi-label classification for content policy violations	Perspective API, Detoxify, Azure AI Content Safety, custom DeBERTa	Critical
Domain Guardrails Engine	Rule + ML	Detects prohibited advice domain content; applies configured response	Custom classifier + rules, Azure AI Content Safety categories	High
Hallucination Scorer	ML + Comparison	Scores output against source documents for grounding; produces confidence metadata	NLI model (DeBERTa), custom sentence-overlap scoring, TruLens	High
Output Schema Validator	Validation	Validates structured output conforms to expected schema; triggers retry on failure	JSON Schema validator, Pydantic, custom validator	High
Aggregation Layer	Decision Engine	Combines filter stage verdicts; applies response policy	Custom rule engine; OPA	Critical
Response Sanitiser	Transformation	Applies inline redactions and replacements to produce sanitised response	Custom text processor, Presidio anonymiser	High
Output Audit Logger	Compliance	Records all filter decisions and violations in immutable audit log	Same pipeline as AI Gateway audit log	Critical
Policy Configuration Store	Configuration	Per-application content policy thresholds, guardrail rules, schema definitions	Git-versioned YAML, database	Critical

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	AI Gateway	Routes raw model response to output filter pipeline	Raw response text + metadata (model, application, trace_id)
2	PII Detector	Scans response for PII entities; records detected entities	Detection results: entity type, span, confidence
3	Harmful Content Classifier	Classifies response against content policy categories; returns per-category probability scores	Classification scores per category
4	Domain Guardrails Engine	Evaluates response against use-case guardrail rules	Guardrail violations with rule IDs
5	Hallucination Scorer	Scores response sentences for grounding against source documents (if RAG context provided)	Hallucination risk score (0–1) per sentence; overall score
6	Schema Validator	Validates response conforms to expected output schema	VALID or INVALID with violation details
7	Aggregation Layer	Combines all stage results against configured response policy	Composite disposition: BLOCK / SANITISE / FLAG / ALLOW
8	Response Sanitiser	If SANITISE: applies inline PII redactions and guardrail replacements	Modified response text
9	Output Audit Logger	Records: application, trace_id, disposition, violations, scores, response_hash	Immutable audit record
10	AI Gateway	Returns final response (clean, sanitised, flagged, or error) to caller	Response with filter metadata headers

Error Flow

Error	Handling	Status	Alert
PII detected in output (high severity)	BLOCK or SANITISE per policy	200 (sanitised) / 502 (blocked)	Privacy: PII leakage in model output
Harmful content above threshold	BLOCK with replacement message	200 (safe default message)	Security: content policy violation
Domain guardrail violation (financial/medical advice)	SANITISE with disclaimer replacement	200 (sanitised)	Compliance: guardrail triggered
Schema validation failure + retry exhausted	Return error	502	Warning: output schema failure
Hallucination score > 0.8	FLAG response with metadata	200 (flagged)	Info: high hallucination risk

8. Security Considerations

Authentication & Authorisation

Output filter is an internal pipeline component; not directly accessible by external callers.
Per-application policy configuration protected by role-based access: only AI Platform administrators can modify thresholds.

OWASP LLM Top 10 Coverage

OWASP LLM Risk	Output Filter Mitigation	Coverage
LLM01: Prompt Injection	Catches consequences of successful injection (unexpected output content, structural anomalies)	Medium
LLM02: Insecure Output Handling	Core purpose: validates and sanitises outputs before delivery — directly addresses this risk	Critical
LLM03: Training Data Poisoning	Detects PII or unexpected content that may indicate training data leakage	Medium
LLM04: Model Denial of Service	Not applicable	None
LLM05: Supply Chain Vulnerabilities	Schema validation detects deviations that may result from unexpected model behaviour	Low
LLM06: Sensitive Information Disclosure	PII leak detection directly addresses this risk	Critical
LLM07: Insecure Plugin Design	Output validation of tool call arguments prevents injection via tool outputs	High
LLM08: Excessive Agency	Detecting and blocking commands embedded in model outputs prevents downstream agent manipulation	High
LLM09: Overreliance	Hallucination flagging + domain guardrails provide signals that mitigate overreliance	High
LLM10: Model Theft	Not directly applicable	None

9. Governance Considerations

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Content Policy Configuration	AI Governance	Reviewed quarterly; updated as needed	Documents threshold settings per application and rationale
Output Filter Violation Report	Compliance / AI Governance	Weekly	Trend analysis of violation types; identifies model behaviour changes
Hallucination Score Distribution	AI Quality	Monthly	Monitors RAG quality; triggers investigation if scores trend upward
Domain Guardrail Trigger Log	Legal / Compliance	Monthly	Evidence of guardrail effectiveness for regulatory purposes
PII Leakage Incident Register	Privacy Team	Continuous; monthly review	Records all PII detections in model outputs for Privacy Act compliance

10. Operational Considerations

SLOs

SLO	Target	Measurement
Output filter latency p99	<60ms (parallel execution)	Filter entry → aggregation span
PII false-negative rate	<0.5% of outputs with PII	Monthly sampled audit
Harmful content false-positive rate	<1% of legitimate responses blocked	Monthly sampled user feedback
Schema validation accuracy	100% (no non-conformant schemas passed)	Schema validation metric
Filter availability	99.9% (fail-closed if unavailable)	Health check monitoring

11. Cost Considerations

Cost Drivers

Cost Driver	Description	Relative Impact
Content classifier inference	CPU/GPU compute for multi-label classification	High
NLI model for hallucination scoring	Expensive relative to other stages; skip for non-RAG applications	High
PII detection (shared with SEC005)	Amortised if same service as input pipeline	Medium
Response regeneration retries	Schema failure retries consume additional model tokens	Low–Medium

Indicative Cost Range

Scale	Monthly Cost (USD)	Notes
Small (< 1M responses/day)	$400–$1,000	CPU inference; shared Presidio service
Medium (1M–10M responses/day)	$2,000–$8,000	GPU inference for classifier + NLI; autoscaling
Large (> 10M responses/day)	$10,000–$35,000	GPU cluster; skip NLI for non-RAG paths

12. Trade-Off Analysis

Option Comparison

Option	Description	Pros	Cons	Best For
A: Model safety training only	Rely on provider's safety fine-tuning; no output filter	Zero latency; zero cost	No control; no audit; safety training bypassed by adversarial inputs	Not recommended for production user-facing applications
B: Rule-based output filter	Pattern matching on outputs for known violation strings	Fast; deterministic; low cost	Easily bypassed by paraphrasing; cannot detect nuanced content	Low-risk internal tools; supplement to other controls
C: Full parallel classifier pipeline (this pattern)	Parallel ML classifiers for PII, content, domain, hallucination	Comprehensive coverage; parallel = low latency; per-application configurability	ML ops burden; cost at scale; false positives require tuning	Production user-facing applications; regulated use cases
D: Human review	All AI outputs reviewed by human before delivery	Perfect accuracy for human judgment	Impossible at scale; not practical for real-time AI	Highest-risk use cases only (e.g., regulatory filings)

Architectural Tensions

Tension	Trade-Off
Safety vs User Experience	Overly aggressive filtering blocks legitimate responses and degrades user experience. Resolution: start permissive; tighten thresholds as violation patterns are understood; use SANITISE before BLOCK wherever possible.
Latency vs Coverage	More filters = more latency. Parallel execution mitigates this but does not eliminate it. Resolution: run all filters in parallel; timeout-exit slow filters with degraded confidence rather than blocking the response.
Hallucination Detection vs Cost	NLI-based hallucination scoring is computationally expensive. Resolution: enable only for RAG-based applications; skip for non-grounded generation.

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Content classifier model degradation	Medium	High (harmful content reaches users)	Weekly automated test suite against known violation examples	Rollback to previous model version; emergency rule-based backup
PII false negative in output	Medium	High (Privacy Act risk)	Post-hoc audit sampling	Improve NER model; add specific pattern for missed entity type
Hallucination scorer false positives	Medium	Medium (legitimate responses flagged; user confusion)	User feedback; high FLAG rate	Adjust hallucination threshold; review NLI model
Output filter latency spike (classifier overloaded)	Medium	Medium (gateway SLO breach)	Filter latency metric	Autoscale classifier instances; reduce parallelism to prioritise PII+harmful

14. Regulatory Considerations

Regulation	Requirement	Implementation
Australian Privacy Act — APP 6	Use or disclose personal information only for the primary purpose of collection	PII detection prevents inadvertent disclosure via AI output
EU AI Act Art. 13 (Transparency)	High-risk AI system outputs must be interpretable to users	Hallucination scoring provides interpretability metadata
EU AI Act Art. 14 (Human Oversight)	High-risk AI systems must allow human oversight	FLAG disposition routes outputs for human review
AFSL obligations (ASIC RG 255)	Licensed financial advice must be appropriate; unlicensed advice is prohibited	Domain guardrails block specific financial recommendations in unlicensed applications
AHPRA/Medical Board	AI must not provide individual clinical diagnoses	Domain guardrails block specific clinical advice in non-clinical applications

15. Reference Implementations

AWS

Component	AWS Service
Content classifier	AWS Comprehend Custom Classifier or SageMaker (DeBERTa fine-tuned)
PII detection	Amazon Comprehend PII Detection
Hallucination scoring	SageMaker NLI endpoint
Domain guardrails	Lambda (custom rule engine) + Bedrock Guardrails
Output audit log	Kinesis Firehose → S3 Object Lock

Azure

Component	Azure Service
Content classifier + PII + harmful	Azure AI Content Safety (all-in-one)
Hallucination scoring	Azure AI Language (NLI)
Output audit log	Event Hub → Immutable Blob

On-Premises

Component	Technology
Content classifier	DeBERTa fine-tuned on OWASP/safety dataset, ONNX Runtime
PII detection	Presidio (shared with SEC005)
Hallucination scoring	HHEM (HuggingFace HHEM model)
Domain guardrails	OPA rules + custom classifier
Output audit log	Kafka → Elasticsearch

Pattern	ID	Relationship
AI Gateway	EAAPL-SEC001	Output filter is a response stage in the gateway
Prompt Firewall	EAAPL-SEC002	Defence pair: SEC002 at input; SEC006 at output
LLM Input Sanitisation	EAAPL-SEC005	Input/output pair for PII governance
Hallucination Detection	EAAPL-OBS003	SEC006 hallucination scoring is the runtime detection; OBS003 is the monitoring pattern
AI Data Classification	EAAPL-SEC009	Output classification labels applied by SEC009 feed SEC006 disposition decisions

17. Maturity Assessment

Overall Maturity: Proven

Dimension	Score (1–5)	Rationale
Pattern definition clarity	5	Clear stages and decision model
Technology availability	4	Strong commercial options (Azure Content Safety); hallucination scoring tooling maturing
Industry adoption	4	Widely recognised requirement; implementation completeness varies
Regulatory alignment	5	Directly addresses EU AI Act Art. 13/14, Privacy Act, and domain-specific obligations
Operational tooling	4	Good commercial tooling; hallucination scoring requires custom integration

18. Revision History

Version	Date	Author	Changes
1.0	2024-03-15	Security Architecture Team	Initial pattern definition
1.1	2024-07-01	Security Architecture Team	Added domain guardrails for financial/medical/legal; hallucination scoring integration
1.2	2025-01-20	Security Architecture Team	Updated OWASP mapping; added Australian regulatory context; expanded failure modes

← Back to Library More AI Security →