EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryAI SecurityEAAPL-SEC006
EAAPL-SEC006Proven
⇄ Compare

AI Output Filtering

🔐 AI SecurityAPRA CPS234EU AI Act🏭 Field-tested in AU

[EAAPL-SEC006] AI Output Filtering

Category: Security / Response Validation Sub-category: Post-Generation Content Controls Version: 1.2 Maturity: Proven Tags: output-safety content-filtering pii-detection hallucination-flagging guardrails harmful-content schema-validation Regulatory Relevance: Australian Privacy Act, EU AI Act Art. 13 & 14, APRA CPS234, NIST AI RMF MANAGE 2.4, OWASP LLM02


1. Executive Summary

AI Output Filtering is the post-generation inspection and transformation pipeline that evaluates every LLM response before it is delivered to users or downstream systems. Where input controls (EAAPL-SEC002, SEC005) reduce the probability of harmful outputs, output filtering is the last line of defence — catching what gets through.

For business leaders, this pattern addresses a direct liability risk: organisations that deploy AI-facing users without output controls are accepting that every possible model output will reach end users unreviewed. LLMs produce harmful content, leak PII, give incorrect advice in domains requiring professional qualifications (financial, medical, legal), and hallucinate facts — not as rare edge cases but as a statistical property of the technology. Output filtering transforms this from an accepted risk to a managed one.

The pattern provides: PII leak detection (catching personal information present in model responses), harmful content filtering (blocking responses that violate content policy), hallucination flagging (scoring response confidence for downstream use), domain guardrails (blocking financial, medical, and legal advice in applications not licensed to provide it), and output schema validation (ensuring responses conform to expected structure for automated processing). It operates inline with minimal latency impact and supports both hard blocks and soft flags, allowing organisations to calibrate the balance between safety and user experience.


2. Problem Statement

Business Problem

LLMs deployed without output filtering expose organisations to three material risks simultaneously:

  1. Data leakage: Models can regurgitate PII from their training data or from context window contents — leaking information about other users, producing personally identifiable outputs, or generating content that combined with inference becomes identifying.
  2. Harmful content: Models can generate content that is harmful, offensive, or inappropriate, even with well-designed system prompts. Safety fine-tuning reduces but does not eliminate this risk.
  3. Professional liability: Models can produce advice in regulated domains (financial, medical, legal) that the organisation is not licensed to provide. A chatbot that says "you should definitely sell those shares" creates liability regardless of the disclaimer in the system prompt.

Technical Problem

Model safety training provides a probabilistic reduction in harmful output probability — it is not an absolute guarantee. Adversarial inputs (prompt injection, jailbreaks), unusual prompt combinations, and model version changes can all cause safety training to produce inadequate results for specific inputs. Output filtering provides a deterministic second layer that operates independently of the model's internal safety training.

Symptoms

  • Model producing inappropriate content reaching users before detection (social media reports, support tickets).
  • PII from one user appearing in another user's conversation (context contamination).
  • Model giving specific financial/medical advice in a use case where that is inappropriate.
  • Model output format deviating from expected schema, causing downstream application errors.
  • No mechanism to detect hallucinated facts before they reach users.

Cost of Inaction

Dimension Impact
Legal Organisation liable for AI-generated harmful content, medical advice, or financial advice
Regulatory Privacy breach from PII leakage; financial services regulatory action for unlicensed financial advice
Reputational Public incident involving AI-generated harmful content causes significant brand damage
Operational Downstream automation breaks when model output does not match expected schema
User Safety Harmful content or dangerous advice reaches vulnerable users

3. Context

When to Apply

  • All user-facing AI applications generating freeform text responses.
  • AI systems that provide responses in regulated domains (financial services, healthcare, legal, education).
  • Applications where PII from multiple users may be present in the context window.
  • Automated pipelines where model outputs feed downstream systems and schema conformance is critical.
  • Agentic systems where model outputs become tool call arguments or actions.

When NOT to Apply

  • Fully offline, isolated systems with no external users.
  • Applications where all outputs are reviewed by a human before any action is taken (the human IS the output filter).
  • Developer tooling (code generation, code explanation) where content policy enforcement would create excessive friction for legitimate use.

Prerequisites

Prerequisite Detail
AI Gateway (EAAPL-SEC001) Output filter deployed as a response stage in the gateway
Content Policy Definitions Organisation's content policy codified into detectable violation categories
Domain Guardrail Rules Per-use-case rules defining which advice domains are prohibited
PII Detection Library Microsoft Presidio or equivalent (shared with SEC005)

Industry Applicability

Industry Applicability Key Driver
Financial Services Critical AFSL obligations; unlicensed financial advice risk
Healthcare Critical Clinical advice liability; PHI in model outputs
Legal Critical Unlicensed legal advice; privilege leakage
E-commerce / Retail High Customer PII leakage; harmful content in customer service
Education High Content appropriate for minor users; factual accuracy
Government High Citizen data protection; official advice accuracy

4. Architecture Overview

The output filtering pipeline receives the raw model response immediately after generation — before it is returned to the AI Gateway's response path. The pipeline is designed for low-latency parallel operation: multiple filtering stages execute concurrently, with a final aggregation step that combines results and determines the composite disposition.

PII Leak Detection

The first stage scans the model output for PII using the same detection engine as the input sanitisation pipeline (EAAPL-SEC005). Model outputs can contain PII in several ways: the model may hallucinate plausible-sounding personal details, it may reproduce PII from its training data, or it may leak PII from the context window (e.g., information about user A appearing in a response to user B due to shared conversation history). Detected PII triggers one of: inline redaction (PII replaced with type label), partial block (affected sentence removed), or full response block depending on the severity configuration.

Harmful Content Classification

A multi-label content classifier evaluates the response against the organisation's content policy categories: hate speech, violence, self-harm, illegal activity facilitation, adult content, and harassment. The classifier returns probability scores per category. Responses exceeding thresholds for any enabled category are blocked. The classifier operates as a distilled transformer model (typically DeBERTa or DistilBERT fine-tuned on content safety datasets) running on CPU with 20–40ms latency. For applications with strict content requirements (children's platforms, consumer health applications), thresholds are set more conservatively.

Domain Guardrails

This stage enforces use-case-specific content restrictions. For each application registered with the output filter, a guardrail configuration specifies prohibited advice domains and the filter action when they are detected:

  • Financial services application without AFSL: block responses containing specific financial recommendations ("you should buy/sell X").
  • Healthcare information site (not a medical service): block specific clinical diagnoses or treatment recommendations.
  • Legal information platform: block specific legal advice (as opposed to legal information). Domain detection uses a combination of classifier models and pattern rules. Detected violations are replaced with appropriate alternative text (e.g., "For specific financial advice, please consult a licensed financial adviser.").

Hallucination Scoring

For RAG-based applications, the output is scored for grounding against the retrieved source documents. Sentences in the response are checked for: presence in the source corpus, factual consistency with sources, and citation accuracy. A hallucination risk score is attached to the response as metadata. Downstream systems can use this score to: flag responses for human review, present a confidence indicator to users, or trigger automatic regeneration with different parameters. The hallucination detection architecture is elaborated in EAAPL-OBS003.

Output Schema Validation

For applications that expect structured model outputs (JSON, XML, specific formats), the response is validated against the registered output schema. Schema validation failures indicate either model output quality issues or adversarial manipulation of output format. Invalid schemas trigger response regeneration (up to a configured retry limit) before falling back to an error response.

Response Aggregation and Decision

All parallel filter stages emit verdicts and confidence scores. The aggregation layer applies the configured response policy: BLOCK (return error to caller), SANITISE (return modified response with violations removed), FLAG (return response with metadata flags for downstream handling), or ALLOW. The aggregation policy is configurable per application and per content category, allowing fine-grained calibration of the tradeoff between safety and user experience.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Input["Model Output"] A[Raw Model Response] end subgraph Filters["Parallel Filter Stages"] B[PII Leak Detector] C[Harmful Content Classifier] D[Domain Guardrails Engine] E[Hallucination Scorer] end subgraph Decision["Aggregation + Delivery"] F[Aggregation Layer] G{Composite Disposition} H[Audit Log] end A --> B A --> C A --> D A --> E B --> F C --> F D --> F E --> F F --> G G -->|block| I[Error Response] G -->|sanitise / allow| J[Caller Response] G --> H style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#f3e8ff,stroke:#a855f7 style H fill:#fef9c3,stroke:#eab308 style I fill:#fee2e2,stroke:#ef4444 style J fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
PII Leak Detector NLP Scans model output for PII; redacts or blocks Microsoft Presidio, AWS Comprehend PII, custom NER Critical
Harmful Content Classifier ML Classifier Multi-label classification for content policy violations Perspective API, Detoxify, Azure AI Content Safety, custom DeBERTa Critical
Domain Guardrails Engine Rule + ML Detects prohibited advice domain content; applies configured response Custom classifier + rules, Azure AI Content Safety categories High
Hallucination Scorer ML + Comparison Scores output against source documents for grounding; produces confidence metadata NLI model (DeBERTa), custom sentence-overlap scoring, TruLens High
Output Schema Validator Validation Validates structured output conforms to expected schema; triggers retry on failure JSON Schema validator, Pydantic, custom validator High
Aggregation Layer Decision Engine Combines filter stage verdicts; applies response policy Custom rule engine; OPA Critical
Response Sanitiser Transformation Applies inline redactions and replacements to produce sanitised response Custom text processor, Presidio anonymiser High
Output Audit Logger Compliance Records all filter decisions and violations in immutable audit log Same pipeline as AI Gateway audit log Critical
Policy Configuration Store Configuration Per-application content policy thresholds, guardrail rules, schema definitions Git-versioned YAML, database Critical

7. Data Flow

Primary Flow

Step Actor Action Output
1 AI Gateway Routes raw model response to output filter pipeline Raw response text + metadata (model, application, trace_id)
2 PII Detector Scans response for PII entities; records detected entities Detection results: entity type, span, confidence
3 Harmful Content Classifier Classifies response against content policy categories; returns per-category probability scores Classification scores per category
4 Domain Guardrails Engine Evaluates response against use-case guardrail rules Guardrail violations with rule IDs
5 Hallucination Scorer Scores response sentences for grounding against source documents (if RAG context provided) Hallucination risk score (0–1) per sentence; overall score
6 Schema Validator Validates response conforms to expected output schema VALID or INVALID with violation details
7 Aggregation Layer Combines all stage results against configured response policy Composite disposition: BLOCK / SANITISE / FLAG / ALLOW
8 Response Sanitiser If SANITISE: applies inline PII redactions and guardrail replacements Modified response text
9 Output Audit Logger Records: application, trace_id, disposition, violations, scores, response_hash Immutable audit record
10 AI Gateway Returns final response (clean, sanitised, flagged, or error) to caller Response with filter metadata headers

Error Flow

Error Handling Status Alert
PII detected in output (high severity) BLOCK or SANITISE per policy 200 (sanitised) / 502 (blocked) Privacy: PII leakage in model output
Harmful content above threshold BLOCK with replacement message 200 (safe default message) Security: content policy violation
Domain guardrail violation (financial/medical advice) SANITISE with disclaimer replacement 200 (sanitised) Compliance: guardrail triggered
Schema validation failure + retry exhausted Return error 502 Warning: output schema failure
Hallucination score > 0.8 FLAG response with metadata 200 (flagged) Info: high hallucination risk

8. Security Considerations

Authentication & Authorisation

  • Output filter is an internal pipeline component; not directly accessible by external callers.
  • Per-application policy configuration protected by role-based access: only AI Platform administrators can modify thresholds.

OWASP LLM Top 10 Coverage

OWASP LLM Risk Output Filter Mitigation Coverage
LLM01: Prompt Injection Catches consequences of successful injection (unexpected output content, structural anomalies) Medium
LLM02: Insecure Output Handling Core purpose: validates and sanitises outputs before delivery — directly addresses this risk Critical
LLM03: Training Data Poisoning Detects PII or unexpected content that may indicate training data leakage Medium
LLM04: Model Denial of Service Not applicable None
LLM05: Supply Chain Vulnerabilities Schema validation detects deviations that may result from unexpected model behaviour Low
LLM06: Sensitive Information Disclosure PII leak detection directly addresses this risk Critical
LLM07: Insecure Plugin Design Output validation of tool call arguments prevents injection via tool outputs High
LLM08: Excessive Agency Detecting and blocking commands embedded in model outputs prevents downstream agent manipulation High
LLM09: Overreliance Hallucination flagging + domain guardrails provide signals that mitigate overreliance High
LLM10: Model Theft Not directly applicable None

9. Governance Considerations

Governance Artefacts

Artefact Owner Frequency Purpose
Content Policy Configuration AI Governance Reviewed quarterly; updated as needed Documents threshold settings per application and rationale
Output Filter Violation Report Compliance / AI Governance Weekly Trend analysis of violation types; identifies model behaviour changes
Hallucination Score Distribution AI Quality Monthly Monitors RAG quality; triggers investigation if scores trend upward
Domain Guardrail Trigger Log Legal / Compliance Monthly Evidence of guardrail effectiveness for regulatory purposes
PII Leakage Incident Register Privacy Team Continuous; monthly review Records all PII detections in model outputs for Privacy Act compliance

10. Operational Considerations

SLOs

SLO Target Measurement
Output filter latency p99 <60ms (parallel execution) Filter entry → aggregation span
PII false-negative rate <0.5% of outputs with PII Monthly sampled audit
Harmful content false-positive rate <1% of legitimate responses blocked Monthly sampled user feedback
Schema validation accuracy 100% (no non-conformant schemas passed) Schema validation metric
Filter availability 99.9% (fail-closed if unavailable) Health check monitoring

11. Cost Considerations

Cost Drivers

Cost Driver Description Relative Impact
Content classifier inference CPU/GPU compute for multi-label classification High
NLI model for hallucination scoring Expensive relative to other stages; skip for non-RAG applications High
PII detection (shared with SEC005) Amortised if same service as input pipeline Medium
Response regeneration retries Schema failure retries consume additional model tokens Low–Medium

Indicative Cost Range

Scale Monthly Cost (USD) Notes
Small (< 1M responses/day) $400–$1,000 CPU inference; shared Presidio service
Medium (1M–10M responses/day) $2,000–$8,000 GPU inference for classifier + NLI; autoscaling
Large (> 10M responses/day) $10,000–$35,000 GPU cluster; skip NLI for non-RAG paths

12. Trade-Off Analysis

Option Comparison

Option Description Pros Cons Best For
A: Model safety training only Rely on provider's safety fine-tuning; no output filter Zero latency; zero cost No control; no audit; safety training bypassed by adversarial inputs Not recommended for production user-facing applications
B: Rule-based output filter Pattern matching on outputs for known violation strings Fast; deterministic; low cost Easily bypassed by paraphrasing; cannot detect nuanced content Low-risk internal tools; supplement to other controls
C: Full parallel classifier pipeline (this pattern) Parallel ML classifiers for PII, content, domain, hallucination Comprehensive coverage; parallel = low latency; per-application configurability ML ops burden; cost at scale; false positives require tuning Production user-facing applications; regulated use cases
D: Human review All AI outputs reviewed by human before delivery Perfect accuracy for human judgment Impossible at scale; not practical for real-time AI Highest-risk use cases only (e.g., regulatory filings)

Architectural Tensions

Tension Trade-Off
Safety vs User Experience Overly aggressive filtering blocks legitimate responses and degrades user experience. Resolution: start permissive; tighten thresholds as violation patterns are understood; use SANITISE before BLOCK wherever possible.
Latency vs Coverage More filters = more latency. Parallel execution mitigates this but does not eliminate it. Resolution: run all filters in parallel; timeout-exit slow filters with degraded confidence rather than blocking the response.
Hallucination Detection vs Cost NLI-based hallucination scoring is computationally expensive. Resolution: enable only for RAG-based applications; skip for non-grounded generation.

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Content classifier model degradation Medium High (harmful content reaches users) Weekly automated test suite against known violation examples Rollback to previous model version; emergency rule-based backup
PII false negative in output Medium High (Privacy Act risk) Post-hoc audit sampling Improve NER model; add specific pattern for missed entity type
Hallucination scorer false positives Medium Medium (legitimate responses flagged; user confusion) User feedback; high FLAG rate Adjust hallucination threshold; review NLI model
Output filter latency spike (classifier overloaded) Medium Medium (gateway SLO breach) Filter latency metric Autoscale classifier instances; reduce parallelism to prioritise PII+harmful

14. Regulatory Considerations

Regulation Requirement Implementation
Australian Privacy Act — APP 6 Use or disclose personal information only for the primary purpose of collection PII detection prevents inadvertent disclosure via AI output
EU AI Act Art. 13 (Transparency) High-risk AI system outputs must be interpretable to users Hallucination scoring provides interpretability metadata
EU AI Act Art. 14 (Human Oversight) High-risk AI systems must allow human oversight FLAG disposition routes outputs for human review
AFSL obligations (ASIC RG 255) Licensed financial advice must be appropriate; unlicensed advice is prohibited Domain guardrails block specific financial recommendations in unlicensed applications
AHPRA/Medical Board AI must not provide individual clinical diagnoses Domain guardrails block specific clinical advice in non-clinical applications

15. Reference Implementations

AWS

Component AWS Service
Content classifier AWS Comprehend Custom Classifier or SageMaker (DeBERTa fine-tuned)
PII detection Amazon Comprehend PII Detection
Hallucination scoring SageMaker NLI endpoint
Domain guardrails Lambda (custom rule engine) + Bedrock Guardrails
Output audit log Kinesis Firehose → S3 Object Lock

Azure

Component Azure Service
Content classifier + PII + harmful Azure AI Content Safety (all-in-one)
Hallucination scoring Azure AI Language (NLI)
Output audit log Event Hub → Immutable Blob

On-Premises

Component Technology
Content classifier DeBERTa fine-tuned on OWASP/safety dataset, ONNX Runtime
PII detection Presidio (shared with SEC005)
Hallucination scoring HHEM (HuggingFace HHEM model)
Domain guardrails OPA rules + custom classifier
Output audit log Kafka → Elasticsearch

Pattern ID Relationship
AI Gateway EAAPL-SEC001 Output filter is a response stage in the gateway
Prompt Firewall EAAPL-SEC002 Defence pair: SEC002 at input; SEC006 at output
LLM Input Sanitisation EAAPL-SEC005 Input/output pair for PII governance
Hallucination Detection EAAPL-OBS003 SEC006 hallucination scoring is the runtime detection; OBS003 is the monitoring pattern
AI Data Classification EAAPL-SEC009 Output classification labels applied by SEC009 feed SEC006 disposition decisions

17. Maturity Assessment

Overall Maturity: Proven

Dimension Score (1–5) Rationale
Pattern definition clarity 5 Clear stages and decision model
Technology availability 4 Strong commercial options (Azure Content Safety); hallucination scoring tooling maturing
Industry adoption 4 Widely recognised requirement; implementation completeness varies
Regulatory alignment 5 Directly addresses EU AI Act Art. 13/14, Privacy Act, and domain-specific obligations
Operational tooling 4 Good commercial tooling; hallucination scoring requires custom integration

18. Revision History

Version Date Author Changes
1.0 2024-03-15 Security Architecture Team Initial pattern definition
1.1 2024-07-01 Security Architecture Team Added domain guardrails for financial/medical/legal; hallucination scoring integration
1.2 2025-01-20 Security Architecture Team Updated OWASP mapping; added Australian regulatory context; expanded failure modes
← Back to LibraryMore AI Security