[EAAPL-SEC006] AI Output Filtering
Category: Security / Response Validation
Sub-category: Post-Generation Content Controls
Version: 1.2
Maturity: Proven
Tags: output-safety content-filtering pii-detection hallucination-flagging guardrails harmful-content schema-validation
Regulatory Relevance: Australian Privacy Act, EU AI Act Art. 13 & 14, APRA CPS234, NIST AI RMF MANAGE 2.4, OWASP LLM02
1. Executive Summary
AI Output Filtering is the post-generation inspection and transformation pipeline that evaluates every LLM response before it is delivered to users or downstream systems. Where input controls (EAAPL-SEC002, SEC005) reduce the probability of harmful outputs, output filtering is the last line of defence — catching what gets through.
For business leaders, this pattern addresses a direct liability risk: organisations that deploy AI-facing users without output controls are accepting that every possible model output will reach end users unreviewed. LLMs produce harmful content, leak PII, give incorrect advice in domains requiring professional qualifications (financial, medical, legal), and hallucinate facts — not as rare edge cases but as a statistical property of the technology. Output filtering transforms this from an accepted risk to a managed one.
The pattern provides: PII leak detection (catching personal information present in model responses), harmful content filtering (blocking responses that violate content policy), hallucination flagging (scoring response confidence for downstream use), domain guardrails (blocking financial, medical, and legal advice in applications not licensed to provide it), and output schema validation (ensuring responses conform to expected structure for automated processing). It operates inline with minimal latency impact and supports both hard blocks and soft flags, allowing organisations to calibrate the balance between safety and user experience.
2. Problem Statement
Business Problem
LLMs deployed without output filtering expose organisations to three material risks simultaneously:
- Data leakage: Models can regurgitate PII from their training data or from context window contents — leaking information about other users, producing personally identifiable outputs, or generating content that combined with inference becomes identifying.
- Harmful content: Models can generate content that is harmful, offensive, or inappropriate, even with well-designed system prompts. Safety fine-tuning reduces but does not eliminate this risk.
- Professional liability: Models can produce advice in regulated domains (financial, medical, legal) that the organisation is not licensed to provide. A chatbot that says "you should definitely sell those shares" creates liability regardless of the disclaimer in the system prompt.
Technical Problem
Model safety training provides a probabilistic reduction in harmful output probability — it is not an absolute guarantee. Adversarial inputs (prompt injection, jailbreaks), unusual prompt combinations, and model version changes can all cause safety training to produce inadequate results for specific inputs. Output filtering provides a deterministic second layer that operates independently of the model's internal safety training.
Symptoms
- Model producing inappropriate content reaching users before detection (social media reports, support tickets).
- PII from one user appearing in another user's conversation (context contamination).
- Model giving specific financial/medical advice in a use case where that is inappropriate.
- Model output format deviating from expected schema, causing downstream application errors.
- No mechanism to detect hallucinated facts before they reach users.
Cost of Inaction
| Dimension |
Impact |
| Legal |
Organisation liable for AI-generated harmful content, medical advice, or financial advice |
| Regulatory |
Privacy breach from PII leakage; financial services regulatory action for unlicensed financial advice |
| Reputational |
Public incident involving AI-generated harmful content causes significant brand damage |
| Operational |
Downstream automation breaks when model output does not match expected schema |
| User Safety |
Harmful content or dangerous advice reaches vulnerable users |
3. Context
When to Apply
- All user-facing AI applications generating freeform text responses.
- AI systems that provide responses in regulated domains (financial services, healthcare, legal, education).
- Applications where PII from multiple users may be present in the context window.
- Automated pipelines where model outputs feed downstream systems and schema conformance is critical.
- Agentic systems where model outputs become tool call arguments or actions.
When NOT to Apply
- Fully offline, isolated systems with no external users.
- Applications where all outputs are reviewed by a human before any action is taken (the human IS the output filter).
- Developer tooling (code generation, code explanation) where content policy enforcement would create excessive friction for legitimate use.
Prerequisites
| Prerequisite |
Detail |
| AI Gateway (EAAPL-SEC001) |
Output filter deployed as a response stage in the gateway |
| Content Policy Definitions |
Organisation's content policy codified into detectable violation categories |
| Domain Guardrail Rules |
Per-use-case rules defining which advice domains are prohibited |
| PII Detection Library |
Microsoft Presidio or equivalent (shared with SEC005) |
Industry Applicability
| Industry |
Applicability |
Key Driver |
| Financial Services |
Critical |
AFSL obligations; unlicensed financial advice risk |
| Healthcare |
Critical |
Clinical advice liability; PHI in model outputs |
| Legal |
Critical |
Unlicensed legal advice; privilege leakage |
| E-commerce / Retail |
High |
Customer PII leakage; harmful content in customer service |
| Education |
High |
Content appropriate for minor users; factual accuracy |
| Government |
High |
Citizen data protection; official advice accuracy |
4. Architecture Overview
The output filtering pipeline receives the raw model response immediately after generation — before it is returned to the AI Gateway's response path. The pipeline is designed for low-latency parallel operation: multiple filtering stages execute concurrently, with a final aggregation step that combines results and determines the composite disposition.
PII Leak Detection
The first stage scans the model output for PII using the same detection engine as the input sanitisation pipeline (EAAPL-SEC005). Model outputs can contain PII in several ways: the model may hallucinate plausible-sounding personal details, it may reproduce PII from its training data, or it may leak PII from the context window (e.g., information about user A appearing in a response to user B due to shared conversation history). Detected PII triggers one of: inline redaction (PII replaced with type label), partial block (affected sentence removed), or full response block depending on the severity configuration.
Harmful Content Classification
A multi-label content classifier evaluates the response against the organisation's content policy categories: hate speech, violence, self-harm, illegal activity facilitation, adult content, and harassment. The classifier returns probability scores per category. Responses exceeding thresholds for any enabled category are blocked. The classifier operates as a distilled transformer model (typically DeBERTa or DistilBERT fine-tuned on content safety datasets) running on CPU with 20–40ms latency. For applications with strict content requirements (children's platforms, consumer health applications), thresholds are set more conservatively.
Domain Guardrails
This stage enforces use-case-specific content restrictions. For each application registered with the output filter, a guardrail configuration specifies prohibited advice domains and the filter action when they are detected:
- Financial services application without AFSL: block responses containing specific financial recommendations ("you should buy/sell X").
- Healthcare information site (not a medical service): block specific clinical diagnoses or treatment recommendations.
- Legal information platform: block specific legal advice (as opposed to legal information).
Domain detection uses a combination of classifier models and pattern rules. Detected violations are replaced with appropriate alternative text (e.g., "For specific financial advice, please consult a licensed financial adviser.").
Hallucination Scoring
For RAG-based applications, the output is scored for grounding against the retrieved source documents. Sentences in the response are checked for: presence in the source corpus, factual consistency with sources, and citation accuracy. A hallucination risk score is attached to the response as metadata. Downstream systems can use this score to: flag responses for human review, present a confidence indicator to users, or trigger automatic regeneration with different parameters. The hallucination detection architecture is elaborated in EAAPL-OBS003.
Output Schema Validation
For applications that expect structured model outputs (JSON, XML, specific formats), the response is validated against the registered output schema. Schema validation failures indicate either model output quality issues or adversarial manipulation of output format. Invalid schemas trigger response regeneration (up to a configured retry limit) before falling back to an error response.
Response Aggregation and Decision
All parallel filter stages emit verdicts and confidence scores. The aggregation layer applies the configured response policy: BLOCK (return error to caller), SANITISE (return modified response with violations removed), FLAG (return response with metadata flags for downstream handling), or ALLOW. The aggregation policy is configurable per application and per content category, allowing fine-grained calibration of the tradeoff between safety and user experience.
5. Architecture Diagram
flowchart TD
subgraph Input["Model Output"]
A[Raw Model Response]
end
subgraph Filters["Parallel Filter Stages"]
B[PII Leak Detector]
C[Harmful Content Classifier]
D[Domain Guardrails Engine]
E[Hallucination Scorer]
end
subgraph Decision["Aggregation + Delivery"]
F[Aggregation Layer]
G{Composite Disposition}
H[Audit Log]
end
A --> B
A --> C
A --> D
A --> E
B --> F
C --> F
D --> F
E --> F
F --> G
G -->|block| I[Error Response]
G -->|sanitise / allow| J[Caller Response]
G --> H
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#f0fdf4,stroke:#22c55e
style C fill:#f0fdf4,stroke:#22c55e
style D fill:#f0fdf4,stroke:#22c55e
style E fill:#f0fdf4,stroke:#22c55e
style F fill:#f0fdf4,stroke:#22c55e
style G fill:#f3e8ff,stroke:#a855f7
style H fill:#fef9c3,stroke:#eab308
style I fill:#fee2e2,stroke:#ef4444
style J fill:#d1fae5,stroke:#10b981
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| PII Leak Detector |
NLP |
Scans model output for PII; redacts or blocks |
Microsoft Presidio, AWS Comprehend PII, custom NER |
Critical |
| Harmful Content Classifier |
ML Classifier |
Multi-label classification for content policy violations |
Perspective API, Detoxify, Azure AI Content Safety, custom DeBERTa |
Critical |
| Domain Guardrails Engine |
Rule + ML |
Detects prohibited advice domain content; applies configured response |
Custom classifier + rules, Azure AI Content Safety categories |
High |
| Hallucination Scorer |
ML + Comparison |
Scores output against source documents for grounding; produces confidence metadata |
NLI model (DeBERTa), custom sentence-overlap scoring, TruLens |
High |
| Output Schema Validator |
Validation |
Validates structured output conforms to expected schema; triggers retry on failure |
JSON Schema validator, Pydantic, custom validator |
High |
| Aggregation Layer |
Decision Engine |
Combines filter stage verdicts; applies response policy |
Custom rule engine; OPA |
Critical |
| Response Sanitiser |
Transformation |
Applies inline redactions and replacements to produce sanitised response |
Custom text processor, Presidio anonymiser |
High |
| Output Audit Logger |
Compliance |
Records all filter decisions and violations in immutable audit log |
Same pipeline as AI Gateway audit log |
Critical |
| Policy Configuration Store |
Configuration |
Per-application content policy thresholds, guardrail rules, schema definitions |
Git-versioned YAML, database |
Critical |
7. Data Flow
Primary Flow
| Step |
Actor |
Action |
Output |
| 1 |
AI Gateway |
Routes raw model response to output filter pipeline |
Raw response text + metadata (model, application, trace_id) |
| 2 |
PII Detector |
Scans response for PII entities; records detected entities |
Detection results: entity type, span, confidence |
| 3 |
Harmful Content Classifier |
Classifies response against content policy categories; returns per-category probability scores |
Classification scores per category |
| 4 |
Domain Guardrails Engine |
Evaluates response against use-case guardrail rules |
Guardrail violations with rule IDs |
| 5 |
Hallucination Scorer |
Scores response sentences for grounding against source documents (if RAG context provided) |
Hallucination risk score (0–1) per sentence; overall score |
| 6 |
Schema Validator |
Validates response conforms to expected output schema |
VALID or INVALID with violation details |
| 7 |
Aggregation Layer |
Combines all stage results against configured response policy |
Composite disposition: BLOCK / SANITISE / FLAG / ALLOW |
| 8 |
Response Sanitiser |
If SANITISE: applies inline PII redactions and guardrail replacements |
Modified response text |
| 9 |
Output Audit Logger |
Records: application, trace_id, disposition, violations, scores, response_hash |
Immutable audit record |
| 10 |
AI Gateway |
Returns final response (clean, sanitised, flagged, or error) to caller |
Response with filter metadata headers |
Error Flow
| Error |
Handling |
Status |
Alert |
| PII detected in output (high severity) |
BLOCK or SANITISE per policy |
200 (sanitised) / 502 (blocked) |
Privacy: PII leakage in model output |
| Harmful content above threshold |
BLOCK with replacement message |
200 (safe default message) |
Security: content policy violation |
| Domain guardrail violation (financial/medical advice) |
SANITISE with disclaimer replacement |
200 (sanitised) |
Compliance: guardrail triggered |
| Schema validation failure + retry exhausted |
Return error |
502 |
Warning: output schema failure |
| Hallucination score > 0.8 |
FLAG response with metadata |
200 (flagged) |
Info: high hallucination risk |
8. Security Considerations
Authentication & Authorisation
- Output filter is an internal pipeline component; not directly accessible by external callers.
- Per-application policy configuration protected by role-based access: only AI Platform administrators can modify thresholds.
OWASP LLM Top 10 Coverage
| OWASP LLM Risk |
Output Filter Mitigation |
Coverage |
| LLM01: Prompt Injection |
Catches consequences of successful injection (unexpected output content, structural anomalies) |
Medium |
| LLM02: Insecure Output Handling |
Core purpose: validates and sanitises outputs before delivery — directly addresses this risk |
Critical |
| LLM03: Training Data Poisoning |
Detects PII or unexpected content that may indicate training data leakage |
Medium |
| LLM04: Model Denial of Service |
Not applicable |
None |
| LLM05: Supply Chain Vulnerabilities |
Schema validation detects deviations that may result from unexpected model behaviour |
Low |
| LLM06: Sensitive Information Disclosure |
PII leak detection directly addresses this risk |
Critical |
| LLM07: Insecure Plugin Design |
Output validation of tool call arguments prevents injection via tool outputs |
High |
| LLM08: Excessive Agency |
Detecting and blocking commands embedded in model outputs prevents downstream agent manipulation |
High |
| LLM09: Overreliance |
Hallucination flagging + domain guardrails provide signals that mitigate overreliance |
High |
| LLM10: Model Theft |
Not directly applicable |
None |
9. Governance Considerations
Governance Artefacts
| Artefact |
Owner |
Frequency |
Purpose |
| Content Policy Configuration |
AI Governance |
Reviewed quarterly; updated as needed |
Documents threshold settings per application and rationale |
| Output Filter Violation Report |
Compliance / AI Governance |
Weekly |
Trend analysis of violation types; identifies model behaviour changes |
| Hallucination Score Distribution |
AI Quality |
Monthly |
Monitors RAG quality; triggers investigation if scores trend upward |
| Domain Guardrail Trigger Log |
Legal / Compliance |
Monthly |
Evidence of guardrail effectiveness for regulatory purposes |
| PII Leakage Incident Register |
Privacy Team |
Continuous; monthly review |
Records all PII detections in model outputs for Privacy Act compliance |
10. Operational Considerations
SLOs
| SLO |
Target |
Measurement |
| Output filter latency p99 |
<60ms (parallel execution) |
Filter entry → aggregation span |
| PII false-negative rate |
<0.5% of outputs with PII |
Monthly sampled audit |
| Harmful content false-positive rate |
<1% of legitimate responses blocked |
Monthly sampled user feedback |
| Schema validation accuracy |
100% (no non-conformant schemas passed) |
Schema validation metric |
| Filter availability |
99.9% (fail-closed if unavailable) |
Health check monitoring |
11. Cost Considerations
Cost Drivers
| Cost Driver |
Description |
Relative Impact |
| Content classifier inference |
CPU/GPU compute for multi-label classification |
High |
| NLI model for hallucination scoring |
Expensive relative to other stages; skip for non-RAG applications |
High |
| PII detection (shared with SEC005) |
Amortised if same service as input pipeline |
Medium |
| Response regeneration retries |
Schema failure retries consume additional model tokens |
Low–Medium |
Indicative Cost Range
| Scale |
Monthly Cost (USD) |
Notes |
| Small (< 1M responses/day) |
$400–$1,000 |
CPU inference; shared Presidio service |
| Medium (1M–10M responses/day) |
$2,000–$8,000 |
GPU inference for classifier + NLI; autoscaling |
| Large (> 10M responses/day) |
$10,000–$35,000 |
GPU cluster; skip NLI for non-RAG paths |
12. Trade-Off Analysis
Option Comparison
| Option |
Description |
Pros |
Cons |
Best For |
| A: Model safety training only |
Rely on provider's safety fine-tuning; no output filter |
Zero latency; zero cost |
No control; no audit; safety training bypassed by adversarial inputs |
Not recommended for production user-facing applications |
| B: Rule-based output filter |
Pattern matching on outputs for known violation strings |
Fast; deterministic; low cost |
Easily bypassed by paraphrasing; cannot detect nuanced content |
Low-risk internal tools; supplement to other controls |
| C: Full parallel classifier pipeline (this pattern) |
Parallel ML classifiers for PII, content, domain, hallucination |
Comprehensive coverage; parallel = low latency; per-application configurability |
ML ops burden; cost at scale; false positives require tuning |
Production user-facing applications; regulated use cases |
| D: Human review |
All AI outputs reviewed by human before delivery |
Perfect accuracy for human judgment |
Impossible at scale; not practical for real-time AI |
Highest-risk use cases only (e.g., regulatory filings) |
Architectural Tensions
| Tension |
Trade-Off |
| Safety vs User Experience |
Overly aggressive filtering blocks legitimate responses and degrades user experience. Resolution: start permissive; tighten thresholds as violation patterns are understood; use SANITISE before BLOCK wherever possible. |
| Latency vs Coverage |
More filters = more latency. Parallel execution mitigates this but does not eliminate it. Resolution: run all filters in parallel; timeout-exit slow filters with degraded confidence rather than blocking the response. |
| Hallucination Detection vs Cost |
NLI-based hallucination scoring is computationally expensive. Resolution: enable only for RAG-based applications; skip for non-grounded generation. |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Content classifier model degradation |
Medium |
High (harmful content reaches users) |
Weekly automated test suite against known violation examples |
Rollback to previous model version; emergency rule-based backup |
| PII false negative in output |
Medium |
High (Privacy Act risk) |
Post-hoc audit sampling |
Improve NER model; add specific pattern for missed entity type |
| Hallucination scorer false positives |
Medium |
Medium (legitimate responses flagged; user confusion) |
User feedback; high FLAG rate |
Adjust hallucination threshold; review NLI model |
| Output filter latency spike (classifier overloaded) |
Medium |
Medium (gateway SLO breach) |
Filter latency metric |
Autoscale classifier instances; reduce parallelism to prioritise PII+harmful |
14. Regulatory Considerations
| Regulation |
Requirement |
Implementation |
| Australian Privacy Act — APP 6 |
Use or disclose personal information only for the primary purpose of collection |
PII detection prevents inadvertent disclosure via AI output |
| EU AI Act Art. 13 (Transparency) |
High-risk AI system outputs must be interpretable to users |
Hallucination scoring provides interpretability metadata |
| EU AI Act Art. 14 (Human Oversight) |
High-risk AI systems must allow human oversight |
FLAG disposition routes outputs for human review |
| AFSL obligations (ASIC RG 255) |
Licensed financial advice must be appropriate; unlicensed advice is prohibited |
Domain guardrails block specific financial recommendations in unlicensed applications |
| AHPRA/Medical Board |
AI must not provide individual clinical diagnoses |
Domain guardrails block specific clinical advice in non-clinical applications |
15. Reference Implementations
AWS
| Component |
AWS Service |
| Content classifier |
AWS Comprehend Custom Classifier or SageMaker (DeBERTa fine-tuned) |
| PII detection |
Amazon Comprehend PII Detection |
| Hallucination scoring |
SageMaker NLI endpoint |
| Domain guardrails |
Lambda (custom rule engine) + Bedrock Guardrails |
| Output audit log |
Kinesis Firehose → S3 Object Lock |
Azure
| Component |
Azure Service |
| Content classifier + PII + harmful |
Azure AI Content Safety (all-in-one) |
| Hallucination scoring |
Azure AI Language (NLI) |
| Output audit log |
Event Hub → Immutable Blob |
On-Premises
| Component |
Technology |
| Content classifier |
DeBERTa fine-tuned on OWASP/safety dataset, ONNX Runtime |
| PII detection |
Presidio (shared with SEC005) |
| Hallucination scoring |
HHEM (HuggingFace HHEM model) |
| Domain guardrails |
OPA rules + custom classifier |
| Output audit log |
Kafka → Elasticsearch |
| Pattern |
ID |
Relationship |
| AI Gateway |
EAAPL-SEC001 |
Output filter is a response stage in the gateway |
| Prompt Firewall |
EAAPL-SEC002 |
Defence pair: SEC002 at input; SEC006 at output |
| LLM Input Sanitisation |
EAAPL-SEC005 |
Input/output pair for PII governance |
| Hallucination Detection |
EAAPL-OBS003 |
SEC006 hallucination scoring is the runtime detection; OBS003 is the monitoring pattern |
| AI Data Classification |
EAAPL-SEC009 |
Output classification labels applied by SEC009 feed SEC006 disposition decisions |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension |
Score (1–5) |
Rationale |
| Pattern definition clarity |
5 |
Clear stages and decision model |
| Technology availability |
4 |
Strong commercial options (Azure Content Safety); hallucination scoring tooling maturing |
| Industry adoption |
4 |
Widely recognised requirement; implementation completeness varies |
| Regulatory alignment |
5 |
Directly addresses EU AI Act Art. 13/14, Privacy Act, and domain-specific obligations |
| Operational tooling |
4 |
Good commercial tooling; hallucination scoring requires custom integration |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2024-03-15 |
Security Architecture Team |
Initial pattern definition |
| 1.1 |
2024-07-01 |
Security Architecture Team |
Added domain guardrails for financial/medical/legal; hallucination scoring integration |
| 1.2 |
2025-01-20 |
Security Architecture Team |
Updated OWASP mapping; added Australian regulatory context; expanded failure modes |