EAAPL-SEC002Proven

Prompt Firewall

🔐 AI SecurityAPRA CPS234EU AI Act🏭 Field-tested in AU

[EAAPL-SEC002] Prompt Firewall

Category: Security / Threat Prevention Sub-category: Adversarial Input Defence Version: 1.3 Maturity: Proven Tags: prompt-injection jailbreak input-validation content-policy nlp-security classifier defence-in-depth Regulatory Relevance: APRA CPS234, EU AI Act Art. 9 & 15, OWASP LLM01, NIST AI RMF MANAGE 1.3

1. Executive Summary

The Prompt Firewall is an inline defensive layer that inspects every user input and system-constructed prompt before it reaches a large language model. It detects and blocks prompt injection attacks, jailbreak attempts, policy violations, and adversarial instructions that seek to override the model's intended behaviour or extract sensitive information.

For business stakeholders, the risk is concrete: a single successful prompt injection can cause an AI-powered application to ignore its system instructions, impersonate another user, exfiltrate data from its context window, or generate harmful content — all of which carry regulatory, reputational, and financial consequences. A prompt firewall reduces this risk to near zero for known attack patterns and significantly degrades the success rate of novel attacks through semantic analysis.

Unlike perimeter firewalls that operate on network packets, a prompt firewall operates on natural language — requiring a combination of rule-based detection (fast, deterministic, low false-positive), semantic similarity analysis (catches paraphrased attacks), and ML classifiers (catches novel attack classes). The pattern is deployed as an inline middleware stage, typically within the AI Gateway (EAAPL-SEC001), and adds 20–50ms of latency while providing a material reduction in successful prompt injection incidents.

2. Problem Statement

Business Problem

Organisations deploying AI assistants — customer service bots, internal productivity tools, code generation assistants — face an attack vector with no analogue in traditional software: natural language manipulation of the AI's behaviour. An attacker does not need to find a SQL injection vulnerability or exploit a buffer overflow. They need only craft a message that convinces the model to ignore its instructions, impersonate another user, or disclose information it should not.

High-profile incidents have demonstrated that even production LLM deployments from major vendors are vulnerable to prompt injection. The business consequences include: leakage of system prompts (containing proprietary logic or sensitive context), data exfiltration from the context window (e.g., previous conversation turns containing other users' data), generation of policy-violating content that causes regulatory exposure, and denial of service through resource-exhausting prompts.

Technical Problem

LLMs process user input and system instructions in the same channel (the prompt). Unlike a database that cleanly separates queries from data, an LLM cannot inherently distinguish between "authorised instruction" and "adversarial instruction embedded in user data." Any user-controlled text that reaches the model's context window is potentially an attack surface.

Prompt injection attacks take multiple forms: direct injection (attacker directly sends malicious instructions), indirect injection (attacker embeds malicious instructions in documents or web pages that the AI retrieves), jailbreaking (persuasion-based attempts to bypass safety training), role-play exploitation (convincing the model it is a different, unconstrained entity), and token manipulation (using special characters, encoding tricks, or unusual spacing to bypass simple pattern matching).

Symptoms

AI application generating content that violates its stated purpose (e.g., a coding assistant generating phishing emails).
System prompt contents appearing in model responses.
Users reporting that the AI "acted differently" after an unusual input.
AI application performing actions it was not instructed to perform by the application (in agentic contexts).
Sudden spikes in content policy violations in output filtering logs.

Cost of Inaction

Dimension	Impact
Regulatory	Disclosure of system prompt containing proprietary logic or PII; potential Privacy Act breach if user data exfiltrated from context
Reputational	Public demonstration of AI jailbreak attracts media attention; erodes user trust in AI-powered product
Financial	Regulatory fines; remediation costs; potential liability for AI-generated harmful content
Security	System prompt exfiltration reveals application architecture; can be used to craft more targeted attacks
Operational	Model abuse through resource-exhausting prompts drives API cost spikes and degraded availability for legitimate users

3. Context

When to Apply

Any AI application that accepts user-generated text as input to an LLM.
AI applications operating in adversarial environments (public-facing, customer-facing, or accessible by untrusted internal users).
Agentic systems where LLMs can invoke tools, APIs, or execute code — the consequences of injection are significantly higher.
Applications where the system prompt contains sensitive instructions, proprietary logic, or confidential context.
Regulated use cases where policy-violating outputs carry compliance risk.

When NOT to Apply

Fully internal, developer-only AI tools where all users are trusted and the threat model does not include insider adversaries.
Batch processing pipelines where inputs come exclusively from trusted, validated internal sources with no user-controlled content.
Scenarios where the latency overhead (20–50ms) is prohibitive and alternative controls (strong output filtering) provide acceptable coverage.

Prerequisites

Prerequisite	Detail
AI Gateway (EAAPL-SEC001)	Firewall is ideally deployed as a stage within the gateway; can also be deployed as an application-level middleware
Classifier Model	A fine-tuned text classifier or embedding similarity model for semantic analysis
Policy Definitions	Organisation's AI Acceptable Use Policy codified into firewall rules
Attack Pattern Library	Maintained library of known prompt injection and jailbreak patterns
Observability Stack	Logging and alerting infrastructure for firewall events

Industry Applicability

Industry	Applicability	Key Driver
Financial Services	Critical	Regulatory exposure from AI-assisted advice; system prompt exfiltration risk
Healthcare	Critical	Protected health information in context window; safety-critical AI outputs
Government	Critical	Classified information protection; adversarial nation-state threat actors
E-commerce / Retail	High	Customer-facing AI with promotional/pricing logic in system prompt
Technology / SaaS	High	Public-facing AI features; developer tools vulnerable to supply chain injection
Education	Medium	Minor users; content policy enforcement

4. Architecture Overview

The Prompt Firewall is a multi-stage detection pipeline that processes every prompt before it reaches the LLM. The pipeline architecture is designed around a fundamental principle: layered defence with increasing cost and decreasing false-positive rate at each layer. Fast, cheap checks run first; expensive, accurate checks run only when cheap checks are inconclusive.

Layer 1: Pattern Matching (Deterministic)

The first layer operates on character and token sequences. It applies a library of regular expressions and exact-match patterns derived from a constantly updated catalogue of known injection strings, jailbreak templates, and policy-violating phrases. This layer executes in microseconds and catches the vast majority of script-kiddie attacks and known jailbreak variants. The pattern library is maintained as a versioned configuration artefact, updated through a CI/CD pipeline that incorporates patterns from public jailbreak repositories (JailbreakChat, LLM Security research) and internal incident findings.

Layer 2: Semantic Analysis (Vector Similarity)

Pattern matching is defeated by paraphrasing. An attacker who knows the patterns can rephrase an injection attack to avoid any string-match. The semantic layer addresses this by embedding the input into a vector space and computing similarity against a library of known malicious embeddings. A cosine similarity threshold (typically 0.85) triggers a block. This layer catches paraphrased attacks and novel variants that share semantic intent with known attacks. It adds approximately 10–20ms using a lightweight embedding model (e.g., sentence-transformers/all-MiniLM-L6-v2 running on CPU). The embedding library is updated with new malicious examples whenever a new attack pattern is identified in the wild.

Layer 3: ML Classifier (Probabilistic)

The third layer applies a fine-tuned binary classifier trained specifically to distinguish legitimate prompts from adversarial ones. Unlike the semantic layer which measures distance to known attacks, the classifier learns decision boundaries from a labelled dataset of benign and malicious prompts — including novel attack types. This layer provides the highest accuracy but also the highest latency (30–80ms on CPU, 5–15ms on GPU). For latency-sensitive applications, this layer runs asynchronously: the request is allowed through with monitoring, but a definitive classifier decision is stored and used to update the pattern library and trigger retrospective review if the classifier scores high probability of injection.

Policy Enforcement Layer

Beyond injection detection, the firewall enforces content policies: does this input request content that violates the organisation's AI Acceptable Use Policy? This includes checks for: requests for content involving minors, attempts to obtain detailed instructions for illegal activities, requests that target specific individuals, and use-case-specific policy violations (e.g., a financial assistant being asked to produce stock tips). Policy checks use a combination of pattern matching and classifier models trained on the specific policy domain.

Allow/Deny List Management

The firewall maintains per-application allow lists (patterns that should never be blocked regardless of classifier score — e.g., legitimate security research applications) and deny lists (patterns that should always be blocked). Allow lists are critical for preventing false positives in legitimate use cases; they require governance review before addition to prevent allow list abuse.

Sanitisation Path

Not all suspicious inputs result in a block. For inputs that are ambiguous (e.g., a high pattern-match score but low semantic similarity), the firewall can sanitise: stripping suspicious instruction sequences while preserving the legitimate intent of the input. Sanitisation is logged and flagged for review to identify attack pattern evolution.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Input["Prompt Input"] A[User Prompt] end subgraph Firewall["Detection Layers"] B{Pattern Match} C{Semantic Similarity} D{ML Classifier} E{Policy Check} end subgraph Outcome["Decision + Feedback"] F[Block + Alert] G[Allow to LLM] H[Event Log] end A --> B B -->|known pattern| F B -->|no match| C C -->|high score| F C -->|low score| D D -->|high confidence| F D -->|low confidence| E E -->|violation| F E -->|ok| G F --> H G --> H style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f3e8ff,stroke:#a855f7 style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f3e8ff,stroke:#a855f7 style E fill:#f3e8ff,stroke:#a855f7 style F fill:#fee2e2,stroke:#ef4444 style G fill:#d1fae5,stroke:#10b981 style H fill:#fef9c3,stroke:#eab308

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Pattern Matching Engine	Rule Engine	Deterministic check against library of known injection strings and regex patterns	Hyperscan, PCRE2, re2, custom trie-based matcher	High
Embedding Service	ML Inference	Converts input to vector representation for semantic similarity comparison	sentence-transformers, OpenAI Embeddings, Cohere Embed (local deployment preferred)	High
Malicious Embedding Library	Vector Store	Pre-computed embeddings of known attack prompts; indexed for ANN search	FAISS, hnswlib, Pinecone (local), ChromaDB	High
ML Classifier	ML Inference	Fine-tuned binary classifier for injection/jailbreak detection	DistilBERT fine-tuned, DeBERTa, custom logistic regression on embeddings	High
Policy Rule Engine	Rule Engine	Evaluates content policy rules against prompt content	OPA, custom rule DSL, AWS Comprehend Custom Classifier	High
Pattern Library	Configuration	Versioned library of known attack patterns (regex, exact match, fuzzy match)	Git-versioned YAML/JSON, updated via CI/CD	Critical
Allow/Deny List Manager	Configuration	Per-application overrides for firewall decisions	Key-value store (Redis), configuration service	Medium
Sanitisation Engine	Transformation	Strips suspicious instruction fragments while preserving legitimate intent	Custom NLP, regex substitution	Medium
Firewall Event Logger	Observability	Structured logging of all firewall events (blocks, allows, sanitisations) for security review	Kafka, Fluentd, CloudWatch Logs	Critical
Feedback Pipeline	ML Operations	Routes flagged inputs to analyst review; feeds confirmed attacks into retraining	Label Studio, Prodigy, custom review UI	Medium

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Application / AI Gateway	Submits assembled prompt (system + user turn) to firewall entry point	Prompt text submitted for inspection
2	Pattern Matching Engine	Applies all regex and exact-match patterns from pattern library; records match details if found	MATCH or NO_MATCH with match details
3	Embedding Service	Converts prompt to vector embedding using local embedding model	Embedding vector (e.g., 384 dimensions)
4	Similarity Search	Computes cosine similarity against malicious embedding library using ANN index	Nearest-neighbour distance and similarity score
5	ML Classifier	Runs fine-tuned classifier on prompt (synchronous for score >0.60; asynchronous below threshold)	Probability score: P(injection), P(jailbreak), P(policy_violation)
6	Policy Rule Engine	Evaluates content policy rules against prompt; applies use-case-specific deny rules	POLICY_PASS or POLICY_VIOLATION with rule ID
7	Decision Aggregator	Combines results from all layers; determines final action (BLOCK, SANITISE, ALLOW+WATCH, ALLOW)	Final disposition with reason codes
8	Firewall Event Logger	Writes structured event record regardless of disposition	Audit log entry with: disposition, reason codes, model scores, timestamp, trace_id
9	Response to Caller	Returns disposition to AI Gateway / application	ALLOW (forward prompt), BLOCK (return 400), SANITISE (return modified prompt)

Error Flow

Error Condition	Firewall Behaviour	Disposition	Alert
Embedding service unavailable	Skip Layer 2; proceed with Layers 1, 3, and policy	ALLOW with degraded confidence flag	Warning alert: Layer 2 unavailable
Classifier unavailable	Skip Layer 3; proceed with Layers 1 and 2	ALLOW with degraded confidence flag	Warning alert: Layer 3 unavailable
Pattern library stale (>24h without update)	Continue with cached library	Stale library flag on all decisions	Alert: pattern library update required
Firewall latency > 200ms (SLA breach)	Log timeout; fail-open (ALLOW) to protect availability	ALLOW with timeout flag for async review	SLA breach alert
All detection layers unavailable	Fail closed: BLOCK all requests	BLOCK	Critical alert: firewall fully unavailable

8. Security Considerations

Authentication & Authorisation

The firewall service itself must be accessible only from authorised callers (AI Gateway, application middleware). mTLS or API key authentication prevents direct access.
Pattern library and classifier model updates are authorised through a signed artefact pipeline — an attacker who can modify the pattern library can blind the firewall to specific attacks.

Secrets Management

If the firewall uses a cloud embedding API (e.g., OpenAI Embeddings for the similarity layer), the API key must be managed per EAAPL-SEC008. Preferably, use a locally-deployed embedding model to avoid sending potentially sensitive prompt content to an external embedding provider.

Data Classification

Prompts processed by the firewall may contain sensitive data (PII, confidential context). The firewall should not log full prompt content at INFO level; log only truncated indicators, hashes, or anonymised representations unless explicitly configured for full-content logging under a controlled data handling agreement.

Encryption

All firewall service communication over TLS 1.3.
Firewall event logs encrypted at rest.
Classifier model weights stored in encrypted object storage; access audited.

False Positive Management

False positives (blocking legitimate inputs) are a security misconfiguration, not a minor inconvenience. High false-positive rates cause users to route around the firewall or disable it. Maintain false-positive rate <0.5% of legitimate traffic.

Auditability

Every firewall decision is logged with full reasoning: which layers triggered, what scores were returned, which pattern matched. This supports both security operations (investigating incidents) and model improvement (identifying false positives).

OWASP LLM Top 10 Coverage

OWASP LLM Risk	Prompt Firewall Mitigation	Coverage
LLM01: Prompt Injection	Primary purpose: detect and block direct and indirect prompt injection	Critical
LLM02: Insecure Output Handling	Prevents injection attacks that cause unsafe outputs at the source	High (upstream of output)
LLM03: Training Data Poisoning	Out of scope for this pattern	None
LLM04: Model Denial of Service	Detects resource-exhausting prompt patterns (extremely long nested instructions)	Medium
LLM05: Supply Chain Vulnerabilities	Pattern library update pipeline must be secured against supply chain attack	Medium
LLM06: Sensitive Information Disclosure	Blocks prompts crafted to elicit disclosure of system prompt or context window contents	High
LLM07: Insecure Plugin Design	Blocks injection attacks targeting agentic tool call triggering	High
LLM08: Excessive Agency	Blocks prompts that attempt to expand model's scope of action beyond intended permissions	High
LLM09: Overreliance	Out of scope	None
LLM10: Model Theft	Blocks prompts designed to extract model training data or behaviour through systematic querying	Medium

9. Governance Considerations

Responsible AI

The prompt firewall enforces the organisation's AI Acceptable Use Policy at the input layer. Policy rules must be reviewed by the AI Ethics and Governance function before deployment to ensure they do not introduce discriminatory filtering (e.g., blocking inputs in non-English languages disproportionately).

Model Risk Management

The classifier model used in Layer 3 is itself an AI model and subject to model risk management: it must be validated on representative samples of legitimate traffic before deployment, and its false-positive and false-negative rates must be documented.

Human Approval

ALLOW+WATCH dispositions (medium-confidence suspicious inputs that were allowed through) must be reviewed by a security analyst within 24 hours. Confirmed injections trigger classifier retraining.

Traceability

Every block event is traceable to the specific pattern, embedding similarity score, or classifier score that triggered it. This supports appeals processes (a user who believes their input was wrongly blocked can request a review) and regulatory enquiries.

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Pattern Library Release Notes	Security Team	With each library update	Documents new patterns added, patterns retired, false-positive corrections
Classifier Validation Report	AI Risk Team	Quarterly; with each model update	Documents FPR, FNR, precision, recall on validation dataset
Firewall Policy Review	AI Governance	Quarterly	Reviews policy rules for AUP alignment, discriminatory impact assessment
False Positive Trend Report	AI Platform Team	Monthly	Tracks FPR trend; triggers tuning if >0.5%
Security Incident Log	Security Operations	Continuous	Record of all BLOCK events with confirmed/unconfirmed injection classification

10. Operational Considerations

Monitoring

Real-time dashboard: block rate by layer (Pattern / Semantic / Classifier / Policy), false-positive rate (from analyst review), latency per layer, classifier confidence distribution.
Alerting: block rate spike (>10× baseline) = possible coordinated attack; FPR spike = classifier degradation; layer unavailability = degraded defence posture.

SLOs

SLO	Target	Measurement
Firewall decision latency p99	<80ms (synchronous path)	Span: firewall_entry → firewall_decision
False-positive rate	<0.5% of legitimate traffic	Monthly analyst review sample
Pattern library freshness	<24h since last update check	Library update timestamp metric
Detection rate for known attacks	>99% of test attack suite blocked	Weekly automated red-team test suite
Firewall availability	99.9% (fail-open if unavailable)	Synthetic health checks

Logging

Structured JSON. Mandatory fields: trace_id, disposition, layer_triggered, pattern_id (if pattern match), semantic_score, classifier_score, policy_rule_id, latency_ms, input_hash, timestamp_utc.
Full input content logged only at AUDIT level under controlled access; standard logs contain only hash and truncated prefix.

Incident Management

Block rate spike → automated alert to Security Operations.
Confirmed novel injection technique → Security Operations escalates to threat intelligence team; pattern library update initiated within 4 hours.
Classifier false-positive spike → immediate escalation to AI Platform team; temporary threshold relaxation if FPR >2%.

DR

Scenario	RTO	Recovery
Layer 3 classifier unavailable	0 (fail-open without Layer 3)	Deploy classifier to backup endpoint; alert
Embedding service unavailable	0 (fail-open without Layer 2)	Restore embedding service; alert
Pattern library corruption	15min	Rollback to previous version via artefact registry
Complete firewall service failure	0 (fail-open; alert)	Immediate recovery required; escalate to P1

Capacity

Pattern matching: CPU-bound, scales linearly with rule count × request rate. 10,000 patterns at 1,000 req/s: ~2 CPU cores.
Embedding inference: 30ms/request on single CPU core; 8 cores handles ~260 req/s; GPU (T4): ~5ms/request → 200 req/s/GPU.
Classifier inference: similar to embedding; can be batched for throughput.

11. Cost Considerations

Cost Drivers

Cost Driver	Description	Relative Impact
ML inference compute	GPU or CPU instances for embedding model + classifier	High
Pattern library maintenance	Security engineer time to curate, test, and release pattern updates	Medium
Classifier retraining	Periodic retraining on new labelled examples; GPU compute for training	Medium
False-positive review	Analyst time to review ALLOW+WATCH decisions	Low–Medium
Embedding model licensing	If using commercial embedding API (OpenAI, Cohere)	Medium (eliminated with local deployment)

Scaling Risks

Classifier inference becomes a bottleneck at high request rates if running on CPU. Provision GPU inference early.
Embedding library grows with each new attack pattern added; ANN search latency increases. Prune stale embeddings and monitor search latency.

Optimisations

Deploy embedding and classifier models as shared services (not per-application) to amortise GPU cost.
Cache pattern matching results for identical inputs (hash-based deduplication) — many attackers repeat the same payload.
Run Layer 3 classifier asynchronously for low-risk inputs to reduce synchronous path latency and allow CPU inference to be sufficient.

Indicative Cost Range

Scale	Monthly AWS Cost (USD)	Notes
Small (< 500K req/day)	$300–$800	2 CPU inference instances (c6i.2xlarge), ElastiCache for embedding cache
Medium (500K–10M req/day)	$1,500–$5,000	1–2 g4dn.xlarge GPU instances, load balanced; auto-scaling
Large (> 10M req/day)	$10,000–$30,000	GPU inference cluster (g4dn.12xlarge × N); model server (Triton)

12. Trade-Off Analysis

Option Comparison

Option	Description	Pros	Cons	Best For
A: Rule-only firewall	Layer 1 (pattern matching) only	Extremely fast (<1ms); zero ML dependencies; deterministic	Defeated by paraphrasing; requires manual pattern maintenance; cannot detect novel attacks	Low-risk internal tools; latency-critical scenarios
B: Semantic + Rule firewall	Layers 1 + 2 (pattern + embedding similarity)	Catches paraphrased attacks; moderate latency (20–30ms); no classifier training cost	Does not generalise to truly novel attack classes; embedding library requires curation	Most production use cases; balanced cost/protection
C: Full three-layer firewall	Layers 1 + 2 + 3 (pattern + embedding + classifier)	Highest detection rate; generalises to novel attacks; continuous improvement via feedback	Highest latency (50–80ms sync); ML ops burden (classifier maintenance); GPU cost	High-risk, public-facing AI applications; regulated use cases
D: Cloud-native content safety	Azure AI Content Safety, AWS Bedrock Guardrails, Google Cloud DLP	Low operational burden; managed SLAs; continuously updated by provider	Limited customisation; sends prompt content to external service (data residency risk); may not cover all injection types	Cloud-committed organisations; non-sensitive content

Architectural Tensions

Tension	Trade-Off
Detection Rate vs Latency	More detection layers = higher accuracy but higher latency. Resolution: async Layer 3 for medium-confidence inputs; sync only for high-confidence suspects.
Sensitivity vs False Positives	Lowering classifier thresholds catches more attacks but blocks more legitimate inputs. Resolution: tune thresholds against organisation-specific traffic using A/B shadow mode before enforcing.
Centralisation vs Application Context	A shared gateway-level firewall lacks application-specific context (e.g., a coding assistant has different legitimate input patterns than a customer service bot). Resolution: per-application allow lists and policy profiles configurable in the shared firewall.
Local vs Cloud Embedding	Local deployment protects data residency; cloud embedding APIs are faster to deploy and continuously updated. Resolution: default to local; allow cloud only for non-sensitive use cases with contractual data processing agreements.

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Pattern library not updated (stale patterns)	Medium	High (missed novel attack variants)	Pattern library age metric > 24h → alert	Automated CI/CD pipeline for pattern library updates; runbook for manual update
Classifier model drift (degraded accuracy over time)	Medium	High (increased FNR for evolved attack styles)	Weekly automated red-team test suite; FNR trend	Quarterly retraining; rollback to previous model version
Embedding library too large (ANN search latency spike)	Low	Medium (latency SLO breach)	ANN search latency metric	Prune stale embeddings; increase ANN index resources
False positive spike (legitimate inputs blocked)	Medium	High (user experience degradation; firewall bypass attempts)	FPR metric from analyst review	Threshold relaxation; allow list additions; root cause investigation
Layer 1 + 2 both fail simultaneously	Very Low	Critical (reliance on Layer 3 only or fail-open)	Layer health metrics	Multi-AZ deployment; independent failure domains for each layer
Adversarial evasion of all three layers	Low	Critical (successful injection reaching LLM)	Anomalous LLM output patterns (caught by SEC006 output filter)	Output filter provides second defence; incident response; pattern library update

Cascading Failure

If the firewall fails open (allowing all traffic) during a targeted attack, the LLM's output filter (EAAPL-SEC006) becomes the last line of defence. Output filters are less effective at preventing injection (they can only catch the consequences, not the attack itself). Ensure output filtering is independently deployed and does not share failure domains with the input firewall.

14. Regulatory Considerations

Regulation	Requirement	Prompt Firewall Implementation
APRA CPS234 §21	Controls must be commensurate with vulnerability and threat environment	Three-layer detection architecture with continuous pattern updates matches threat-proportionate control requirement
EU AI Act Art. 9 (Risk Management)	High-risk AI systems must implement appropriate risk management	Prompt firewall directly implements input risk management for high-risk AI use cases
EU AI Act Art. 15 (Robustness & Accuracy)	High-risk AI systems must be resilient against attempts to alter outputs	Explicit jailbreak and injection defence addresses robustness requirement
Australian Privacy Act 1988	Prevent unauthorised access to personal information	Blocking injection attacks that attempt to exfiltrate personal information from context window
NIST AI RMF MANAGE 1.3	Responses to identified risks are monitored and adjusted	Feedback loop from analyst review to classifier retraining implements continuous risk management
ISO/IEC 42001 §8.4 (AI System Operation)	Monitor AI system inputs and outputs	Firewall event log provides required input monitoring artefact

15. Reference Implementations

AWS

Component	AWS Service
Pattern matching	Lambda (custom Hyperscan-based filter) triggered from API Gateway
Embedding service	SageMaker endpoint (sentence-transformers) or Bedrock Titan Embeddings
Similarity search	OpenSearch k-NN index
Classifier	SageMaker endpoint (fine-tuned DeBERTa)
Policy rules	AWS Bedrock Guardrails (content filtering) + custom Lambda rules
Event logging	CloudWatch Logs + Kinesis Firehose → S3

Azure

Component	Azure Service
Pattern + classifier	Azure AI Content Safety (prompt shield) + custom APIM policy
Embedding	Azure OpenAI `text-embedding-ada-002` (or local via AKS)
Similarity search	Azure AI Search with vector search
Policy rules	Azure AI Content Safety content filters
Event logging	Azure Monitor → Log Analytics → Immutable storage

GCP

Component	GCP Service
Pattern matching	Cloud Functions (custom) + Sensitive Data Protection (DLP)
Embedding	Vertex AI Text Embeddings
Similarity search	Vertex AI Vector Search
Classifier	Vertex AI custom model endpoint
Event logging	Cloud Logging → BigQuery → Cloud Storage

On-Premises

Component	Technology
Pattern matching	Hyperscan library in Go/Rust service
Embedding	Sentence-transformers on GPU server (NVIDIA T4)
Similarity search	FAISS (Facebook AI Similarity Search)
Classifier	ONNX Runtime + fine-tuned DeBERTa
Policy rules	OPA (Open Policy Agent) with custom Rego rules
Event logging	Kafka → Elasticsearch

Pattern	ID	Relationship
AI Gateway	EAAPL-SEC001	Parent pattern: prompt firewall deployed as a stage within the AI Gateway
LLM Input Sanitisation	EAAPL-SEC005	Complementary: SEC005 handles PII/schema validation; SEC002 handles adversarial intent detection
AI Output Filtering	EAAPL-SEC006	Defence-in-depth pair: SEC002 blocks at input; SEC006 catches consequences at output
Adversarial Input Defence	EAAPL-SEC010	Extends SEC002 to handle adversarial ML attacks beyond prompt injection
AI Data Classification	EAAPL-SEC009	Classification labels inform SEC002 policy rules (higher-sensitivity data = stricter injection detection threshold)
Secure Tool Invocation	EAAPL-SEC004	SEC002 blocks injection attacks targeting tool call manipulation; SEC004 enforces safe execution after the prompt passes the firewall

17. Maturity Assessment

Overall Maturity: Proven

Dimension	Score (1–5)	Rationale
Pattern definition clarity	5	Well-defined scope and detection pipeline
Technology availability	4	Strong OSS options; cloud-native solutions emerging; GPU inference required for full pipeline
Industry adoption	3	Adopted by security-mature AI teams; not yet universal; underestimated by many organisations
Attack landscape coverage	4	Covers known attack classes well; novel attacks remain a challenge
Operational tooling	3	Pattern library management and classifier MLOps require custom tooling investment
Regulatory alignment	4	Strong alignment with EU AI Act robustness requirements; increasingly referenced in financial services guidance
Community knowledge	3	Growing body of research (OWASP LLM, academic); practitioner knowledge still developing

18. Revision History

Version	Date	Author	Changes
1.0	2024-02-10	Security Architecture Team	Initial pattern definition
1.1	2024-05-15	Security Architecture Team	Added indirect injection detection; expanded Layer 2 semantic analysis detail
1.2	2024-08-20	Security Architecture Team	Updated OWASP LLM Top 10 mapping to 2024 edition; added agentic context guidance
1.3	2025-01-10	Security Architecture Team	Added async Layer 3 mode; updated cost guidance; added cloud-native option (Option D)

← Back to Library More AI Security →