EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryAI SecurityEAAPL-SEC002
EAAPL-SEC002Proven
⇄ Compare

Prompt Firewall

🔐 AI SecurityAPRA CPS234EU AI Act🏭 Field-tested in AU

[EAAPL-SEC002] Prompt Firewall

Category: Security / Threat Prevention Sub-category: Adversarial Input Defence Version: 1.3 Maturity: Proven Tags: prompt-injection jailbreak input-validation content-policy nlp-security classifier defence-in-depth Regulatory Relevance: APRA CPS234, EU AI Act Art. 9 & 15, OWASP LLM01, NIST AI RMF MANAGE 1.3


1. Executive Summary

The Prompt Firewall is an inline defensive layer that inspects every user input and system-constructed prompt before it reaches a large language model. It detects and blocks prompt injection attacks, jailbreak attempts, policy violations, and adversarial instructions that seek to override the model's intended behaviour or extract sensitive information.

For business stakeholders, the risk is concrete: a single successful prompt injection can cause an AI-powered application to ignore its system instructions, impersonate another user, exfiltrate data from its context window, or generate harmful content — all of which carry regulatory, reputational, and financial consequences. A prompt firewall reduces this risk to near zero for known attack patterns and significantly degrades the success rate of novel attacks through semantic analysis.

Unlike perimeter firewalls that operate on network packets, a prompt firewall operates on natural language — requiring a combination of rule-based detection (fast, deterministic, low false-positive), semantic similarity analysis (catches paraphrased attacks), and ML classifiers (catches novel attack classes). The pattern is deployed as an inline middleware stage, typically within the AI Gateway (EAAPL-SEC001), and adds 20–50ms of latency while providing a material reduction in successful prompt injection incidents.


2. Problem Statement

Business Problem

Organisations deploying AI assistants — customer service bots, internal productivity tools, code generation assistants — face an attack vector with no analogue in traditional software: natural language manipulation of the AI's behaviour. An attacker does not need to find a SQL injection vulnerability or exploit a buffer overflow. They need only craft a message that convinces the model to ignore its instructions, impersonate another user, or disclose information it should not.

High-profile incidents have demonstrated that even production LLM deployments from major vendors are vulnerable to prompt injection. The business consequences include: leakage of system prompts (containing proprietary logic or sensitive context), data exfiltration from the context window (e.g., previous conversation turns containing other users' data), generation of policy-violating content that causes regulatory exposure, and denial of service through resource-exhausting prompts.

Technical Problem

LLMs process user input and system instructions in the same channel (the prompt). Unlike a database that cleanly separates queries from data, an LLM cannot inherently distinguish between "authorised instruction" and "adversarial instruction embedded in user data." Any user-controlled text that reaches the model's context window is potentially an attack surface.

Prompt injection attacks take multiple forms: direct injection (attacker directly sends malicious instructions), indirect injection (attacker embeds malicious instructions in documents or web pages that the AI retrieves), jailbreaking (persuasion-based attempts to bypass safety training), role-play exploitation (convincing the model it is a different, unconstrained entity), and token manipulation (using special characters, encoding tricks, or unusual spacing to bypass simple pattern matching).

Symptoms

  • AI application generating content that violates its stated purpose (e.g., a coding assistant generating phishing emails).
  • System prompt contents appearing in model responses.
  • Users reporting that the AI "acted differently" after an unusual input.
  • AI application performing actions it was not instructed to perform by the application (in agentic contexts).
  • Sudden spikes in content policy violations in output filtering logs.

Cost of Inaction

Dimension Impact
Regulatory Disclosure of system prompt containing proprietary logic or PII; potential Privacy Act breach if user data exfiltrated from context
Reputational Public demonstration of AI jailbreak attracts media attention; erodes user trust in AI-powered product
Financial Regulatory fines; remediation costs; potential liability for AI-generated harmful content
Security System prompt exfiltration reveals application architecture; can be used to craft more targeted attacks
Operational Model abuse through resource-exhausting prompts drives API cost spikes and degraded availability for legitimate users

3. Context

When to Apply

  • Any AI application that accepts user-generated text as input to an LLM.
  • AI applications operating in adversarial environments (public-facing, customer-facing, or accessible by untrusted internal users).
  • Agentic systems where LLMs can invoke tools, APIs, or execute code — the consequences of injection are significantly higher.
  • Applications where the system prompt contains sensitive instructions, proprietary logic, or confidential context.
  • Regulated use cases where policy-violating outputs carry compliance risk.

When NOT to Apply

  • Fully internal, developer-only AI tools where all users are trusted and the threat model does not include insider adversaries.
  • Batch processing pipelines where inputs come exclusively from trusted, validated internal sources with no user-controlled content.
  • Scenarios where the latency overhead (20–50ms) is prohibitive and alternative controls (strong output filtering) provide acceptable coverage.

Prerequisites

Prerequisite Detail
AI Gateway (EAAPL-SEC001) Firewall is ideally deployed as a stage within the gateway; can also be deployed as an application-level middleware
Classifier Model A fine-tuned text classifier or embedding similarity model for semantic analysis
Policy Definitions Organisation's AI Acceptable Use Policy codified into firewall rules
Attack Pattern Library Maintained library of known prompt injection and jailbreak patterns
Observability Stack Logging and alerting infrastructure for firewall events

Industry Applicability

Industry Applicability Key Driver
Financial Services Critical Regulatory exposure from AI-assisted advice; system prompt exfiltration risk
Healthcare Critical Protected health information in context window; safety-critical AI outputs
Government Critical Classified information protection; adversarial nation-state threat actors
E-commerce / Retail High Customer-facing AI with promotional/pricing logic in system prompt
Technology / SaaS High Public-facing AI features; developer tools vulnerable to supply chain injection
Education Medium Minor users; content policy enforcement

4. Architecture Overview

The Prompt Firewall is a multi-stage detection pipeline that processes every prompt before it reaches the LLM. The pipeline architecture is designed around a fundamental principle: layered defence with increasing cost and decreasing false-positive rate at each layer. Fast, cheap checks run first; expensive, accurate checks run only when cheap checks are inconclusive.

Layer 1: Pattern Matching (Deterministic)

The first layer operates on character and token sequences. It applies a library of regular expressions and exact-match patterns derived from a constantly updated catalogue of known injection strings, jailbreak templates, and policy-violating phrases. This layer executes in microseconds and catches the vast majority of script-kiddie attacks and known jailbreak variants. The pattern library is maintained as a versioned configuration artefact, updated through a CI/CD pipeline that incorporates patterns from public jailbreak repositories (JailbreakChat, LLM Security research) and internal incident findings.

Layer 2: Semantic Analysis (Vector Similarity)

Pattern matching is defeated by paraphrasing. An attacker who knows the patterns can rephrase an injection attack to avoid any string-match. The semantic layer addresses this by embedding the input into a vector space and computing similarity against a library of known malicious embeddings. A cosine similarity threshold (typically 0.85) triggers a block. This layer catches paraphrased attacks and novel variants that share semantic intent with known attacks. It adds approximately 10–20ms using a lightweight embedding model (e.g., sentence-transformers/all-MiniLM-L6-v2 running on CPU). The embedding library is updated with new malicious examples whenever a new attack pattern is identified in the wild.

Layer 3: ML Classifier (Probabilistic)

The third layer applies a fine-tuned binary classifier trained specifically to distinguish legitimate prompts from adversarial ones. Unlike the semantic layer which measures distance to known attacks, the classifier learns decision boundaries from a labelled dataset of benign and malicious prompts — including novel attack types. This layer provides the highest accuracy but also the highest latency (30–80ms on CPU, 5–15ms on GPU). For latency-sensitive applications, this layer runs asynchronously: the request is allowed through with monitoring, but a definitive classifier decision is stored and used to update the pattern library and trigger retrospective review if the classifier scores high probability of injection.

Policy Enforcement Layer

Beyond injection detection, the firewall enforces content policies: does this input request content that violates the organisation's AI Acceptable Use Policy? This includes checks for: requests for content involving minors, attempts to obtain detailed instructions for illegal activities, requests that target specific individuals, and use-case-specific policy violations (e.g., a financial assistant being asked to produce stock tips). Policy checks use a combination of pattern matching and classifier models trained on the specific policy domain.

Allow/Deny List Management

The firewall maintains per-application allow lists (patterns that should never be blocked regardless of classifier score — e.g., legitimate security research applications) and deny lists (patterns that should always be blocked). Allow lists are critical for preventing false positives in legitimate use cases; they require governance review before addition to prevent allow list abuse.

Sanitisation Path

Not all suspicious inputs result in a block. For inputs that are ambiguous (e.g., a high pattern-match score but low semantic similarity), the firewall can sanitise: stripping suspicious instruction sequences while preserving the legitimate intent of the input. Sanitisation is logged and flagged for review to identify attack pattern evolution.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Input["Prompt Input"] A[User Prompt] end subgraph Firewall["Detection Layers"] B{Pattern Match} C{Semantic Similarity} D{ML Classifier} E{Policy Check} end subgraph Outcome["Decision + Feedback"] F[Block + Alert] G[Allow to LLM] H[Event Log] end A --> B B -->|known pattern| F B -->|no match| C C -->|high score| F C -->|low score| D D -->|high confidence| F D -->|low confidence| E E -->|violation| F E -->|ok| G F --> H G --> H style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f3e8ff,stroke:#a855f7 style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f3e8ff,stroke:#a855f7 style E fill:#f3e8ff,stroke:#a855f7 style F fill:#fee2e2,stroke:#ef4444 style G fill:#d1fae5,stroke:#10b981 style H fill:#fef9c3,stroke:#eab308

6. Components

Component Type Responsibility Technology Options Criticality
Pattern Matching Engine Rule Engine Deterministic check against library of known injection strings and regex patterns Hyperscan, PCRE2, re2, custom trie-based matcher High
Embedding Service ML Inference Converts input to vector representation for semantic similarity comparison sentence-transformers, OpenAI Embeddings, Cohere Embed (local deployment preferred) High
Malicious Embedding Library Vector Store Pre-computed embeddings of known attack prompts; indexed for ANN search FAISS, hnswlib, Pinecone (local), ChromaDB High
ML Classifier ML Inference Fine-tuned binary classifier for injection/jailbreak detection DistilBERT fine-tuned, DeBERTa, custom logistic regression on embeddings High
Policy Rule Engine Rule Engine Evaluates content policy rules against prompt content OPA, custom rule DSL, AWS Comprehend Custom Classifier High
Pattern Library Configuration Versioned library of known attack patterns (regex, exact match, fuzzy match) Git-versioned YAML/JSON, updated via CI/CD Critical
Allow/Deny List Manager Configuration Per-application overrides for firewall decisions Key-value store (Redis), configuration service Medium
Sanitisation Engine Transformation Strips suspicious instruction fragments while preserving legitimate intent Custom NLP, regex substitution Medium
Firewall Event Logger Observability Structured logging of all firewall events (blocks, allows, sanitisations) for security review Kafka, Fluentd, CloudWatch Logs Critical
Feedback Pipeline ML Operations Routes flagged inputs to analyst review; feeds confirmed attacks into retraining Label Studio, Prodigy, custom review UI Medium

7. Data Flow

Primary Flow

Step Actor Action Output
1 Application / AI Gateway Submits assembled prompt (system + user turn) to firewall entry point Prompt text submitted for inspection
2 Pattern Matching Engine Applies all regex and exact-match patterns from pattern library; records match details if found MATCH or NO_MATCH with match details
3 Embedding Service Converts prompt to vector embedding using local embedding model Embedding vector (e.g., 384 dimensions)
4 Similarity Search Computes cosine similarity against malicious embedding library using ANN index Nearest-neighbour distance and similarity score
5 ML Classifier Runs fine-tuned classifier on prompt (synchronous for score >0.60; asynchronous below threshold) Probability score: P(injection), P(jailbreak), P(policy_violation)
6 Policy Rule Engine Evaluates content policy rules against prompt; applies use-case-specific deny rules POLICY_PASS or POLICY_VIOLATION with rule ID
7 Decision Aggregator Combines results from all layers; determines final action (BLOCK, SANITISE, ALLOW+WATCH, ALLOW) Final disposition with reason codes
8 Firewall Event Logger Writes structured event record regardless of disposition Audit log entry with: disposition, reason codes, model scores, timestamp, trace_id
9 Response to Caller Returns disposition to AI Gateway / application ALLOW (forward prompt), BLOCK (return 400), SANITISE (return modified prompt)

Error Flow

Error Condition Firewall Behaviour Disposition Alert
Embedding service unavailable Skip Layer 2; proceed with Layers 1, 3, and policy ALLOW with degraded confidence flag Warning alert: Layer 2 unavailable
Classifier unavailable Skip Layer 3; proceed with Layers 1 and 2 ALLOW with degraded confidence flag Warning alert: Layer 3 unavailable
Pattern library stale (>24h without update) Continue with cached library Stale library flag on all decisions Alert: pattern library update required
Firewall latency > 200ms (SLA breach) Log timeout; fail-open (ALLOW) to protect availability ALLOW with timeout flag for async review SLA breach alert
All detection layers unavailable Fail closed: BLOCK all requests BLOCK Critical alert: firewall fully unavailable

8. Security Considerations

Authentication & Authorisation

  • The firewall service itself must be accessible only from authorised callers (AI Gateway, application middleware). mTLS or API key authentication prevents direct access.
  • Pattern library and classifier model updates are authorised through a signed artefact pipeline — an attacker who can modify the pattern library can blind the firewall to specific attacks.

Secrets Management

  • If the firewall uses a cloud embedding API (e.g., OpenAI Embeddings for the similarity layer), the API key must be managed per EAAPL-SEC008. Preferably, use a locally-deployed embedding model to avoid sending potentially sensitive prompt content to an external embedding provider.

Data Classification

  • Prompts processed by the firewall may contain sensitive data (PII, confidential context). The firewall should not log full prompt content at INFO level; log only truncated indicators, hashes, or anonymised representations unless explicitly configured for full-content logging under a controlled data handling agreement.

Encryption

  • All firewall service communication over TLS 1.3.
  • Firewall event logs encrypted at rest.
  • Classifier model weights stored in encrypted object storage; access audited.

False Positive Management

  • False positives (blocking legitimate inputs) are a security misconfiguration, not a minor inconvenience. High false-positive rates cause users to route around the firewall or disable it. Maintain false-positive rate <0.5% of legitimate traffic.

Auditability

  • Every firewall decision is logged with full reasoning: which layers triggered, what scores were returned, which pattern matched. This supports both security operations (investigating incidents) and model improvement (identifying false positives).

OWASP LLM Top 10 Coverage

OWASP LLM Risk Prompt Firewall Mitigation Coverage
LLM01: Prompt Injection Primary purpose: detect and block direct and indirect prompt injection Critical
LLM02: Insecure Output Handling Prevents injection attacks that cause unsafe outputs at the source High (upstream of output)
LLM03: Training Data Poisoning Out of scope for this pattern None
LLM04: Model Denial of Service Detects resource-exhausting prompt patterns (extremely long nested instructions) Medium
LLM05: Supply Chain Vulnerabilities Pattern library update pipeline must be secured against supply chain attack Medium
LLM06: Sensitive Information Disclosure Blocks prompts crafted to elicit disclosure of system prompt or context window contents High
LLM07: Insecure Plugin Design Blocks injection attacks targeting agentic tool call triggering High
LLM08: Excessive Agency Blocks prompts that attempt to expand model's scope of action beyond intended permissions High
LLM09: Overreliance Out of scope None
LLM10: Model Theft Blocks prompts designed to extract model training data or behaviour through systematic querying Medium

9. Governance Considerations

Responsible AI

  • The prompt firewall enforces the organisation's AI Acceptable Use Policy at the input layer. Policy rules must be reviewed by the AI Ethics and Governance function before deployment to ensure they do not introduce discriminatory filtering (e.g., blocking inputs in non-English languages disproportionately).

Model Risk Management

  • The classifier model used in Layer 3 is itself an AI model and subject to model risk management: it must be validated on representative samples of legitimate traffic before deployment, and its false-positive and false-negative rates must be documented.

Human Approval

  • ALLOW+WATCH dispositions (medium-confidence suspicious inputs that were allowed through) must be reviewed by a security analyst within 24 hours. Confirmed injections trigger classifier retraining.

Traceability

  • Every block event is traceable to the specific pattern, embedding similarity score, or classifier score that triggered it. This supports appeals processes (a user who believes their input was wrongly blocked can request a review) and regulatory enquiries.

Governance Artefacts

Artefact Owner Frequency Purpose
Pattern Library Release Notes Security Team With each library update Documents new patterns added, patterns retired, false-positive corrections
Classifier Validation Report AI Risk Team Quarterly; with each model update Documents FPR, FNR, precision, recall on validation dataset
Firewall Policy Review AI Governance Quarterly Reviews policy rules for AUP alignment, discriminatory impact assessment
False Positive Trend Report AI Platform Team Monthly Tracks FPR trend; triggers tuning if >0.5%
Security Incident Log Security Operations Continuous Record of all BLOCK events with confirmed/unconfirmed injection classification

10. Operational Considerations

Monitoring

  • Real-time dashboard: block rate by layer (Pattern / Semantic / Classifier / Policy), false-positive rate (from analyst review), latency per layer, classifier confidence distribution.
  • Alerting: block rate spike (>10× baseline) = possible coordinated attack; FPR spike = classifier degradation; layer unavailability = degraded defence posture.

SLOs

SLO Target Measurement
Firewall decision latency p99 <80ms (synchronous path) Span: firewall_entry → firewall_decision
False-positive rate <0.5% of legitimate traffic Monthly analyst review sample
Pattern library freshness <24h since last update check Library update timestamp metric
Detection rate for known attacks >99% of test attack suite blocked Weekly automated red-team test suite
Firewall availability 99.9% (fail-open if unavailable) Synthetic health checks

Logging

  • Structured JSON. Mandatory fields: trace_id, disposition, layer_triggered, pattern_id (if pattern match), semantic_score, classifier_score, policy_rule_id, latency_ms, input_hash, timestamp_utc.
  • Full input content logged only at AUDIT level under controlled access; standard logs contain only hash and truncated prefix.

Incident Management

  • Block rate spike → automated alert to Security Operations.
  • Confirmed novel injection technique → Security Operations escalates to threat intelligence team; pattern library update initiated within 4 hours.
  • Classifier false-positive spike → immediate escalation to AI Platform team; temporary threshold relaxation if FPR >2%.

DR

Scenario RTO Recovery
Layer 3 classifier unavailable 0 (fail-open without Layer 3) Deploy classifier to backup endpoint; alert
Embedding service unavailable 0 (fail-open without Layer 2) Restore embedding service; alert
Pattern library corruption 15min Rollback to previous version via artefact registry
Complete firewall service failure 0 (fail-open; alert) Immediate recovery required; escalate to P1

Capacity

  • Pattern matching: CPU-bound, scales linearly with rule count × request rate. 10,000 patterns at 1,000 req/s: ~2 CPU cores.
  • Embedding inference: 30ms/request on single CPU core; 8 cores handles ~260 req/s; GPU (T4): ~5ms/request → 200 req/s/GPU.
  • Classifier inference: similar to embedding; can be batched for throughput.

11. Cost Considerations

Cost Drivers

Cost Driver Description Relative Impact
ML inference compute GPU or CPU instances for embedding model + classifier High
Pattern library maintenance Security engineer time to curate, test, and release pattern updates Medium
Classifier retraining Periodic retraining on new labelled examples; GPU compute for training Medium
False-positive review Analyst time to review ALLOW+WATCH decisions Low–Medium
Embedding model licensing If using commercial embedding API (OpenAI, Cohere) Medium (eliminated with local deployment)

Scaling Risks

  • Classifier inference becomes a bottleneck at high request rates if running on CPU. Provision GPU inference early.
  • Embedding library grows with each new attack pattern added; ANN search latency increases. Prune stale embeddings and monitor search latency.

Optimisations

  • Deploy embedding and classifier models as shared services (not per-application) to amortise GPU cost.
  • Cache pattern matching results for identical inputs (hash-based deduplication) — many attackers repeat the same payload.
  • Run Layer 3 classifier asynchronously for low-risk inputs to reduce synchronous path latency and allow CPU inference to be sufficient.

Indicative Cost Range

Scale Monthly AWS Cost (USD) Notes
Small (< 500K req/day) $300–$800 2 CPU inference instances (c6i.2xlarge), ElastiCache for embedding cache
Medium (500K–10M req/day) $1,500–$5,000 1–2 g4dn.xlarge GPU instances, load balanced; auto-scaling
Large (> 10M req/day) $10,000–$30,000 GPU inference cluster (g4dn.12xlarge × N); model server (Triton)

12. Trade-Off Analysis

Option Comparison

Option Description Pros Cons Best For
A: Rule-only firewall Layer 1 (pattern matching) only Extremely fast (<1ms); zero ML dependencies; deterministic Defeated by paraphrasing; requires manual pattern maintenance; cannot detect novel attacks Low-risk internal tools; latency-critical scenarios
B: Semantic + Rule firewall Layers 1 + 2 (pattern + embedding similarity) Catches paraphrased attacks; moderate latency (20–30ms); no classifier training cost Does not generalise to truly novel attack classes; embedding library requires curation Most production use cases; balanced cost/protection
C: Full three-layer firewall Layers 1 + 2 + 3 (pattern + embedding + classifier) Highest detection rate; generalises to novel attacks; continuous improvement via feedback Highest latency (50–80ms sync); ML ops burden (classifier maintenance); GPU cost High-risk, public-facing AI applications; regulated use cases
D: Cloud-native content safety Azure AI Content Safety, AWS Bedrock Guardrails, Google Cloud DLP Low operational burden; managed SLAs; continuously updated by provider Limited customisation; sends prompt content to external service (data residency risk); may not cover all injection types Cloud-committed organisations; non-sensitive content

Architectural Tensions

Tension Trade-Off
Detection Rate vs Latency More detection layers = higher accuracy but higher latency. Resolution: async Layer 3 for medium-confidence inputs; sync only for high-confidence suspects.
Sensitivity vs False Positives Lowering classifier thresholds catches more attacks but blocks more legitimate inputs. Resolution: tune thresholds against organisation-specific traffic using A/B shadow mode before enforcing.
Centralisation vs Application Context A shared gateway-level firewall lacks application-specific context (e.g., a coding assistant has different legitimate input patterns than a customer service bot). Resolution: per-application allow lists and policy profiles configurable in the shared firewall.
Local vs Cloud Embedding Local deployment protects data residency; cloud embedding APIs are faster to deploy and continuously updated. Resolution: default to local; allow cloud only for non-sensitive use cases with contractual data processing agreements.

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Pattern library not updated (stale patterns) Medium High (missed novel attack variants) Pattern library age metric > 24h → alert Automated CI/CD pipeline for pattern library updates; runbook for manual update
Classifier model drift (degraded accuracy over time) Medium High (increased FNR for evolved attack styles) Weekly automated red-team test suite; FNR trend Quarterly retraining; rollback to previous model version
Embedding library too large (ANN search latency spike) Low Medium (latency SLO breach) ANN search latency metric Prune stale embeddings; increase ANN index resources
False positive spike (legitimate inputs blocked) Medium High (user experience degradation; firewall bypass attempts) FPR metric from analyst review Threshold relaxation; allow list additions; root cause investigation
Layer 1 + 2 both fail simultaneously Very Low Critical (reliance on Layer 3 only or fail-open) Layer health metrics Multi-AZ deployment; independent failure domains for each layer
Adversarial evasion of all three layers Low Critical (successful injection reaching LLM) Anomalous LLM output patterns (caught by SEC006 output filter) Output filter provides second defence; incident response; pattern library update

Cascading Failure

If the firewall fails open (allowing all traffic) during a targeted attack, the LLM's output filter (EAAPL-SEC006) becomes the last line of defence. Output filters are less effective at preventing injection (they can only catch the consequences, not the attack itself). Ensure output filtering is independently deployed and does not share failure domains with the input firewall.


14. Regulatory Considerations

Regulation Requirement Prompt Firewall Implementation
APRA CPS234 §21 Controls must be commensurate with vulnerability and threat environment Three-layer detection architecture with continuous pattern updates matches threat-proportionate control requirement
EU AI Act Art. 9 (Risk Management) High-risk AI systems must implement appropriate risk management Prompt firewall directly implements input risk management for high-risk AI use cases
EU AI Act Art. 15 (Robustness & Accuracy) High-risk AI systems must be resilient against attempts to alter outputs Explicit jailbreak and injection defence addresses robustness requirement
Australian Privacy Act 1988 Prevent unauthorised access to personal information Blocking injection attacks that attempt to exfiltrate personal information from context window
NIST AI RMF MANAGE 1.3 Responses to identified risks are monitored and adjusted Feedback loop from analyst review to classifier retraining implements continuous risk management
ISO/IEC 42001 §8.4 (AI System Operation) Monitor AI system inputs and outputs Firewall event log provides required input monitoring artefact

15. Reference Implementations

AWS

Component AWS Service
Pattern matching Lambda (custom Hyperscan-based filter) triggered from API Gateway
Embedding service SageMaker endpoint (sentence-transformers) or Bedrock Titan Embeddings
Similarity search OpenSearch k-NN index
Classifier SageMaker endpoint (fine-tuned DeBERTa)
Policy rules AWS Bedrock Guardrails (content filtering) + custom Lambda rules
Event logging CloudWatch Logs + Kinesis Firehose → S3

Azure

Component Azure Service
Pattern + classifier Azure AI Content Safety (prompt shield) + custom APIM policy
Embedding Azure OpenAI text-embedding-ada-002 (or local via AKS)
Similarity search Azure AI Search with vector search
Policy rules Azure AI Content Safety content filters
Event logging Azure Monitor → Log Analytics → Immutable storage

GCP

Component GCP Service
Pattern matching Cloud Functions (custom) + Sensitive Data Protection (DLP)
Embedding Vertex AI Text Embeddings
Similarity search Vertex AI Vector Search
Classifier Vertex AI custom model endpoint
Event logging Cloud Logging → BigQuery → Cloud Storage

On-Premises

Component Technology
Pattern matching Hyperscan library in Go/Rust service
Embedding Sentence-transformers on GPU server (NVIDIA T4)
Similarity search FAISS (Facebook AI Similarity Search)
Classifier ONNX Runtime + fine-tuned DeBERTa
Policy rules OPA (Open Policy Agent) with custom Rego rules
Event logging Kafka → Elasticsearch

Pattern ID Relationship
AI Gateway EAAPL-SEC001 Parent pattern: prompt firewall deployed as a stage within the AI Gateway
LLM Input Sanitisation EAAPL-SEC005 Complementary: SEC005 handles PII/schema validation; SEC002 handles adversarial intent detection
AI Output Filtering EAAPL-SEC006 Defence-in-depth pair: SEC002 blocks at input; SEC006 catches consequences at output
Adversarial Input Defence EAAPL-SEC010 Extends SEC002 to handle adversarial ML attacks beyond prompt injection
AI Data Classification EAAPL-SEC009 Classification labels inform SEC002 policy rules (higher-sensitivity data = stricter injection detection threshold)
Secure Tool Invocation EAAPL-SEC004 SEC002 blocks injection attacks targeting tool call manipulation; SEC004 enforces safe execution after the prompt passes the firewall

17. Maturity Assessment

Overall Maturity: Proven

Dimension Score (1–5) Rationale
Pattern definition clarity 5 Well-defined scope and detection pipeline
Technology availability 4 Strong OSS options; cloud-native solutions emerging; GPU inference required for full pipeline
Industry adoption 3 Adopted by security-mature AI teams; not yet universal; underestimated by many organisations
Attack landscape coverage 4 Covers known attack classes well; novel attacks remain a challenge
Operational tooling 3 Pattern library management and classifier MLOps require custom tooling investment
Regulatory alignment 4 Strong alignment with EU AI Act robustness requirements; increasingly referenced in financial services guidance
Community knowledge 3 Growing body of research (OWASP LLM, academic); practitioner knowledge still developing

18. Revision History

Version Date Author Changes
1.0 2024-02-10 Security Architecture Team Initial pattern definition
1.1 2024-05-15 Security Architecture Team Added indirect injection detection; expanded Layer 2 semantic analysis detail
1.2 2024-08-20 Security Architecture Team Updated OWASP LLM Top 10 mapping to 2024 edition; added agentic context guidance
1.3 2025-01-10 Security Architecture Team Added async Layer 3 mode; updated cost guidance; added cloud-native option (Option D)
← Back to LibraryMore AI Security