EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryAgentic AIEAAPL-AGT006
EAAPL-AGT006Proven
⇄ Compare

Reflexive Agent

🤖 Agentic AIEU AI ActISO/IEC 42001

[EAAPL-AGT006] Reflexive Agent

Category: Agentic AI Sub-category: Quality Assurance Architecture Version: 1.1 Maturity: Emerging Tags: self-critique, reflection, quality-gate, generate-critique-revise, anti-loop, cost-control, output-quality Regulatory Relevance: EU AI Act (Art. 9, 15), ISO 42001 §8.4, NIST AI RMF (MEASURE 2.5)


1. Executive Summary

The Reflexive Agent Pattern defines an architecture in which an AI agent evaluates the quality of its own outputs through a structured generate-critique-revise cycle before returning results to the calling system. By adding an explicit self-evaluation step to the standard agent loop, organisations achieve measurable improvements in output quality — particularly for high-stakes knowledge work tasks like contract drafting, regulatory analysis, and clinical documentation — without requiring manual human review of every output.

For CIO/CTO audiences: this pattern is the AI equivalent of a professional practice quality review. A lawyer reviews their own memo before sending it; a radiologist performs a double-read on ambiguous scans. The Reflexive Agent embeds that review step into the automated workflow, catching errors and quality gaps before they reach users or downstream systems. The trade-off is cost: reflection requires additional LLM inference calls. This pattern defines the governance around when reflection is worth the cost, how to prevent reflection cycles from running indefinitely, and how to integrate reflection with human oversight. For high-stakes, low-volume tasks, the quality improvement easily justifies the cost. For high-volume, low-stakes tasks, reflection should be applied selectively based on confidence scoring.


2. Problem Statement

Business Problem

AI agents deployed for high-stakes knowledge work (legal drafting, medical documentation, financial analysis) produce outputs that are factually incorrect, structurally incomplete, or inconsistent with organisational standards at rates that are unacceptable for direct use without review. Manual review by human experts is the only existing quality gate, but it is expensive and creates the bottleneck that undermines the productivity value of automation.

Technical Problem

A standard agent loop generates outputs without any internal mechanism to evaluate their quality relative to the task objective. The model produces the most probable next token; it has no objective function that penalises factual errors, logical inconsistencies, or failure to meet specified quality criteria. Adding an external evaluation step after the loop completes catches errors too late — the full generation cost has already been incurred for an output that may require significant revision.

Symptoms of Absence

  • Agent outputs for high-stakes tasks require expert human review of every output, negating the productivity benefit
  • Quality is inconsistent and unpredictable — excellent outputs and poor outputs arrive with no distinguishing signal
  • No feedback loop: the agent does not learn from its quality failures within or across tasks
  • High escalation rate to human review even when outputs are clearly adequate

Cost of Inaction

  • Quality Risk: Unreviewed poor-quality outputs from agents performing regulated tasks create compliance and liability exposure
  • Operational: Expert review bottleneck grows with agent usage volume, offsetting scale benefits
  • Competitive: Peers who implement reflection achieve demonstrably better output quality and can deploy agents in higher-stakes domains

3. Context

When to Apply

  • Output quality has material business or compliance consequences (legal, medical, financial, regulatory)
  • The task type has clear, articulable quality criteria that can be expressed in a critique prompt
  • Task volume is moderate (the additional LLM cost per task is justified by quality improvement)
  • The target quality improvement is measurable (a quality benchmark exists or can be created)
  • Tasks where partial output correction is faster than full regeneration

When NOT to Apply

  • High-volume, low-stakes tasks where reflection cost exceeds quality improvement value
  • Tasks with no articulable quality criteria (purely subjective outputs)
  • Real-time tasks with hard latency constraints incompatible with multi-pass generation
  • Tasks where the initial output quality is already above the acceptance threshold (waste of compute)

Prerequisites

  • EAAPL-AGT001 (Single Agent Pattern) baseline
  • Defined quality rubric for the task type (criteria for the critique prompt)
  • Quality threshold parameter (minimum acceptable quality score)
  • Anti-loop detection (max revision iteration limit)
  • Cost tracking per reflection cycle

Industry Applicability

Industry Task Type Quality Criteria Reflection Value
Legal Services Contract drafting, clause review Accuracy, completeness, consistency with precedents Very High
Healthcare Clinical summary, discharge letter Clinical accuracy, completeness, safety Very High
Financial Services Analyst reports, regulatory disclosures Factual accuracy, regulatory compliance, clarity High
Technology Code generation, technical documentation Correctness, security, completeness High
Consulting Executive reports, strategy documents Logical consistency, evidence support, clarity Medium

4. Architecture Overview

The Reflexive Agent Pattern extends the standard agent loop (EAAPL-AGT001) by inserting a critique-revise sub-loop between the initial output generation and the final result delivery. The sub-loop has its own termination conditions and cost controls independent of the outer loop.

Why separate the critic from the generator? The same model that generates an output has a well-documented tendency to fail to critique its own errors — it is drawn toward confirming its own output rather than challenging it. Two strategies address this. First, the critique is prompted with an explicitly adversarial persona ("You are a strict expert reviewer. Identify all factual errors, logical gaps, and failures to meet the stated criteria"). Second, in higher-investment implementations, a separate model instance (or a different model entirely) performs the critique, reducing the correlation between generator and critic errors.

Generate Phase The initial generation follows the standard agent loop. The generate phase produces a candidate output — a document, analysis, code, or other artifact — and a confidence score (either model-produced or estimated from the output structure and completeness).

Confidence Gating Before entering the reflection sub-loop, a confidence gate evaluates whether reflection is needed. If the initial output's confidence score exceeds the configured "auto-accept threshold," the output is returned without reflection. This is the primary cost optimisation: for the majority of tasks where the initial output is clearly adequate, no additional inference calls are made. The threshold is tuned per task type based on observed quality distributions.

Critique Phase The Critique Engine receives the candidate output and the task objective (original instruction + quality rubric). It executes an LLM inference call with an adversarial reviewer persona. The critique prompt is carefully designed to produce structured output: a list of specific issues (each with a category: factual error / logical gap / missing requirement / style violation / inconsistency) and an overall quality score (0–100). The critique prompt is the most important engineering artefact in this pattern — vague critique prompts produce vague, unhelpful critique that does not guide revision.

Quality Gate The Quality Gate evaluates the critique output. If the quality score meets or exceeds the acceptance threshold and no critical issues are flagged, the output is accepted and returned. If issues are present, the Revision Engine is invoked.

Revision Phase The Revision Engine receives the original output, the original task instruction, and the structured critique. It invokes an LLM to produce a revised output that addresses the specific issues identified in the critique. The revision prompt is targeted: "Revise the following draft to address these specific issues: [critique issues]. Do not change content that was not flagged as an issue." This targeted revision approach is more efficient than full regeneration and preserves the valid portions of the initial output.

Anti-Loop Detection and Cost Control The reflection sub-loop enforces a hard maximum of N critique-revise cycles (default: 3). If the output has not reached the acceptance threshold after N cycles, the best output produced so far (highest quality score across all iterations) is returned with a reflection metadata flag indicating that the quality threshold was not reached. This prevents infinite reflection loops from running up unbounded inference costs. The total cost of all reflection cycles is tracked and reported; a per-task reflection cost ceiling can trigger early termination.

Reflection Memory Critique outputs from completed tasks are written to the agent's episodic memory store (EAAPL-AGT002) with the task type, initial quality score, final quality score, and the specific issues identified. The Memory Consolidation Engine processes these records to update the semantic memory with task-type-specific quality learnings. Over time, the generator's prompting is improved based on accumulated knowledge of the most common quality failures for each task type — reducing the number of reflection cycles needed and improving first-pass quality.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Input["Task Input"] A[Task + Quality Rubric] end subgraph Core["Generate-Critique-Revise Loop"] B[Generate Phase] C{Confidence Gate} D[Critique Engine] E{Quality Gate} F[Revision Engine] end subgraph Output["Output Layer"] G[Accepted Output] H[Best Output Warning] I[(Reflection Memory)] end A --> B B --> C C -->|above threshold| G C -->|below threshold| D D --> E E -->|accepted| G E -->|max cycles hit| H E -->|revise| F F --> D G --> I H --> I style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f0fdf4,stroke:#22c55e style E fill:#f3e8ff,stroke:#a855f7 style F fill:#f0fdf4,stroke:#22c55e style G fill:#d1fae5,stroke:#10b981 style H fill:#fee2e2,stroke:#ef4444 style I fill:#fef9c3,stroke:#eab308

6. Components

Component Type Responsibility Technology Options Criticality
Generate Phase Agent Loop Standard agent execution producing candidate output EAAPL-AGT001 implementation Critical
Confidence Gate Quality Control Evaluates initial output confidence; gates reflection entry Model logprobs; heuristic scoring; LLM confidence prompt High
Critique Engine AI Component Generates structured critique using adversarial reviewer prompt Separate LLM instance (same or different model); critique-tuned prompt Critical
Quality Gate Logic Component Evaluates critique quality score vs. acceptance threshold; decides accept/revise/escalate Custom logic; configurable threshold per task type Critical
Revision Engine AI Component Produces targeted revision addressing specific critique issues LLM with revision-focused prompt Critical
Best Output Tracker State Tracks the highest-quality output produced across reflection cycles In-memory; part of loop state High
Anti-Loop Controller Safety Enforces maximum cycle limit; triggers fallback to best output Counter in loop state; configurable max N Critical
Reflection Cost Monitor Governance Tracks cumulative token cost of critique + revision calls; enforces cost ceiling Custom; EAAPL-AGT010 integration High
Reflection Memory Writer Learning Writes critique outcomes to episodic memory for future learning EAAPL-AGT002 memory write API Medium
Quality Score Time Series Observability Tracks quality scores per task type over time; detects drift Metrics platform; Grafana; custom analytics Medium

7. Data Flow

Full Reflection Cycle

Step Actor Action Output
1 Task System Submits task with quality_rubric: list of acceptance criteria, quality_threshold (e.g., 85/100) Task + quality config
2 Generate Phase Executes standard agent loop; produces candidate output and confidence score Candidate: {output_text, confidence: 0.72}
3 Confidence Gate Compares confidence (0.72) to auto-accept threshold (e.g., 0.90): below threshold; enter reflection Reflection triggered
4 Critique Engine Sends critique prompt: [adversarial_persona] Review this draft against [quality_rubric]. Output JSON: {issues: [{category, description, severity}], quality_score: int} Structured critique: {issues: [{factual_error: ...}, {missing_req: ...}], quality_score: 71}
5 Quality Gate Quality score 71 < acceptance threshold 85; cycle count 1 < max 3; continue Revise
6 Revision Engine Sends revision prompt with original output + critique issues Revised output
7 Best Output Tracker Revised output quality estimated; compare to prior best Updated best candidate
8 Critique Engine (cycle 2) Critiques revised output Critique: {issues: [{minor_style: ...}], quality_score: 89}
9 Quality Gate Score 89 ≥ threshold 85; accept Accept
10 Output Returns accepted output with metadata: {output, reflection_cycles: 2, final_quality_score: 89, issues_resolved: 2} Final output
11 Reflection Memory Writer Writes: task_type, initial_score, final_score, issues_resolved, cycle_count Memory record

Error Flow

Error Detection Recovery
Critique engine returns malformed JSON JSON parse error Retry critique call with explicit JSON schema instruction; max 2 retries
Revision does not improve quality score Quality Gate detects same or lower score Increment cycle counter; if max reached, return best output; log plateau
Reflection cost budget exceeded Cost Monitor Immediately return best output with status: reflection_budget_exceeded
LLM provider timeout during critique Timeout exception Return current best output with status: critique_timeout

8. Security Considerations

Prompt Injection in Critique

  • The critique prompt injects the candidate output as content — if the candidate output contains injected instructions, the critique LLM could be manipulated
  • Mitigation: the critique prompt wrapper clearly delineates the content being reviewed from the critic's instructions; content is wrapped in explicit delimiters (XML tags or similar); output validation on critique output before Quality Gate evaluation

OWASP LLM Top 10

OWASP LLM Risk Reflection Applicability Mitigation
LLM01 Prompt Injection Candidate output injected into critique context Content delimiters; output validation on critique JSON
LLM09 Overreliance Quality score could create false confidence in flawed output Quality score is advisory metadata; high-stakes outputs always include reflection metadata for human reference; quality score ≠ accuracy guarantee
LLM08 Excessive Agency Reflection cycles could be exploited to iteratively refine harmful outputs Quality rubric includes safety criteria; critique is instructed to flag safety violations as terminal issues; safety-flagged outputs are rejected regardless of quality score
LLM04 DoS Infinite reflection loops exhaust inference budget Hard cycle limit; cost ceiling enforcement; anti-loop controller

9. Governance Considerations

Quality Rubric Governance

  • Quality rubrics are owned by domain subject matter experts (legal team owns legal rubrics, clinical leads own clinical rubrics)
  • Rubrics are versioned and change-managed; changes require impact assessment on existing task benchmarks
  • Acceptance thresholds are set and reviewed by the domain owner, not by engineering

Model Risk Management

  • Reflection quality scores are not objective ground truth; they are model judgments subject to model limitations
  • Quality scores must be validated against human expert assessments on a held-out benchmark before being used as primary quality gatekeepers
  • For highest-stakes tasks, model reflection quality scores are advisory only; human review remains the final gate

Governance Artefacts

Artefact Owner Frequency Purpose
Quality Rubric Register Domain SME + AI Platform Per task type; on change Documents acceptance criteria per task type and threshold justification
Reflection Quality Benchmark ML Engineering Monthly Compares model quality scores to human assessments; validates rubric effectiveness
Quality Score Distribution Report Operations Monthly Per-task-type quality score distributions; identifies degradation
Reflection Cost Report FinOps Monthly Average reflection cost per task type; ROI analysis vs. quality improvement

10. Operational Considerations

SLOs

SLO Target Window Alert
Reflection cycle p95 latency ≤ 30s per cycle 1-hour rolling > 60s triggers P2
Auto-accept rate (no reflection needed) ≥ 60% of tasks 24-hour rolling < 40% indicates prompt quality issue; P3
Quality acceptance rate (within max cycles) ≥ 90% 24-hour rolling < 80% triggers P2; quality rubric review
Average reflection cycles per accepted output ≤ 1.5 24-hour rolling > 2.5 indicates poor initial generation

Monitoring

  • Quality score distribution per task type: trending toward lower initial scores indicates prompt degradation
  • Reflection cycle count distribution: bimodal (0 cycles or ≥2 cycles) may indicate confidence gate miscalibration
  • Cost per reflection cycle per task type: anomaly detection for cost spikes

11. Cost Considerations

Cost Drivers

Scenario Additional Token Cost vs. No Reflection Quality Benefit
60% auto-accept, 40% need 1 reflection cycle +40% (approx) High — issues caught in 40% of cases
60% auto-accept, 30% need 1 cycle, 10% need 2 cycles +60% (approx) Very High
20% auto-accept, 80% need 2 cycles +200% (approx) Very High but expensive — optimise generation

Optimisations

  • Use a smaller, faster model for the critique step and the full model only for revision (model routing)
  • Cache common critique patterns and their resolutions as procedural memories to reduce iteration count
  • Tune confidence gate threshold upward (be more selective about what triggers reflection) if auto-accept rate is too low

Indicative Cost Range (per 1,000 tasks)

Task Type Without Reflection With Reflection (1.5 avg cycles) Quality Improvement
Contract clause review $20–50 $35–85 +20–35% quality score
Clinical documentation $15–40 $28–72 +25–40% quality score
Technical documentation $10–30 $16–48 +15–25% quality score

12. Trade-Off Analysis

Reflection Implementation Options

Option Quality Improvement Cost Complexity Best For
A: Same-model adversarial critique (Recommended) High Medium Low Most production deployments
B: Separate critic model Very High High Medium Highest-stakes domains (legal, clinical)
C: Constitutional AI-style (self-correction via principles) Medium–High Medium Low When critique rubric is stable and articulable as principles
D: Multi-agent debate (see EAAPL-MAG005) Very High Very High High High-stakes decisions where structured debate adds unique value

Architectural Tensions

Tension Left Pole Right Pole Balance
Quality vs. Latency Maximum reflection cycles for best quality Single pass for lowest latency Risk-tiered: async reflection for background tasks; 1-cycle max for interactive
Critique specificity vs. Prompt complexity Highly detailed rubric; specific critique Simple rubric; general critique Start with 5–10 specific criteria; iterate based on quality benchmark results
Auto-accept rate vs. Quality coverage High threshold: most outputs go through reflection Low threshold: rarely reflects; risk of poor quality Tune threshold per task type to balance cost and quality

13. Failure Modes

Failure Mode Likelihood Impact Detection Recovery
Critique affirms rather than challenges (sycophancy) High (model tendency) High — reflection adds cost but not quality Quality score does not improve across cycles Strengthen adversarial persona in critique prompt; validate critique calibration on benchmark
Revision makes output worse (regression) Medium High — quality degrades Best output tracker catches if revision score < prior best Return prior best output; log regression; review revision prompt
Max cycles reached with sub-threshold quality Medium Medium — partial quality improvement Quality warning flag in output metadata Route to human review queue; log for rubric improvement
Reflection cost exceeds budget on complex task Low–Medium Medium — unexpected cost Cost monitor Truncate reflection; return best output; alert
Critique hallucinates non-existent issues Medium Medium — unnecessary revision Validate critique against original for false positives Human audit of critique quality on sample; rubric refinement

14. Regulatory Considerations

EU AI Act

  • Art. 9 (Risk Management): reflection quality scores provide evidence of quality management for high-risk AI systems; must be preserved in the task audit log
  • Art. 15 (Accuracy and Robustness): the reflection cycle directly implements the requirement for AI systems to remain accurate and robust; quality benchmark validation satisfies the measurement requirement

ISO 42001

  • §8.4: The reflection mechanism and quality monitoring are part of the AI system's operational quality management lifecycle

NIST AI RMF

  • MEASURE 2.5: The quality score time series and benchmark validation implement the AI performance measurement requirement

15. Reference Implementations

AWS

Component Service
Generate + Critique + Revise Amazon Bedrock (Claude 3 Sonnet for generation; Claude 3 Haiku for critique)
Confidence Gate Custom Lambda function evaluating model response metadata
Quality Score Tracking Amazon CloudWatch custom metrics

Azure

Component Service
Generate + Critique + Revise Azure OpenAI Service (GPT-4o for generation; GPT-4o-mini for critique)
Reflection Orchestration Azure Durable Functions (sub-orchestration for reflection sub-loop)

On-Premises

Component Technology
Generate + Critique + Revise vLLM serving Llama 3.1 70B (generation); Llama 3.1 8B (critique)
Reflection Orchestration LangGraph with custom reflection node

Pattern ID Relationship Type Notes
Single Agent Pattern EAAPL-AGT001 Extends Reflection sub-loop extends the Reflect phase of the base agent loop
Stateful Agent Memory EAAPL-AGT002 Integrates With Critique outcomes are written to episodic memory for learning
Agent Cost Governance EAAPL-AGT010 Integrates With Reflection cost is tracked and controlled under the cost governance pattern
Debate Agent EAAPL-MAG005 Related Debate is an alternative quality mechanism using multiple agents; this pattern uses self-critique
Human-in-the-Loop Agent EAAPL-MAG003 Peer Outputs that fail reflection after max cycles are escalated to human review

17. Maturity Assessment

Overall Maturity: Emerging

Dimension Score (1–5) Evidence
Research Foundation 5 Constitutional AI, Self-Refine, Reflexion papers provide strong academic foundation
Production Deployment 3 Deployed in specialised high-stakes applications; general production tooling still maturing
Quality Measurement 3 Quality benchmark methodology developing; no standard evaluation framework yet
Cost Optimisation 3 Model routing for critique maturing; confidence gate calibration still domain-specific
Framework Support 3 LangGraph supports reflection nodes; general framework support growing

18. Revision History

Version Date Author Changes
1.0 2024-07-01 Architecture Board Initial publication
1.1 2025-02-15 ML Engineering Added model routing for critique; anti-loop cost ceiling; quality benchmark methodology
← Back to LibraryMore Agentic AI