[EAAPL-AGT006] Reflexive Agent
Category: Agentic AI
Sub-category: Quality Assurance Architecture
Version: 1.1
Maturity: Emerging
Tags: self-critique, reflection, quality-gate, generate-critique-revise, anti-loop, cost-control, output-quality
Regulatory Relevance: EU AI Act (Art. 9, 15), ISO 42001 §8.4, NIST AI RMF (MEASURE 2.5)
1. Executive Summary
The Reflexive Agent Pattern defines an architecture in which an AI agent evaluates the quality of its own outputs through a structured generate-critique-revise cycle before returning results to the calling system. By adding an explicit self-evaluation step to the standard agent loop, organisations achieve measurable improvements in output quality — particularly for high-stakes knowledge work tasks like contract drafting, regulatory analysis, and clinical documentation — without requiring manual human review of every output.
For CIO/CTO audiences: this pattern is the AI equivalent of a professional practice quality review. A lawyer reviews their own memo before sending it; a radiologist performs a double-read on ambiguous scans. The Reflexive Agent embeds that review step into the automated workflow, catching errors and quality gaps before they reach users or downstream systems. The trade-off is cost: reflection requires additional LLM inference calls. This pattern defines the governance around when reflection is worth the cost, how to prevent reflection cycles from running indefinitely, and how to integrate reflection with human oversight. For high-stakes, low-volume tasks, the quality improvement easily justifies the cost. For high-volume, low-stakes tasks, reflection should be applied selectively based on confidence scoring.
2. Problem Statement
Business Problem
AI agents deployed for high-stakes knowledge work (legal drafting, medical documentation, financial analysis) produce outputs that are factually incorrect, structurally incomplete, or inconsistent with organisational standards at rates that are unacceptable for direct use without review. Manual review by human experts is the only existing quality gate, but it is expensive and creates the bottleneck that undermines the productivity value of automation.
Technical Problem
A standard agent loop generates outputs without any internal mechanism to evaluate their quality relative to the task objective. The model produces the most probable next token; it has no objective function that penalises factual errors, logical inconsistencies, or failure to meet specified quality criteria. Adding an external evaluation step after the loop completes catches errors too late — the full generation cost has already been incurred for an output that may require significant revision.
Symptoms of Absence
- Agent outputs for high-stakes tasks require expert human review of every output, negating the productivity benefit
- Quality is inconsistent and unpredictable — excellent outputs and poor outputs arrive with no distinguishing signal
- No feedback loop: the agent does not learn from its quality failures within or across tasks
- High escalation rate to human review even when outputs are clearly adequate
Cost of Inaction
- Quality Risk: Unreviewed poor-quality outputs from agents performing regulated tasks create compliance and liability exposure
- Operational: Expert review bottleneck grows with agent usage volume, offsetting scale benefits
- Competitive: Peers who implement reflection achieve demonstrably better output quality and can deploy agents in higher-stakes domains
3. Context
When to Apply
- Output quality has material business or compliance consequences (legal, medical, financial, regulatory)
- The task type has clear, articulable quality criteria that can be expressed in a critique prompt
- Task volume is moderate (the additional LLM cost per task is justified by quality improvement)
- The target quality improvement is measurable (a quality benchmark exists or can be created)
- Tasks where partial output correction is faster than full regeneration
When NOT to Apply
- High-volume, low-stakes tasks where reflection cost exceeds quality improvement value
- Tasks with no articulable quality criteria (purely subjective outputs)
- Real-time tasks with hard latency constraints incompatible with multi-pass generation
- Tasks where the initial output quality is already above the acceptance threshold (waste of compute)
Prerequisites
- EAAPL-AGT001 (Single Agent Pattern) baseline
- Defined quality rubric for the task type (criteria for the critique prompt)
- Quality threshold parameter (minimum acceptable quality score)
- Anti-loop detection (max revision iteration limit)
- Cost tracking per reflection cycle
Industry Applicability
| Industry |
Task Type |
Quality Criteria |
Reflection Value |
| Legal Services |
Contract drafting, clause review |
Accuracy, completeness, consistency with precedents |
Very High |
| Healthcare |
Clinical summary, discharge letter |
Clinical accuracy, completeness, safety |
Very High |
| Financial Services |
Analyst reports, regulatory disclosures |
Factual accuracy, regulatory compliance, clarity |
High |
| Technology |
Code generation, technical documentation |
Correctness, security, completeness |
High |
| Consulting |
Executive reports, strategy documents |
Logical consistency, evidence support, clarity |
Medium |
4. Architecture Overview
The Reflexive Agent Pattern extends the standard agent loop (EAAPL-AGT001) by inserting a critique-revise sub-loop between the initial output generation and the final result delivery. The sub-loop has its own termination conditions and cost controls independent of the outer loop.
Why separate the critic from the generator?
The same model that generates an output has a well-documented tendency to fail to critique its own errors — it is drawn toward confirming its own output rather than challenging it. Two strategies address this. First, the critique is prompted with an explicitly adversarial persona ("You are a strict expert reviewer. Identify all factual errors, logical gaps, and failures to meet the stated criteria"). Second, in higher-investment implementations, a separate model instance (or a different model entirely) performs the critique, reducing the correlation between generator and critic errors.
Generate Phase
The initial generation follows the standard agent loop. The generate phase produces a candidate output — a document, analysis, code, or other artifact — and a confidence score (either model-produced or estimated from the output structure and completeness).
Confidence Gating
Before entering the reflection sub-loop, a confidence gate evaluates whether reflection is needed. If the initial output's confidence score exceeds the configured "auto-accept threshold," the output is returned without reflection. This is the primary cost optimisation: for the majority of tasks where the initial output is clearly adequate, no additional inference calls are made. The threshold is tuned per task type based on observed quality distributions.
Critique Phase
The Critique Engine receives the candidate output and the task objective (original instruction + quality rubric). It executes an LLM inference call with an adversarial reviewer persona. The critique prompt is carefully designed to produce structured output: a list of specific issues (each with a category: factual error / logical gap / missing requirement / style violation / inconsistency) and an overall quality score (0–100). The critique prompt is the most important engineering artefact in this pattern — vague critique prompts produce vague, unhelpful critique that does not guide revision.
Quality Gate
The Quality Gate evaluates the critique output. If the quality score meets or exceeds the acceptance threshold and no critical issues are flagged, the output is accepted and returned. If issues are present, the Revision Engine is invoked.
Revision Phase
The Revision Engine receives the original output, the original task instruction, and the structured critique. It invokes an LLM to produce a revised output that addresses the specific issues identified in the critique. The revision prompt is targeted: "Revise the following draft to address these specific issues: [critique issues]. Do not change content that was not flagged as an issue." This targeted revision approach is more efficient than full regeneration and preserves the valid portions of the initial output.
Anti-Loop Detection and Cost Control
The reflection sub-loop enforces a hard maximum of N critique-revise cycles (default: 3). If the output has not reached the acceptance threshold after N cycles, the best output produced so far (highest quality score across all iterations) is returned with a reflection metadata flag indicating that the quality threshold was not reached. This prevents infinite reflection loops from running up unbounded inference costs. The total cost of all reflection cycles is tracked and reported; a per-task reflection cost ceiling can trigger early termination.
Reflection Memory
Critique outputs from completed tasks are written to the agent's episodic memory store (EAAPL-AGT002) with the task type, initial quality score, final quality score, and the specific issues identified. The Memory Consolidation Engine processes these records to update the semantic memory with task-type-specific quality learnings. Over time, the generator's prompting is improved based on accumulated knowledge of the most common quality failures for each task type — reducing the number of reflection cycles needed and improving first-pass quality.
5. Architecture Diagram
flowchart TD
subgraph Input["Task Input"]
A[Task + Quality Rubric]
end
subgraph Core["Generate-Critique-Revise Loop"]
B[Generate Phase]
C{Confidence Gate}
D[Critique Engine]
E{Quality Gate}
F[Revision Engine]
end
subgraph Output["Output Layer"]
G[Accepted Output]
H[Best Output Warning]
I[(Reflection Memory)]
end
A --> B
B --> C
C -->|above threshold| G
C -->|below threshold| D
D --> E
E -->|accepted| G
E -->|max cycles hit| H
E -->|revise| F
F --> D
G --> I
H --> I
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#f0fdf4,stroke:#22c55e
style C fill:#f3e8ff,stroke:#a855f7
style D fill:#f0fdf4,stroke:#22c55e
style E fill:#f3e8ff,stroke:#a855f7
style F fill:#f0fdf4,stroke:#22c55e
style G fill:#d1fae5,stroke:#10b981
style H fill:#fee2e2,stroke:#ef4444
style I fill:#fef9c3,stroke:#eab308
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Generate Phase |
Agent Loop |
Standard agent execution producing candidate output |
EAAPL-AGT001 implementation |
Critical |
| Confidence Gate |
Quality Control |
Evaluates initial output confidence; gates reflection entry |
Model logprobs; heuristic scoring; LLM confidence prompt |
High |
| Critique Engine |
AI Component |
Generates structured critique using adversarial reviewer prompt |
Separate LLM instance (same or different model); critique-tuned prompt |
Critical |
| Quality Gate |
Logic Component |
Evaluates critique quality score vs. acceptance threshold; decides accept/revise/escalate |
Custom logic; configurable threshold per task type |
Critical |
| Revision Engine |
AI Component |
Produces targeted revision addressing specific critique issues |
LLM with revision-focused prompt |
Critical |
| Best Output Tracker |
State |
Tracks the highest-quality output produced across reflection cycles |
In-memory; part of loop state |
High |
| Anti-Loop Controller |
Safety |
Enforces maximum cycle limit; triggers fallback to best output |
Counter in loop state; configurable max N |
Critical |
| Reflection Cost Monitor |
Governance |
Tracks cumulative token cost of critique + revision calls; enforces cost ceiling |
Custom; EAAPL-AGT010 integration |
High |
| Reflection Memory Writer |
Learning |
Writes critique outcomes to episodic memory for future learning |
EAAPL-AGT002 memory write API |
Medium |
| Quality Score Time Series |
Observability |
Tracks quality scores per task type over time; detects drift |
Metrics platform; Grafana; custom analytics |
Medium |
7. Data Flow
Full Reflection Cycle
| Step |
Actor |
Action |
Output |
| 1 |
Task System |
Submits task with quality_rubric: list of acceptance criteria, quality_threshold (e.g., 85/100) |
Task + quality config |
| 2 |
Generate Phase |
Executes standard agent loop; produces candidate output and confidence score |
Candidate: {output_text, confidence: 0.72} |
| 3 |
Confidence Gate |
Compares confidence (0.72) to auto-accept threshold (e.g., 0.90): below threshold; enter reflection |
Reflection triggered |
| 4 |
Critique Engine |
Sends critique prompt: [adversarial_persona] Review this draft against [quality_rubric]. Output JSON: {issues: [{category, description, severity}], quality_score: int} |
Structured critique: {issues: [{factual_error: ...}, {missing_req: ...}], quality_score: 71} |
| 5 |
Quality Gate |
Quality score 71 < acceptance threshold 85; cycle count 1 < max 3; continue |
Revise |
| 6 |
Revision Engine |
Sends revision prompt with original output + critique issues |
Revised output |
| 7 |
Best Output Tracker |
Revised output quality estimated; compare to prior best |
Updated best candidate |
| 8 |
Critique Engine (cycle 2) |
Critiques revised output |
Critique: {issues: [{minor_style: ...}], quality_score: 89} |
| 9 |
Quality Gate |
Score 89 ≥ threshold 85; accept |
Accept |
| 10 |
Output |
Returns accepted output with metadata: {output, reflection_cycles: 2, final_quality_score: 89, issues_resolved: 2} |
Final output |
| 11 |
Reflection Memory Writer |
Writes: task_type, initial_score, final_score, issues_resolved, cycle_count |
Memory record |
Error Flow
| Error |
Detection |
Recovery |
| Critique engine returns malformed JSON |
JSON parse error |
Retry critique call with explicit JSON schema instruction; max 2 retries |
| Revision does not improve quality score |
Quality Gate detects same or lower score |
Increment cycle counter; if max reached, return best output; log plateau |
| Reflection cost budget exceeded |
Cost Monitor |
Immediately return best output with status: reflection_budget_exceeded |
| LLM provider timeout during critique |
Timeout exception |
Return current best output with status: critique_timeout |
8. Security Considerations
Prompt Injection in Critique
- The critique prompt injects the candidate output as content — if the candidate output contains injected instructions, the critique LLM could be manipulated
- Mitigation: the critique prompt wrapper clearly delineates the content being reviewed from the critic's instructions; content is wrapped in explicit delimiters (XML tags or similar); output validation on critique output before Quality Gate evaluation
OWASP LLM Top 10
| OWASP LLM Risk |
Reflection Applicability |
Mitigation |
| LLM01 Prompt Injection |
Candidate output injected into critique context |
Content delimiters; output validation on critique JSON |
| LLM09 Overreliance |
Quality score could create false confidence in flawed output |
Quality score is advisory metadata; high-stakes outputs always include reflection metadata for human reference; quality score ≠ accuracy guarantee |
| LLM08 Excessive Agency |
Reflection cycles could be exploited to iteratively refine harmful outputs |
Quality rubric includes safety criteria; critique is instructed to flag safety violations as terminal issues; safety-flagged outputs are rejected regardless of quality score |
| LLM04 DoS |
Infinite reflection loops exhaust inference budget |
Hard cycle limit; cost ceiling enforcement; anti-loop controller |
9. Governance Considerations
Quality Rubric Governance
- Quality rubrics are owned by domain subject matter experts (legal team owns legal rubrics, clinical leads own clinical rubrics)
- Rubrics are versioned and change-managed; changes require impact assessment on existing task benchmarks
- Acceptance thresholds are set and reviewed by the domain owner, not by engineering
Model Risk Management
- Reflection quality scores are not objective ground truth; they are model judgments subject to model limitations
- Quality scores must be validated against human expert assessments on a held-out benchmark before being used as primary quality gatekeepers
- For highest-stakes tasks, model reflection quality scores are advisory only; human review remains the final gate
Governance Artefacts
| Artefact |
Owner |
Frequency |
Purpose |
| Quality Rubric Register |
Domain SME + AI Platform |
Per task type; on change |
Documents acceptance criteria per task type and threshold justification |
| Reflection Quality Benchmark |
ML Engineering |
Monthly |
Compares model quality scores to human assessments; validates rubric effectiveness |
| Quality Score Distribution Report |
Operations |
Monthly |
Per-task-type quality score distributions; identifies degradation |
| Reflection Cost Report |
FinOps |
Monthly |
Average reflection cost per task type; ROI analysis vs. quality improvement |
10. Operational Considerations
SLOs
| SLO |
Target |
Window |
Alert |
| Reflection cycle p95 latency |
≤ 30s per cycle |
1-hour rolling |
> 60s triggers P2 |
| Auto-accept rate (no reflection needed) |
≥ 60% of tasks |
24-hour rolling |
< 40% indicates prompt quality issue; P3 |
| Quality acceptance rate (within max cycles) |
≥ 90% |
24-hour rolling |
< 80% triggers P2; quality rubric review |
| Average reflection cycles per accepted output |
≤ 1.5 |
24-hour rolling |
> 2.5 indicates poor initial generation |
Monitoring
- Quality score distribution per task type: trending toward lower initial scores indicates prompt degradation
- Reflection cycle count distribution: bimodal (0 cycles or ≥2 cycles) may indicate confidence gate miscalibration
- Cost per reflection cycle per task type: anomaly detection for cost spikes
11. Cost Considerations
Cost Drivers
| Scenario |
Additional Token Cost vs. No Reflection |
Quality Benefit |
| 60% auto-accept, 40% need 1 reflection cycle |
+40% (approx) |
High — issues caught in 40% of cases |
| 60% auto-accept, 30% need 1 cycle, 10% need 2 cycles |
+60% (approx) |
Very High |
| 20% auto-accept, 80% need 2 cycles |
+200% (approx) |
Very High but expensive — optimise generation |
Optimisations
- Use a smaller, faster model for the critique step and the full model only for revision (model routing)
- Cache common critique patterns and their resolutions as procedural memories to reduce iteration count
- Tune confidence gate threshold upward (be more selective about what triggers reflection) if auto-accept rate is too low
Indicative Cost Range (per 1,000 tasks)
| Task Type |
Without Reflection |
With Reflection (1.5 avg cycles) |
Quality Improvement |
| Contract clause review |
$20–50 |
$35–85 |
+20–35% quality score |
| Clinical documentation |
$15–40 |
$28–72 |
+25–40% quality score |
| Technical documentation |
$10–30 |
$16–48 |
+15–25% quality score |
12. Trade-Off Analysis
Reflection Implementation Options
| Option |
Quality Improvement |
Cost |
Complexity |
Best For |
| A: Same-model adversarial critique (Recommended) |
High |
Medium |
Low |
Most production deployments |
| B: Separate critic model |
Very High |
High |
Medium |
Highest-stakes domains (legal, clinical) |
| C: Constitutional AI-style (self-correction via principles) |
Medium–High |
Medium |
Low |
When critique rubric is stable and articulable as principles |
| D: Multi-agent debate (see EAAPL-MAG005) |
Very High |
Very High |
High |
High-stakes decisions where structured debate adds unique value |
Architectural Tensions
| Tension |
Left Pole |
Right Pole |
Balance |
| Quality vs. Latency |
Maximum reflection cycles for best quality |
Single pass for lowest latency |
Risk-tiered: async reflection for background tasks; 1-cycle max for interactive |
| Critique specificity vs. Prompt complexity |
Highly detailed rubric; specific critique |
Simple rubric; general critique |
Start with 5–10 specific criteria; iterate based on quality benchmark results |
| Auto-accept rate vs. Quality coverage |
High threshold: most outputs go through reflection |
Low threshold: rarely reflects; risk of poor quality |
Tune threshold per task type to balance cost and quality |
13. Failure Modes
| Failure Mode |
Likelihood |
Impact |
Detection |
Recovery |
| Critique affirms rather than challenges (sycophancy) |
High (model tendency) |
High — reflection adds cost but not quality |
Quality score does not improve across cycles |
Strengthen adversarial persona in critique prompt; validate critique calibration on benchmark |
| Revision makes output worse (regression) |
Medium |
High — quality degrades |
Best output tracker catches if revision score < prior best |
Return prior best output; log regression; review revision prompt |
| Max cycles reached with sub-threshold quality |
Medium |
Medium — partial quality improvement |
Quality warning flag in output metadata |
Route to human review queue; log for rubric improvement |
| Reflection cost exceeds budget on complex task |
Low–Medium |
Medium — unexpected cost |
Cost monitor |
Truncate reflection; return best output; alert |
| Critique hallucinates non-existent issues |
Medium |
Medium — unnecessary revision |
Validate critique against original for false positives |
Human audit of critique quality on sample; rubric refinement |
14. Regulatory Considerations
EU AI Act
- Art. 9 (Risk Management): reflection quality scores provide evidence of quality management for high-risk AI systems; must be preserved in the task audit log
- Art. 15 (Accuracy and Robustness): the reflection cycle directly implements the requirement for AI systems to remain accurate and robust; quality benchmark validation satisfies the measurement requirement
ISO 42001
- §8.4: The reflection mechanism and quality monitoring are part of the AI system's operational quality management lifecycle
NIST AI RMF
- MEASURE 2.5: The quality score time series and benchmark validation implement the AI performance measurement requirement
15. Reference Implementations
AWS
| Component |
Service |
| Generate + Critique + Revise |
Amazon Bedrock (Claude 3 Sonnet for generation; Claude 3 Haiku for critique) |
| Confidence Gate |
Custom Lambda function evaluating model response metadata |
| Quality Score Tracking |
Amazon CloudWatch custom metrics |
Azure
| Component |
Service |
| Generate + Critique + Revise |
Azure OpenAI Service (GPT-4o for generation; GPT-4o-mini for critique) |
| Reflection Orchestration |
Azure Durable Functions (sub-orchestration for reflection sub-loop) |
On-Premises
| Component |
Technology |
| Generate + Critique + Revise |
vLLM serving Llama 3.1 70B (generation); Llama 3.1 8B (critique) |
| Reflection Orchestration |
LangGraph with custom reflection node |
| Pattern |
ID |
Relationship Type |
Notes |
| Single Agent Pattern |
EAAPL-AGT001 |
Extends |
Reflection sub-loop extends the Reflect phase of the base agent loop |
| Stateful Agent Memory |
EAAPL-AGT002 |
Integrates With |
Critique outcomes are written to episodic memory for learning |
| Agent Cost Governance |
EAAPL-AGT010 |
Integrates With |
Reflection cost is tracked and controlled under the cost governance pattern |
| Debate Agent |
EAAPL-MAG005 |
Related |
Debate is an alternative quality mechanism using multiple agents; this pattern uses self-critique |
| Human-in-the-Loop Agent |
EAAPL-MAG003 |
Peer |
Outputs that fail reflection after max cycles are escalated to human review |
17. Maturity Assessment
Overall Maturity: Emerging
| Dimension |
Score (1–5) |
Evidence |
| Research Foundation |
5 |
Constitutional AI, Self-Refine, Reflexion papers provide strong academic foundation |
| Production Deployment |
3 |
Deployed in specialised high-stakes applications; general production tooling still maturing |
| Quality Measurement |
3 |
Quality benchmark methodology developing; no standard evaluation framework yet |
| Cost Optimisation |
3 |
Model routing for critique maturing; confidence gate calibration still domain-specific |
| Framework Support |
3 |
LangGraph supports reflection nodes; general framework support growing |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2024-07-01 |
Architecture Board |
Initial publication |
| 1.1 |
2025-02-15 |
ML Engineering |
Added model routing for critique; anti-loop cost ceiling; quality benchmark methodology |