Hybrid Intelligence Pattern
Pattern ID: EAAPL-HIL008
Status: Proven
Tags: human-oversight explainability accountability high-complexity
Version: 1.0
Last Updated: 2026-06-12
1. Executive Summary
The Hybrid Intelligence Pattern defines an architecture for systematically allocating each component of a complex task to the agent — human or AI — that can perform it best. Rather than treating AI as an optional add-on to human workflows or humans as a fallback for AI failures, it decomposes the task into sub-components, classifies each by suitability criteria, and designs explicit handoff protocols that transfer context between agents with minimal friction and maximal fidelity.
This pattern addresses the most common failure mode in enterprise AI deployments: asking a single agent (usually AI alone or human-plus-AI suggestions in a sidebar) to perform all components of a task, ignoring that AI excels at pattern-matched high-volume processing while humans excel at novel, ethical, and ambiguous reasoning. The pattern covers task decomposition methodology; interface design for human-AI collaboration; handoff protocols; cognitive load management; trust calibration to prevent both automation bias and underuse; and performance measurement that compares hybrid intelligence against human-only and AI-only baselines. CIOs and CTOs gain a framework for continuous optimisation of human-AI task allocation, delivering measurable quality and efficiency gains that neither AI nor humans can achieve independently.
2. Problem Statement
Business Problem
Enterprises deploying AI into complex knowledge work face a paradox: AI is too often given the entire task (producing poor quality on the hard parts) or too little of the task (producing limited efficiency gains). The optimal allocation — AI takes what it does well, humans take the rest — is rarely designed deliberately. It emerges ad hoc, inconsistently across teams, and without any mechanism for measurement or improvement.
Technical Problem
Task decomposition requires identifying sub-task types and classifying them by AI suitability criteria. Without this classification, engineers default to giving AI the whole input and presenting the whole output to humans for review — which does not actually leverage AI's strengths or protect against its weaknesses. The handoff between AI and human sub-tasks is an engineering problem that is frequently underspecified: what context transfers, in what format, at what latency, and with what guarantees of completeness?
Symptoms
- AI is used for whole-task processing but humans override a high percentage of AI outputs
- Human reviewers cannot articulate which parts of the AI output they are checking vs rubber-stamping
- Task completion time with AI assistance is not significantly faster than without it, despite AI being deployed
- No measurement exists of whether hybrid performance exceeds human-only or AI-only performance
- Human trust in AI varies wildly across team members: some always accept AI outputs; others always override
Cost of Inaction
- Sub-optimal allocation leaves efficiency gains unrealised and quality gains uncaptured
- Automation bias (over-trusting AI) and underuse bias (over-riding AI) coexist in the same team, producing inconsistent quality
- Without performance measurement, it is impossible to demonstrate AI ROI or to improve the allocation over time
3. Context
When to Apply
- Complex knowledge work tasks with multiple identifiable sub-components
- Domains where some sub-tasks are AI-suitable (high volume, pattern-matched) and others are human-suitable (novel, ethical, ambiguous)
- Regulated environments where accountability for specific decision components must be attributable to a named human
- Teams mature enough to run performance measurement and iterate on human-AI allocation
When NOT to Apply
- Simple uniform tasks where decomposition does not reveal meaningfully different sub-components
- Latency-critical tasks where the handoff protocol overhead is architecturally prohibitive
- Organisations that lack the operational maturity to manage the handoff protocol and performance measurement
Prerequisites
- Complex task with identifiable, separable sub-components
- AI system capable of processing at least some sub-components reliably
- Human expert workforce available for high-suitability-human sub-components
- Performance measurement infrastructure (quality metrics, timing, outcome tracking)
Industry Applicability
| Industry |
Complex Task |
AI-Suitable Sub-Tasks |
Human-Suitable Sub-Tasks |
| Legal |
Contract review |
Clause identification, standard risk flagging, precedent matching |
Novel risk assessment, negotiation recommendation, client advice |
| Healthcare |
Clinical documentation |
ICD coding of standard diagnoses, medication reconciliation |
Complex comorbidity assessment, patient communication, ethical decisions |
| Financial Services |
Credit assessment |
Document data extraction, fraud signal scoring, policy compliance check |
Final credit judgment, exception handling, relationship context |
| Insurance |
Claims processing |
Document classification, damage estimation from photos, fraud scoring |
Coverage interpretation disputes, large loss adjustment, litigation management |
| HR |
Candidate assessment |
Resume parsing, skills matching, compliance screening |
Interview quality assessment, culture fit judgment, hiring decision |
| Research |
Literature review |
Paper retrieval, citation extraction, structured data extraction |
Synthesis, hypothesis generation, expert interpretation |
4. Architecture Overview
The Hybrid Intelligence Pattern requires implementing six capabilities in integrated sequence.
Capability 1 — Task Decomposition Framework. The first step is to decompose the target task into sub-components and classify each by suitability using four criteria: volume and repetition (AI-suitable if the sub-task type recurs at high volume with consistent patterns); variance and novelty (human-suitable if novel cases are common and require contextual judgment); ethical and accountability requirements (always human if the sub-task produces a judgment requiring personal accountability — e.g. "should we approve this patient's request for surgery?"); and error tolerance (human-required if errors have high consequence and are hard to reverse). Each sub-task is classified on a 2×2 matrix: AI confidence (can the AI perform this reliably?) versus consequence of error (how bad is an undetected AI error?). Sub-tasks in the high-confidence, low-consequence quadrant are AI-automated; sub-tasks in the low-confidence or high-consequence quadrants involve human judgment.
Capability 2 — Interface Design for Human-AI Collaboration. The collaboration interface is not a blank canvas with AI suggestions floating in a sidebar — that design has been repeatedly shown to produce the worst outcomes (anchoring bias plus cognitive overload). The correct design is a structured presentation where AI takes the first pass and produces structured output in a defined schema; the human reviews the structured output, edits specific fields, and approves. The interface explicitly marks which sub-tasks were AI-completed (with confidence indicator) and which require human action. AI-completed fields are pre-filled and editable; human-required fields are empty and mandatory. This design communicates clearly what has been done, what needs review, and what the human must decide — without requiring the human to re-read the full source input from scratch.
Capability 3 — Handoff Protocol. When a task component transitions from AI to human (or from one AI component to another), the handoff must be specified: what context transfers (previous sub-task outputs, source documents, AI confidence for each field, retrieved evidence, constraints that apply to this sub-task); what format (structured JSON schema shared by AI output and human input form); what latency is expected (synchronous for human-waiting workflows; async for batch); and what happens if the handoff is incomplete or invalid (validation at the receiving side; rejection with retry request). The handoff message schema is the contract between AI and human components of the workflow. It must be versioned: changes to the schema require migration of in-flight tasks.
Capability 4 — Cognitive Load Management. The human reviewer's time is the bottleneck in any hybrid intelligence system. Cognitive load must be actively managed: present only the minimum necessary information on the primary view; allow drill-down for supporting evidence; pre-process and structure AI output to eliminate information the human does not need to review (e.g. do not show a human reviewer the full 100-page document if the AI has already extracted the 5 relevant clauses they need to judge); use progressive disclosure (show the highest-priority item first; additional items accessible on demand). Time-on-review is tracked per sub-task type; unexpectedly short review times (potential rubber-stamping) and unexpectedly long review times (interface complexity issue) are both flagged for investigation.
Capability 5 — Trust Calibration. Human trust in AI varies by individual and over time. Both extremes are harmful: over-trust (automation bias) produces rubber-stamping; under-trust (automation resistance) eliminates AI efficiency gains. Trust calibration involves: tracking each human's agreement rate with AI on each sub-task type; comparing their agreement rate to the AI's actual accuracy rate on that sub-task type; if agreement rate significantly exceeds accuracy rate, flag automation bias and notify supervisor; if agreement rate significantly falls below accuracy rate, investigate whether AI performance on this sub-task type is genuinely poor (justified under-trust) or whether individual bias is driving over-riding. Trust calibration data feeds the task allocation framework: if a human systematically re-does AI sub-tasks from scratch, the AI is not adding value for that component with that human and the allocation should be reconsidered.
Capability 6 — Performance Measurement. The hybrid intelligence system must be benchmarked against human-only and AI-only baselines to demonstrate value and identify optimisation opportunities. Metrics measured: task completion time (hybrid vs human-only vs AI-only); output quality (human expert evaluation of random samples from each condition); error rate on downstream outcomes (regulatory findings, claims outcomes, customer complaints); human cognitive load (reported workload rating, time on task); and cost per completed task. The measurement must be run on sufficiently large samples to achieve statistical power. Results are reviewed quarterly; significant deviations from expected performance trigger an allocation review.
5. Architecture Diagram
flowchart TD
subgraph Decomposition["Task Decomposition"]
A[Complex Task Input]
B[Task Decomposer]
C{Sub-task Classifier}
end
subgraph Processing["Processing Layer"]
D[AI Processing Engine]
E[Human Action Queue]
F[Handoff Package Builder]
end
subgraph Assembly["Assembly and Learning"]
G[Collaboration Interface]
H[Output Assembler]
I[Trust Calibration Monitor]
end
A --> B
B --> C
C -->|AI-suitable| D
C -->|human-suitable| E
D --> F
E --> G
F --> G
G --> H
H --> I
I -->|bias alert| G
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#f0fdf4,stroke:#22c55e
style C fill:#f3e8ff,stroke:#a855f7
style D fill:#f0fdf4,stroke:#22c55e
style E fill:#f0fdf4,stroke:#22c55e
style F fill:#f0fdf4,stroke:#22c55e
style G fill:#f0fdf4,stroke:#22c55e
style H fill:#d1fae5,stroke:#10b981
style I fill:#fee2e2,stroke:#ef4444
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Task Decomposer |
Application Service |
Parse task input; identify sub-tasks; route to AI or human queue |
Rules-based router; LLM-based decomposer; domain-specific parser |
Critical |
| AI Processing Engine |
ML Serving |
Execute AI-suitable sub-tasks; return structured output with confidence |
LLM (Claude, GPT-4), fine-tuned classifier, extraction model |
Critical |
| Handoff Package Builder |
Application Service |
Assemble context package for human review; validate against schema |
Python microservice; JSON Schema validation |
High |
| Collaboration Interface |
Web Application |
Present structured AI output; highlight human-required tasks; capture edits |
Custom React app; task-specific form design |
Critical |
| Human Action Queue |
Durable Queue |
Hold human-required sub-tasks; manage assignment and SLA |
PostgreSQL queue; Temporal workflow |
High |
| Output Assembler |
Application Service |
Merge AI and human sub-task outputs into final task output |
Python microservice |
High |
| Trust Calibration Monitor |
Analytics Service |
Track individual agreement rates vs AI accuracy; detect bias |
Python analytics job; BI dashboard |
High |
| Performance Measurement Service |
Analytics Service |
Compare hybrid performance to baselines; compute quality, speed, cost metrics |
Python analytics; Jupyter notebooks; BI tool |
Medium |
| Allocation Optimiser |
Analytics + Advisory |
Recommend sub-task reallocation based on performance data |
Python analysis; human decision required for changes |
Medium |
7. Data Flow
Primary Flow
| Step |
Actor |
Action |
Output |
| 1 |
Source System |
Submits complex task |
task_id, task_type, input_payload |
| 2 |
Task Decomposer |
Identifies sub-tasks; classifies each |
sub_tasks[]: {sub_task_id, type, classification, input_slice} |
| 3 |
AI Processing Engine |
Processes AI-suitable sub-tasks in parallel |
ai_outputs[]: {sub_task_id, result, confidence, evidence[], processing_time_ms} |
| 4 |
Handoff Package Builder |
Assembles AI outputs + human-required sub-tasks + context into collaboration package |
handoff_package: {task_id, completed_sub_tasks[], pending_human_sub_tasks[], context_docs[], constraints[]} |
| 5 |
Schema Validator |
Validates handoff package completeness |
valid: true/false; validation_errors[] |
| 6 |
Collaboration Interface |
Presents structured package to reviewer |
UI rendered; review_started_at timestamp |
| 7 |
Human Reviewer |
Edits AI sub-task outputs; completes human-required sub-tasks |
reviewed_sub_tasks[]: {sub_task_id, final_value, was_ai_modified, modification_reason, time_spent_ms} |
| 8 |
Output Assembler |
Merges AI and human contributions; produces final task output |
final_output: {task_id, sub_task_outputs[], human_contribution_map{}, ai_contribution_map{}} |
| 9 |
Trust Calibration Monitor |
Updates agreement rate for reviewer across sub-task types |
trust_metrics per reviewer per sub-task_type |
| 10 |
Performance Measurement |
Records quality, time, and cost metrics |
performance_record linked to task_id |
Error Flow
| Error Condition |
Detected By |
Recovery Action |
Notification |
| AI sub-task fails (error or timeout) |
AI Processing Engine |
Mark sub-task as human-required; add to human action queue |
Collaboration interface shows "AI unavailable for this item" |
| Handoff package schema validation failure |
Handoff Package Builder |
Retry AI sub-task with explicit structured output instruction; escalate to human if second attempt fails |
ML Ops alert; human takes over affected sub-task |
| Human reviewer times out on task |
SLA Manager |
Re-assign or escalate to supervisor |
Operations manager notification |
| Trust calibration detects automation bias |
Trust Calibration Monitor |
Supervisor notification; optional mandatory re-training of reviewer on task guidelines |
Supervisor; HR if persistent |
8. Security Considerations
Authentication and Authorisation
- Collaboration interface requires SSO + MFA
- Sub-task authority levels enforced: certain high-consequence sub-tasks (final approval, regulatory referral) may only be completed by senior-level reviewers
- AI processing service accounts have read access to input data and write access only to AI output fields; cannot write human-required fields
Secrets Management
- AI model API keys stored in secrets manager; rotated quarterly
- Source system integration credentials stored in secrets manager
Data Classification
- Full task input may contain sensitive data (PII, financial, health); collaboration interface presents only the minimum data slice required for each sub-task
- AI processing engine should not receive sub-task inputs containing data irrelevant to that sub-task (data minimisation at decomposition layer)
Encryption
- All task data encrypted at rest and in transit
- Handoff packages containing sensitive data encrypted with envelope encryption; human reviewer decrypts with their session key
Auditability
- Every sub-task assignment, AI output, human action, and final output logged with full provenance
- Human-AI contribution map is part of the permanent task record
OWASP LLM Top 10 Considerations
| OWASP LLM Risk |
Applicability |
Mitigation |
| LLM01: Prompt Injection |
High — task input data is passed to LLM for sub-task processing |
Sanitise task input before inclusion in LLM prompts; use structured output schemas to limit injection surface |
| LLM02: Insecure Output Handling |
High — AI outputs are pre-filled into human review forms |
Validate and sanitise AI output against sub-task schema before rendering in interface |
| LLM03: Training Data Poisoning |
Medium — human edits may feed training |
Validate training data provenance; authority-level filter on edits used as training |
| LLM04: Model Denial of Service |
Low |
Rate limiting on AI processing engine |
| LLM05: Supply Chain Vulnerabilities |
Medium — third-party LLM providers |
Approved provider list; output validation |
| LLM06: Sensitive Information Disclosure |
High — LLM may leak training data in sub-task outputs |
Structured output schemas limiting output surface; PII detection on AI outputs before display |
| LLM07: Insecure Plugin Design |
Medium — if AI sub-tasks use tool calls |
Apply tool call security controls; minimum-permission tool access |
| LLM08: Excessive Agency |
High — AI takes first pass on multiple sub-tasks |
Human final review of all AI outputs is mandatory; no AI sub-task auto-applies to the final output without human confirmation |
| LLM09: Overreliance |
Critical — hybrid design can seduce human into rubber-stamping AI outputs |
Trust calibration monitoring; review time tracking; automation bias alerts |
| LLM10: Model Theft |
Medium |
Access controls on AI output logs which reveal model capabilities |
9. Governance Considerations
Responsible AI
- Task decomposition must be reviewed for bias: is the AI systematically assigned sub-tasks where errors would disproportionately affect protected groups without adequate human oversight?
- Performance measurement must include fairness analysis: does hybrid performance degrade for cases involving protected group attributes compared to the overall baseline?
Model Risk Management
- AI sub-task components are models subject to model risk management; each must be registered, validated, and subject to ongoing monitoring
- Changes to sub-task allocation (moving a sub-task from human to AI) are model risk events requiring sign-off
Human Approval Gates
- Allocation changes (reclassifying a sub-task from human-suitable to AI-suitable) require performance evidence and Model Risk approval
- Quarterly performance review must confirm hybrid performance exceeds human-only baseline; if not, allocation is reviewed
Policy Compliance
- Accountability map must identify which human is accountable for which sub-tasks in the final output
- For regulated tasks, the accountable human must have the qualifications required by regulation to perform that sub-task
Traceability
- Final task output must be traceable to: AI sub-task outputs (with model versions and confidence); human sub-task inputs (with reviewer identity); and any modifications made during human review
Governance Artefacts
| Artefact |
Owner |
Frequency |
Purpose |
| Sub-task Allocation Decision Record |
ML Ops + Domain Lead |
Per allocation change |
Document decision to assign sub-task to AI vs human with supporting performance evidence |
| Hybrid Performance Report |
ML Ops |
Quarterly |
Compare hybrid vs baselines on quality, speed, cost; include fairness analysis |
| Trust Calibration Report |
Operations Manager |
Monthly |
Individual and aggregate agreement rates; automation bias flags and resolutions |
| Human-AI Contribution Map |
Compliance |
Per task type, annual review |
Document which sub-tasks are AI-completed vs human-completed for regulatory reporting |
10. Operational Considerations
Monitoring
| Metric |
SLO |
Alert Threshold |
Owner |
| Task end-to-end completion time |
Baseline × 1.2 (hybrid should be faster than human-only) |
> Baseline × 1.5 |
Operations |
| AI sub-task accuracy (sampled audit) |
> Sub-task accuracy SLA |
> 5% relative drop |
ML Ops |
| Human review completion time per sub-task |
< defined SLA per sub-task type |
> 150% SLA |
Operations Manager |
| Trust calibration (individual agreement rate) |
Within ±15% of AI accuracy rate |
Outside ±25% |
Supervisor |
| Handoff package validation success rate |
> 99% |
< 98% |
ML Ops |
| Output quality score (expert sample rating) |
> defined quality bar |
> 5% drop from baseline |
Quality lead |
Logging
- Full task processing log with AI and human contribution details
- Time-on-review per sub-task per reviewer
- All trust calibration metrics stored with rolling history
Incident Response
- AI processing failure on critical sub-task: immediately assign to human; log AI failure for model review
- Trust calibration automation bias alert: supervisor investigation within 48 hours
- Performance report shows hybrid below human-only baseline: emergency allocation review within 2 weeks
Disaster Recovery
| Component |
RTO |
RPO |
Strategy |
| AI Processing Engine |
15 min |
0 (stateless) |
Multi-AZ; all sub-tasks fall back to human queue |
| Collaboration Interface |
30 min |
N/A |
Multi-AZ |
| Task and Sub-task Store |
30 min |
5 min |
PostgreSQL synchronous standby |
Capacity Planning
- Human reviewer capacity must handle peak volume on all human-required sub-tasks plus AI failure rate spillover
- AI processing must scale to handle task volume within the handoff latency SLO
11. Cost Considerations
Cost Drivers
| Driver |
Description |
Relative Weight |
| Human Reviewer Labour |
For human-required sub-tasks; reduced compared to human-only baseline by AI handling high-volume sub-tasks |
High (but lower than human-only) |
| AI Processing |
Per sub-task token cost × volume; LLM-based sub-tasks most expensive |
Medium |
| Interface Development |
Custom collaboration interface development; most significant one-time cost |
High (one-time) |
| Trust Calibration and Performance Measurement |
Analytics infrastructure; low ongoing cost |
Low |
Scaling Risks
- If AI accuracy on any sub-task type falls below its SLA, that sub-task reverts to human handling, increasing labour cost
- LLM token costs scale with task complexity; complex sub-task prompts at high volume can become significant
Optimisations
- Fine-tune AI on domain-specific data to improve accuracy and reduce token usage per sub-task
- Cache AI outputs for identical or near-identical sub-task inputs (reduces cost at high volume)
- Progressive automation: start with AI as a pre-filler; as confidence in AI accuracy grows, graduate sub-tasks to AI-automated
Indicative Cost Range
| Baseline |
Human-Only Monthly Cost |
AI-Only Monthly Cost |
Hybrid Monthly Cost |
Hybrid Saving vs Human-Only |
| Small (1K tasks/month) |
$50,000 |
$5,000 |
$30,000 |
40% |
| Medium (10K tasks/month) |
$400,000 |
$40,000 |
$200,000 |
50% |
| Large (100K tasks/month) |
$3M |
$300,000 |
$1.2M |
60% |
12. Trade-Off Analysis
Decomposition Granularity Options
| Granularity |
Human Oversight Precision |
Handoff Complexity |
Cognitive Load |
Recommended |
| Coarse (2–3 large sub-tasks) |
Low — large AI blocks with limited human check points |
Low |
Low |
Use for mature, well-calibrated domains where AI reliability is high |
| Medium (5–10 sub-tasks) |
High — humans review at multiple precise checkpoints |
Medium |
Medium |
Default recommendation; balances oversight and complexity |
| Fine (>10 sub-tasks) |
Very High — humans check AI at every step |
High |
High — cognitive overload risk |
Use only for highest-stakes regulated tasks; requires strong interface design to manage load |
Architectural Tensions
| Tension |
Option A |
Option B |
Resolution Guidance |
| AI-first vs human-first for ambiguous sub-tasks |
AI takes first pass; human edits |
Human takes first pass; AI validates |
For efficiency: AI-first is 30–50% faster. For quality on novel tasks: human-first avoids anchoring. Default to AI-first; switch to human-first when anchoring is measured as a problem |
| Collaboration interface richness vs simplicity |
Full evidence display for every sub-task |
Minimal display with drill-down |
Always default to minimal display; provide drill-down for every field. Decision-makers should never need to read 40 pages to approve a sub-task |
| Strict allocation vs adaptive allocation |
Fixed allocation: every task follows the same human/AI split |
Adaptive: confidence-based routing adjusts allocation per-instance |
Adaptive delivers better efficiency but higher complexity. Start with fixed allocation; add confidence-based routing for mature deployments |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| AI anchoring bias in collaboration interface |
High |
High — human does not exercise independent judgment |
Time-on-review monitoring; override rate monitoring |
Interface redesign; present AI output after human initial assessment |
| Sub-task classification error (human-required task allocated to AI) |
Medium |
Critical for high-stakes tasks |
Quality audit of AI-completed sub-tasks; outcome monitoring |
Immediate reallocation; retroactive review of affected task outputs |
| Handoff package incomplete (missing context) |
Medium |
Medium — human makes sub-optimal decision without full context |
Schema validation failure rate; human review time anomalously high |
Improve handoff package builder; add missing context sources |
| Performance measurement baseline contamination |
Low |
High — incorrect performance comparison; wrong allocation decisions |
Performance measurement methodology review |
Re-run baseline measurement with clean experimental design |
| Trust calibration data lag (calibration update is slow) |
Medium |
Medium — automation bias persists undetected for weeks |
Trust calibration update frequency monitoring |
Increase calibration update frequency; near-real-time for high-volume deployments |
Cascading Failure Scenario
- AI accuracy on a key sub-task degrades silently → human agreement rate on that sub-task remains high (automation bias) → performance audit reveals outcome quality decline → source traced to AI sub-task degradation not detected because humans were not providing independent oversight
- Mitigation: AI sub-task accuracy monitored independently of human agreement rate (sampled expert audit of AI outputs); trust calibration uses AI accuracy data not just human agreement rate
14. Regulatory Considerations
| Regulation |
Specific Clause |
Requirement |
Implementation |
| EU AI Act |
Article 14 — Human oversight |
High-risk AI systems require meaningful human oversight at key decision points |
Hybrid design explicitly maps human oversight to each high-stakes sub-task; human-AI contribution map documents this |
| EU AI Act |
Article 13 — Transparency |
AI system must enable humans to understand outputs |
AI confidence and evidence per sub-task are presented in collaboration interface |
| EU AI Act |
Article 9 — Risk management |
AI system risk includes sub-task misallocation |
Sub-task classification review; authority level controls for high-stakes sub-tasks |
| APRA CPS 230 |
§50 — Material operational risk |
Complex AI-assisted workflows are operational risk |
Performance measurement demonstrates hybrid reliability vs baseline |
| Privacy Act 1988 (Australia) |
APP 3 — Minimisation |
Task decomposition allows data minimisation: each sub-task receives only the data slice it needs |
Sub-task-level data minimisation is a key design principle of this pattern |
| ISO 42001:2023 |
§8.4 — AI system operation |
Operational controls must maintain performance |
Performance measurement and allocation optimisation are the operational controls |
| NIST AI RMF |
MAP 3.5 — Task suitability |
AI is deployed only for tasks where it is suitable |
Task decomposition framework is the formal task suitability assessment |
| GDPR Article 22 |
Automated individual decision-making |
Solely automated decisions with significant effects require human involvement |
Hybrid design ensures human involvement at all high-consequence decision sub-tasks |
15. Reference Implementations
AWS
- AI Processing: Amazon Bedrock (Claude 3.5 Sonnet for reasoning sub-tasks; Nova Lite for extraction)
- Task Orchestration: AWS Step Functions for sub-task routing and handoff state machine
- Human Action Queue: Amazon SQS FIFO + Amazon Connect Tasks for assignment
- Collaboration Interface: Custom React on Amplify; Amazon Lex for simple conversational sub-tasks
- Performance Measurement: Amazon QuickSight dashboard reading from Redshift analytics store
Azure
- AI Processing: Azure OpenAI (GPT-4o for reasoning; GPT-4o-mini for extraction)
- Task Orchestration: Azure Durable Functions for sub-task state machine
- Human Action Queue: Azure Service Bus + Microsoft Teams Adaptive Cards for lightweight review
- Collaboration Interface: Power Apps or custom React on Static Web Apps
- Performance Measurement: Azure Synapse Analytics + Power BI
GCP
- AI Processing: Vertex AI Gemini (Pro for reasoning; Flash for extraction)
- Task Orchestration: Workflows or Cloud Composer (Airflow) for sub-task routing
- Human Action Queue: Cloud Tasks + Cloud Run for human action API
- Collaboration Interface: Custom React on Firebase Hosting
- Performance Measurement: BigQuery + Looker Studio
On-Premises / Private Cloud
- AI Processing: vLLM serving Llama 3 or Mistral on Kubernetes; fine-tuned sub-task models
- Task Orchestration: Temporal for durable sub-task state machine
- Human Action Queue: PostgreSQL-backed queue with priority ordering
- Collaboration Interface: Custom React on Kubernetes
- Performance Measurement: Airflow + dbt + Grafana
| Pattern |
ID |
Relationship |
Notes |
| Collaborative AI Decision |
EAAPL-HIL004 |
Specialisation — collaborative decision is a two-sub-task hybrid: AI recommendation + human judgment |
Hybrid intelligence is the generalisation of collaborative decision to N sub-tasks |
| Human Escalation Pattern |
EAAPL-HIL003 |
Complementary — escalation handles cases where AI sub-task confidence is below threshold |
Confidence-based sub-task escalation is compatible with hybrid architecture |
| AI Confidence Threshold Routing |
EAAPL-HIL005 |
Dependency — sub-task allocation can be confidence-adaptive |
Threshold routing applies at sub-task level in adaptive hybrid deployments |
| Annotation and Feedback Loop |
EAAPL-HIL007 |
Complementary — human sub-task completions are annotation data for AI sub-task models |
Human inputs on hybrid tasks feed annotation store for sub-task model improvement |
| Supervisor Agent |
EAAPL-MAG002 |
Complementary — supervisor agent can orchestrate hybrid intelligence workflow |
Agent supervisor can be the orchestration layer for AI and human sub-tasks |
| Human Override Pattern |
EAAPL-HIL006 |
Dependency — human reviewers must be able to override any AI sub-task output |
Override is embedded in the collaboration interface for every AI-completed sub-task field |
17. Maturity Assessment
Overall Maturity Level: Proven
| Dimension |
Score (1–5) |
Rationale |
| Technical Maturity |
4 |
Task decomposition and handoff protocols are well-understood; trust calibration tooling is less mature |
| Operational Maturity |
3 |
Managing human-AI task allocation dynamically requires significant operational discipline; most organisations have not formalised this |
| Governance Maturity |
4 |
EU AI Act Article 14 and accountability requirements drive adoption; sub-task accountability mapping satisfies governance needs |
| Tooling Ecosystem |
3 |
No purpose-built hybrid intelligence platforms; implemented from components (workflow engines, LLM APIs, collaboration tools) |
| Enterprise Adoption |
3 |
Widely adopted in concept; formally implemented with performance measurement and trust calibration is less common |
| Risk Profile |
Medium-High |
Highest risk is automation bias within the hybrid design; mitigated by trust calibration and performance measurement |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2026-06-12 |
EAAPL Working Group |
Initial publication covering task decomposition framework, collaboration interface design, handoff protocol, cognitive load management, trust calibration, and performance measurement |