EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryHuman-in-the-Loop
Proven
⇄ Compare

Annotation and Feedback Loop

Annotation and Feedback Loop

Pattern ID: EAAPL-HIL007 Status: Proven Tags: human-oversight model-risk medium-complexity Version: 1.0 Last Updated: 2026-06-12


1. Executive Summary

The Annotation and Feedback Loop pattern defines the end-to-end architecture for collecting structured human annotations on AI inputs and outputs, managing annotator quality, storing labels with full provenance, and routing validated labels back to model training. It is the operational backbone underlying all other human-in-the-loop feedback mechanisms. Where the Active Learning Loop (EAAPL-HIL002) addresses which items to annotate, this pattern addresses how to annotate them — the annotator management system, quality assurance framework, data storage schema, and ingestion pipeline that transform human judgment into model-trainable data.

The pattern covers annotation task design with clear guidelines and uncertainty protocols; annotator management including onboarding calibration tests, ongoing quality monitoring, and bias detection; quality assurance through golden datasets, adjudication, and inter-annotator agreement thresholds; a detailed feedback storage schema; validation and deduplication in the ingestion pipeline; dataset versioning; and closed-loop verification that new models trained on annotations are tested on a held-out set before promotion. CIOs and CTOs gain a structured, auditable annotation operation that produces high-quality training data, satisfies EU AI Act Article 10 training data governance requirements, and converts the operational cost of human review into a compounding strategic asset.


2. Problem Statement

Business Problem

Organisations deploying AI at scale need human-labelled data to train and retrain models. Without a structured annotation operation, labelling is ad hoc: different teams use different instructions, quality varies widely, there is no record of who labelled what or how reliably, and the resulting training data has unknown quality. Models trained on poor-quality annotations are worse than models trained on no new data — annotation effort destroys model quality instead of improving it.

Technical Problem

Annotation is not a solved problem. Human annotators disagree, make errors, develop biases over time, and game easy QA mechanisms. Inter-annotator agreement on complex enterprise tasks (legal clause classification, clinical coding, nuanced sentiment) is often below acceptable thresholds without deliberate intervention. Storing annotations without provenance makes it impossible to trace model errors back to labelling decisions or to exclude low-quality annotators' labels from training.

Symptoms

  • No formal annotation guidelines exist; different annotators interpret tasks differently
  • Annotator accuracy is not monitored; poor-quality annotators continue labelling indefinitely
  • Training data schema does not record who labelled what or when; provenance is lost
  • Adjudication for disagreements is informal and inconsistently applied
  • New model versions are trained on newly annotated data and deployed without verifying they improve on held-out data from the same annotation batch

Cost of Inaction

  • Model trained on poor-quality annotations performs worse in production than the previous version
  • Regulatory examination of training data governance (EU AI Act Article 10) reveals no quality controls
  • Annotator bias — systematic mislabelling by a demographic group or individual — corrupts training data without detection
  • Annotation effort is wasted: high cost, zero benefit

3. Context

When to Apply

  • Any organisation running ongoing model training with human-labelled data
  • Teams adding human feedback collection to production AI systems
  • Regulated environments requiring documented training data quality controls
  • Projects using third-party annotation vendors who must be quality-managed

When NOT to Apply

  • Pure generative model fine-tuning with RLHF using preference ranking rather than categorical labels (requires a specialised variant of this pattern)
  • One-off labelling projects where ongoing quality management is not cost-justified

Prerequisites

  • A defined annotation task with a finite label taxonomy
  • Access to an annotation workforce (internal, outsourced, or crowdsourced)
  • A training pipeline that can consume new validated labels

Industry Applicability

Industry Annotation Task Type Label Taxonomy Example Annotator Source
Financial Services Transaction intent classification 12 transaction categories + anomaly flag Compliance analysts
Healthcare Clinical note coding ICD-10/CPT code sets Certified clinical coders
Insurance Claims document classification Claim type, fraud indicator, priority Claims staff
Legal Contract clause risk flagging Risk level (None/Low/Medium/High) + clause type Paralegals
Media Content moderation Safe/Restricted/Removed + reason codes Trust and Safety team
Retail Product attribute extraction Structured attribute taxonomy Category management team

4. Architecture Overview

The Annotation and Feedback Loop architecture has six stages that must operate together to produce reliable training data.

Stage 1 — Annotation Task Design. The annotation task must be fully specified before any annotator touches a single item. The task specification includes: a clear one-paragraph description of the labelling objective; a label taxonomy with definitions for every category and explicit boundary cases; positive examples (items where each label clearly applies); negative examples (items where the label seems applicable but does not apply — these are the most important for quality); an uncertainty protocol defining what annotators should do when they genuinely cannot decide (flag as ambiguous rather than guess — guesses on ambiguous items produce unreliable labels that damage training data quality); and a time-per-annotation target to discourage rushing. The task specification is reviewed by at least one domain expert before annotation begins.

Stage 2 — Annotator Onboarding and Calibration. Every new annotator completes a structured onboarding process: they read the task specification and guidelines; they complete a calibration test of 30–50 items with known correct answers; they review their results with explanations for any errors; and they must achieve a minimum accuracy threshold (typically 85% on the calibration set) before being permitted to annotate production items. Annotators who fail the calibration test can retry after reviewing the guidance. Annotators who fail three calibration attempts are not cleared for the task.

Stage 3 — Ongoing Quality Monitoring. Golden dataset items — items with known correct answers verified by a senior domain expert — are seeded into the annotation queue at a rate of 5–10% of items. Annotators are not told which items are golden. The system tracks each annotator's accuracy on golden items on a rolling basis. If an annotator's rolling golden-set accuracy drops below 80%, their account is suspended and their recent work is queued for re-annotation. Peer agreement monitoring runs continuously: for each item where multiple annotators have labelled it, Cohen's Kappa is computed. If a specific annotator's pairwise Kappa with all peers drops below 0.65 over a 7-day window, this annotator is flagged for review. Bias detection computes label distribution per annotator and compares to the population distribution; annotators whose distribution is systematically skewed are investigated.

Stage 4 — Adjudication. Items where annotator disagreement exceeds the threshold are routed to an adjudication queue. The adjudicator (a senior domain expert) reviews all annotations, their reasoning, and the original item, and provides a definitive label with a mandatory written reasoning explanation. Adjudicated labels are stored separately from directly agreed labels and are considered higher quality (eligible for use in evaluation sets). Recurring adjudication on the same label category is a signal that the task specification needs clarification for that category.

Stage 5 — Feedback Storage Schema. The annotation store captures: annotation_id (UUID, primary key); item_id (link to the original data item); annotator_id (pseudonymised for privacy); task_version_id (link to the task specification version used — critical for reproducibility); timestamp (annotation completion time); label (the annotation value); annotator_confidence (the annotator's self-rated confidence: certain/probable/uncertain); reasoning (optional free-text explanation, required for adjudication); time_spent_ms (elapsed time from task display to submission); is_golden (boolean flag for golden items, visible only to QA team); adjudication_id (null for non-adjudicated; link to adjudication record if applicable); and quality_flags (array of any quality concerns flagged during QA). This schema enables full provenance tracing from any model version back to the specific annotator and task version that produced each training label.

Stage 6 — Ingestion Pipeline to Training. Validated labels flow from the annotation store through a four-step ingestion pipeline: validation (check label is in allowed taxonomy; check confidence is set; check time_spent_ms is within expected range — flag outliers for review); deduplication (if an item has been annotated multiple times, apply majority vote or weighted average by annotator quality score to produce a canonical label); dataset versioning (each ingestion run produces a named, immutable dataset version in the training data store — never overwrite; append only); and training data store update (the new version is registered in the dataset registry and made available to the training pipeline). The training pipeline trains the challenger model on the new dataset version. Before any model is promoted to production, it is evaluated on a held-out set sampled from the same annotation batch (same distribution as the training data but not seen during training). This closed-loop verification catches cases where annotation quality is too low to support training — the model will not improve on held-out data from the same batch if the labels are noisy.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Collection["Annotation Collection"] A[Items for Annotation] B[Annotation Queue] C[Annotator Pool] end subgraph QA["Quality Assurance"] D[IAA Scorer] E[Adjudication Queue] F[(Annotation Store)] end subgraph Training["Model Training"] G[Ingestion Pipeline] H[Training Pipeline] I{Closed-Loop Verification} end A --> B B --> C C --> D D -->|agreement met| F D -->|disagreement| E E --> F F --> G G --> H H --> I I -->|improvement confirmed| A I -->|no improvement| E style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#fee2e2,stroke:#ef4444 style F fill:#fef9c3,stroke:#eab308 style G fill:#f0fdf4,stroke:#22c55e style H fill:#f0fdf4,stroke:#22c55e style I fill:#f3e8ff,stroke:#a855f7

6. Components

Component Type Responsibility Technology Options Criticality
Annotation Queue Durable Queue Hold items for annotation; assign to annotators; manage golden item seeding PostgreSQL queue table, Label Studio task queue, Scale AI project Critical
Annotation Interface Web Application Present item with full context, task spec, guidelines; capture label + metadata Label Studio (self-hosted or SaaS), Labelbox, Scale AI, Prolific, custom React Critical
Annotator Management Service Application Service Track annotator onboarding status, calibration results, golden-set accuracy, bias metrics Custom service backed by PostgreSQL High
IAA Scorer Quality Service Compute inter-annotator agreement for each item Python scikit-learn (cohen_kappa_score), custom Krippendorff Alpha Critical
Golden Set Manager Quality Service Seed golden items into queue; compute and track annotator accuracy on golden items Custom service; golden items stored in separate sealed table Critical
Adjudication Interface Web Application Present disagreeing annotations to adjudicator; capture definitive label + reasoning Custom interface or Label Studio with review mode High
Annotation Store Data Store Persist annotations with full schema; append-only PostgreSQL; Delta Lake Critical
Ingestion Pipeline ETL Validate → deduplicate → version → load to training data store Airflow DAG; dbt for transformation; MLflow datasets High
Training Data Store Data Store Hold versioned immutable training datasets S3 + Delta Lake; Vertex AI Dataset; Azure ML Dataset Critical
Closed-Loop Verifier ML Evaluation Service Evaluate challenger on held-out set from same annotation batch; produce improvement report Python evaluation job; MLflow tracking Critical

7. Data Flow

Primary Flow

Step Actor Action Output
1 Data Pipeline / Active Learning Selector Pushes items to annotation queue queue_item{item_id, content, priority, is_golden}
2 Annotation Interface Assigns item to annotator; presents with task spec task_displayed_at timestamp
3 Annotator Reviews item; selects label; sets confidence; adds optional reasoning; submits raw_annotation{item_id, annotator_id, label, confidence, reasoning, time_spent_ms}
4 IAA Scorer After minimum 2 annotations per item: computes Kappa iaa_score, agreement: true/false
5a Quality Validator For agreed items: validates label, confidence, time_spent_ms; checks golden accuracy validated_annotation or quality_flag
5b Adjudication Queue For disagreed items: creates adjudication task adjudication_task{item_id, annotations[]}
6 Adjudicator Reviews and provides definitive label with reasoning adjudication_record{item_id, label, reasoning, adjudicator_id}
7 Annotation Store Persists annotation with full schema annotation_id, full annotation record
8 Ingestion Pipeline Validates, deduplicates, versions, loads Dataset version N in training data store
9 Training Pipeline Trains challenger on new dataset version Challenger model artefact
10 Closed-Loop Verifier Evaluates challenger on held-out set Improvement report: accuracy delta, held-out accuracy
11 Model Registry On confirmed improvement: registers challenger Updated champion or pending A/B test

Error Flow

Error Condition Detected By Recovery Action Notification
Annotator accuracy below threshold on golden set Golden Set Manager Suspend annotator; queue their recent work for re-annotation Annotation manager; annotator receives re-calibration task
IAA consistently below threshold for a label category IAA Scorer trend report Trigger task specification review; pause annotation of that category Annotation manager; domain expert
Ingestion pipeline validation failure (invalid label value) Ingestion validator Quarantine affected batch; log validation error; notify QA team QA team; ML Ops
Closed-loop verification shows no improvement Closed-Loop Verifier Halt model promotion; trigger annotation quality review ML Ops; Model Risk Officer
Adjudication queue backlog exceeds 500 items Queue depth monitor Alert annotation manager; prioritise adjudication sprint Annotation manager

8. Security Considerations

Authentication and Authorisation

  • Annotators authenticate via SSO; annotation interface sessions expire after 30 minutes of inactivity
  • Golden set items visible only to QA administrators, not annotators (seeding would be ineffective if annotators knew which items were golden)
  • Annotation store write access restricted to annotation interface service account; no direct annotator access to the database
  • Adjudication interface accessible only to designated senior annotators with elevated RBAC role

Secrets Management

  • Annotation platform API keys (for SaaS platforms like Scale AI, Labelbox) stored in secrets manager
  • Training data store access credentials stored in secrets manager; rotated every 90 days

Data Classification

  • Annotation items inherit the classification of the source data; items containing PII require de-identification before annotation where feasible
  • For tasks requiring PII annotation (e.g. named entity recognition on real names), annotators sign specific NDA and PII handling agreement; access is logged and audited
  • Annotator IDs pseudonymised in training data store; mapping table access restricted to QA and HR

Encryption

  • Annotation store encrypted at rest (AES-256); annotator PII (email, name) stored in encrypted HR system, not in annotation store
  • All data in transit encrypted (TLS 1.3)

Auditability

  • Every annotation event logged with annotator_id (pseudonymised), item_id, timestamp, task_version_id
  • Adjudication decisions logged with full annotation context and adjudicator_id
  • Dataset version provenance traceable from training data store back to annotation_ids

OWASP LLM Top 10 Considerations

OWASP LLM Risk Applicability Mitigation
LLM01: Prompt Injection Low — annotation interface is human-driven N/A
LLM02: Insecure Output Handling Low — annotation outputs are categorical labels Validate label values against taxonomy; sanitise free-text reasoning
LLM03: Training Data Poisoning High — adversarial annotators could deliberately mislabel to degrade model Golden set monitoring; IAA thresholds; bias detection; closed-loop verification rejects poisoned batches
LLM04: Model Denial of Service Low N/A
LLM05: Supply Chain Vulnerabilities Medium — third-party annotation platforms (Scale AI, Labelbox) process sensitive data Security and privacy assessment of annotation vendors; DPA; penetration testing
LLM06: Sensitive Information Disclosure High — annotation items may contain sensitive data accessible to annotators Data minimisation; annotator NDA; PII de-identification where feasible
LLM07: Insecure Plugin Design Low N/A
LLM08: Excessive Agency Low — annotations are human judgments, not AI autonomy N/A
LLM09: Overreliance Medium — if annotators defer to AI-assisted labelling tools, label independence is compromised Annotator guidelines explicitly prohibit using external AI tools; interface should not show AI suggestions before annotator's initial label
LLM10: Model Theft Medium — high-quality annotated dataset is a significant IP asset Access controls on training data store; restrict export; watermark datasets

9. Governance Considerations

Responsible AI

  • Annotator cohort diversity: monitor whether annotator pool introduces demographic bias; compare label distributions across annotator demographic groups (where known and with consent)
  • Task specification bias audit: have task specifications reviewed by fairness expert before deployment to identify instruction language that may systematically bias labelling against protected groups

Model Risk Management

  • Annotation batch quality report reviewed by Model Risk before training begins on any batch
  • Closed-loop verification report required before champion promotion; Model Risk Officer signs off on each promotion

Human Approval Gates

  • Task specification changes require domain expert and Model Risk review; changing the specification mid-batch invalidates existing annotations (must be annotated under the new spec)
  • Golden set additions or changes require QA team approval; golden set is a controlled asset

Policy Compliance

  • Annotators must complete mandatory training on data handling, PII, and annotation ethics before being onboarded
  • Third-party annotation vendor agreements must include: data processing addendum, security assessment, audit rights, right to terminate and retrieve data

Traceability

  • Every model version traceable to: dataset version → annotation batch → individual annotation_ids → annotator_ids (pseudonymised) → task_version_id (guidelines used)
  • Full trace available for EU AI Act Article 10 training data documentation

Governance Artefacts

Artefact Owner Frequency Purpose
Annotator Quality Report Annotation Manager Weekly Golden-set accuracy, IAA trends, suspension events
Annotation Batch Quality Report QA Team Per batch IAA summary, adjudication rate, validation failure rate
Closed-Loop Verification Report ML Ops Per training cycle Challenger improvement on held-out set
Dataset Version Provenance Certificate Data Governance Per dataset version Certify lawful basis, annotator cohort, task spec version
Annotation Vendor Security Assessment Security / Legal Annually Confirm annotation vendor meets data handling requirements

10. Operational Considerations

Monitoring

Metric SLO Alert Threshold Owner
Annotation queue depth < 2x annotator daily capacity > 3x daily capacity Annotation Manager
Average IAA (Kappa) across active tasks > 0.70 < 0.60 for any task on 7-day rolling Annotation Manager
Golden set annotator accuracy (average) > 0.85 < 0.80 for any active annotator QA Team
Adjudication queue backlog < 100 items > 500 items Annotation Manager
Ingestion pipeline success rate > 99% Any failure ML Ops
Closed-loop verification pass rate > 80% of batches show improvement < 3 consecutive batches without improvement Model Risk Officer

Logging

  • All annotation events logged with full schema; retained 7 years
  • Ingestion pipeline runs logged with dataset version, record counts, validation error counts
  • Adjudication decisions logged with full annotation context

Incident Response

  • Annotator quality failure: suspend within 1 hour of detection; re-annotation scheduled within 5 business days
  • IAA collapse on a task: pause annotation of that task; convene domain expert review within 48 hours
  • Closed-loop verification failure: no model promotion; annotation quality investigation within 5 business days

Disaster Recovery

Component RTO RPO Strategy
Annotation Queue 1 hour 30 min PostgreSQL synchronous standby
Annotation Store 4 hours 15 min PostgreSQL with continuous WAL archiving
Training Data Store 4 hours 1 hour Object storage replication; versioned, immutable
Ingestion Pipeline 8 hours N/A (re-runnable) Idempotent pipeline; re-process from annotation store

Capacity Planning

  • Annotator headcount must be sized to process annotation queue within 48 hours at target throughput
  • Adjudication capacity must scale with IAA quality: lower IAA = more adjudication work; model adjudication volume from historical IAA rates
  • Training data store grows permanently; plan for 5–10 years of annotation accumulation

11. Cost Considerations

Cost Drivers

Driver Description Relative Weight
Annotator Labour Per-item cost × volume; dominant cost driver Very High
Adjudication Labour Senior expert time; typically 10–25% of items High
Annotation Platform Licensing SaaS per-seat or per-item pricing; or open-source hosting costs Medium
QA Operations Staff time for golden set management, annotator quality review Medium
Storage Annotation store + training data store; grows permanently Low
Training Compute Not a direct annotation cost; scales with dataset size Medium

Scaling Risks

  • Without active learning selection (EAAPL-HIL002), annotation volume scales linearly with data volume regardless of marginal value
  • Low IAA tasks require disproportionate adjudication effort: a task with 40% adjudication rate (IAA below threshold for 40% of items) is 3× more expensive per confirmed label than a task with 10% adjudication rate
  • Task specification ambiguity is the largest cost multiplier: invest in task design to reduce adjudication costs

Optimisations

  • Invest heavily in task specification quality: every 10% improvement in IAA reduces adjudication cost by 40–60%
  • Use active learning selection to annotate only the highest-value items
  • Use adjudicated items to improve task specification over time: recurring adjudication on the same label type reveals specification ambiguity
  • Pre-annotation with model suggestions (shown AFTER annotator's initial label) can reduce annotation time per item by 20–30%

Indicative Cost Range

Scale Monthly Annotation Volume Annotation Cost/Item Adjudication Rate Total Monthly Cost
Small (5K items/month) 5,000 $2–$5 15% $12,500–$30,000
Medium (50K items/month) 50,000 $1–$3 12% $56,000–$168,000
Large (500K items/month) 500,000 $0.50–$2 10% $275,000–$1.1M

12. Trade-Off Analysis

Annotator Sourcing Options

Source Quality Cost Scalability Domain Knowledge Recommended Use Case
Internal subject-matter experts Very High Very High Low Excellent Complex regulated tasks (clinical, legal, compliance); golden set creation
Internal operations staff High High Medium Good Operational tasks within their domain
Managed labelling vendors (Scale AI, Surge) Medium-High Medium High Low-Medium General annotation at volume; quality depends on briefing quality
Crowdsourcing (Mechanical Turk, Prolific) Low-Medium Low Very High Very Low Simple, unambiguous annotation tasks only; high adjudication overhead
Automated (LLM-based pre-annotation) Medium Very Low Very High Depends on model Pre-annotation to accelerate human review; never as sole annotator

Architectural Tensions

Tension Option A Option B Resolution Guidance
Annotation speed vs independence (anchoring) Show model prediction to annotator to speed up agreement Never show model prediction until after annotator's initial label For training data: always annotate independently first; model suggestion can be shown as reference AFTER initial label is submitted
IAA threshold strictness vs adjudication cost Strict (Kappa > 0.80): high-quality labels, very high adjudication cost Lenient (Kappa > 0.60): lower quality, lower cost Domain-calibrated: regulated tasks require Kappa > 0.75; standard tasks Kappa > 0.65; simple tasks Kappa > 0.60
Single annotator with golden set QA vs dual annotator Single annotator: 2× throughput, lower cost Dual annotator: IAA measurement, higher quality Dual annotator for model training labels; single annotator with dense golden set for high-volume operational annotation where IAA overhead is unjustified

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Task specification ambiguity causes low IAA High High — high adjudication costs; noisy training data IAA monitoring on first 200 items of a new task Pause task; revise specification; re-annotate first batch under new spec
Annotator collusion (annotators share answers) Low Critical — IAA appears high but labels are not independent Suspicious IAA improvement without calibration improvement; IP address / submission timing analysis Forensic investigation; remove colluding annotators; re-annotate affected batch
Golden set staleness (same items for > 6 months, answers memorised) Medium High — golden set QA becomes ineffective Annotator accuracy suspiciously high (>0.97) on golden set Rotate golden set items; suspend suspicious annotators pending investigation
Closed-loop verification failure (model does not improve) Medium Medium — annotation batch wasted; model not promoted Closed-loop verifier run Annotation quality investigation; may need to discard batch or re-annotate under revised spec
Dataset version mis-used in training (wrong version selected) Low High — model trained on incorrect data Dataset version tracking in training pipeline with validation MLflow/registry version pinning; pipeline validation step checking expected version

Cascading Failure Scenario

  • Task specification ambiguity → low IAA → high adjudication rate → adjudication backlog → annotations delayed → training pipeline starved → model not retrained for 3 months → model degrades silently in production
  • Mitigation: IAA monitoring on first 200 items fires within 24 hours of task launch; automatic task pause if IAA below threshold prevents backlog accumulation

14. Regulatory Considerations

Regulation Specific Clause Requirement Implementation
EU AI Act Article 10 §3 — Training data quality Training data must be subject to data governance practices, examined for errors and biases IAA monitoring, golden set QA, bias detection, closed-loop verification collectively satisfy Article 10 §3
EU AI Act Article 10 §2(f) — Data governance Training data governance must include examination with regard to possible biases Annotator bias detection; demographic analysis of label distributions; fairness testing of trained models
EU AI Act Article 12 — Record keeping High-risk AI systems must log data used for training Full annotation provenance schema and dataset version registry satisfy Article 12
APRA CPS 234 §36 — Integrity of information Training data must be protected from unauthorised modification Append-only annotation store; access controls; audit logging
Privacy Act 1988 (Australia) APP 11 — Security of personal information Personal information in annotation items must be protected Encryption; access controls; de-identification where feasible; annotator NDA
ISO 42001:2023 §8.3 — Data for AI systems AI systems must address data quality and relevance Annotation quality controls, IAA, and closed-loop verification satisfy ISO 42001 §8.3
NIST AI RMF MAP 1.5 — Training data assessment Training data must be assessed for quality and representativeness Annotation batch quality report; IAA metrics; annotator diversity monitoring
GDPR Article 5(1)(d) Data accuracy Personal data must be accurate; steps must be taken to correct inaccurate data Annotation quality controls prevent introduction of inaccurate labels into training data

15. Reference Implementations

AWS

  • Annotation Interface: Amazon SageMaker Ground Truth (managed annotation with workforce management)
  • Annotation Queue: SageMaker Ground Truth project queue or Amazon SQS for custom interface
  • IAA Scoring: Lambda function triggered by SQS or SageMaker callback
  • Annotation Store: Amazon RDS PostgreSQL
  • Ingestion Pipeline: AWS Glue job reading from RDS; writing to S3 as Parquet with Delta Lake
  • Training Data Store: Amazon S3 with AWS Glue Data Catalog
  • Closed-Loop Verifier: SageMaker Processing Job

Azure

  • Annotation Interface: Azure ML Data Labeling (managed) or Label Studio on Azure Container Apps
  • Annotation Store: Azure SQL Database
  • Ingestion Pipeline: Azure Data Factory pipeline; writing to Azure Data Lake Storage Gen2
  • Training Data Store: Azure ML Dataset with versioning
  • Closed-Loop Verifier: Azure ML Evaluation step in Azure ML Pipeline

GCP

  • Annotation Interface: Vertex AI Data Labeling Service or Label Studio on Cloud Run
  • Annotation Store: Cloud SQL PostgreSQL or Firestore
  • Ingestion Pipeline: Cloud Dataflow or Cloud Composer (Airflow)
  • Training Data Store: Google Cloud Storage + BigQuery for analytics
  • Closed-Loop Verifier: Vertex AI Evaluation step in Vertex AI Pipeline

On-Premises / Private Cloud

  • Annotation Interface: Label Studio (self-hosted on Kubernetes); open-source, full-featured
  • Annotation Store: PostgreSQL with full schema; pgaudit for append-only enforcement
  • IAA Scoring: Python microservice computing Cohen's Kappa via scikit-learn
  • Ingestion Pipeline: Airflow DAG with dbt transformations
  • Training Data Store: MinIO (S3-compatible) with Delta Lake; MLflow Dataset Registry
  • Closed-Loop Verifier: Python evaluation job in Airflow; results logged to MLflow

Pattern ID Relationship Notes
Active Learning Loop EAAPL-HIL002 Complementary — active learning determines which items to annotate; this pattern governs how Active learning feeds the annotation queue; this pattern manages what happens inside the queue
Human Escalation Pattern EAAPL-HIL003 Complementary — expert resolutions from escalation are high-quality annotation items Resolved escalations can be routed to the annotation store as training labels
Collaborative AI Decision EAAPL-HIL004 Complementary — human overrides from collaborative decisions are annotation signals Override records feed annotation ingestion pipeline
Human Override Pattern EAAPL-HIL006 Complementary — override events are natural annotation items Override records with reason codes are annotation-quality training data
Hybrid Intelligence Pattern EAAPL-HIL008 Dependency — hybrid intelligence requires well-designed annotation to measure human vs AI accuracy Annotation quality determines the accuracy of human-AI performance comparison
Supervisor Agent EAAPL-MAG002 Loosely related — supervisor agent quality review produces annotation-quality feedback Agent supervisor outputs can be routed to annotation store for model improvement

17. Maturity Assessment

Overall Maturity Level: Proven

Dimension Score (1–5) Rationale
Technical Maturity 5 Annotation platforms (Label Studio, Scale AI, Labelbox), IAA algorithms, and ML pipelines are mature
Operational Maturity 3 Annotator management and quality operations are organisationally complex; most enterprises under-invest in QA operations
Governance Maturity 4 EU AI Act Article 10 directly requires training data governance; this pattern is the prescribed implementation
Tooling Ecosystem 5 Multiple mature open-source and commercial annotation platforms; strong ML framework support
Enterprise Adoption 4 Widely adopted in financial services and healthcare; quality management practices (golden set, bias detection) less mature outside ML-first organisations
Risk Profile Medium Primary risk is annotation quality degradation without detection; controlled with golden set monitoring and closed-loop verification

18. Revision History

Version Date Author Changes
1.0 2026-06-12 EAAPL Working Group Initial publication covering task design, annotator management, quality assurance, feedback storage schema, ingestion pipeline, and closed-loop verification
← Back to LibraryMore Human-in-the-Loop