Proven

Annotation and Feedback Loop

Pattern ID: EAAPL-HIL007 Status: Proven Tags: human-oversight model-risk medium-complexity Version: 1.0 Last Updated: 2026-06-12

1. Executive Summary

The Annotation and Feedback Loop pattern defines the end-to-end architecture for collecting structured human annotations on AI inputs and outputs, managing annotator quality, storing labels with full provenance, and routing validated labels back to model training. It is the operational backbone underlying all other human-in-the-loop feedback mechanisms. Where the Active Learning Loop (EAAPL-HIL002) addresses which items to annotate, this pattern addresses how to annotate them — the annotator management system, quality assurance framework, data storage schema, and ingestion pipeline that transform human judgment into model-trainable data.

The pattern covers annotation task design with clear guidelines and uncertainty protocols; annotator management including onboarding calibration tests, ongoing quality monitoring, and bias detection; quality assurance through golden datasets, adjudication, and inter-annotator agreement thresholds; a detailed feedback storage schema; validation and deduplication in the ingestion pipeline; dataset versioning; and closed-loop verification that new models trained on annotations are tested on a held-out set before promotion. CIOs and CTOs gain a structured, auditable annotation operation that produces high-quality training data, satisfies EU AI Act Article 10 training data governance requirements, and converts the operational cost of human review into a compounding strategic asset.

2. Problem Statement

Business Problem

Organisations deploying AI at scale need human-labelled data to train and retrain models. Without a structured annotation operation, labelling is ad hoc: different teams use different instructions, quality varies widely, there is no record of who labelled what or how reliably, and the resulting training data has unknown quality. Models trained on poor-quality annotations are worse than models trained on no new data — annotation effort destroys model quality instead of improving it.

Technical Problem

Annotation is not a solved problem. Human annotators disagree, make errors, develop biases over time, and game easy QA mechanisms. Inter-annotator agreement on complex enterprise tasks (legal clause classification, clinical coding, nuanced sentiment) is often below acceptable thresholds without deliberate intervention. Storing annotations without provenance makes it impossible to trace model errors back to labelling decisions or to exclude low-quality annotators' labels from training.

Symptoms

No formal annotation guidelines exist; different annotators interpret tasks differently
Annotator accuracy is not monitored; poor-quality annotators continue labelling indefinitely
Training data schema does not record who labelled what or when; provenance is lost
Adjudication for disagreements is informal and inconsistently applied
New model versions are trained on newly annotated data and deployed without verifying they improve on held-out data from the same annotation batch

Cost of Inaction

Model trained on poor-quality annotations performs worse in production than the previous version
Regulatory examination of training data governance (EU AI Act Article 10) reveals no quality controls
Annotator bias — systematic mislabelling by a demographic group or individual — corrupts training data without detection
Annotation effort is wasted: high cost, zero benefit

3. Context

When to Apply

Any organisation running ongoing model training with human-labelled data
Teams adding human feedback collection to production AI systems
Regulated environments requiring documented training data quality controls
Projects using third-party annotation vendors who must be quality-managed

When NOT to Apply

Pure generative model fine-tuning with RLHF using preference ranking rather than categorical labels (requires a specialised variant of this pattern)
One-off labelling projects where ongoing quality management is not cost-justified

Prerequisites

A defined annotation task with a finite label taxonomy
Access to an annotation workforce (internal, outsourced, or crowdsourced)
A training pipeline that can consume new validated labels

Industry Applicability

Industry	Annotation Task Type	Label Taxonomy Example	Annotator Source
Financial Services	Transaction intent classification	12 transaction categories + anomaly flag	Compliance analysts
Healthcare	Clinical note coding	ICD-10/CPT code sets	Certified clinical coders
Insurance	Claims document classification	Claim type, fraud indicator, priority	Claims staff
Legal	Contract clause risk flagging	Risk level (None/Low/Medium/High) + clause type	Paralegals
Media	Content moderation	Safe/Restricted/Removed + reason codes	Trust and Safety team
Retail	Product attribute extraction	Structured attribute taxonomy	Category management team

4. Architecture Overview

The Annotation and Feedback Loop architecture has six stages that must operate together to produce reliable training data.

Stage 1 — Annotation Task Design. The annotation task must be fully specified before any annotator touches a single item. The task specification includes: a clear one-paragraph description of the labelling objective; a label taxonomy with definitions for every category and explicit boundary cases; positive examples (items where each label clearly applies); negative examples (items where the label seems applicable but does not apply — these are the most important for quality); an uncertainty protocol defining what annotators should do when they genuinely cannot decide (flag as ambiguous rather than guess — guesses on ambiguous items produce unreliable labels that damage training data quality); and a time-per-annotation target to discourage rushing. The task specification is reviewed by at least one domain expert before annotation begins.

Stage 2 — Annotator Onboarding and Calibration. Every new annotator completes a structured onboarding process: they read the task specification and guidelines; they complete a calibration test of 30–50 items with known correct answers; they review their results with explanations for any errors; and they must achieve a minimum accuracy threshold (typically 85% on the calibration set) before being permitted to annotate production items. Annotators who fail the calibration test can retry after reviewing the guidance. Annotators who fail three calibration attempts are not cleared for the task.

Stage 3 — Ongoing Quality Monitoring. Golden dataset items — items with known correct answers verified by a senior domain expert — are seeded into the annotation queue at a rate of 5–10% of items. Annotators are not told which items are golden. The system tracks each annotator's accuracy on golden items on a rolling basis. If an annotator's rolling golden-set accuracy drops below 80%, their account is suspended and their recent work is queued for re-annotation. Peer agreement monitoring runs continuously: for each item where multiple annotators have labelled it, Cohen's Kappa is computed. If a specific annotator's pairwise Kappa with all peers drops below 0.65 over a 7-day window, this annotator is flagged for review. Bias detection computes label distribution per annotator and compares to the population distribution; annotators whose distribution is systematically skewed are investigated.

Stage 4 — Adjudication. Items where annotator disagreement exceeds the threshold are routed to an adjudication queue. The adjudicator (a senior domain expert) reviews all annotations, their reasoning, and the original item, and provides a definitive label with a mandatory written reasoning explanation. Adjudicated labels are stored separately from directly agreed labels and are considered higher quality (eligible for use in evaluation sets). Recurring adjudication on the same label category is a signal that the task specification needs clarification for that category.

Stage 5 — Feedback Storage Schema. The annotation store captures: annotation_id (UUID, primary key); item_id (link to the original data item); annotator_id (pseudonymised for privacy); task_version_id (link to the task specification version used — critical for reproducibility); timestamp (annotation completion time); label (the annotation value); annotator_confidence (the annotator's self-rated confidence: certain/probable/uncertain); reasoning (optional free-text explanation, required for adjudication); time_spent_ms (elapsed time from task display to submission); is_golden (boolean flag for golden items, visible only to QA team); adjudication_id (null for non-adjudicated; link to adjudication record if applicable); and quality_flags (array of any quality concerns flagged during QA). This schema enables full provenance tracing from any model version back to the specific annotator and task version that produced each training label.

Stage 6 — Ingestion Pipeline to Training. Validated labels flow from the annotation store through a four-step ingestion pipeline: validation (check label is in allowed taxonomy; check confidence is set; check time_spent_ms is within expected range — flag outliers for review); deduplication (if an item has been annotated multiple times, apply majority vote or weighted average by annotator quality score to produce a canonical label); dataset versioning (each ingestion run produces a named, immutable dataset version in the training data store — never overwrite; append only); and training data store update (the new version is registered in the dataset registry and made available to the training pipeline). The training pipeline trains the challenger model on the new dataset version. Before any model is promoted to production, it is evaluated on a held-out set sampled from the same annotation batch (same distribution as the training data but not seen during training). This closed-loop verification catches cases where annotation quality is too low to support training — the model will not improve on held-out data from the same batch if the labels are noisy.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Collection["Annotation Collection"] A[Items for Annotation] B[Annotation Queue] C[Annotator Pool] end subgraph QA["Quality Assurance"] D[IAA Scorer] E[Adjudication Queue] F[(Annotation Store)] end subgraph Training["Model Training"] G[Ingestion Pipeline] H[Training Pipeline] I{Closed-Loop Verification} end A --> B B --> C C --> D D -->|agreement met| F D -->|disagreement| E E --> F F --> G G --> H H --> I I -->|improvement confirmed| A I -->|no improvement| E style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#fee2e2,stroke:#ef4444 style F fill:#fef9c3,stroke:#eab308 style G fill:#f0fdf4,stroke:#22c55e style H fill:#f0fdf4,stroke:#22c55e style I fill:#f3e8ff,stroke:#a855f7

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Annotation Queue	Durable Queue	Hold items for annotation; assign to annotators; manage golden item seeding	PostgreSQL queue table, Label Studio task queue, Scale AI project	Critical
Annotation Interface	Web Application	Present item with full context, task spec, guidelines; capture label + metadata	Label Studio (self-hosted or SaaS), Labelbox, Scale AI, Prolific, custom React	Critical
Annotator Management Service	Application Service	Track annotator onboarding status, calibration results, golden-set accuracy, bias metrics	Custom service backed by PostgreSQL	High
IAA Scorer	Quality Service	Compute inter-annotator agreement for each item	Python scikit-learn (cohen_kappa_score), custom Krippendorff Alpha	Critical
Golden Set Manager	Quality Service	Seed golden items into queue; compute and track annotator accuracy on golden items	Custom service; golden items stored in separate sealed table	Critical
Adjudication Interface	Web Application	Present disagreeing annotations to adjudicator; capture definitive label + reasoning	Custom interface or Label Studio with review mode	High
Annotation Store	Data Store	Persist annotations with full schema; append-only	PostgreSQL; Delta Lake	Critical
Ingestion Pipeline	ETL	Validate → deduplicate → version → load to training data store	Airflow DAG; dbt for transformation; MLflow datasets	High
Training Data Store	Data Store	Hold versioned immutable training datasets	S3 + Delta Lake; Vertex AI Dataset; Azure ML Dataset	Critical
Closed-Loop Verifier	ML Evaluation Service	Evaluate challenger on held-out set from same annotation batch; produce improvement report	Python evaluation job; MLflow tracking	Critical

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Data Pipeline / Active Learning Selector	Pushes items to annotation queue	queue_item{item_id, content, priority, is_golden}
2	Annotation Interface	Assigns item to annotator; presents with task spec	task_displayed_at timestamp
3	Annotator	Reviews item; selects label; sets confidence; adds optional reasoning; submits	raw_annotation{item_id, annotator_id, label, confidence, reasoning, time_spent_ms}
4	IAA Scorer	After minimum 2 annotations per item: computes Kappa	iaa_score, agreement: true/false
5a	Quality Validator	For agreed items: validates label, confidence, time_spent_ms; checks golden accuracy	validated_annotation or quality_flag
5b	Adjudication Queue	For disagreed items: creates adjudication task	adjudication_task{item_id, annotations[]}
6	Adjudicator	Reviews and provides definitive label with reasoning	adjudication_record{item_id, label, reasoning, adjudicator_id}
7	Annotation Store	Persists annotation with full schema	annotation_id, full annotation record
8	Ingestion Pipeline	Validates, deduplicates, versions, loads	Dataset version N in training data store
9	Training Pipeline	Trains challenger on new dataset version	Challenger model artefact
10	Closed-Loop Verifier	Evaluates challenger on held-out set	Improvement report: accuracy delta, held-out accuracy
11	Model Registry	On confirmed improvement: registers challenger	Updated champion or pending A/B test

Error Flow

Error Condition	Detected By	Recovery Action	Notification
Annotator accuracy below threshold on golden set	Golden Set Manager	Suspend annotator; queue their recent work for re-annotation	Annotation manager; annotator receives re-calibration task
IAA consistently below threshold for a label category	IAA Scorer trend report	Trigger task specification review; pause annotation of that category	Annotation manager; domain expert
Ingestion pipeline validation failure (invalid label value)	Ingestion validator	Quarantine affected batch; log validation error; notify QA team	QA team; ML Ops
Closed-loop verification shows no improvement	Closed-Loop Verifier	Halt model promotion; trigger annotation quality review	ML Ops; Model Risk Officer
Adjudication queue backlog exceeds 500 items	Queue depth monitor	Alert annotation manager; prioritise adjudication sprint	Annotation manager

8. Security Considerations

Authentication and Authorisation

Annotators authenticate via SSO; annotation interface sessions expire after 30 minutes of inactivity
Golden set items visible only to QA administrators, not annotators (seeding would be ineffective if annotators knew which items were golden)
Annotation store write access restricted to annotation interface service account; no direct annotator access to the database
Adjudication interface accessible only to designated senior annotators with elevated RBAC role

Secrets Management

Annotation platform API keys (for SaaS platforms like Scale AI, Labelbox) stored in secrets manager
Training data store access credentials stored in secrets manager; rotated every 90 days

Data Classification

Annotation items inherit the classification of the source data; items containing PII require de-identification before annotation where feasible
For tasks requiring PII annotation (e.g. named entity recognition on real names), annotators sign specific NDA and PII handling agreement; access is logged and audited
Annotator IDs pseudonymised in training data store; mapping table access restricted to QA and HR

Encryption

Annotation store encrypted at rest (AES-256); annotator PII (email, name) stored in encrypted HR system, not in annotation store
All data in transit encrypted (TLS 1.3)

Auditability

Every annotation event logged with annotator_id (pseudonymised), item_id, timestamp, task_version_id
Adjudication decisions logged with full annotation context and adjudicator_id
Dataset version provenance traceable from training data store back to annotation_ids

OWASP LLM Top 10 Considerations

OWASP LLM Risk	Applicability	Mitigation
LLM01: Prompt Injection	Low — annotation interface is human-driven	N/A
LLM02: Insecure Output Handling	Low — annotation outputs are categorical labels	Validate label values against taxonomy; sanitise free-text reasoning
LLM03: Training Data Poisoning	High — adversarial annotators could deliberately mislabel to degrade model	Golden set monitoring; IAA thresholds; bias detection; closed-loop verification rejects poisoned batches
LLM04: Model Denial of Service	Low	N/A
LLM05: Supply Chain Vulnerabilities	Medium — third-party annotation platforms (Scale AI, Labelbox) process sensitive data	Security and privacy assessment of annotation vendors; DPA; penetration testing
LLM06: Sensitive Information Disclosure	High — annotation items may contain sensitive data accessible to annotators	Data minimisation; annotator NDA; PII de-identification where feasible
LLM07: Insecure Plugin Design	Low	N/A
LLM08: Excessive Agency	Low — annotations are human judgments, not AI autonomy	N/A
LLM09: Overreliance	Medium — if annotators defer to AI-assisted labelling tools, label independence is compromised	Annotator guidelines explicitly prohibit using external AI tools; interface should not show AI suggestions before annotator's initial label
LLM10: Model Theft	Medium — high-quality annotated dataset is a significant IP asset	Access controls on training data store; restrict export; watermark datasets

9. Governance Considerations

Responsible AI

Annotator cohort diversity: monitor whether annotator pool introduces demographic bias; compare label distributions across annotator demographic groups (where known and with consent)
Task specification bias audit: have task specifications reviewed by fairness expert before deployment to identify instruction language that may systematically bias labelling against protected groups

Model Risk Management

Annotation batch quality report reviewed by Model Risk before training begins on any batch
Closed-loop verification report required before champion promotion; Model Risk Officer signs off on each promotion

Human Approval Gates

Task specification changes require domain expert and Model Risk review; changing the specification mid-batch invalidates existing annotations (must be annotated under the new spec)
Golden set additions or changes require QA team approval; golden set is a controlled asset

Policy Compliance

Annotators must complete mandatory training on data handling, PII, and annotation ethics before being onboarded
Third-party annotation vendor agreements must include: data processing addendum, security assessment, audit rights, right to terminate and retrieve data

Traceability

Every model version traceable to: dataset version → annotation batch → individual annotation_ids → annotator_ids (pseudonymised) → task_version_id (guidelines used)
Full trace available for EU AI Act Article 10 training data documentation

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Annotator Quality Report	Annotation Manager	Weekly	Golden-set accuracy, IAA trends, suspension events
Annotation Batch Quality Report	QA Team	Per batch	IAA summary, adjudication rate, validation failure rate
Closed-Loop Verification Report	ML Ops	Per training cycle	Challenger improvement on held-out set
Dataset Version Provenance Certificate	Data Governance	Per dataset version	Certify lawful basis, annotator cohort, task spec version
Annotation Vendor Security Assessment	Security / Legal	Annually	Confirm annotation vendor meets data handling requirements

10. Operational Considerations

Monitoring

Metric	SLO	Alert Threshold	Owner
Annotation queue depth	< 2x annotator daily capacity	> 3x daily capacity	Annotation Manager
Average IAA (Kappa) across active tasks	> 0.70	< 0.60 for any task on 7-day rolling	Annotation Manager
Golden set annotator accuracy (average)	> 0.85	< 0.80 for any active annotator	QA Team
Adjudication queue backlog	< 100 items	> 500 items	Annotation Manager
Ingestion pipeline success rate	> 99%	Any failure	ML Ops
Closed-loop verification pass rate	> 80% of batches show improvement	< 3 consecutive batches without improvement	Model Risk Officer

Logging

All annotation events logged with full schema; retained 7 years
Ingestion pipeline runs logged with dataset version, record counts, validation error counts
Adjudication decisions logged with full annotation context

Incident Response

Annotator quality failure: suspend within 1 hour of detection; re-annotation scheduled within 5 business days
IAA collapse on a task: pause annotation of that task; convene domain expert review within 48 hours
Closed-loop verification failure: no model promotion; annotation quality investigation within 5 business days

Disaster Recovery

Component	RTO	RPO	Strategy
Annotation Queue	1 hour	30 min	PostgreSQL synchronous standby
Annotation Store	4 hours	15 min	PostgreSQL with continuous WAL archiving
Training Data Store	4 hours	1 hour	Object storage replication; versioned, immutable
Ingestion Pipeline	8 hours	N/A (re-runnable)	Idempotent pipeline; re-process from annotation store

Capacity Planning

Annotator headcount must be sized to process annotation queue within 48 hours at target throughput
Adjudication capacity must scale with IAA quality: lower IAA = more adjudication work; model adjudication volume from historical IAA rates
Training data store grows permanently; plan for 5–10 years of annotation accumulation

11. Cost Considerations

Cost Drivers

Driver	Description	Relative Weight
Annotator Labour	Per-item cost × volume; dominant cost driver	Very High
Adjudication Labour	Senior expert time; typically 10–25% of items	High
Annotation Platform Licensing	SaaS per-seat or per-item pricing; or open-source hosting costs	Medium
QA Operations	Staff time for golden set management, annotator quality review	Medium
Storage	Annotation store + training data store; grows permanently	Low
Training Compute	Not a direct annotation cost; scales with dataset size	Medium

Scaling Risks

Without active learning selection (EAAPL-HIL002), annotation volume scales linearly with data volume regardless of marginal value
Low IAA tasks require disproportionate adjudication effort: a task with 40% adjudication rate (IAA below threshold for 40% of items) is 3× more expensive per confirmed label than a task with 10% adjudication rate
Task specification ambiguity is the largest cost multiplier: invest in task design to reduce adjudication costs

Optimisations

Invest heavily in task specification quality: every 10% improvement in IAA reduces adjudication cost by 40–60%
Use active learning selection to annotate only the highest-value items
Use adjudicated items to improve task specification over time: recurring adjudication on the same label type reveals specification ambiguity
Pre-annotation with model suggestions (shown AFTER annotator's initial label) can reduce annotation time per item by 20–30%

Indicative Cost Range

Scale	Monthly Annotation Volume	Annotation Cost/Item	Adjudication Rate	Total Monthly Cost
Small (5K items/month)	5,000	$2–$5	15%	$12,500–$30,000
Medium (50K items/month)	50,000	$1–$3	12%	$56,000–$168,000
Large (500K items/month)	500,000	$0.50–$2	10%	$275,000–$1.1M

12. Trade-Off Analysis

Annotator Sourcing Options

Source	Quality	Cost	Scalability	Domain Knowledge	Recommended Use Case
Internal subject-matter experts	Very High	Very High	Low	Excellent	Complex regulated tasks (clinical, legal, compliance); golden set creation
Internal operations staff	High	High	Medium	Good	Operational tasks within their domain
Managed labelling vendors (Scale AI, Surge)	Medium-High	Medium	High	Low-Medium	General annotation at volume; quality depends on briefing quality
Crowdsourcing (Mechanical Turk, Prolific)	Low-Medium	Low	Very High	Very Low	Simple, unambiguous annotation tasks only; high adjudication overhead
Automated (LLM-based pre-annotation)	Medium	Very Low	Very High	Depends on model	Pre-annotation to accelerate human review; never as sole annotator

Architectural Tensions

Tension	Option A	Option B	Resolution Guidance
Annotation speed vs independence (anchoring)	Show model prediction to annotator to speed up agreement	Never show model prediction until after annotator's initial label	For training data: always annotate independently first; model suggestion can be shown as reference AFTER initial label is submitted
IAA threshold strictness vs adjudication cost	Strict (Kappa > 0.80): high-quality labels, very high adjudication cost	Lenient (Kappa > 0.60): lower quality, lower cost	Domain-calibrated: regulated tasks require Kappa > 0.75; standard tasks Kappa > 0.65; simple tasks Kappa > 0.60
Single annotator with golden set QA vs dual annotator	Single annotator: 2× throughput, lower cost	Dual annotator: IAA measurement, higher quality	Dual annotator for model training labels; single annotator with dense golden set for high-volume operational annotation where IAA overhead is unjustified

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Task specification ambiguity causes low IAA	High	High — high adjudication costs; noisy training data	IAA monitoring on first 200 items of a new task	Pause task; revise specification; re-annotate first batch under new spec
Annotator collusion (annotators share answers)	Low	Critical — IAA appears high but labels are not independent	Suspicious IAA improvement without calibration improvement; IP address / submission timing analysis	Forensic investigation; remove colluding annotators; re-annotate affected batch
Golden set staleness (same items for > 6 months, answers memorised)	Medium	High — golden set QA becomes ineffective	Annotator accuracy suspiciously high (>0.97) on golden set	Rotate golden set items; suspend suspicious annotators pending investigation
Closed-loop verification failure (model does not improve)	Medium	Medium — annotation batch wasted; model not promoted	Closed-loop verifier run	Annotation quality investigation; may need to discard batch or re-annotate under revised spec
Dataset version mis-used in training (wrong version selected)	Low	High — model trained on incorrect data	Dataset version tracking in training pipeline with validation	MLflow/registry version pinning; pipeline validation step checking expected version

Cascading Failure Scenario

Task specification ambiguity → low IAA → high adjudication rate → adjudication backlog → annotations delayed → training pipeline starved → model not retrained for 3 months → model degrades silently in production
Mitigation: IAA monitoring on first 200 items fires within 24 hours of task launch; automatic task pause if IAA below threshold prevents backlog accumulation

14. Regulatory Considerations

Regulation	Specific Clause	Requirement	Implementation
EU AI Act	Article 10 §3 — Training data quality	Training data must be subject to data governance practices, examined for errors and biases	IAA monitoring, golden set QA, bias detection, closed-loop verification collectively satisfy Article 10 §3
EU AI Act	Article 10 §2(f) — Data governance	Training data governance must include examination with regard to possible biases	Annotator bias detection; demographic analysis of label distributions; fairness testing of trained models
EU AI Act	Article 12 — Record keeping	High-risk AI systems must log data used for training	Full annotation provenance schema and dataset version registry satisfy Article 12
APRA CPS 234	§36 — Integrity of information	Training data must be protected from unauthorised modification	Append-only annotation store; access controls; audit logging
Privacy Act 1988 (Australia)	APP 11 — Security of personal information	Personal information in annotation items must be protected	Encryption; access controls; de-identification where feasible; annotator NDA
ISO 42001:2023	§8.3 — Data for AI systems	AI systems must address data quality and relevance	Annotation quality controls, IAA, and closed-loop verification satisfy ISO 42001 §8.3
NIST AI RMF	MAP 1.5 — Training data assessment	Training data must be assessed for quality and representativeness	Annotation batch quality report; IAA metrics; annotator diversity monitoring
GDPR Article 5(1)(d)	Data accuracy	Personal data must be accurate; steps must be taken to correct inaccurate data	Annotation quality controls prevent introduction of inaccurate labels into training data

15. Reference Implementations

AWS

Annotation Interface: Amazon SageMaker Ground Truth (managed annotation with workforce management)
Annotation Queue: SageMaker Ground Truth project queue or Amazon SQS for custom interface
IAA Scoring: Lambda function triggered by SQS or SageMaker callback
Annotation Store: Amazon RDS PostgreSQL
Ingestion Pipeline: AWS Glue job reading from RDS; writing to S3 as Parquet with Delta Lake
Training Data Store: Amazon S3 with AWS Glue Data Catalog
Closed-Loop Verifier: SageMaker Processing Job

Azure

Annotation Interface: Azure ML Data Labeling (managed) or Label Studio on Azure Container Apps
Annotation Store: Azure SQL Database
Ingestion Pipeline: Azure Data Factory pipeline; writing to Azure Data Lake Storage Gen2
Training Data Store: Azure ML Dataset with versioning
Closed-Loop Verifier: Azure ML Evaluation step in Azure ML Pipeline

GCP

Annotation Interface: Vertex AI Data Labeling Service or Label Studio on Cloud Run
Annotation Store: Cloud SQL PostgreSQL or Firestore
Ingestion Pipeline: Cloud Dataflow or Cloud Composer (Airflow)
Training Data Store: Google Cloud Storage + BigQuery for analytics
Closed-Loop Verifier: Vertex AI Evaluation step in Vertex AI Pipeline

On-Premises / Private Cloud

Annotation Interface: Label Studio (self-hosted on Kubernetes); open-source, full-featured
Annotation Store: PostgreSQL with full schema; pgaudit for append-only enforcement
IAA Scoring: Python microservice computing Cohen's Kappa via scikit-learn
Ingestion Pipeline: Airflow DAG with dbt transformations
Training Data Store: MinIO (S3-compatible) with Delta Lake; MLflow Dataset Registry
Closed-Loop Verifier: Python evaluation job in Airflow; results logged to MLflow

Pattern	ID	Relationship	Notes
Active Learning Loop	EAAPL-HIL002	Complementary — active learning determines which items to annotate; this pattern governs how	Active learning feeds the annotation queue; this pattern manages what happens inside the queue
Human Escalation Pattern	EAAPL-HIL003	Complementary — expert resolutions from escalation are high-quality annotation items	Resolved escalations can be routed to the annotation store as training labels
Collaborative AI Decision	EAAPL-HIL004	Complementary — human overrides from collaborative decisions are annotation signals	Override records feed annotation ingestion pipeline
Human Override Pattern	EAAPL-HIL006	Complementary — override events are natural annotation items	Override records with reason codes are annotation-quality training data
Hybrid Intelligence Pattern	EAAPL-HIL008	Dependency — hybrid intelligence requires well-designed annotation to measure human vs AI accuracy	Annotation quality determines the accuracy of human-AI performance comparison
Supervisor Agent	EAAPL-MAG002	Loosely related — supervisor agent quality review produces annotation-quality feedback	Agent supervisor outputs can be routed to annotation store for model improvement

17. Maturity Assessment

Overall Maturity Level: Proven

Dimension	Score (1–5)	Rationale
Technical Maturity	5	Annotation platforms (Label Studio, Scale AI, Labelbox), IAA algorithms, and ML pipelines are mature
Operational Maturity	3	Annotator management and quality operations are organisationally complex; most enterprises under-invest in QA operations
Governance Maturity	4	EU AI Act Article 10 directly requires training data governance; this pattern is the prescribed implementation
Tooling Ecosystem	5	Multiple mature open-source and commercial annotation platforms; strong ML framework support
Enterprise Adoption	4	Widely adopted in financial services and healthcare; quality management practices (golden set, bias detection) less mature outside ML-first organisations
Risk Profile	Medium	Primary risk is annotation quality degradation without detection; controlled with golden set monitoring and closed-loop verification

18. Revision History

Version	Date	Author	Changes
1.0	2026-06-12	EAAPL Working Group	Initial publication covering task design, annotator management, quality assurance, feedback storage schema, ingestion pipeline, and closed-loop verification

← Back to Library More Human-in-the-Loop →

Annotation and Feedback Loop

Annotation and Feedback Loop

1. Executive Summary

2. Problem Statement

Business Problem

Technical Problem

Symptoms

Cost of Inaction

3. Context

When to Apply

When NOT to Apply

Prerequisites

Industry Applicability

4. Architecture Overview

5. Architecture Diagram

6. Components

7. Data Flow

Primary Flow

Error Flow

8. Security Considerations

Authentication and Authorisation

Secrets Management

Data Classification

Encryption

Auditability

OWASP LLM Top 10 Considerations

9. Governance Considerations

Responsible AI

Model Risk Management

Human Approval Gates

Policy Compliance

Traceability

Governance Artefacts

10. Operational Considerations

Monitoring

Logging

Incident Response

Disaster Recovery

Capacity Planning

11. Cost Considerations

Cost Drivers

Scaling Risks

Optimisations

Indicative Cost Range

12. Trade-Off Analysis

Annotator Sourcing Options

Architectural Tensions

13. Failure Modes

Cascading Failure Scenario

14. Regulatory Considerations

15. Reference Implementations

AWS

Azure

GCP

On-Premises / Private Cloud

16. Related Patterns

17. Maturity Assessment

18. Revision History