Proven

EAAPL-KNW006: Corpus Quality Assurance

Pattern ID: EAAPL-KNW006 Status: Proven Complexity: Medium Tags: observability model-risk traceability medium-complexity Version: 1.0 Last Updated: 2026-06-12

1. Executive Summary

Corpus Quality Assurance (CQA) is the automated pipeline that evaluates the fitness of documents for inclusion in an AI knowledge corpus — before ingestion and on a continuous basis after ingestion. It is the quality control function that stands between raw enterprise content and the retrieval systems that power AI answers.

This pattern defines six quality dimensions — completeness, accuracy, duplication, staleness, coverage, and structure — and specifies the automated measurement, threshold gating, and alerting mechanisms for each. It also covers the quality trend dashboard and the escalation path when automated quality assurance cannot make a determination.

For CIOs and CTOs, the core argument is: AI answer quality cannot exceed corpus quality. Teams investing in LLM selection, prompt engineering, and retrieval architecture while neglecting corpus quality are optimising the wrong variable. A well-tuned AI system on a poor corpus will outperform a poorly tuned system on a good corpus in the short term, but the corpus quality deficit compounds — AI answers degrade as documents age, duplicate, and diverge, while the AI model remains fixed. CQA is the ongoing quality investment that protects AI answer quality as the corpus grows and ages.

Implementation is medium complexity. Unlike knowledge graph or semantic layer patterns, CQA does not require new data infrastructure — it adds a quality measurement and gating layer to existing document ingestion pipelines.

2. Problem Statement

2.1 Business Problem

Enterprise knowledge corpora degrade without active management. Documents become outdated as policies and products change, but the old versions remain in the retrieval index — the AI continues citing superseded information. Duplicate documents accumulate as the same content is ingested from multiple sources with minor variations, causing inconsistent retrieval. Poorly formatted or truncated documents produce low-quality retrieval chunks that confuse rather than inform the LLM. The business consequence is AI answers that become less reliable over time, eroding user trust in proportion to the corpus quality deficit.

2.2 Technical Problem

Retrieval quality in RAG systems is directly determined by the quality of the documents retrieved. Standard vector similarity search has no quality awareness: a low-quality, outdated document with high semantic similarity to a query will be retrieved in preference to a high-quality, current document with slightly lower similarity. Without quality scores attached to documents and factored into retrieval ranking, quality degradation is invisible to the retrieval algorithm.

2.3 Symptoms

AI cites version-superseded documents (e.g., a policy withdrawn 18 months ago)
Same question receives different answers on different days because duplicate documents with conflicting content are retrieved inconsistently
Truncated or corrupted documents appear in retrieval results; LLM produces incoherent answers for those queries
No metric exists for corpus health; the team does not know if quality is improving or declining
Coverage gaps are discovered reactively (users complain the AI "doesn't know" about a topic) rather than proactively

2.4 Cost of Inaction

AI adoption reversal: business units that experience repeated quality failures disengage and revert to manual research
Regulatory risk: AI answers based on superseded regulatory or compliance documents produce incorrect guidance with potential legal consequences
Compounding quality debt: the longer quality management is deferred, the more documents require remediation, and the larger the quality remediation project becomes
Lost insight: coverage gaps mean entire knowledge domains are unrepresented in AI answers — the system is not aware of what it doesn't know

3. Context

3.1 When to Apply

Any production RAG system with >500 documents — below this scale, manual review is feasible; above it, automation is necessary
Corpora with multiple document sources and types of varying quality (the diversity creates the quality variance that requires automated management)
Domains with regulatory compliance implications where document currency is essential (compliance, legal, product, medical)
Organisations with high document update velocity — quality degrades fastest where content changes frequently
As a companion to EAAPL-KNW003 (AI Knowledge Corpus Management) — CQA is the quality assurance function; KNW003 is the lifecycle management function

3.2 When NOT to Apply

Single-source corpora from a single authoritative owner with manual review already in place — CQA overhead is not justified
Pure experimental/prototype deployments where AI answer quality is not yet a business concern
Corpora updated in batch by a controlled process with built-in quality controls upstream — additional CQA layer may be redundant

3.3 Prerequisites

Document metadata standard: at minimum, source system, author, effective date, expiry date, document type
Document ingestion pipeline with an interception point where quality checks can be executed before final ingestion
Storage for quality scores and quality history per document
Alerting infrastructure for quality threshold violations

3.4 Industry Applicability

Industry	Applicability	Primary Quality Risk	Key Quality Dimension
Financial Services	Critical	Superseded regulatory documents	Staleness + Authority
Healthcare	Critical	Outdated clinical guidelines, drug information	Staleness + Accuracy
Legal	High	Superseded case law, outdated legislation references	Staleness + Completeness
Government	High	Policy version conflicts, outdated service information	Staleness + Duplication
Technology	Medium	Outdated product documentation, deprecated API references	Staleness + Completeness
Retail / CPG	Medium	Obsolete product specs, superseded compliance certs	Staleness + Duplication

4. Architecture Overview

The Corpus Quality Assurance architecture comprises two operational phases: Pre-Ingestion Quality Gating and Post-Ingestion Continuous Quality Monitoring, unified by a shared quality score store and health dashboard.

4.1 Pre-Ingestion Quality Gating Pipeline

Every document submitted for corpus ingestion passes through six quality checks in sequence:

Completeness Check. The completeness scorer evaluates whether the document is whole and self-contained. Checks include: minimum word count for the document type; absence of truncation indicators (sentences that end abruptly, missing conclusion sections, "Page X of Y" indicators suggesting missing pages); presence of expected structural elements for the document type (a policy document without a "Scope" or "Effective Date" section is flagged as incomplete); broken internal references (citations to sections that don't exist in the document). Completeness score: 0–1.

Accuracy Validation. Accuracy validation operates at two tiers. Tier 1 (automated): for documents in high-stakes domains, automated fact-claim extraction identifies specific factual assertions (percentages, thresholds, named entities with specific attributes) that can be cross-checked against a trusted reference source (a regulatory database, a product master data system, an authoritative ontology). Claims that contradict the reference source reduce the accuracy score. Tier 2 (human): a statistically sampled proportion of documents from high-stakes domains are routed to a human accuracy reviewer — a domain expert who validates a representative sample of claims against primary sources. The human sampling rate is configurable per domain (typically 2–10% for high-stakes; 0.1–0.5% for informational domains).

Duplication Detection. Exact duplication is detected via cryptographic hash (SHA-256 of document content, normalised for whitespace and formatting). Near-duplicate detection uses cosine similarity of document-level embeddings: documents with similarity above a configurable threshold (default 0.95) are flagged as near-duplicates. For near-duplicate pairs, the deduplication strategy is configurable: reject the newer document (preserve canonical version), merge metadata (combine source attributions), or route to human review to determine which is authoritative. Exact duplicates are always rejected silently.

Staleness Evaluation. The staleness scorer evaluates document freshness relative to domain-specific maximum age thresholds. Threshold configuration is per document type and domain: regulatory instruments (12 months), internal policies (6 months), product technical specifications (3 months), market data summaries (1 week). The staleness score decays from 1.0 (fully fresh) to 0.0 (at maximum age) and goes negative (below 0) when past expiry — a document past expiry cannot be ingested. The score factors in not just age but also the velocity of change in the domain: regulatory areas with a recent burst of amendments require more frequent refresh.

Structural Integrity Check. The structural checker validates that the document can be processed by the downstream chunking pipeline. Checks: document is machine-readable (not a scanned image without OCR text layer); character encoding is valid UTF-8; no binary artefacts that would confuse chunking; minimum retrievable text content (>100 words of coherent prose). A structurally invalid document cannot produce useful retrieval chunks.

Coverage Assessment. The coverage assessor checks whether the document adds genuine value to the corpus by mapping its content to the knowledge ontology. If the document's topic is already represented by ≥N high-quality, current documents, the new document's incremental value is low and it is deprioritised or queued for later ingestion. If the document covers a topic with <N representations, it is flagged as a coverage gap filler and prioritised.

Composite Quality Score and Gate. Each of the six dimension scores is combined into a composite quality score with configurable weights per document type. Documents above the high-quality threshold are auto-ingested. Documents in the middle band enter a quality review queue where a document owner is notified with specific dimension-level feedback. Documents below the minimum threshold are rejected with a detailed rejection report.

4.2 Post-Ingestion Continuous Quality Monitoring

Quality degrades after ingestion as time passes and the broader knowledge landscape changes. The continuous monitoring layer runs scheduled jobs to re-evaluate the active corpus.

Freshness Monitor runs daily: re-scores all documents' staleness dimension; flags documents approaching the pre-expiry warning threshold; triggers automated expiry at the hard threshold.

Recall Probe runs weekly: executes a golden query set against the active corpus; measures recall@k for each query category; a decline in recall for a specific category indicates that the documents covering that topic have degraded in quality or have been removed.

Duplication Drift Monitor runs weekly: checks for near-duplicates introduced since the last run; newly ingested documents are compared against the existing corpus; cross-document contradictions (same topic, conflicting factual claims) are detected and flagged.

Coverage Gap Monitor runs monthly: maps the active corpus against the knowledge ontology; identifies topics with declining document counts (documents aged out without replacement); generates a prioritised ingestion backlog for the content management team.

4.3 Quality Score Store and History

All quality scores (pre-ingestion and post-ingestion re-evaluation) are stored per document with timestamps. This enables quality trend analysis: is a domain's average quality improving or declining? Are specific document types consistently failing particular quality dimensions? The quality history also supports root cause analysis when AI answer quality issues are investigated.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Submission["Document Submission"] DOC[Incoming Document] META[Document Metadata\nType · Source · Dates] end subgraph PreIngest["Pre-Ingestion Quality Gate"] COMP[Completeness\nScorer] ACC[Accuracy\nValidator] DUP[Duplication\nDetector] STALE[Staleness\nEvaluator] STRUCT[Structural\nIntegrity Check] COV[Coverage\nAssessor] DOC --> COMP --> ACC --> DUP --> STALE --> STRUCT --> COV end subgraph Scoring["Composite Scoring + Gate"] CSCORE[Composite Quality\nScore Calculation] GATE{Score\nThreshold} AUTO[Auto-Ingest\nHigh Quality] REVIEW[Quality Review\nQueue with Feedback] REJECT[Rejection\nReport to Owner] COV --> CSCORE --> GATE GATE -->|High| AUTO GATE -->|Medium| REVIEW GATE -->|Low| REJECT end subgraph Store["Quality Score Store"] QHIST[Quality History\nPer Document × Dimension × Timestamp] AUTO --> QHIST REVIEW -->|Owner resolves| AUTO end subgraph Corpus["Active Corpus + Vector Index"] ACTIVE[(Active Corpus\nDocument Store)] VECIDX[(Vector Index\nActive Chunks)] AUTO --> ACTIVE ACTIVE --> VECIDX end subgraph Continuous["Continuous Monitoring"] FRESH[Daily Freshness\nMonitor] RECALL[Weekly Recall\nProbe — Golden Set] DUPMON[Weekly Duplication\nDrift Monitor] COVMON[Monthly Coverage\nGap Monitor] ACTIVE --> FRESH & DUPMON VECIDX --> RECALL ACTIVE --> COVMON end subgraph Dashboard["Quality Health Dashboard"] TREND[Quality Score\nTrend by Domain] COVMAP[Coverage Map\nOntology vs Corpus] QUEUE[Review Queue\nDepth + SLA Status] RECALLT[Recall Trend\nby Query Category] FRESH & RECALL & DUPMON & COVMON --> QHIST QHIST --> TREND & COVMAP & QUEUE & RECALLT end META --> STALE META --> COV

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Completeness Scorer	Processing	Evaluate document wholeness: structure, word count, truncation	Custom Python scorer; readability libraries; document structure parser	High
Accuracy Validator	AI + Human workflow	Automated fact-claim extraction + reference cross-check; human sampling for high-stakes domains	Custom NLP pipeline; spaCy; LLM-based claim extractor; human review workflow	High
Duplication Detector	Processing	Exact hash deduplication; near-duplicate embedding similarity	SHA-256 hash; document embedding (Sentence Transformers); cosine similarity threshold	High
Staleness Evaluator	Processing	Age-based freshness scoring with domain-specific thresholds; expiry enforcement	Custom scorer; metadata date arithmetic; domain threshold configuration	Critical
Structural Integrity Checker	Processing	Validate machine-readability, encoding, minimum text content	PyMuPDF (PDF validation), python-magic (format detection), character encoding detection	High
Coverage Assessor	Processing	Map document to ontology; assess incremental value against existing corpus coverage	Topic modelling (BERTopic); ontology lookup; document count per topic	Medium
Composite Score Calculator	Processing	Weighted combination of six dimension scores; apply threshold decision	Custom Python service; configurable weight matrix per document type	Critical
Quality Review Queue	Workflow	Route medium-quality documents to owners; track remediation SLA	Custom workflow app; Jira integration; email notification	High
Quality Score Store	Storage	Persist quality scores with history; support trend analysis	PostgreSQL with time-series extension; InfluxDB for metrics; DynamoDB	High
Freshness Monitor	Scheduler	Daily re-evaluation of staleness scores; expiry flagging	Kubernetes CronJob; AWS Lambda; Airflow DAG	Critical
Recall Probe	Scheduler	Weekly golden query set recall@k measurement	Custom Python evaluation job; Ragas framework	High
Duplication Drift Monitor	Scheduler	Weekly scan for newly introduced duplicates	Custom job using document embedding similarity	Medium
Coverage Gap Monitor	Scheduler	Monthly ontology coverage analysis	Custom analytics job; ontology API integration	Medium
Quality Health Dashboard	Observability	Unified view of all quality dimensions across domains	Grafana + custom metrics; Tableau; Metabase	Medium

7. Data Flow

7.1 Primary Data Flow — Pre-Ingestion Quality Gate

Step	Actor	Action	Output
1	Document Source	Submits document + metadata to quality gate API	Document file + metadata
2	Completeness Scorer	Evaluates structural completeness, word count, truncation	Completeness score 0–1
3	Accuracy Validator	Extracts factual claims; cross-checks against reference sources	Accuracy score 0–1; list of unverified claims
4	Duplication Detector	Computes hash; compares against existing corpus embeddings	Duplicate flag (exact/near/none); similar document IDs if near-duplicate
5	Staleness Evaluator	Computes age against domain threshold; returns freshness score	Staleness score 0–1; expiry flag if past threshold
6	Structural Integrity Checker	Validates encoding, format, minimum text	Pass/fail with specific failure reason
7	Coverage Assessor	Maps to ontology; counts existing documents on same topic	Incremental coverage value score; topic assignments
8	Composite Calculator	Applies dimension weights; computes composite score	Composite quality score 0–1
9	Quality Gate	Routes document by composite score	Auto-ingest / review queue / reject
10	Quality Score Store	Persists all dimension scores and composite with document ID and timestamp	Quality record written
11	Active Corpus	Document chunked, embedded, ingested into vector index	Corpus updated

7.2 Error Flow

Error	Detection	Recovery	Escalation
Reference source unavailable (accuracy validator cannot cross-check)	HTTP timeout / API error from reference source	Fall back to reduced accuracy check (claim extraction only, no cross-check); flag document for human accuracy review	Alert operations; reference source SLA breach
Embedding generation failure (duplication detector)	Embedder exception	Retry ×3; skip similarity deduplication (still run hash deduplication); flag for re-check on next batch run	Alert ingestion pipeline team
Review queue SLA breach (documents not remediated within SLA)	Automated SLA monitoring job	Escalation notification to domain data steward and corpus governance	Corpus governance intervention; temporary threshold adjustment if volume overwhelms capacity
Staleness expiry with no replacement document	Freshness monitor flags; no new version ingested	Remove from active corpus; generate gap alert in coverage dashboard	Content management team notified to source replacement
Recall probe decline (quality issue not caught by pre-ingestion)	Recall@k below threshold	Identify recently ingested documents; trigger retrospective quality audit	Review quality gate thresholds; investigate specific query category failures

8. Security Considerations

8.1 Authentication and Authorisation

The quality gate API authenticates document submissions using the same source authentication mechanism as the corpus management pipeline. Quality scores are internal operational data and are accessible to corpus administrators, data stewards, and the AI platform engineering team. Quality history records for a specific document are accessible to the document owner. The quality review queue interface requires MFA-enabled SSO.

8.2 Secrets Management

Reference source API credentials (for accuracy cross-checking), embedding model API keys (for duplication detection), and quality score database credentials are stored in a secrets vault with standard rotation.

8.3 Data Classification

Quality scores are metadata and carry the same classification as the document they describe. A quality report for a Confidential document is itself Confidential. The quality health dashboard aggregates are typically Internal classification (no individual document details).

8.4 Encryption

Quality score store: encrypted at rest (AES-256). Data in transit: TLS 1.3. Accuracy reviewer interface: HTTPS-only with session token management. Documents processed by the quality pipeline are processed in-memory only where possible; no sensitive content written to intermediary disk storage.

8.5 Auditability

All quality gate decisions are logged: document ID, timestamp, submitter identity, all dimension scores, composite score, gate decision (auto-ingest/review/reject), rejection reason if applicable. For documents entering the review queue, the reviewer's identity, decision, and timestamp are logged. This audit trail enables investigation of any document's quality history.

8.6 OWASP LLM Top 10 Mapping

OWASP LLM Risk	Relevance	Mitigation
LLM01 Prompt Injection	Adversarial documents could embed prompt injection content; quality gate is the first line of defence	Structural integrity check rejects documents with instruction-like patterns; content sanitisation before quality scoring
LLM02 Insecure Output Handling	Quality gate LLM components (accuracy validator) could produce unsafe outputs	Structured output format for all LLM quality assessments; no free-form output from quality LLMs
LLM03 Training Data Poisoning	Low-quality or adversarial documents that pass the quality gate pollute the corpus	Quality gate is the primary control; post-ingestion recall probe detects degradation from poisoned documents
LLM04 Model Denial of Service	Adversarially complex documents (extremely large, pathological encoding) could exhaust quality gate compute	Maximum document size limit; processing timeout per quality check; reject documents exceeding limits
LLM05 Supply Chain Vulnerabilities	Reference sources used for accuracy validation could be compromised	Reference source authentication; cross-check against multiple independent reference sources for critical claims
LLM06 Sensitive Information Disclosure	Accuracy validation process exposes document content to external reference APIs	On-premises or private reference sources for sensitive domains; no external API calls for Restricted documents
LLM07 Insecure Plugin Design	Reference source connectors in accuracy validator	Source connector allowlist; input validation; read-only connector access
LLM08 Excessive Agency	Quality gate automation could auto-reject valid documents at scale	Quality gate generates recommendations; escalation to human reviewer for borderline decisions
LLM09 Overreliance	Teams trust quality scores as absolute measures of document quality	Quality scores are indicators, not guarantees; human review programme for high-stakes domains; score interpretation guidance
LLM10 Model Theft	Quality scoring models encode domain knowledge	Quality model artefacts access-controlled; no external API exposure of quality scoring logic

9. Governance Considerations

9.1 Responsible AI

Quality thresholds are value judgements encoded as configuration. A high completeness threshold that rejects documents with non-standard formatting may systematically exclude content from certain sources or geographies that format differently. Quality threshold calibration should include a bias audit: do the thresholds disproportionately exclude any legitimate document types, sources, or domain perspectives? Results are reviewed annually.

9.2 Model Risk Management

The accuracy validator's claim extraction and reference cross-checking component is a model. Its false negative rate (claims it fails to flag as inaccurate) determines the probability of inaccurate facts reaching the corpus. This model is subject to model risk management: model card documenting training data, precision/recall on validation set per claim type, known failure modes, and refresh schedule. The human accuracy review sampling programme provides an independent validation signal.

9.3 Human Approval Gates

All documents in the quality review queue require human action: either the document owner remediates the quality issue and resubmits, or the document is permanently rejected. Rejected documents cannot be automatically re-submitted — re-submission requires the owner to explicitly acknowledge the original rejection reason. Human accuracy reviewers for high-stakes domains complete competency validation before being assigned review tasks.

9.4 Policy Ownership

Quality threshold policy (minimum dimension scores, composite weights, human review sampling rates, domain-specific freshness schedules) is owned by the Corpus Governance Board. Quality threshold changes require a 10-business-day review period and simulation of the impact on the existing corpus (what percentage of currently active documents would fail under the new thresholds). Threshold changes that would invalidate >5% of the active corpus require executive approval.

9.5 Traceability

Every document in the active corpus has a complete quality history: all dimension scores at each quality gate evaluation, all continuous monitoring scores, all human review decisions, and the current quality score. This history supports both root cause analysis (why did AI answer quality degrade in domain X?) and compliance reporting (confirm that all documents in the corpus met quality standards at ingestion).

9.6 Governance Artefacts

Artefact	Owner	Frequency	Location
Quality threshold policy	Corpus Governance Board	Annual review; ad-hoc for significant incidents	Policy management system
Accuracy validator model card	ML Engineering	Per model version	ML model registry
Quality bias audit report	Data Governance	Annual	Data governance platform
Human accuracy review report	Domain Data Stewards	Monthly	Governance dashboard
Coverage gap prioritised backlog	Content Management + Domain Stewards	Monthly	Content management system
Quality health monthly report	Corpus Operations	Monthly	Governance dashboard

10. Operational Considerations

10.1 Monitoring and SLOs

Metric	SLO Target	Alerting Threshold	Tool
Pre-ingestion gate throughput	≤15 min per document (automated checks)	>60 min for any document in automated pipeline	Pipeline monitoring
Quality review queue clearance	100% cleared within 3 business days	Any item >2 business days	Workflow SLA alert
Active corpus average quality score	≥0.78 composite across all domains	<0.70 in any domain	Quality dashboard
Stale document rate	<3% of active corpus	>8%	Daily freshness monitor metric
Recall@5 on golden query set	≥0.88	<0.82	Weekly recall probe
Duplication rate (near-duplicates in active corpus)	<2%	>5%	Weekly duplication monitor
Human accuracy review false negative rate	<1% on sampled documents	>2% in quarterly audit	Quality audit programme

10.2 Logging

All quality gate events are logged as structured JSON: {document_id, timestamp, source, dimension_scores{}, composite_score, gate_decision, rejection_reason, reviewer_id}. Continuous monitoring events: {run_id, timestamp, check_type, documents_evaluated, alerts_generated}. Recall probe results: {run_id, timestamp, query_category, recall_at_k, threshold, pass_fail}. Log retention: 90 days operational; 7 years archive.

10.3 Incident Management

P1: Recall probe shows recall@5 below 0.70 — immediate investigation of recently ingested documents; potential recall of batch ingestion. P2: Average corpus quality score below threshold in a critical domain (compliance, medical) — same-day quality audit; halt of new ingestion until root cause identified. P3: Review queue SLA breach; single domain coverage gap — next business day response.

10.4 Disaster Recovery

Scenario	RTO	RPO	Recovery Procedure
Quality gate service unavailable	30 min (container restart; stateless)	N/A (stateless)	Restart; documents queued during outage re-processed
Quality score store unavailable	1 hour (replica promotion)	5 min	Promote read replica; validate recent score retrieval
Reference source unavailable (accuracy check)	N/A (degraded mode)	N/A	Fall back to accuracy-check-disabled mode; flag all documents in this period for human review
Quality pipeline misconfiguration (wrong thresholds)	2 hours (configuration rollback)	Last configuration version	Roll back threshold configuration; re-evaluate documents processed under wrong configuration

10.5 Capacity Planning

Quality gate processing is CPU-intensive for large documents (structural parsing, embedding generation for deduplication). At high ingestion rates (>1,000 documents per day), parallelise quality gate workers with a job queue. The quality score store grows at approximately 1 KB per quality evaluation per document; a corpus of 100,000 documents with monthly re-evaluation accumulates ~1.2 GB per year — manageable.

11. Cost Considerations

11.1 Cost Drivers

Cost Driver	Description	Typical Range
Embedding generation (deduplication)	Per-document embedding for near-duplicate detection	$0.0001–$0.001 per document
Accuracy validator LLM calls	Claim extraction per document in high-stakes domains	$0.01–$0.10 per document in high-stakes domains
Reference source API costs	External API calls for claim cross-checking	Variable; $0–$0.05 per cross-checked claim
Human accuracy reviewer labour	Domain expert time for sampled human review	$15–$75 per reviewed document depending on domain complexity
Quality score store	PostgreSQL-equivalent; modest size	$100–$500/month
Continuous monitoring compute	Scheduled jobs (freshness, recall, deduplication, coverage)	$200–$1,000/month

11.2 Scaling Risks

Human accuracy review is the primary cost scaling risk: if the high-stakes document volume grows and sampling rates are maintained, review labour grows proportionally
Accuracy validator LLM cost can be significant for large, complex documents with many factual claims — restrict deep accuracy validation to genuinely high-stakes domains
Recall probe cost scales with golden query set size and retrieval computation — keep golden set to 200–500 representative queries

11.3 Optimisations

Hash deduplication is free (CPU-only) — always run before embedding-based near-duplicate detection
Tier accuracy validation by document classification: Restricted documents get full claim extraction and human review; Internal documents get automated checks only
Cache quality scores for documents that have not changed between evaluation runs (hash-based change detection)
Use smaller embedding models for deduplication (the absolute embedding values matter less than the similarity ranking)

11.4 Indicative Cost Ranges

Corpus Scale	Monthly QA Infrastructure Cost	Annual Total (incl. human review)
Small (10K docs, 500 new/month)	$300–$800	$20,000–$60,000
Medium (100K docs, 5K new/month)	$2,000–$6,000	$100,000–$300,000
Large (1M+ docs, 50K new/month)	$15,000–$50,000	$500,000–$2,000,000

12. Trade-Off Analysis

12.1 Quality Gate Strictness Options

Option	Strengths	Weaknesses	Best For
Strict gate (high thresholds, manual review for borderline)	Maximum corpus quality; low false-positive rate (bad docs in corpus)	Slow ingestion; review queue bottleneck; risk of under-populated corpus	High-stakes domains (medical, legal, regulatory) where quality > coverage
Permissive gate (lower thresholds, auto-ingest most content)	Fast ingestion; high corpus coverage	Higher false-positive rate; lower average quality; more cleanup required	Informational domains where coverage > quality; very high document volume
Adaptive gate (thresholds calibrated per document type and domain)	Optimised quality/coverage trade-off per domain	Complex configuration; requires ongoing calibration	Recommended for most enterprise deployments with diverse document types

12.2 Accuracy Validation Approaches

Approach	Accuracy	Cost	Speed	Best For
Automated claim extraction + reference cross-check	Medium — reference source may not cover all claims	Medium	Fast (minutes)	Domains with authoritative machine-readable reference sources
LLM-based plausibility check (no reference source)	Low — LLM may hallucinate; cannot substitute for ground truth	Low	Very fast	Quick screening; flag obviously wrong claims for human review
Human domain expert review (full document)	Highest	High	Slow (hours–days)	Critical high-stakes documents; small volume
Statistical sampling with human review	High for sampled documents; inference to corpus quality	Medium	Manageable at scale	Enterprise-scale quality assurance programme

12.3 Architectural Tensions

Tension	Option A	Option B	Recommended Resolution
Quality gate latency vs. thoroughness	Fast checks only (structural + hash dedup) for near-real-time ingestion	Full six-dimension check for maximum quality assurance	Tiered: fast checks for real-time ingestion path; deep checks in parallel async job; gate on fast checks immediately, gate on deep checks within 15 minutes
Centralised vs. distributed quality assessment	Single centralised CQA service for all corpus types	Domain-specific quality services with domain-tuned thresholds	Centralised framework and tooling; domain-configurable thresholds and reference sources within the framework
Automated vs. human quality decisions	Fully automated quality gating	Human review for all documents	Automation for clear cases (score far above or below threshold); human for borderline band; human sampling for quality audit

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Threshold misconfiguration (too lenient)	Medium	High — low-quality documents enter active corpus	Recall probe degradation; user-reported quality issues	Retrospective quality audit; purge documents below actual threshold; recalibrate thresholds
Threshold misconfiguration (too strict)	Medium	Medium — legitimate documents excluded; corpus coverage gaps	Coverage gap monitor; domain steward reports missing knowledge	Threshold recalibration; re-submit previously rejected documents
Accuracy validator high false negative rate	Medium	High — inaccurate facts enter corpus	Human accuracy review sampling detects discrepancy	Model retraining; increase human review sampling rate until model is fixed
Staleness monitor missed expiry	Low	High — outdated documents remain active	User-reported stale AI answers; periodic manual audit	Manual sweep of domain; fix monitoring job; add regression test
Coverage gap not detected (ontology coverage map outdated)	Medium	Medium — AI doesn't know about topic	User complaints; ontology gap identified manually	Ontology refresh; update coverage monitor; proactive gap-filling ingestion
Recall probe false positives (golden set stale)	Medium	Medium — alerts on correct behaviour; team distrust of monitoring	Golden set review identifies outdated expected results	Quarterly golden set refresh with domain expert validation

13.1 Cascading Failure Scenarios

Scenario 1: Quality Gate Configuration Drift. Over 18 months, quality thresholds are relaxed in small increments to keep pace with ingestion volume pressures. No single change is significant enough to trigger governance review. Average corpus quality declines from 0.82 to 0.68. AI answer quality declines proportionally. Detection occurs when a business unit escalates multiple AI errors in a high-profile project. Root cause analysis reveals the threshold drift. Resolution requires: threshold reset to original values; corpus-wide retrospective quality re-evaluation; purge of documents now below threshold; 3-month remediation project.

Scenario 2: Reference Source Compromised. The external regulatory database used as an accuracy validation reference source is compromised and injects incorrect threshold values for a compliance domain. The accuracy validator cross-checks document claims against the now-incorrect reference and accepts inaccurate documents. AI begins providing incorrect compliance guidance. Detection: compliance team notices AI answers contradict known regulatory requirements. Resolution: immediate removal of affected documents from corpus; reference source integrity investigation; temporary switch to human accuracy review until reference source is validated clean; add cross-check against secondary reference source.

14. Regulatory Considerations

Regulation	Relevant Clause	Requirement	How CQA Addresses It
APRA CPS 234	§15(c) (Classification of Information Assets)	Information assets classified by criticality and sensitivity	Quality scorer classifies each document by domain criticality; quality thresholds calibrated to criticality
APRA CPS 230	§33 (Information Management Obligations)	Framework for managing information quality in material systems	CQA is the documented quality management framework for AI knowledge assets
Australian Privacy Act 1988	APP 10 (Quality of Personal Information)	Take reasonable steps to ensure personal information is accurate, up-to-date, complete	Accuracy and staleness dimensions directly address APP 10 for any corpus documents containing personal information
EU AI Act	Article 10(3) (Data Governance)	Training and knowledge data must be subject to data governance practices covering quality	Six-dimension quality gate + continuous monitoring constitutes documented data governance practices
EU AI Act	Article 10(2)(f) (Data Quality)	Data governance practices must address quality and accuracy of data used in high-risk AI	Accuracy validator + human review sampling satisfy accuracy requirement; staleness monitor satisfies currency requirement
ISO/IEC 42001	§8.2.3 (Data Quality)	Organisations must address data quality in AI system lifecycle management	CQA pipeline is the data quality management implementation for the AI knowledge corpus
NIST AI RMF	MEASURE 2.5 (AI Data Quality)	Identify and measure AI system data quality risks and limitations	Quality score dimensions, trend monitoring, and coverage gap analysis directly address this measure

15. Reference Implementations

15.1 AWS

Component	AWS Service
Quality gate pipeline	AWS Step Functions (orchestration) + Lambda (individual quality checks)
Structural integrity check	Lambda + PyMuPDF/python-magic
Document embedding (deduplication)	Amazon Bedrock Titan Embeddings
Similarity search (deduplication)	OpenSearch k-NN
Accuracy reference source	Custom Lambda + external API or Amazon Kendra (knowledge source)
Quality score store	Amazon RDS PostgreSQL
Review queue	SQS + custom React UI + SES email notifications
Continuous monitoring	EventBridge Scheduler + Lambda
Dashboard	Amazon Managed Grafana

15.2 Azure

Component	Azure Service
Quality gate pipeline	Azure Logic Apps + Azure Functions
Document embedding (deduplication)	Azure OpenAI Embeddings
Similarity search	Azure AI Search (vector)
Accuracy validation	Azure AI Language + custom reference source API
Quality score store	Azure SQL Database
Review queue	Azure Service Bus + Power Apps
Continuous monitoring	Azure Functions with Timer trigger
Dashboard	Azure Monitor + Grafana

15.3 GCP

Component	GCP Service
Quality gate pipeline	Cloud Workflows + Cloud Functions
Document embedding	Vertex AI Embeddings
Similarity search	Vertex AI Vector Search
Quality score store	Cloud SQL PostgreSQL
Continuous monitoring	Cloud Scheduler + Cloud Functions
Dashboard	Cloud Monitoring + Grafana

15.4 On-Premises

Component	Technology
Quality gate pipeline	Apache Airflow DAG
Structural integrity	Python + PyMuPDF + chardet
Document embedding	Sentence Transformers on GPU
Deduplication similarity	Qdrant or pgvector for similarity search
Accuracy validation	spaCy NLP + custom reference source API
Quality score store	PostgreSQL
Review queue	Custom Flask app + email notifications
Dashboard	Prometheus + Grafana

Pattern ID	Pattern Name	Relationship Type	Notes
EAAPL-KNW003	AI Knowledge Corpus Management	Complementary	KNW003 is the lifecycle management pattern; KNW006 is the quality assurance function within that lifecycle
EAAPL-KNW004	Vector Database Management	Downstream	CQA governs document quality before it enters the vector index; vector DB recall monitoring provides a quality feedback signal
EAAPL-KNW001	Enterprise Knowledge Graph	Complementary	Coverage gap analysis uses the EKG ontology as the coverage target; KNW001 defines what topics the corpus should cover
EAAPL-KNW005	Knowledge Graph for Explainability	Supporting	Explanation quality is constrained by corpus quality; CQA ensures the corpus facts used in explanations are accurate and current
EAAPL-GOV002	AI Model Risk Management	Supporting	Document classifier and accuracy validator are models subject to model risk management
EAAPL-OPS001	AI Observability	Complementary	Quality health dashboard is part of the broader AI observability framework

17. Maturity Assessment

Overall Maturity Label: Proven

Dimension	Score (1–5)	Rationale
Technology readiness	4	All component technologies (NLP libraries, embedding models, workflow tools, monitoring platforms) are production-proven; the integration pattern is well-established
Organisational capability	3	Requires data quality engineering skills and domain expert involvement for threshold calibration; achievable for most organisations with a data governance function
Standards availability	3	No specific CQA standard for AI corpora; draws on data quality standards (ISO 8000, DAMA DMBOK) with AI-specific adaptations
Vendor ecosystem	4	All major cloud providers offer component services; multiple open-source options; some emerging specialised corpus management vendors
Case evidence	4	Well-documented in library science and content management; AI-specific implementations growing rapidly with RAG adoption
Regulatory alignment	5	EU AI Act Article 10 data governance requirements and APP 10 are directly addressed; strongest regulatory coverage of the knowledge management patterns
Overall	3.8 / 5	Proven with strong regulatory alignment and accessible technology; primary investment is in calibrating thresholds and establishing human review processes for high-stakes domains

18. Revision History

Version	Date	Author	Changes
1.0	2026-06-12	EAAPL Editorial Board	Initial publication — covers six quality dimensions (completeness, accuracy, duplication, staleness, coverage, structure), pre-ingestion gating, continuous monitoring, human review programme, quality trend dashboard, and regulatory mapping

← Back to Library More Knowledge Management →

EAAPL-KNW006: Corpus Quality Assurance

EAAPL-KNW006: Corpus Quality Assurance

1. Executive Summary

2. Problem Statement

2.1 Business Problem

2.2 Technical Problem

2.3 Symptoms

2.4 Cost of Inaction

3. Context

3.1 When to Apply

3.2 When NOT to Apply

3.3 Prerequisites

3.4 Industry Applicability

4. Architecture Overview

4.1 Pre-Ingestion Quality Gating Pipeline

4.2 Post-Ingestion Continuous Quality Monitoring

4.3 Quality Score Store and History

5. Architecture Diagram

6. Components

7. Data Flow

7.1 Primary Data Flow — Pre-Ingestion Quality Gate

7.2 Error Flow

8. Security Considerations

8.1 Authentication and Authorisation

8.2 Secrets Management

8.3 Data Classification

8.4 Encryption

8.5 Auditability

8.6 OWASP LLM Top 10 Mapping

9. Governance Considerations

9.1 Responsible AI

9.2 Model Risk Management

9.3 Human Approval Gates

9.4 Policy Ownership

9.5 Traceability

9.6 Governance Artefacts

10. Operational Considerations

10.1 Monitoring and SLOs

10.2 Logging

10.3 Incident Management

10.4 Disaster Recovery

10.5 Capacity Planning

11. Cost Considerations

11.1 Cost Drivers

11.2 Scaling Risks

11.3 Optimisations

11.4 Indicative Cost Ranges

12. Trade-Off Analysis

12.1 Quality Gate Strictness Options

12.2 Accuracy Validation Approaches

12.3 Architectural Tensions

13. Failure Modes

13.1 Cascading Failure Scenarios

14. Regulatory Considerations

15. Reference Implementations

15.1 AWS

15.2 Azure

15.3 GCP

15.4 On-Premises

16. Related Patterns

17. Maturity Assessment

18. Revision History