EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryKnowledge Management
Proven
⇄ Compare

EAAPL-KNW006: Corpus Quality Assurance

EAAPL-KNW006: Corpus Quality Assurance

Pattern ID: EAAPL-KNW006 Status: Proven Complexity: Medium Tags: observability model-risk traceability medium-complexity Version: 1.0 Last Updated: 2026-06-12


1. Executive Summary

Corpus Quality Assurance (CQA) is the automated pipeline that evaluates the fitness of documents for inclusion in an AI knowledge corpus — before ingestion and on a continuous basis after ingestion. It is the quality control function that stands between raw enterprise content and the retrieval systems that power AI answers.

This pattern defines six quality dimensions — completeness, accuracy, duplication, staleness, coverage, and structure — and specifies the automated measurement, threshold gating, and alerting mechanisms for each. It also covers the quality trend dashboard and the escalation path when automated quality assurance cannot make a determination.

For CIOs and CTOs, the core argument is: AI answer quality cannot exceed corpus quality. Teams investing in LLM selection, prompt engineering, and retrieval architecture while neglecting corpus quality are optimising the wrong variable. A well-tuned AI system on a poor corpus will outperform a poorly tuned system on a good corpus in the short term, but the corpus quality deficit compounds — AI answers degrade as documents age, duplicate, and diverge, while the AI model remains fixed. CQA is the ongoing quality investment that protects AI answer quality as the corpus grows and ages.

Implementation is medium complexity. Unlike knowledge graph or semantic layer patterns, CQA does not require new data infrastructure — it adds a quality measurement and gating layer to existing document ingestion pipelines.


2. Problem Statement

2.1 Business Problem

Enterprise knowledge corpora degrade without active management. Documents become outdated as policies and products change, but the old versions remain in the retrieval index — the AI continues citing superseded information. Duplicate documents accumulate as the same content is ingested from multiple sources with minor variations, causing inconsistent retrieval. Poorly formatted or truncated documents produce low-quality retrieval chunks that confuse rather than inform the LLM. The business consequence is AI answers that become less reliable over time, eroding user trust in proportion to the corpus quality deficit.

2.2 Technical Problem

Retrieval quality in RAG systems is directly determined by the quality of the documents retrieved. Standard vector similarity search has no quality awareness: a low-quality, outdated document with high semantic similarity to a query will be retrieved in preference to a high-quality, current document with slightly lower similarity. Without quality scores attached to documents and factored into retrieval ranking, quality degradation is invisible to the retrieval algorithm.

2.3 Symptoms

  • AI cites version-superseded documents (e.g., a policy withdrawn 18 months ago)
  • Same question receives different answers on different days because duplicate documents with conflicting content are retrieved inconsistently
  • Truncated or corrupted documents appear in retrieval results; LLM produces incoherent answers for those queries
  • No metric exists for corpus health; the team does not know if quality is improving or declining
  • Coverage gaps are discovered reactively (users complain the AI "doesn't know" about a topic) rather than proactively

2.4 Cost of Inaction

  • AI adoption reversal: business units that experience repeated quality failures disengage and revert to manual research
  • Regulatory risk: AI answers based on superseded regulatory or compliance documents produce incorrect guidance with potential legal consequences
  • Compounding quality debt: the longer quality management is deferred, the more documents require remediation, and the larger the quality remediation project becomes
  • Lost insight: coverage gaps mean entire knowledge domains are unrepresented in AI answers — the system is not aware of what it doesn't know

3. Context

3.1 When to Apply

  • Any production RAG system with >500 documents — below this scale, manual review is feasible; above it, automation is necessary
  • Corpora with multiple document sources and types of varying quality (the diversity creates the quality variance that requires automated management)
  • Domains with regulatory compliance implications where document currency is essential (compliance, legal, product, medical)
  • Organisations with high document update velocity — quality degrades fastest where content changes frequently
  • As a companion to EAAPL-KNW003 (AI Knowledge Corpus Management) — CQA is the quality assurance function; KNW003 is the lifecycle management function

3.2 When NOT to Apply

  • Single-source corpora from a single authoritative owner with manual review already in place — CQA overhead is not justified
  • Pure experimental/prototype deployments where AI answer quality is not yet a business concern
  • Corpora updated in batch by a controlled process with built-in quality controls upstream — additional CQA layer may be redundant

3.3 Prerequisites

  • Document metadata standard: at minimum, source system, author, effective date, expiry date, document type
  • Document ingestion pipeline with an interception point where quality checks can be executed before final ingestion
  • Storage for quality scores and quality history per document
  • Alerting infrastructure for quality threshold violations

3.4 Industry Applicability

Industry Applicability Primary Quality Risk Key Quality Dimension
Financial Services Critical Superseded regulatory documents Staleness + Authority
Healthcare Critical Outdated clinical guidelines, drug information Staleness + Accuracy
Legal High Superseded case law, outdated legislation references Staleness + Completeness
Government High Policy version conflicts, outdated service information Staleness + Duplication
Technology Medium Outdated product documentation, deprecated API references Staleness + Completeness
Retail / CPG Medium Obsolete product specs, superseded compliance certs Staleness + Duplication

4. Architecture Overview

The Corpus Quality Assurance architecture comprises two operational phases: Pre-Ingestion Quality Gating and Post-Ingestion Continuous Quality Monitoring, unified by a shared quality score store and health dashboard.

4.1 Pre-Ingestion Quality Gating Pipeline

Every document submitted for corpus ingestion passes through six quality checks in sequence:

Completeness Check. The completeness scorer evaluates whether the document is whole and self-contained. Checks include: minimum word count for the document type; absence of truncation indicators (sentences that end abruptly, missing conclusion sections, "Page X of Y" indicators suggesting missing pages); presence of expected structural elements for the document type (a policy document without a "Scope" or "Effective Date" section is flagged as incomplete); broken internal references (citations to sections that don't exist in the document). Completeness score: 0–1.

Accuracy Validation. Accuracy validation operates at two tiers. Tier 1 (automated): for documents in high-stakes domains, automated fact-claim extraction identifies specific factual assertions (percentages, thresholds, named entities with specific attributes) that can be cross-checked against a trusted reference source (a regulatory database, a product master data system, an authoritative ontology). Claims that contradict the reference source reduce the accuracy score. Tier 2 (human): a statistically sampled proportion of documents from high-stakes domains are routed to a human accuracy reviewer — a domain expert who validates a representative sample of claims against primary sources. The human sampling rate is configurable per domain (typically 2–10% for high-stakes; 0.1–0.5% for informational domains).

Duplication Detection. Exact duplication is detected via cryptographic hash (SHA-256 of document content, normalised for whitespace and formatting). Near-duplicate detection uses cosine similarity of document-level embeddings: documents with similarity above a configurable threshold (default 0.95) are flagged as near-duplicates. For near-duplicate pairs, the deduplication strategy is configurable: reject the newer document (preserve canonical version), merge metadata (combine source attributions), or route to human review to determine which is authoritative. Exact duplicates are always rejected silently.

Staleness Evaluation. The staleness scorer evaluates document freshness relative to domain-specific maximum age thresholds. Threshold configuration is per document type and domain: regulatory instruments (12 months), internal policies (6 months), product technical specifications (3 months), market data summaries (1 week). The staleness score decays from 1.0 (fully fresh) to 0.0 (at maximum age) and goes negative (below 0) when past expiry — a document past expiry cannot be ingested. The score factors in not just age but also the velocity of change in the domain: regulatory areas with a recent burst of amendments require more frequent refresh.

Structural Integrity Check. The structural checker validates that the document can be processed by the downstream chunking pipeline. Checks: document is machine-readable (not a scanned image without OCR text layer); character encoding is valid UTF-8; no binary artefacts that would confuse chunking; minimum retrievable text content (>100 words of coherent prose). A structurally invalid document cannot produce useful retrieval chunks.

Coverage Assessment. The coverage assessor checks whether the document adds genuine value to the corpus by mapping its content to the knowledge ontology. If the document's topic is already represented by ≥N high-quality, current documents, the new document's incremental value is low and it is deprioritised or queued for later ingestion. If the document covers a topic with <N representations, it is flagged as a coverage gap filler and prioritised.

Composite Quality Score and Gate. Each of the six dimension scores is combined into a composite quality score with configurable weights per document type. Documents above the high-quality threshold are auto-ingested. Documents in the middle band enter a quality review queue where a document owner is notified with specific dimension-level feedback. Documents below the minimum threshold are rejected with a detailed rejection report.

4.2 Post-Ingestion Continuous Quality Monitoring

Quality degrades after ingestion as time passes and the broader knowledge landscape changes. The continuous monitoring layer runs scheduled jobs to re-evaluate the active corpus.

Freshness Monitor runs daily: re-scores all documents' staleness dimension; flags documents approaching the pre-expiry warning threshold; triggers automated expiry at the hard threshold.

Recall Probe runs weekly: executes a golden query set against the active corpus; measures recall@k for each query category; a decline in recall for a specific category indicates that the documents covering that topic have degraded in quality or have been removed.

Duplication Drift Monitor runs weekly: checks for near-duplicates introduced since the last run; newly ingested documents are compared against the existing corpus; cross-document contradictions (same topic, conflicting factual claims) are detected and flagged.

Coverage Gap Monitor runs monthly: maps the active corpus against the knowledge ontology; identifies topics with declining document counts (documents aged out without replacement); generates a prioritised ingestion backlog for the content management team.

4.3 Quality Score Store and History

All quality scores (pre-ingestion and post-ingestion re-evaluation) are stored per document with timestamps. This enables quality trend analysis: is a domain's average quality improving or declining? Are specific document types consistently failing particular quality dimensions? The quality history also supports root cause analysis when AI answer quality issues are investigated.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Submission["Document Submission"] DOC[Incoming Document] META[Document Metadata\nType · Source · Dates] end subgraph PreIngest["Pre-Ingestion Quality Gate"] COMP[Completeness\nScorer] ACC[Accuracy\nValidator] DUP[Duplication\nDetector] STALE[Staleness\nEvaluator] STRUCT[Structural\nIntegrity Check] COV[Coverage\nAssessor] DOC --> COMP --> ACC --> DUP --> STALE --> STRUCT --> COV end subgraph Scoring["Composite Scoring + Gate"] CSCORE[Composite Quality\nScore Calculation] GATE{Score\nThreshold} AUTO[Auto-Ingest\nHigh Quality] REVIEW[Quality Review\nQueue with Feedback] REJECT[Rejection\nReport to Owner] COV --> CSCORE --> GATE GATE -->|High| AUTO GATE -->|Medium| REVIEW GATE -->|Low| REJECT end subgraph Store["Quality Score Store"] QHIST[Quality History\nPer Document × Dimension × Timestamp] AUTO --> QHIST REVIEW -->|Owner resolves| AUTO end subgraph Corpus["Active Corpus + Vector Index"] ACTIVE[(Active Corpus\nDocument Store)] VECIDX[(Vector Index\nActive Chunks)] AUTO --> ACTIVE ACTIVE --> VECIDX end subgraph Continuous["Continuous Monitoring"] FRESH[Daily Freshness\nMonitor] RECALL[Weekly Recall\nProbe — Golden Set] DUPMON[Weekly Duplication\nDrift Monitor] COVMON[Monthly Coverage\nGap Monitor] ACTIVE --> FRESH & DUPMON VECIDX --> RECALL ACTIVE --> COVMON end subgraph Dashboard["Quality Health Dashboard"] TREND[Quality Score\nTrend by Domain] COVMAP[Coverage Map\nOntology vs Corpus] QUEUE[Review Queue\nDepth + SLA Status] RECALLT[Recall Trend\nby Query Category] FRESH & RECALL & DUPMON & COVMON --> QHIST QHIST --> TREND & COVMAP & QUEUE & RECALLT end META --> STALE META --> COV

6. Components

Component Type Responsibility Technology Options Criticality
Completeness Scorer Processing Evaluate document wholeness: structure, word count, truncation Custom Python scorer; readability libraries; document structure parser High
Accuracy Validator AI + Human workflow Automated fact-claim extraction + reference cross-check; human sampling for high-stakes domains Custom NLP pipeline; spaCy; LLM-based claim extractor; human review workflow High
Duplication Detector Processing Exact hash deduplication; near-duplicate embedding similarity SHA-256 hash; document embedding (Sentence Transformers); cosine similarity threshold High
Staleness Evaluator Processing Age-based freshness scoring with domain-specific thresholds; expiry enforcement Custom scorer; metadata date arithmetic; domain threshold configuration Critical
Structural Integrity Checker Processing Validate machine-readability, encoding, minimum text content PyMuPDF (PDF validation), python-magic (format detection), character encoding detection High
Coverage Assessor Processing Map document to ontology; assess incremental value against existing corpus coverage Topic modelling (BERTopic); ontology lookup; document count per topic Medium
Composite Score Calculator Processing Weighted combination of six dimension scores; apply threshold decision Custom Python service; configurable weight matrix per document type Critical
Quality Review Queue Workflow Route medium-quality documents to owners; track remediation SLA Custom workflow app; Jira integration; email notification High
Quality Score Store Storage Persist quality scores with history; support trend analysis PostgreSQL with time-series extension; InfluxDB for metrics; DynamoDB High
Freshness Monitor Scheduler Daily re-evaluation of staleness scores; expiry flagging Kubernetes CronJob; AWS Lambda; Airflow DAG Critical
Recall Probe Scheduler Weekly golden query set recall@k measurement Custom Python evaluation job; Ragas framework High
Duplication Drift Monitor Scheduler Weekly scan for newly introduced duplicates Custom job using document embedding similarity Medium
Coverage Gap Monitor Scheduler Monthly ontology coverage analysis Custom analytics job; ontology API integration Medium
Quality Health Dashboard Observability Unified view of all quality dimensions across domains Grafana + custom metrics; Tableau; Metabase Medium

7. Data Flow

7.1 Primary Data Flow — Pre-Ingestion Quality Gate

Step Actor Action Output
1 Document Source Submits document + metadata to quality gate API Document file + metadata
2 Completeness Scorer Evaluates structural completeness, word count, truncation Completeness score 0–1
3 Accuracy Validator Extracts factual claims; cross-checks against reference sources Accuracy score 0–1; list of unverified claims
4 Duplication Detector Computes hash; compares against existing corpus embeddings Duplicate flag (exact/near/none); similar document IDs if near-duplicate
5 Staleness Evaluator Computes age against domain threshold; returns freshness score Staleness score 0–1; expiry flag if past threshold
6 Structural Integrity Checker Validates encoding, format, minimum text Pass/fail with specific failure reason
7 Coverage Assessor Maps to ontology; counts existing documents on same topic Incremental coverage value score; topic assignments
8 Composite Calculator Applies dimension weights; computes composite score Composite quality score 0–1
9 Quality Gate Routes document by composite score Auto-ingest / review queue / reject
10 Quality Score Store Persists all dimension scores and composite with document ID and timestamp Quality record written
11 Active Corpus Document chunked, embedded, ingested into vector index Corpus updated

7.2 Error Flow

Error Detection Recovery Escalation
Reference source unavailable (accuracy validator cannot cross-check) HTTP timeout / API error from reference source Fall back to reduced accuracy check (claim extraction only, no cross-check); flag document for human accuracy review Alert operations; reference source SLA breach
Embedding generation failure (duplication detector) Embedder exception Retry ×3; skip similarity deduplication (still run hash deduplication); flag for re-check on next batch run Alert ingestion pipeline team
Review queue SLA breach (documents not remediated within SLA) Automated SLA monitoring job Escalation notification to domain data steward and corpus governance Corpus governance intervention; temporary threshold adjustment if volume overwhelms capacity
Staleness expiry with no replacement document Freshness monitor flags; no new version ingested Remove from active corpus; generate gap alert in coverage dashboard Content management team notified to source replacement
Recall probe decline (quality issue not caught by pre-ingestion) Recall@k below threshold Identify recently ingested documents; trigger retrospective quality audit Review quality gate thresholds; investigate specific query category failures

8. Security Considerations

8.1 Authentication and Authorisation

The quality gate API authenticates document submissions using the same source authentication mechanism as the corpus management pipeline. Quality scores are internal operational data and are accessible to corpus administrators, data stewards, and the AI platform engineering team. Quality history records for a specific document are accessible to the document owner. The quality review queue interface requires MFA-enabled SSO.

8.2 Secrets Management

Reference source API credentials (for accuracy cross-checking), embedding model API keys (for duplication detection), and quality score database credentials are stored in a secrets vault with standard rotation.

8.3 Data Classification

Quality scores are metadata and carry the same classification as the document they describe. A quality report for a Confidential document is itself Confidential. The quality health dashboard aggregates are typically Internal classification (no individual document details).

8.4 Encryption

Quality score store: encrypted at rest (AES-256). Data in transit: TLS 1.3. Accuracy reviewer interface: HTTPS-only with session token management. Documents processed by the quality pipeline are processed in-memory only where possible; no sensitive content written to intermediary disk storage.

8.5 Auditability

All quality gate decisions are logged: document ID, timestamp, submitter identity, all dimension scores, composite score, gate decision (auto-ingest/review/reject), rejection reason if applicable. For documents entering the review queue, the reviewer's identity, decision, and timestamp are logged. This audit trail enables investigation of any document's quality history.

8.6 OWASP LLM Top 10 Mapping

OWASP LLM Risk Relevance Mitigation
LLM01 Prompt Injection Adversarial documents could embed prompt injection content; quality gate is the first line of defence Structural integrity check rejects documents with instruction-like patterns; content sanitisation before quality scoring
LLM02 Insecure Output Handling Quality gate LLM components (accuracy validator) could produce unsafe outputs Structured output format for all LLM quality assessments; no free-form output from quality LLMs
LLM03 Training Data Poisoning Low-quality or adversarial documents that pass the quality gate pollute the corpus Quality gate is the primary control; post-ingestion recall probe detects degradation from poisoned documents
LLM04 Model Denial of Service Adversarially complex documents (extremely large, pathological encoding) could exhaust quality gate compute Maximum document size limit; processing timeout per quality check; reject documents exceeding limits
LLM05 Supply Chain Vulnerabilities Reference sources used for accuracy validation could be compromised Reference source authentication; cross-check against multiple independent reference sources for critical claims
LLM06 Sensitive Information Disclosure Accuracy validation process exposes document content to external reference APIs On-premises or private reference sources for sensitive domains; no external API calls for Restricted documents
LLM07 Insecure Plugin Design Reference source connectors in accuracy validator Source connector allowlist; input validation; read-only connector access
LLM08 Excessive Agency Quality gate automation could auto-reject valid documents at scale Quality gate generates recommendations; escalation to human reviewer for borderline decisions
LLM09 Overreliance Teams trust quality scores as absolute measures of document quality Quality scores are indicators, not guarantees; human review programme for high-stakes domains; score interpretation guidance
LLM10 Model Theft Quality scoring models encode domain knowledge Quality model artefacts access-controlled; no external API exposure of quality scoring logic

9. Governance Considerations

9.1 Responsible AI

Quality thresholds are value judgements encoded as configuration. A high completeness threshold that rejects documents with non-standard formatting may systematically exclude content from certain sources or geographies that format differently. Quality threshold calibration should include a bias audit: do the thresholds disproportionately exclude any legitimate document types, sources, or domain perspectives? Results are reviewed annually.

9.2 Model Risk Management

The accuracy validator's claim extraction and reference cross-checking component is a model. Its false negative rate (claims it fails to flag as inaccurate) determines the probability of inaccurate facts reaching the corpus. This model is subject to model risk management: model card documenting training data, precision/recall on validation set per claim type, known failure modes, and refresh schedule. The human accuracy review sampling programme provides an independent validation signal.

9.3 Human Approval Gates

All documents in the quality review queue require human action: either the document owner remediates the quality issue and resubmits, or the document is permanently rejected. Rejected documents cannot be automatically re-submitted — re-submission requires the owner to explicitly acknowledge the original rejection reason. Human accuracy reviewers for high-stakes domains complete competency validation before being assigned review tasks.

9.4 Policy Ownership

Quality threshold policy (minimum dimension scores, composite weights, human review sampling rates, domain-specific freshness schedules) is owned by the Corpus Governance Board. Quality threshold changes require a 10-business-day review period and simulation of the impact on the existing corpus (what percentage of currently active documents would fail under the new thresholds). Threshold changes that would invalidate >5% of the active corpus require executive approval.

9.5 Traceability

Every document in the active corpus has a complete quality history: all dimension scores at each quality gate evaluation, all continuous monitoring scores, all human review decisions, and the current quality score. This history supports both root cause analysis (why did AI answer quality degrade in domain X?) and compliance reporting (confirm that all documents in the corpus met quality standards at ingestion).

9.6 Governance Artefacts

Artefact Owner Frequency Location
Quality threshold policy Corpus Governance Board Annual review; ad-hoc for significant incidents Policy management system
Accuracy validator model card ML Engineering Per model version ML model registry
Quality bias audit report Data Governance Annual Data governance platform
Human accuracy review report Domain Data Stewards Monthly Governance dashboard
Coverage gap prioritised backlog Content Management + Domain Stewards Monthly Content management system
Quality health monthly report Corpus Operations Monthly Governance dashboard

10. Operational Considerations

10.1 Monitoring and SLOs

Metric SLO Target Alerting Threshold Tool
Pre-ingestion gate throughput ≤15 min per document (automated checks) >60 min for any document in automated pipeline Pipeline monitoring
Quality review queue clearance 100% cleared within 3 business days Any item >2 business days Workflow SLA alert
Active corpus average quality score ≥0.78 composite across all domains <0.70 in any domain Quality dashboard
Stale document rate <3% of active corpus >8% Daily freshness monitor metric
Recall@5 on golden query set ≥0.88 <0.82 Weekly recall probe
Duplication rate (near-duplicates in active corpus) <2% >5% Weekly duplication monitor
Human accuracy review false negative rate <1% on sampled documents >2% in quarterly audit Quality audit programme

10.2 Logging

All quality gate events are logged as structured JSON: {document_id, timestamp, source, dimension_scores{}, composite_score, gate_decision, rejection_reason, reviewer_id}. Continuous monitoring events: {run_id, timestamp, check_type, documents_evaluated, alerts_generated}. Recall probe results: {run_id, timestamp, query_category, recall_at_k, threshold, pass_fail}. Log retention: 90 days operational; 7 years archive.

10.3 Incident Management

P1: Recall probe shows recall@5 below 0.70 — immediate investigation of recently ingested documents; potential recall of batch ingestion. P2: Average corpus quality score below threshold in a critical domain (compliance, medical) — same-day quality audit; halt of new ingestion until root cause identified. P3: Review queue SLA breach; single domain coverage gap — next business day response.

10.4 Disaster Recovery

Scenario RTO RPO Recovery Procedure
Quality gate service unavailable 30 min (container restart; stateless) N/A (stateless) Restart; documents queued during outage re-processed
Quality score store unavailable 1 hour (replica promotion) 5 min Promote read replica; validate recent score retrieval
Reference source unavailable (accuracy check) N/A (degraded mode) N/A Fall back to accuracy-check-disabled mode; flag all documents in this period for human review
Quality pipeline misconfiguration (wrong thresholds) 2 hours (configuration rollback) Last configuration version Roll back threshold configuration; re-evaluate documents processed under wrong configuration

10.5 Capacity Planning

Quality gate processing is CPU-intensive for large documents (structural parsing, embedding generation for deduplication). At high ingestion rates (>1,000 documents per day), parallelise quality gate workers with a job queue. The quality score store grows at approximately 1 KB per quality evaluation per document; a corpus of 100,000 documents with monthly re-evaluation accumulates ~1.2 GB per year — manageable.


11. Cost Considerations

11.1 Cost Drivers

Cost Driver Description Typical Range
Embedding generation (deduplication) Per-document embedding for near-duplicate detection $0.0001–$0.001 per document
Accuracy validator LLM calls Claim extraction per document in high-stakes domains $0.01–$0.10 per document in high-stakes domains
Reference source API costs External API calls for claim cross-checking Variable; $0–$0.05 per cross-checked claim
Human accuracy reviewer labour Domain expert time for sampled human review $15–$75 per reviewed document depending on domain complexity
Quality score store PostgreSQL-equivalent; modest size $100–$500/month
Continuous monitoring compute Scheduled jobs (freshness, recall, deduplication, coverage) $200–$1,000/month

11.2 Scaling Risks

  • Human accuracy review is the primary cost scaling risk: if the high-stakes document volume grows and sampling rates are maintained, review labour grows proportionally
  • Accuracy validator LLM cost can be significant for large, complex documents with many factual claims — restrict deep accuracy validation to genuinely high-stakes domains
  • Recall probe cost scales with golden query set size and retrieval computation — keep golden set to 200–500 representative queries

11.3 Optimisations

  • Hash deduplication is free (CPU-only) — always run before embedding-based near-duplicate detection
  • Tier accuracy validation by document classification: Restricted documents get full claim extraction and human review; Internal documents get automated checks only
  • Cache quality scores for documents that have not changed between evaluation runs (hash-based change detection)
  • Use smaller embedding models for deduplication (the absolute embedding values matter less than the similarity ranking)

11.4 Indicative Cost Ranges

Corpus Scale Monthly QA Infrastructure Cost Annual Total (incl. human review)
Small (10K docs, 500 new/month) $300–$800 $20,000–$60,000
Medium (100K docs, 5K new/month) $2,000–$6,000 $100,000–$300,000
Large (1M+ docs, 50K new/month) $15,000–$50,000 $500,000–$2,000,000

12. Trade-Off Analysis

12.1 Quality Gate Strictness Options

Option Strengths Weaknesses Best For
Strict gate (high thresholds, manual review for borderline) Maximum corpus quality; low false-positive rate (bad docs in corpus) Slow ingestion; review queue bottleneck; risk of under-populated corpus High-stakes domains (medical, legal, regulatory) where quality > coverage
Permissive gate (lower thresholds, auto-ingest most content) Fast ingestion; high corpus coverage Higher false-positive rate; lower average quality; more cleanup required Informational domains where coverage > quality; very high document volume
Adaptive gate (thresholds calibrated per document type and domain) Optimised quality/coverage trade-off per domain Complex configuration; requires ongoing calibration Recommended for most enterprise deployments with diverse document types

12.2 Accuracy Validation Approaches

Approach Accuracy Cost Speed Best For
Automated claim extraction + reference cross-check Medium — reference source may not cover all claims Medium Fast (minutes) Domains with authoritative machine-readable reference sources
LLM-based plausibility check (no reference source) Low — LLM may hallucinate; cannot substitute for ground truth Low Very fast Quick screening; flag obviously wrong claims for human review
Human domain expert review (full document) Highest High Slow (hours–days) Critical high-stakes documents; small volume
Statistical sampling with human review High for sampled documents; inference to corpus quality Medium Manageable at scale Enterprise-scale quality assurance programme

12.3 Architectural Tensions

Tension Option A Option B Recommended Resolution
Quality gate latency vs. thoroughness Fast checks only (structural + hash dedup) for near-real-time ingestion Full six-dimension check for maximum quality assurance Tiered: fast checks for real-time ingestion path; deep checks in parallel async job; gate on fast checks immediately, gate on deep checks within 15 minutes
Centralised vs. distributed quality assessment Single centralised CQA service for all corpus types Domain-specific quality services with domain-tuned thresholds Centralised framework and tooling; domain-configurable thresholds and reference sources within the framework
Automated vs. human quality decisions Fully automated quality gating Human review for all documents Automation for clear cases (score far above or below threshold); human for borderline band; human sampling for quality audit

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Threshold misconfiguration (too lenient) Medium High — low-quality documents enter active corpus Recall probe degradation; user-reported quality issues Retrospective quality audit; purge documents below actual threshold; recalibrate thresholds
Threshold misconfiguration (too strict) Medium Medium — legitimate documents excluded; corpus coverage gaps Coverage gap monitor; domain steward reports missing knowledge Threshold recalibration; re-submit previously rejected documents
Accuracy validator high false negative rate Medium High — inaccurate facts enter corpus Human accuracy review sampling detects discrepancy Model retraining; increase human review sampling rate until model is fixed
Staleness monitor missed expiry Low High — outdated documents remain active User-reported stale AI answers; periodic manual audit Manual sweep of domain; fix monitoring job; add regression test
Coverage gap not detected (ontology coverage map outdated) Medium Medium — AI doesn't know about topic User complaints; ontology gap identified manually Ontology refresh; update coverage monitor; proactive gap-filling ingestion
Recall probe false positives (golden set stale) Medium Medium — alerts on correct behaviour; team distrust of monitoring Golden set review identifies outdated expected results Quarterly golden set refresh with domain expert validation

13.1 Cascading Failure Scenarios

Scenario 1: Quality Gate Configuration Drift. Over 18 months, quality thresholds are relaxed in small increments to keep pace with ingestion volume pressures. No single change is significant enough to trigger governance review. Average corpus quality declines from 0.82 to 0.68. AI answer quality declines proportionally. Detection occurs when a business unit escalates multiple AI errors in a high-profile project. Root cause analysis reveals the threshold drift. Resolution requires: threshold reset to original values; corpus-wide retrospective quality re-evaluation; purge of documents now below threshold; 3-month remediation project.

Scenario 2: Reference Source Compromised. The external regulatory database used as an accuracy validation reference source is compromised and injects incorrect threshold values for a compliance domain. The accuracy validator cross-checks document claims against the now-incorrect reference and accepts inaccurate documents. AI begins providing incorrect compliance guidance. Detection: compliance team notices AI answers contradict known regulatory requirements. Resolution: immediate removal of affected documents from corpus; reference source integrity investigation; temporary switch to human accuracy review until reference source is validated clean; add cross-check against secondary reference source.


14. Regulatory Considerations

Regulation Relevant Clause Requirement How CQA Addresses It
APRA CPS 234 §15(c) (Classification of Information Assets) Information assets classified by criticality and sensitivity Quality scorer classifies each document by domain criticality; quality thresholds calibrated to criticality
APRA CPS 230 §33 (Information Management Obligations) Framework for managing information quality in material systems CQA is the documented quality management framework for AI knowledge assets
Australian Privacy Act 1988 APP 10 (Quality of Personal Information) Take reasonable steps to ensure personal information is accurate, up-to-date, complete Accuracy and staleness dimensions directly address APP 10 for any corpus documents containing personal information
EU AI Act Article 10(3) (Data Governance) Training and knowledge data must be subject to data governance practices covering quality Six-dimension quality gate + continuous monitoring constitutes documented data governance practices
EU AI Act Article 10(2)(f) (Data Quality) Data governance practices must address quality and accuracy of data used in high-risk AI Accuracy validator + human review sampling satisfy accuracy requirement; staleness monitor satisfies currency requirement
ISO/IEC 42001 §8.2.3 (Data Quality) Organisations must address data quality in AI system lifecycle management CQA pipeline is the data quality management implementation for the AI knowledge corpus
NIST AI RMF MEASURE 2.5 (AI Data Quality) Identify and measure AI system data quality risks and limitations Quality score dimensions, trend monitoring, and coverage gap analysis directly address this measure

15. Reference Implementations

15.1 AWS

Component AWS Service
Quality gate pipeline AWS Step Functions (orchestration) + Lambda (individual quality checks)
Structural integrity check Lambda + PyMuPDF/python-magic
Document embedding (deduplication) Amazon Bedrock Titan Embeddings
Similarity search (deduplication) OpenSearch k-NN
Accuracy reference source Custom Lambda + external API or Amazon Kendra (knowledge source)
Quality score store Amazon RDS PostgreSQL
Review queue SQS + custom React UI + SES email notifications
Continuous monitoring EventBridge Scheduler + Lambda
Dashboard Amazon Managed Grafana

15.2 Azure

Component Azure Service
Quality gate pipeline Azure Logic Apps + Azure Functions
Document embedding (deduplication) Azure OpenAI Embeddings
Similarity search Azure AI Search (vector)
Accuracy validation Azure AI Language + custom reference source API
Quality score store Azure SQL Database
Review queue Azure Service Bus + Power Apps
Continuous monitoring Azure Functions with Timer trigger
Dashboard Azure Monitor + Grafana

15.3 GCP

Component GCP Service
Quality gate pipeline Cloud Workflows + Cloud Functions
Document embedding Vertex AI Embeddings
Similarity search Vertex AI Vector Search
Quality score store Cloud SQL PostgreSQL
Continuous monitoring Cloud Scheduler + Cloud Functions
Dashboard Cloud Monitoring + Grafana

15.4 On-Premises

Component Technology
Quality gate pipeline Apache Airflow DAG
Structural integrity Python + PyMuPDF + chardet
Document embedding Sentence Transformers on GPU
Deduplication similarity Qdrant or pgvector for similarity search
Accuracy validation spaCy NLP + custom reference source API
Quality score store PostgreSQL
Review queue Custom Flask app + email notifications
Dashboard Prometheus + Grafana

Pattern ID Pattern Name Relationship Type Notes
EAAPL-KNW003 AI Knowledge Corpus Management Complementary KNW003 is the lifecycle management pattern; KNW006 is the quality assurance function within that lifecycle
EAAPL-KNW004 Vector Database Management Downstream CQA governs document quality before it enters the vector index; vector DB recall monitoring provides a quality feedback signal
EAAPL-KNW001 Enterprise Knowledge Graph Complementary Coverage gap analysis uses the EKG ontology as the coverage target; KNW001 defines what topics the corpus should cover
EAAPL-KNW005 Knowledge Graph for Explainability Supporting Explanation quality is constrained by corpus quality; CQA ensures the corpus facts used in explanations are accurate and current
EAAPL-GOV002 AI Model Risk Management Supporting Document classifier and accuracy validator are models subject to model risk management
EAAPL-OPS001 AI Observability Complementary Quality health dashboard is part of the broader AI observability framework

17. Maturity Assessment

Overall Maturity Label: Proven

Dimension Score (1–5) Rationale
Technology readiness 4 All component technologies (NLP libraries, embedding models, workflow tools, monitoring platforms) are production-proven; the integration pattern is well-established
Organisational capability 3 Requires data quality engineering skills and domain expert involvement for threshold calibration; achievable for most organisations with a data governance function
Standards availability 3 No specific CQA standard for AI corpora; draws on data quality standards (ISO 8000, DAMA DMBOK) with AI-specific adaptations
Vendor ecosystem 4 All major cloud providers offer component services; multiple open-source options; some emerging specialised corpus management vendors
Case evidence 4 Well-documented in library science and content management; AI-specific implementations growing rapidly with RAG adoption
Regulatory alignment 5 EU AI Act Article 10 data governance requirements and APP 10 are directly addressed; strongest regulatory coverage of the knowledge management patterns
Overall 3.8 / 5 Proven with strong regulatory alignment and accessible technology; primary investment is in calibrating thresholds and establishing human review processes for high-stakes domains

18. Revision History

Version Date Author Changes
1.0 2026-06-12 EAAPL Editorial Board Initial publication — covers six quality dimensions (completeness, accuracy, duplication, staleness, coverage, structure), pre-ingestion gating, continuous monitoring, human review programme, quality trend dashboard, and regulatory mapping
← Back to LibraryMore Knowledge Management