EAAPL-KNW006: Corpus Quality Assurance
Pattern ID: EAAPL-KNW006
Status: Proven
Complexity: Medium
Tags: observability model-risk traceability medium-complexity
Version: 1.0
Last Updated: 2026-06-12
1. Executive Summary
Corpus Quality Assurance (CQA) is the automated pipeline that evaluates the fitness of documents for inclusion in an AI knowledge corpus — before ingestion and on a continuous basis after ingestion. It is the quality control function that stands between raw enterprise content and the retrieval systems that power AI answers.
This pattern defines six quality dimensions — completeness, accuracy, duplication, staleness, coverage, and structure — and specifies the automated measurement, threshold gating, and alerting mechanisms for each. It also covers the quality trend dashboard and the escalation path when automated quality assurance cannot make a determination.
For CIOs and CTOs, the core argument is: AI answer quality cannot exceed corpus quality. Teams investing in LLM selection, prompt engineering, and retrieval architecture while neglecting corpus quality are optimising the wrong variable. A well-tuned AI system on a poor corpus will outperform a poorly tuned system on a good corpus in the short term, but the corpus quality deficit compounds — AI answers degrade as documents age, duplicate, and diverge, while the AI model remains fixed. CQA is the ongoing quality investment that protects AI answer quality as the corpus grows and ages.
Implementation is medium complexity. Unlike knowledge graph or semantic layer patterns, CQA does not require new data infrastructure — it adds a quality measurement and gating layer to existing document ingestion pipelines.
2. Problem Statement
2.1 Business Problem
Enterprise knowledge corpora degrade without active management. Documents become outdated as policies and products change, but the old versions remain in the retrieval index — the AI continues citing superseded information. Duplicate documents accumulate as the same content is ingested from multiple sources with minor variations, causing inconsistent retrieval. Poorly formatted or truncated documents produce low-quality retrieval chunks that confuse rather than inform the LLM. The business consequence is AI answers that become less reliable over time, eroding user trust in proportion to the corpus quality deficit.
2.2 Technical Problem
Retrieval quality in RAG systems is directly determined by the quality of the documents retrieved. Standard vector similarity search has no quality awareness: a low-quality, outdated document with high semantic similarity to a query will be retrieved in preference to a high-quality, current document with slightly lower similarity. Without quality scores attached to documents and factored into retrieval ranking, quality degradation is invisible to the retrieval algorithm.
2.3 Symptoms
- AI cites version-superseded documents (e.g., a policy withdrawn 18 months ago)
- Same question receives different answers on different days because duplicate documents with conflicting content are retrieved inconsistently
- Truncated or corrupted documents appear in retrieval results; LLM produces incoherent answers for those queries
- No metric exists for corpus health; the team does not know if quality is improving or declining
- Coverage gaps are discovered reactively (users complain the AI "doesn't know" about a topic) rather than proactively
2.4 Cost of Inaction
- AI adoption reversal: business units that experience repeated quality failures disengage and revert to manual research
- Regulatory risk: AI answers based on superseded regulatory or compliance documents produce incorrect guidance with potential legal consequences
- Compounding quality debt: the longer quality management is deferred, the more documents require remediation, and the larger the quality remediation project becomes
- Lost insight: coverage gaps mean entire knowledge domains are unrepresented in AI answers — the system is not aware of what it doesn't know
3. Context
3.1 When to Apply
- Any production RAG system with >500 documents — below this scale, manual review is feasible; above it, automation is necessary
- Corpora with multiple document sources and types of varying quality (the diversity creates the quality variance that requires automated management)
- Domains with regulatory compliance implications where document currency is essential (compliance, legal, product, medical)
- Organisations with high document update velocity — quality degrades fastest where content changes frequently
- As a companion to EAAPL-KNW003 (AI Knowledge Corpus Management) — CQA is the quality assurance function; KNW003 is the lifecycle management function
3.2 When NOT to Apply
- Single-source corpora from a single authoritative owner with manual review already in place — CQA overhead is not justified
- Pure experimental/prototype deployments where AI answer quality is not yet a business concern
- Corpora updated in batch by a controlled process with built-in quality controls upstream — additional CQA layer may be redundant
3.3 Prerequisites
- Document metadata standard: at minimum, source system, author, effective date, expiry date, document type
- Document ingestion pipeline with an interception point where quality checks can be executed before final ingestion
- Storage for quality scores and quality history per document
- Alerting infrastructure for quality threshold violations
3.4 Industry Applicability
| Industry | Applicability | Primary Quality Risk | Key Quality Dimension |
|---|---|---|---|
| Financial Services | Critical | Superseded regulatory documents | Staleness + Authority |
| Healthcare | Critical | Outdated clinical guidelines, drug information | Staleness + Accuracy |
| Legal | High | Superseded case law, outdated legislation references | Staleness + Completeness |
| Government | High | Policy version conflicts, outdated service information | Staleness + Duplication |
| Technology | Medium | Outdated product documentation, deprecated API references | Staleness + Completeness |
| Retail / CPG | Medium | Obsolete product specs, superseded compliance certs | Staleness + Duplication |
4. Architecture Overview
The Corpus Quality Assurance architecture comprises two operational phases: Pre-Ingestion Quality Gating and Post-Ingestion Continuous Quality Monitoring, unified by a shared quality score store and health dashboard.
4.1 Pre-Ingestion Quality Gating Pipeline
Every document submitted for corpus ingestion passes through six quality checks in sequence:
Completeness Check. The completeness scorer evaluates whether the document is whole and self-contained. Checks include: minimum word count for the document type; absence of truncation indicators (sentences that end abruptly, missing conclusion sections, "Page X of Y" indicators suggesting missing pages); presence of expected structural elements for the document type (a policy document without a "Scope" or "Effective Date" section is flagged as incomplete); broken internal references (citations to sections that don't exist in the document). Completeness score: 0–1.
Accuracy Validation. Accuracy validation operates at two tiers. Tier 1 (automated): for documents in high-stakes domains, automated fact-claim extraction identifies specific factual assertions (percentages, thresholds, named entities with specific attributes) that can be cross-checked against a trusted reference source (a regulatory database, a product master data system, an authoritative ontology). Claims that contradict the reference source reduce the accuracy score. Tier 2 (human): a statistically sampled proportion of documents from high-stakes domains are routed to a human accuracy reviewer — a domain expert who validates a representative sample of claims against primary sources. The human sampling rate is configurable per domain (typically 2–10% for high-stakes; 0.1–0.5% for informational domains).
Duplication Detection. Exact duplication is detected via cryptographic hash (SHA-256 of document content, normalised for whitespace and formatting). Near-duplicate detection uses cosine similarity of document-level embeddings: documents with similarity above a configurable threshold (default 0.95) are flagged as near-duplicates. For near-duplicate pairs, the deduplication strategy is configurable: reject the newer document (preserve canonical version), merge metadata (combine source attributions), or route to human review to determine which is authoritative. Exact duplicates are always rejected silently.
Staleness Evaluation. The staleness scorer evaluates document freshness relative to domain-specific maximum age thresholds. Threshold configuration is per document type and domain: regulatory instruments (12 months), internal policies (6 months), product technical specifications (3 months), market data summaries (1 week). The staleness score decays from 1.0 (fully fresh) to 0.0 (at maximum age) and goes negative (below 0) when past expiry — a document past expiry cannot be ingested. The score factors in not just age but also the velocity of change in the domain: regulatory areas with a recent burst of amendments require more frequent refresh.
Structural Integrity Check. The structural checker validates that the document can be processed by the downstream chunking pipeline. Checks: document is machine-readable (not a scanned image without OCR text layer); character encoding is valid UTF-8; no binary artefacts that would confuse chunking; minimum retrievable text content (>100 words of coherent prose). A structurally invalid document cannot produce useful retrieval chunks.
Coverage Assessment. The coverage assessor checks whether the document adds genuine value to the corpus by mapping its content to the knowledge ontology. If the document's topic is already represented by ≥N high-quality, current documents, the new document's incremental value is low and it is deprioritised or queued for later ingestion. If the document covers a topic with <N representations, it is flagged as a coverage gap filler and prioritised.
Composite Quality Score and Gate. Each of the six dimension scores is combined into a composite quality score with configurable weights per document type. Documents above the high-quality threshold are auto-ingested. Documents in the middle band enter a quality review queue where a document owner is notified with specific dimension-level feedback. Documents below the minimum threshold are rejected with a detailed rejection report.
4.2 Post-Ingestion Continuous Quality Monitoring
Quality degrades after ingestion as time passes and the broader knowledge landscape changes. The continuous monitoring layer runs scheduled jobs to re-evaluate the active corpus.
Freshness Monitor runs daily: re-scores all documents' staleness dimension; flags documents approaching the pre-expiry warning threshold; triggers automated expiry at the hard threshold.
Recall Probe runs weekly: executes a golden query set against the active corpus; measures recall@k for each query category; a decline in recall for a specific category indicates that the documents covering that topic have degraded in quality or have been removed.
Duplication Drift Monitor runs weekly: checks for near-duplicates introduced since the last run; newly ingested documents are compared against the existing corpus; cross-document contradictions (same topic, conflicting factual claims) are detected and flagged.
Coverage Gap Monitor runs monthly: maps the active corpus against the knowledge ontology; identifies topics with declining document counts (documents aged out without replacement); generates a prioritised ingestion backlog for the content management team.
4.3 Quality Score Store and History
All quality scores (pre-ingestion and post-ingestion re-evaluation) are stored per document with timestamps. This enables quality trend analysis: is a domain's average quality improving or declining? Are specific document types consistently failing particular quality dimensions? The quality history also supports root cause analysis when AI answer quality issues are investigated.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Completeness Scorer | Processing | Evaluate document wholeness: structure, word count, truncation | Custom Python scorer; readability libraries; document structure parser | High |
| Accuracy Validator | AI + Human workflow | Automated fact-claim extraction + reference cross-check; human sampling for high-stakes domains | Custom NLP pipeline; spaCy; LLM-based claim extractor; human review workflow | High |
| Duplication Detector | Processing | Exact hash deduplication; near-duplicate embedding similarity | SHA-256 hash; document embedding (Sentence Transformers); cosine similarity threshold | High |
| Staleness Evaluator | Processing | Age-based freshness scoring with domain-specific thresholds; expiry enforcement | Custom scorer; metadata date arithmetic; domain threshold configuration | Critical |
| Structural Integrity Checker | Processing | Validate machine-readability, encoding, minimum text content | PyMuPDF (PDF validation), python-magic (format detection), character encoding detection | High |
| Coverage Assessor | Processing | Map document to ontology; assess incremental value against existing corpus coverage | Topic modelling (BERTopic); ontology lookup; document count per topic | Medium |
| Composite Score Calculator | Processing | Weighted combination of six dimension scores; apply threshold decision | Custom Python service; configurable weight matrix per document type | Critical |
| Quality Review Queue | Workflow | Route medium-quality documents to owners; track remediation SLA | Custom workflow app; Jira integration; email notification | High |
| Quality Score Store | Storage | Persist quality scores with history; support trend analysis | PostgreSQL with time-series extension; InfluxDB for metrics; DynamoDB | High |
| Freshness Monitor | Scheduler | Daily re-evaluation of staleness scores; expiry flagging | Kubernetes CronJob; AWS Lambda; Airflow DAG | Critical |
| Recall Probe | Scheduler | Weekly golden query set recall@k measurement | Custom Python evaluation job; Ragas framework | High |
| Duplication Drift Monitor | Scheduler | Weekly scan for newly introduced duplicates | Custom job using document embedding similarity | Medium |
| Coverage Gap Monitor | Scheduler | Monthly ontology coverage analysis | Custom analytics job; ontology API integration | Medium |
| Quality Health Dashboard | Observability | Unified view of all quality dimensions across domains | Grafana + custom metrics; Tableau; Metabase | Medium |
7. Data Flow
7.1 Primary Data Flow — Pre-Ingestion Quality Gate
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | Document Source | Submits document + metadata to quality gate API | Document file + metadata |
| 2 | Completeness Scorer | Evaluates structural completeness, word count, truncation | Completeness score 0–1 |
| 3 | Accuracy Validator | Extracts factual claims; cross-checks against reference sources | Accuracy score 0–1; list of unverified claims |
| 4 | Duplication Detector | Computes hash; compares against existing corpus embeddings | Duplicate flag (exact/near/none); similar document IDs if near-duplicate |
| 5 | Staleness Evaluator | Computes age against domain threshold; returns freshness score | Staleness score 0–1; expiry flag if past threshold |
| 6 | Structural Integrity Checker | Validates encoding, format, minimum text | Pass/fail with specific failure reason |
| 7 | Coverage Assessor | Maps to ontology; counts existing documents on same topic | Incremental coverage value score; topic assignments |
| 8 | Composite Calculator | Applies dimension weights; computes composite score | Composite quality score 0–1 |
| 9 | Quality Gate | Routes document by composite score | Auto-ingest / review queue / reject |
| 10 | Quality Score Store | Persists all dimension scores and composite with document ID and timestamp | Quality record written |
| 11 | Active Corpus | Document chunked, embedded, ingested into vector index | Corpus updated |
7.2 Error Flow
| Error | Detection | Recovery | Escalation |
|---|---|---|---|
| Reference source unavailable (accuracy validator cannot cross-check) | HTTP timeout / API error from reference source | Fall back to reduced accuracy check (claim extraction only, no cross-check); flag document for human accuracy review | Alert operations; reference source SLA breach |
| Embedding generation failure (duplication detector) | Embedder exception | Retry ×3; skip similarity deduplication (still run hash deduplication); flag for re-check on next batch run | Alert ingestion pipeline team |
| Review queue SLA breach (documents not remediated within SLA) | Automated SLA monitoring job | Escalation notification to domain data steward and corpus governance | Corpus governance intervention; temporary threshold adjustment if volume overwhelms capacity |
| Staleness expiry with no replacement document | Freshness monitor flags; no new version ingested | Remove from active corpus; generate gap alert in coverage dashboard | Content management team notified to source replacement |
| Recall probe decline (quality issue not caught by pre-ingestion) | Recall@k below threshold | Identify recently ingested documents; trigger retrospective quality audit | Review quality gate thresholds; investigate specific query category failures |
8. Security Considerations
8.1 Authentication and Authorisation
The quality gate API authenticates document submissions using the same source authentication mechanism as the corpus management pipeline. Quality scores are internal operational data and are accessible to corpus administrators, data stewards, and the AI platform engineering team. Quality history records for a specific document are accessible to the document owner. The quality review queue interface requires MFA-enabled SSO.
8.2 Secrets Management
Reference source API credentials (for accuracy cross-checking), embedding model API keys (for duplication detection), and quality score database credentials are stored in a secrets vault with standard rotation.
8.3 Data Classification
Quality scores are metadata and carry the same classification as the document they describe. A quality report for a Confidential document is itself Confidential. The quality health dashboard aggregates are typically Internal classification (no individual document details).
8.4 Encryption
Quality score store: encrypted at rest (AES-256). Data in transit: TLS 1.3. Accuracy reviewer interface: HTTPS-only with session token management. Documents processed by the quality pipeline are processed in-memory only where possible; no sensitive content written to intermediary disk storage.
8.5 Auditability
All quality gate decisions are logged: document ID, timestamp, submitter identity, all dimension scores, composite score, gate decision (auto-ingest/review/reject), rejection reason if applicable. For documents entering the review queue, the reviewer's identity, decision, and timestamp are logged. This audit trail enables investigation of any document's quality history.
8.6 OWASP LLM Top 10 Mapping
| OWASP LLM Risk | Relevance | Mitigation |
|---|---|---|
| LLM01 Prompt Injection | Adversarial documents could embed prompt injection content; quality gate is the first line of defence | Structural integrity check rejects documents with instruction-like patterns; content sanitisation before quality scoring |
| LLM02 Insecure Output Handling | Quality gate LLM components (accuracy validator) could produce unsafe outputs | Structured output format for all LLM quality assessments; no free-form output from quality LLMs |
| LLM03 Training Data Poisoning | Low-quality or adversarial documents that pass the quality gate pollute the corpus | Quality gate is the primary control; post-ingestion recall probe detects degradation from poisoned documents |
| LLM04 Model Denial of Service | Adversarially complex documents (extremely large, pathological encoding) could exhaust quality gate compute | Maximum document size limit; processing timeout per quality check; reject documents exceeding limits |
| LLM05 Supply Chain Vulnerabilities | Reference sources used for accuracy validation could be compromised | Reference source authentication; cross-check against multiple independent reference sources for critical claims |
| LLM06 Sensitive Information Disclosure | Accuracy validation process exposes document content to external reference APIs | On-premises or private reference sources for sensitive domains; no external API calls for Restricted documents |
| LLM07 Insecure Plugin Design | Reference source connectors in accuracy validator | Source connector allowlist; input validation; read-only connector access |
| LLM08 Excessive Agency | Quality gate automation could auto-reject valid documents at scale | Quality gate generates recommendations; escalation to human reviewer for borderline decisions |
| LLM09 Overreliance | Teams trust quality scores as absolute measures of document quality | Quality scores are indicators, not guarantees; human review programme for high-stakes domains; score interpretation guidance |
| LLM10 Model Theft | Quality scoring models encode domain knowledge | Quality model artefacts access-controlled; no external API exposure of quality scoring logic |
9. Governance Considerations
9.1 Responsible AI
Quality thresholds are value judgements encoded as configuration. A high completeness threshold that rejects documents with non-standard formatting may systematically exclude content from certain sources or geographies that format differently. Quality threshold calibration should include a bias audit: do the thresholds disproportionately exclude any legitimate document types, sources, or domain perspectives? Results are reviewed annually.
9.2 Model Risk Management
The accuracy validator's claim extraction and reference cross-checking component is a model. Its false negative rate (claims it fails to flag as inaccurate) determines the probability of inaccurate facts reaching the corpus. This model is subject to model risk management: model card documenting training data, precision/recall on validation set per claim type, known failure modes, and refresh schedule. The human accuracy review sampling programme provides an independent validation signal.
9.3 Human Approval Gates
All documents in the quality review queue require human action: either the document owner remediates the quality issue and resubmits, or the document is permanently rejected. Rejected documents cannot be automatically re-submitted — re-submission requires the owner to explicitly acknowledge the original rejection reason. Human accuracy reviewers for high-stakes domains complete competency validation before being assigned review tasks.
9.4 Policy Ownership
Quality threshold policy (minimum dimension scores, composite weights, human review sampling rates, domain-specific freshness schedules) is owned by the Corpus Governance Board. Quality threshold changes require a 10-business-day review period and simulation of the impact on the existing corpus (what percentage of currently active documents would fail under the new thresholds). Threshold changes that would invalidate >5% of the active corpus require executive approval.
9.5 Traceability
Every document in the active corpus has a complete quality history: all dimension scores at each quality gate evaluation, all continuous monitoring scores, all human review decisions, and the current quality score. This history supports both root cause analysis (why did AI answer quality degrade in domain X?) and compliance reporting (confirm that all documents in the corpus met quality standards at ingestion).
9.6 Governance Artefacts
| Artefact | Owner | Frequency | Location |
|---|---|---|---|
| Quality threshold policy | Corpus Governance Board | Annual review; ad-hoc for significant incidents | Policy management system |
| Accuracy validator model card | ML Engineering | Per model version | ML model registry |
| Quality bias audit report | Data Governance | Annual | Data governance platform |
| Human accuracy review report | Domain Data Stewards | Monthly | Governance dashboard |
| Coverage gap prioritised backlog | Content Management + Domain Stewards | Monthly | Content management system |
| Quality health monthly report | Corpus Operations | Monthly | Governance dashboard |
10. Operational Considerations
10.1 Monitoring and SLOs
| Metric | SLO Target | Alerting Threshold | Tool |
|---|---|---|---|
| Pre-ingestion gate throughput | ≤15 min per document (automated checks) | >60 min for any document in automated pipeline | Pipeline monitoring |
| Quality review queue clearance | 100% cleared within 3 business days | Any item >2 business days | Workflow SLA alert |
| Active corpus average quality score | ≥0.78 composite across all domains | <0.70 in any domain | Quality dashboard |
| Stale document rate | <3% of active corpus | >8% | Daily freshness monitor metric |
| Recall@5 on golden query set | ≥0.88 | <0.82 | Weekly recall probe |
| Duplication rate (near-duplicates in active corpus) | <2% | >5% | Weekly duplication monitor |
| Human accuracy review false negative rate | <1% on sampled documents | >2% in quarterly audit | Quality audit programme |
10.2 Logging
All quality gate events are logged as structured JSON: {document_id, timestamp, source, dimension_scores{}, composite_score, gate_decision, rejection_reason, reviewer_id}. Continuous monitoring events: {run_id, timestamp, check_type, documents_evaluated, alerts_generated}. Recall probe results: {run_id, timestamp, query_category, recall_at_k, threshold, pass_fail}. Log retention: 90 days operational; 7 years archive.
10.3 Incident Management
P1: Recall probe shows recall@5 below 0.70 — immediate investigation of recently ingested documents; potential recall of batch ingestion. P2: Average corpus quality score below threshold in a critical domain (compliance, medical) — same-day quality audit; halt of new ingestion until root cause identified. P3: Review queue SLA breach; single domain coverage gap — next business day response.
10.4 Disaster Recovery
| Scenario | RTO | RPO | Recovery Procedure |
|---|---|---|---|
| Quality gate service unavailable | 30 min (container restart; stateless) | N/A (stateless) | Restart; documents queued during outage re-processed |
| Quality score store unavailable | 1 hour (replica promotion) | 5 min | Promote read replica; validate recent score retrieval |
| Reference source unavailable (accuracy check) | N/A (degraded mode) | N/A | Fall back to accuracy-check-disabled mode; flag all documents in this period for human review |
| Quality pipeline misconfiguration (wrong thresholds) | 2 hours (configuration rollback) | Last configuration version | Roll back threshold configuration; re-evaluate documents processed under wrong configuration |
10.5 Capacity Planning
Quality gate processing is CPU-intensive for large documents (structural parsing, embedding generation for deduplication). At high ingestion rates (>1,000 documents per day), parallelise quality gate workers with a job queue. The quality score store grows at approximately 1 KB per quality evaluation per document; a corpus of 100,000 documents with monthly re-evaluation accumulates ~1.2 GB per year — manageable.
11. Cost Considerations
11.1 Cost Drivers
| Cost Driver | Description | Typical Range |
|---|---|---|
| Embedding generation (deduplication) | Per-document embedding for near-duplicate detection | $0.0001–$0.001 per document |
| Accuracy validator LLM calls | Claim extraction per document in high-stakes domains | $0.01–$0.10 per document in high-stakes domains |
| Reference source API costs | External API calls for claim cross-checking | Variable; $0–$0.05 per cross-checked claim |
| Human accuracy reviewer labour | Domain expert time for sampled human review | $15–$75 per reviewed document depending on domain complexity |
| Quality score store | PostgreSQL-equivalent; modest size | $100–$500/month |
| Continuous monitoring compute | Scheduled jobs (freshness, recall, deduplication, coverage) | $200–$1,000/month |
11.2 Scaling Risks
- Human accuracy review is the primary cost scaling risk: if the high-stakes document volume grows and sampling rates are maintained, review labour grows proportionally
- Accuracy validator LLM cost can be significant for large, complex documents with many factual claims — restrict deep accuracy validation to genuinely high-stakes domains
- Recall probe cost scales with golden query set size and retrieval computation — keep golden set to 200–500 representative queries
11.3 Optimisations
- Hash deduplication is free (CPU-only) — always run before embedding-based near-duplicate detection
- Tier accuracy validation by document classification: Restricted documents get full claim extraction and human review; Internal documents get automated checks only
- Cache quality scores for documents that have not changed between evaluation runs (hash-based change detection)
- Use smaller embedding models for deduplication (the absolute embedding values matter less than the similarity ranking)
11.4 Indicative Cost Ranges
| Corpus Scale | Monthly QA Infrastructure Cost | Annual Total (incl. human review) |
|---|---|---|
| Small (10K docs, 500 new/month) | $300–$800 | $20,000–$60,000 |
| Medium (100K docs, 5K new/month) | $2,000–$6,000 | $100,000–$300,000 |
| Large (1M+ docs, 50K new/month) | $15,000–$50,000 | $500,000–$2,000,000 |
12. Trade-Off Analysis
12.1 Quality Gate Strictness Options
| Option | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Strict gate (high thresholds, manual review for borderline) | Maximum corpus quality; low false-positive rate (bad docs in corpus) | Slow ingestion; review queue bottleneck; risk of under-populated corpus | High-stakes domains (medical, legal, regulatory) where quality > coverage |
| Permissive gate (lower thresholds, auto-ingest most content) | Fast ingestion; high corpus coverage | Higher false-positive rate; lower average quality; more cleanup required | Informational domains where coverage > quality; very high document volume |
| Adaptive gate (thresholds calibrated per document type and domain) | Optimised quality/coverage trade-off per domain | Complex configuration; requires ongoing calibration | Recommended for most enterprise deployments with diverse document types |
12.2 Accuracy Validation Approaches
| Approach | Accuracy | Cost | Speed | Best For |
|---|---|---|---|---|
| Automated claim extraction + reference cross-check | Medium — reference source may not cover all claims | Medium | Fast (minutes) | Domains with authoritative machine-readable reference sources |
| LLM-based plausibility check (no reference source) | Low — LLM may hallucinate; cannot substitute for ground truth | Low | Very fast | Quick screening; flag obviously wrong claims for human review |
| Human domain expert review (full document) | Highest | High | Slow (hours–days) | Critical high-stakes documents; small volume |
| Statistical sampling with human review | High for sampled documents; inference to corpus quality | Medium | Manageable at scale | Enterprise-scale quality assurance programme |
12.3 Architectural Tensions
| Tension | Option A | Option B | Recommended Resolution |
|---|---|---|---|
| Quality gate latency vs. thoroughness | Fast checks only (structural + hash dedup) for near-real-time ingestion | Full six-dimension check for maximum quality assurance | Tiered: fast checks for real-time ingestion path; deep checks in parallel async job; gate on fast checks immediately, gate on deep checks within 15 minutes |
| Centralised vs. distributed quality assessment | Single centralised CQA service for all corpus types | Domain-specific quality services with domain-tuned thresholds | Centralised framework and tooling; domain-configurable thresholds and reference sources within the framework |
| Automated vs. human quality decisions | Fully automated quality gating | Human review for all documents | Automation for clear cases (score far above or below threshold); human for borderline band; human sampling for quality audit |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Threshold misconfiguration (too lenient) | Medium | High — low-quality documents enter active corpus | Recall probe degradation; user-reported quality issues | Retrospective quality audit; purge documents below actual threshold; recalibrate thresholds |
| Threshold misconfiguration (too strict) | Medium | Medium — legitimate documents excluded; corpus coverage gaps | Coverage gap monitor; domain steward reports missing knowledge | Threshold recalibration; re-submit previously rejected documents |
| Accuracy validator high false negative rate | Medium | High — inaccurate facts enter corpus | Human accuracy review sampling detects discrepancy | Model retraining; increase human review sampling rate until model is fixed |
| Staleness monitor missed expiry | Low | High — outdated documents remain active | User-reported stale AI answers; periodic manual audit | Manual sweep of domain; fix monitoring job; add regression test |
| Coverage gap not detected (ontology coverage map outdated) | Medium | Medium — AI doesn't know about topic | User complaints; ontology gap identified manually | Ontology refresh; update coverage monitor; proactive gap-filling ingestion |
| Recall probe false positives (golden set stale) | Medium | Medium — alerts on correct behaviour; team distrust of monitoring | Golden set review identifies outdated expected results | Quarterly golden set refresh with domain expert validation |
13.1 Cascading Failure Scenarios
Scenario 1: Quality Gate Configuration Drift. Over 18 months, quality thresholds are relaxed in small increments to keep pace with ingestion volume pressures. No single change is significant enough to trigger governance review. Average corpus quality declines from 0.82 to 0.68. AI answer quality declines proportionally. Detection occurs when a business unit escalates multiple AI errors in a high-profile project. Root cause analysis reveals the threshold drift. Resolution requires: threshold reset to original values; corpus-wide retrospective quality re-evaluation; purge of documents now below threshold; 3-month remediation project.
Scenario 2: Reference Source Compromised. The external regulatory database used as an accuracy validation reference source is compromised and injects incorrect threshold values for a compliance domain. The accuracy validator cross-checks document claims against the now-incorrect reference and accepts inaccurate documents. AI begins providing incorrect compliance guidance. Detection: compliance team notices AI answers contradict known regulatory requirements. Resolution: immediate removal of affected documents from corpus; reference source integrity investigation; temporary switch to human accuracy review until reference source is validated clean; add cross-check against secondary reference source.
14. Regulatory Considerations
| Regulation | Relevant Clause | Requirement | How CQA Addresses It |
|---|---|---|---|
| APRA CPS 234 | §15(c) (Classification of Information Assets) | Information assets classified by criticality and sensitivity | Quality scorer classifies each document by domain criticality; quality thresholds calibrated to criticality |
| APRA CPS 230 | §33 (Information Management Obligations) | Framework for managing information quality in material systems | CQA is the documented quality management framework for AI knowledge assets |
| Australian Privacy Act 1988 | APP 10 (Quality of Personal Information) | Take reasonable steps to ensure personal information is accurate, up-to-date, complete | Accuracy and staleness dimensions directly address APP 10 for any corpus documents containing personal information |
| EU AI Act | Article 10(3) (Data Governance) | Training and knowledge data must be subject to data governance practices covering quality | Six-dimension quality gate + continuous monitoring constitutes documented data governance practices |
| EU AI Act | Article 10(2)(f) (Data Quality) | Data governance practices must address quality and accuracy of data used in high-risk AI | Accuracy validator + human review sampling satisfy accuracy requirement; staleness monitor satisfies currency requirement |
| ISO/IEC 42001 | §8.2.3 (Data Quality) | Organisations must address data quality in AI system lifecycle management | CQA pipeline is the data quality management implementation for the AI knowledge corpus |
| NIST AI RMF | MEASURE 2.5 (AI Data Quality) | Identify and measure AI system data quality risks and limitations | Quality score dimensions, trend monitoring, and coverage gap analysis directly address this measure |
15. Reference Implementations
15.1 AWS
| Component | AWS Service |
|---|---|
| Quality gate pipeline | AWS Step Functions (orchestration) + Lambda (individual quality checks) |
| Structural integrity check | Lambda + PyMuPDF/python-magic |
| Document embedding (deduplication) | Amazon Bedrock Titan Embeddings |
| Similarity search (deduplication) | OpenSearch k-NN |
| Accuracy reference source | Custom Lambda + external API or Amazon Kendra (knowledge source) |
| Quality score store | Amazon RDS PostgreSQL |
| Review queue | SQS + custom React UI + SES email notifications |
| Continuous monitoring | EventBridge Scheduler + Lambda |
| Dashboard | Amazon Managed Grafana |
15.2 Azure
| Component | Azure Service |
|---|---|
| Quality gate pipeline | Azure Logic Apps + Azure Functions |
| Document embedding (deduplication) | Azure OpenAI Embeddings |
| Similarity search | Azure AI Search (vector) |
| Accuracy validation | Azure AI Language + custom reference source API |
| Quality score store | Azure SQL Database |
| Review queue | Azure Service Bus + Power Apps |
| Continuous monitoring | Azure Functions with Timer trigger |
| Dashboard | Azure Monitor + Grafana |
15.3 GCP
| Component | GCP Service |
|---|---|
| Quality gate pipeline | Cloud Workflows + Cloud Functions |
| Document embedding | Vertex AI Embeddings |
| Similarity search | Vertex AI Vector Search |
| Quality score store | Cloud SQL PostgreSQL |
| Continuous monitoring | Cloud Scheduler + Cloud Functions |
| Dashboard | Cloud Monitoring + Grafana |
15.4 On-Premises
| Component | Technology |
|---|---|
| Quality gate pipeline | Apache Airflow DAG |
| Structural integrity | Python + PyMuPDF + chardet |
| Document embedding | Sentence Transformers on GPU |
| Deduplication similarity | Qdrant or pgvector for similarity search |
| Accuracy validation | spaCy NLP + custom reference source API |
| Quality score store | PostgreSQL |
| Review queue | Custom Flask app + email notifications |
| Dashboard | Prometheus + Grafana |
16. Related Patterns
| Pattern ID | Pattern Name | Relationship Type | Notes |
|---|---|---|---|
| EAAPL-KNW003 | AI Knowledge Corpus Management | Complementary | KNW003 is the lifecycle management pattern; KNW006 is the quality assurance function within that lifecycle |
| EAAPL-KNW004 | Vector Database Management | Downstream | CQA governs document quality before it enters the vector index; vector DB recall monitoring provides a quality feedback signal |
| EAAPL-KNW001 | Enterprise Knowledge Graph | Complementary | Coverage gap analysis uses the EKG ontology as the coverage target; KNW001 defines what topics the corpus should cover |
| EAAPL-KNW005 | Knowledge Graph for Explainability | Supporting | Explanation quality is constrained by corpus quality; CQA ensures the corpus facts used in explanations are accurate and current |
| EAAPL-GOV002 | AI Model Risk Management | Supporting | Document classifier and accuracy validator are models subject to model risk management |
| EAAPL-OPS001 | AI Observability | Complementary | Quality health dashboard is part of the broader AI observability framework |
17. Maturity Assessment
Overall Maturity Label: Proven
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Technology readiness | 4 | All component technologies (NLP libraries, embedding models, workflow tools, monitoring platforms) are production-proven; the integration pattern is well-established |
| Organisational capability | 3 | Requires data quality engineering skills and domain expert involvement for threshold calibration; achievable for most organisations with a data governance function |
| Standards availability | 3 | No specific CQA standard for AI corpora; draws on data quality standards (ISO 8000, DAMA DMBOK) with AI-specific adaptations |
| Vendor ecosystem | 4 | All major cloud providers offer component services; multiple open-source options; some emerging specialised corpus management vendors |
| Case evidence | 4 | Well-documented in library science and content management; AI-specific implementations growing rapidly with RAG adoption |
| Regulatory alignment | 5 | EU AI Act Article 10 data governance requirements and APP 10 are directly addressed; strongest regulatory coverage of the knowledge management patterns |
| Overall | 3.8 / 5 | Proven with strong regulatory alignment and accessible technology; primary investment is in calibrating thresholds and establishing human review processes for high-stakes domains |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-06-12 | EAAPL Editorial Board | Initial publication — covers six quality dimensions (completeness, accuracy, duplication, staleness, coverage, structure), pre-ingestion gating, continuous monitoring, human review programme, quality trend dashboard, and regulatory mapping |