Proven

EAAPL-KNW003: AI Knowledge Corpus Management

Pattern ID: EAAPL-KNW003 Status: Proven Complexity: Medium Tags: rag traceability observability medium-complexity Version: 1.0 Last Updated: 2026-06-12

1. Executive Summary

The AI Knowledge Corpus Management pattern defines the complete operational lifecycle for the document collection that powers Retrieval Augmented Generation (RAG) systems. Unlike a document repository, a managed corpus is a governed, versioned, quality-scored knowledge asset with controlled ingestion, continuous freshness monitoring, and point-in-time traceability.

Without corpus management, enterprise RAG systems degrade silently: outdated policies become embedded context for AI answers, PII-containing documents enter the retrieval pool without screening, and there is no way to reconstruct which corpus version produced a specific AI response six months ago. This pattern closes all three gaps.

For CIOs and CTOs, the business case is straightforward: a managed corpus is the difference between an AI system that is a liability (uncontrolled, unauditable, inconsistent) and one that is a managed enterprise asset (versioned, governed, explainable). Financial services, healthcare, and government organisations operating under AI regulation cannot deploy RAG without it.

Operational benefits include: reduced hallucination rates from higher-quality source documents, compliance-ready audit trails, and systematic identification of knowledge gaps that drive content investment decisions. Implementation complexity is medium — corpus management does not require graph databases or complex NLP pipelines, but it does require disciplined workflow and tooling.

2. Problem Statement

2.1 Business Problem

Enterprise RAG systems are frequently deployed with ad-hoc corpus construction: SharePoint libraries, email attachments, and wiki exports are bulk-ingested without governance. The resulting AI answers reflect the quality of the corpus — which is to say, inconsistent, outdated, and sometimes incorrect. Business users who discover that an AI answer was based on a superseded policy document or an unapproved draft lose confidence in the system permanently.

2.2 Technical Problem

RAG systems have no built-in mechanism for corpus versioning, document expiry, or quality gating. The vector store ingests whatever it receives. When a document is updated, stale embeddings may persist in the index alongside new ones, producing contradictory retrieval results. There is no standard mechanism for associating a specific AI response with the corpus snapshot that produced it, making post-hoc investigation of AI answers impossible.

2.3 Symptoms

AI answers cite policies that have been superseded or withdrawn
PII (names, account numbers, health records) appears in AI responses sourced from ingested documents
Different users receive contradictory AI answers on the same question over time (different corpus states)
Unable to investigate a specific AI response to identify which documents contributed to it
Knowledge gaps discovered reactively (users ask a question AI cannot answer) rather than proactively managed
No metric for corpus health — teams do not know whether the corpus is getting better or worse over time

2.4 Cost of Inaction

Regulatory sanctions for AI systems that cannot demonstrate auditable, controlled knowledge sources
Reputational damage from AI answers based on unauthorised, draft, or withdrawn documents
PII breach risk from unscreened document ingestion
Compounding knowledge debt: corpus quality degrades over time without active management, and recovery becomes increasingly expensive

3. Context

3.1 When to Apply

Any production RAG system where answers influence business decisions or customer interactions
Environments with regulatory requirements for AI explainability and auditability
Organisations with multiple document sources and content types of varying quality and authority
Deployments where corpus freshness materially affects answer accuracy (compliance, product, regulatory domains)
Systems where the same corpus serves multiple AI applications — governance ensures consistent behaviour across all consumers

3.2 When NOT to Apply

Internal prototype RAG systems used only by the development team for experimentation
Single-source corpora with a single owner who manually manages content — full corpus management overhead is disproportionate
Real-time ingestion use cases where every document must be available within seconds — quality gating introduces latency incompatible with this requirement

3.3 Prerequisites

Document management system or content repository with API access
Metadata standard for documents: at minimum, source system, owner, effective date, expiry date, classification
PII scanning capability (existing DLP tools or a dedicated library)
Vector database in use or planned for RAG

3.4 Industry Applicability

Industry	Applicability	Primary Use Case
Financial Services	Critical	Regulatory corpus (prudential standards, internal policies), product disclosure documents
Healthcare	Critical	Clinical guidelines, drug information, regulatory submissions
Legal / Professional Services	High	Case law, regulatory updates, internal precedent library
Government	High	Legislative corpus, policy library, citizen services knowledge
Technology	Medium	Product documentation, support knowledge bases, internal engineering standards
Retail / CPG	Medium	Product specifications, compliance certifications, supplier documentation

4. Architecture Overview

The AI Knowledge Corpus Management architecture is organised into five stages that form a continuous lifecycle: Ingestion Governance, Quality Gating, Versioned Storage, Freshness Management, and Health Monitoring.

4.1 Ingestion Governance

Before a document enters the corpus, it passes through an approval workflow. The workflow begins with source authentication: only documents from approved source systems or submitted by authorised document owners are accepted. Unapproved sources are rejected with a reason code logged in the rejection registry.

The approved document then undergoes automated screening: (1) Document classification using an ML classifier assigns a data sensitivity label (Public, Internal, Confidential, Restricted). Documents classified above the permitted threshold for the corpus are quarantined pending review. (2) PII screening using a named entity recognition model identifies personal information — names, account numbers, health identifiers, addresses. PII-containing documents are either redacted (if the corpus permits redacted versions) or rejected entirely. (3) Format and completeness check validates that the document is machine-readable, not truncated, and meets minimum length and structure requirements.

Documents passing all automated screens enter a human approval queue for any document type designated as requiring manual review (e.g., all policy documents, all external regulatory updates). Low-risk document types (internal product FAQs, approved template-based content) can be auto-approved if automated screening passes.

4.2 Quality Gating

Approved documents are scored on five quality dimensions before ingestion into the active corpus:

Completeness (0–1): Is the document complete? Heuristics include: minimum word count, presence of expected section headings, absence of "TODO" or "DRAFT" markers, valid internal references. Accuracy (0–1): Spot-checked via a sample-based human review programme; for high-stakes domains, automated fact verification against trusted reference sources. Readability (0–1): Flesch-Kincaid readability score normalised for the target domain; documents with very poor readability may confuse the LLM chunking and retrieval process. Authority (0–1): Is this document from an authoritative source for its topic? Regulatory documents from the regulator score higher than secondary commentary. Freshness (0–1): How recently was the document authored or last reviewed? Score decays according to a domain-specific freshness schedule (see §4.4).

The composite quality score is computed as a weighted average of these five dimensions, with weights configurable per document type. Documents below the minimum quality threshold are rejected into a "quality remediation" queue where the document owner is notified to improve the document and resubmit.

4.3 Versioned Storage

Every document ingested into the corpus is stored with a unique version identifier. The corpus itself is snapshotted at each deployment event — when a new version of the AI application using the corpus is deployed, the current corpus state is captured as a named snapshot. This enables point-in-time reconstruction: given an AI response produced on a specific date, the corpus snapshot at that time can be retrieved and the exact documents that would have been retrieved can be identified.

Document updates create new versions; old versions are retained in cold storage (not in the active retrieval index). A document's lineage record shows: all versions, the ingestion date of each version, the quality score at each version, and whether each version was active (in the retrieval index) at any point.

4.4 Freshness Management

Each document domain is assigned a maximum acceptable age before the document must be reviewed or refreshed:

Regulatory documents: 12 months. Internal policies: 6 months. Product specifications: 3 months. Market data or news summaries: 1 week.

A scheduled freshness audit job runs daily and computes each document's "freshness score" based on its age relative to its domain's maximum. Documents approaching expiry (within 20% of the maximum age) trigger an automated notification to the document owner requesting review. Documents past expiry are flagged as "stale" and either removed from the active retrieval index automatically (for low-authority documents) or quarantined pending mandatory human review (for high-authority documents). A stale document is never silently retained in the active index.

4.5 Health Monitoring

The corpus health dashboard provides a real-time view of corpus state: total documents by domain, average quality score per domain, coverage map (which knowledge domains are represented and with what depth), ingestion rate (documents per day/week), obsolescence queue depth, and rejection rate by rejection reason. Coverage gap analysis uses the ontology (if integrated with EAAPL-KNW001 or KNW002) to identify knowledge domains with fewer than a minimum document threshold.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Ingestion["Ingestion Governance"] A[Document Sources] B[Source Auth + PII Screen] C[Rejection Registry] end subgraph Quality["Quality Gating"] D[Human Approval + Quality Scorer] E{Quality Threshold} F[Remediation Queue] end subgraph Storage["Storage and Index"] G[(Versioned Document Store)] H[Chunker + Embedder] I[(Vector Database)] J[Corpus Health Dashboard] end A --> B B -->|rejected| C B -->|approved| D D -->|rejected| C D --> E E -->|below threshold| F E -->|pass| G G --> H H --> I G --> J style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#fee2e2,stroke:#ef4444 style D fill:#f0fdf4,stroke:#22c55e style E fill:#f3e8ff,stroke:#a855f7 style F fill:#fee2e2,stroke:#ef4444 style G fill:#fef9c3,stroke:#eab308 style H fill:#f0fdf4,stroke:#22c55e style I fill:#fef9c3,stroke:#eab308 style J fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Source Authenticator	Gateway	Validate document sources against approved source register; reject unapproved submissions	Custom API gateway, SharePoint webhook validation, S3 bucket policy	High
Document Classifier	AI/Processing	Assign data sensitivity labels using ML classification	AWS Comprehend, Azure AI Content Safety, custom fine-tuned BERT model	High
PII Screener	AI/Processing	Detect and redact PII using NER	Microsoft Presidio, spaCy + custom PII model, AWS Comprehend PII	High
Quality Scorer	Processing	Compute multi-dimension quality scores; apply domain-specific weighting	Custom Python scoring service; readability libraries; Flesch-Kincaid	High
Human Approval Workflow	Workflow	Route documents requiring manual review; track SLA compliance	Custom React workflow app, Jira Service Management, ServiceNow	Medium
Document Store	Storage	Versioned document storage with lineage and metadata	S3/Azure Blob/GCS with versioning enabled; custom metadata database (PostgreSQL)	Critical
Corpus Snapshot Engine	Storage	Capture corpus state at deployment events; enable point-in-time lookup	Custom snapshotting job; immutable snapshot store (S3 Object Lock)	High
Chunker and Embedder	Processing	Split documents into retrieval chunks; generate embeddings	LangChain text splitters, LlamaIndex, OpenAI Embeddings, Sentence Transformers	Critical
Vector Database	Storage	Active retrieval index; serves RAG queries	Pinecone, Weaviate, Qdrant, pgvector, Amazon OpenSearch, Azure AI Search	Critical
Freshness Audit Job	Scheduler	Daily evaluation of all documents against domain freshness schedules	cron job (Kubernetes CronJob or Lambda), Apache Airflow	High
Coverage Gap Analyser	Analytics	Identify under-served knowledge domains based on ontology coverage targets	Custom analytics job querying document metadata store	Medium
Corpus Health Dashboard	Observability	Real-time display of corpus health metrics across all domains	Grafana + custom metrics, Tableau, Superset	Medium

7. Data Flow

7.1 Primary Data Flow — Document Ingestion

Step	Actor	Action	Output
1	Document Source	Submits document via approved API or webhook	Document file + submission metadata
2	Source Authenticator	Validates source identity against approved source register	Approved or rejected with reason code
3	Document Classifier	Classifies document sensitivity	Sensitivity label attached to document metadata
4	PII Screener	Scans for personal information; redacts if permitted by corpus policy	Clean document or quarantine flag
5	Completeness Checker	Validates format, minimum length, structural integrity	Pass or fail with specific failure reason
6	Human Approval Queue	Routes policy-designated document types to manual review	Approved or rejected by reviewer
7	Quality Scorer	Computes five-dimension quality score	Composite quality score + dimension scores
8	Quality Gate	Applies minimum threshold per document type	Proceed to store or route to remediation queue
9	Document Store	Stores document with versioning; assigns version ID	Document stored with lineage record
10	Chunker and Embedder	Splits into chunks; generates embeddings	Chunk list with embeddings
11	Vector Database	Upserts embeddings; retires any older version embeddings for same document	Active corpus updated
12	Corpus Snapshot	Records current corpus state in snapshot log	Snapshot metadata updated

7.2 Error Flow

Error	Detection	Recovery	Escalation
Source authentication failure	Authenticator rejects unknown source	Log rejection; notify submitter with reason	Submitter contacts document governance to register source
PII detected, no redaction policy	PII screener identifies PII; corpus policy prohibits redacted documents	Quarantine document; notify document owner to remove PII at source	Data governance review; legal review if regulatory implications
Quality score below threshold	Quality scorer produces score below minimum	Route to remediation queue; document owner notified with specific improvement guidance	Escalate if remediation queue exceeds SLA
Chunking failure (encoding issues, corrupt PDF)	Chunker exception	Retry with fallback chunking strategy; manual extraction if retry fails	Alert ingestion operations team
Embedding API failure	Embedder throws exception	Retry with exponential backoff; use fallback embedding model if primary unavailable	P2 incident; monitor embedding queue depth
Freshness expiry with no owner response	Freshness audit flags document; no owner response within SLA	Automatically remove from active index after escalation period	Corpus governance team takes ownership action

8. Security Considerations

8.1 Authentication and Authorisation

Document submission endpoints require authenticated API calls (OAuth 2.0 or API key with source-registration). The corpus management admin interface (approval workflow, quality dashboard, corpus configuration) requires MFA-enabled SSO with role-based access: Document Reviewer, Corpus Administrator, Read-Only Observer. The vector database serving RAG queries requires service-to-service authentication.

8.2 Secrets Management

Document source API credentials, embedding model API keys, and vector database credentials are stored in a secrets vault with 90-day rotation. The PII screener model endpoint credentials are treated as high-sensitivity and stored with additional access controls.

8.3 Data Classification

Corpus documents are classified at ingestion. The vector database namespace or collection is partitioned by classification level. AI applications have access only to namespaces at or below their authorised classification. Documents reclassified to a higher level after ingestion are automatically migrated to the appropriate namespace and removed from previously accessible namespaces.

8.4 Encryption

Document store: server-side encryption with customer-managed keys. Vector database: encryption at rest and in transit. PII screener processing: in-memory only; no PII written to intermediary storage. Corpus snapshots: encrypted with the same CMK as the document store.

8.5 Auditability

A complete audit trail is maintained for every document: submission event, source authentication result, each screening result, quality score, approval/rejection decision with reviewer identity, ingestion event, all version transitions, freshness flags, and removal events. This trail enables full reconstruction of the corpus state at any historical point in time, which is the foundation for regulatory AI audit responses.

8.6 OWASP LLM Top 10 Mapping

OWASP LLM Risk	Relevance	Mitigation
LLM01 Prompt Injection	Malicious documents could embed instruction text that manipulates the RAG LLM	Document content sanitisation (strip instruction-like patterns); RAG prompt template hardening
LLM02 Insecure Output Handling	Document content passed to LLM via retrieval could be malicious	Content safety filter on retrieved chunks before LLM inclusion
LLM03 Training Data Poisoning	Malicious document ingested into corpus poisons retrieval results	Source authentication; approval workflow; anomaly detection on new documents from established sources
LLM04 Model Denial of Service	Extremely large documents or adversarial chunking patterns could exhaust compute	Maximum document size limit; chunking timeout; rate limiting on submission API
LLM05 Supply Chain Vulnerabilities	Embedding model or PII screener dependencies could be compromised	Dependency pinning; model integrity verification; vendor security assessments
LLM06 Sensitive Information Disclosure	Confidential documents ingested without proper classification leaking via retrieval	Mandatory classification screening; classification-scoped vector namespaces
LLM07 Insecure Plugin Design	Document source connectors could be exploited to inject unauthorised documents	Source authentication; webhook signature validation; allowlist of approved source systems
LLM08 Excessive Agency	Corpus management automation has write access to vector database	Principle of least privilege: automation writes only to staging namespace; human approval required for production promotion
LLM09 Overreliance	AI answers from stale corpus presented as current	Freshness score surfaced in retrieval metadata; staleness warning in AI response when citing old documents
LLM10 Model Theft	Corpus represents significant intellectual property investment	Access-controlled retrieval API; no bulk export; watermarking for premium corpus content

9. Governance Considerations

9.1 Responsible AI

The corpus is an encoding of the organisation's knowledge and, implicitly, its values and perspectives. Selective ingestion can introduce systematic bias: if compliance documents from one jurisdiction dominate, AI answers will reflect that jurisdiction's standards. A quarterly domain coverage audit reviews not just quantity but representativeness: are all relevant geographies, business units, and perspectives adequately represented?

9.2 Model Risk Management

The document classifier (sensitivity labelling) and PII screener are models subject to model risk management. Each has a model card documenting training data, precision/recall on validation sets, known failure modes (e.g., the classifier may misclassify novel document types), and a scheduled review cycle. A misclassification leading to a Confidential document being accessible in a Public corpus is a model risk event requiring root cause analysis.

9.3 Human Approval Gates

Policy-designated document types require human approval before ingestion. The designated document types include: all external regulatory and legal documents; all documents relating to product claims, compliance assertions, or customer commitments; any document flagged by automated screening for borderline PII or sensitivity classification. Human reviewers complete mandatory training on the corpus acceptance criteria before being granted reviewer access.

9.4 Policy Ownership

Corpus policy (which sources are approved, which document types require manual review, quality thresholds, freshness schedules by domain) is owned by the Corpus Governance Board — a cross-functional body including the CDO, Legal, Compliance, and representatives from each major knowledge domain. Policy changes are documented with rationale and reviewed quarterly.

9.5 Traceability

Every AI response produced by a RAG system using this corpus can be traced to the specific document chunks retrieved, the document versions those chunks came from, the corpus snapshot active at the time of the query, and the full ingestion and quality history of each source document. This traceability chain satisfies the core regulatory requirement for AI decision auditability in financial services and healthcare.

9.6 Governance Artefacts

Artefact	Owner	Frequency	Location
Corpus acceptance policy	Corpus Governance Board	Annual review; ad-hoc for regulatory changes	Policy management system
Approved source register	Corpus Governance Board	Updated per new source request	Corpus management system
Domain freshness schedule	Domain Data Stewards	Annual review	Corpus configuration
Document classifier model card	ML Engineering	Per model version	ML model registry
PII screener model card	ML Engineering	Per model version	ML model registry
Corpus health monthly report	Corpus Operations	Monthly	Governance dashboard
Corpus snapshot index	Engineering	Per deployment event	Immutable snapshot store

10. Operational Considerations

10.1 Monitoring and SLOs

Metric	SLO Target	Alerting Threshold	Tool
Ingestion pipeline latency (submission to active index)	≤30 min for auto-approved documents	>2 hours for any document in pipeline	Airflow/workflow monitoring
Human approval queue clearance	100% cleared within 3 business days	Any item >2 days	Workflow SLA alert
Active corpus document count (expected range)	Within ±10% of target range per domain	Outside ±20%	Custom Grafana metric
Stale document rate (% of active corpus past expiry)	<2%	>5%	Daily freshness job metric
PII screener false negative rate (on test set)	<0.5% on golden PII test set	>1% on weekly test run	Automated test job
Corpus quality score (average across active corpus)	≥0.75 composite score	<0.70	Health dashboard

10.2 Logging

All ingestion events are logged with: document_id, source, submission_timestamp, classifier_result, pii_result, quality_score, approval_decision, ingestion_timestamp, version_id. Retrieval events (which documents were retrieved for which query) are logged by the RAG system referencing document_id and version_id. Log retention: 90 days operational; 7 years archive.

10.3 Incident Management

P1: PII-containing document confirmed active in retrieval index — immediate removal, PII breach assessment, regulatory notification if required. P2: Corpus health score drops below threshold; freshness backlog exceeds 5% — same-day investigation and remediation plan. P3: Single domain coverage gap identified; document owner non-responsive to freshness alert — next business day follow-up.

10.4 Disaster Recovery

Scenario	RTO	RPO	Recovery Procedure
Vector database corruption	2 hours	Last corpus snapshot (max 1 hour if snapshots are hourly)	Rebuild vector index from document store using last snapshot as the corpus definition
Document store unavailability	4 hours	5 min (S3 replication)	Fail over to cross-region replica; validate document count and metadata integrity
Ingestion pipeline failure	30 min	0 (documents re-submitted from source queue)	Restart pipeline; replay from dead letter queue
Accidental mass document deletion	1 hour	0 (document store versioning retains deleted versions)	Restore deleted documents from version history; rebuild vector index

10.5 Capacity Planning

Vector index storage grows at approximately 1–5 KB per chunk (depending on vector dimensions and metadata). A corpus of 100,000 documents with an average of 50 chunks per document requires 500K–2.5M vector records. Plan for 3× storage headroom for re-indexing operations (maintaining old index while building new). Embedding generation compute is the primary CPU cost during bulk ingestion.

11. Cost Considerations

11.1 Cost Drivers

Cost Driver	Description	Typical Range
Embedding API costs	Per-token cost for generating embeddings at ingestion and for queries	$0.0001–$0.001 per 1,000 tokens
Vector database hosting	Managed vector DB service or self-hosted infrastructure	$500–$10,000/month depending on corpus size and query volume
PII screener compute	NLP model inference per document screened	$0.001–$0.005 per document
Document classifier compute	ML classification per document	$0.0005–$0.002 per document
Human approval labour	Reviewer time for manual document approvals	Depends on volume and document type mix; 15–30 min per complex document
Storage (document store + vector index)	Scales with corpus size	$100–$2,000/month for 100K–1M documents

11.2 Scaling Risks

Bulk ingestion events (regulatory corpus refresh, large legacy document library import) can generate spike costs for embedding generation — batch and rate-limit large imports
Human approval bottleneck at scale: if document volume grows faster than reviewer capacity, the ingestion SLA degrades and corpus freshness suffers
Vector database re-indexing after embedding model upgrades requires a complete re-embedding of the corpus — cost and time must be planned for each model version change

11.3 Optimisations

Deduplicate near-identical documents before embedding to avoid storing redundant vectors
Use smaller, cheaper embedding models for low-stakes document types; reserve premium embedding models for high-authority documents
Batch ingestion during off-peak hours to benefit from lower spot compute pricing
Cache embeddings for documents that have not changed between refreshes — only re-embed when document content changes

11.4 Indicative Cost Ranges

Corpus Scale	Monthly Infrastructure Cost	Annual Total (incl. governance labour)
Small (10K documents)	$500–$2,000	$50,000–$150,000
Medium (100K documents)	$3,000–$12,000	$200,000–$500,000
Large (1M+ documents)	$15,000–$60,000	$800,000–$2,500,000

12. Trade-Off Analysis

12.1 Ingestion Approach Options

Option	Strengths	Weaknesses	Best For
Strict manual approval for all documents	Maximum quality and governance control	Very slow ingestion; backlog risk; labour-intensive at scale	High-stakes domains (regulatory, legal, medical) with low document volume
Risk-based tiered approval (manual for high-risk, auto for low-risk)	Balance of speed and control; approvals focused where risk is highest	Requires reliable risk classification; auto-approved documents may contain errors	Most enterprise use cases — the recommended approach
Full automation with retrospective audit	Fast ingestion; no approval bottleneck	Quality and PII risks until retrospective audit catches issues; regulatory risk	Only for low-stakes internal knowledge bases with homogeneous, trusted sources

12.2 Corpus Versioning Strategies

Option	Strengths	Weaknesses	Best For
Continuous live corpus (no explicit versioning)	Always current; simple; no snapshot overhead	Cannot reconstruct past corpus state; no point-in-time audit capability	Low-stakes RAG; no regulatory requirement for auditability
Deployment-event snapshots (this pattern)	Matches AI answer to corpus state at deployment; audit-ready	Answers between snapshots use mixed corpus versions; snapshot storage cost	Regulated use cases; AI systems with infrequent releases
Immutable versioned corpus (new version per ingestion)	Complete audit trail; maximum traceability	Storage cost grows rapidly; complexity in managing version transitions	Highest-stakes domains (medical, legal regulatory) where every answer must be fully reproducible

12.3 Architectural Tensions

Tension	Option A	Option B	Recommended Resolution
Freshness vs. quality	Maximise freshness (low quality bar, fast ingestion)	Maximise quality (high bar, risk of stale approved documents)	Domain-calibrated: regulatory/compliance requires both (escalate if quality + freshness cannot both be met); informational domains prioritise freshness
Coverage breadth vs. quality depth	Ingest broadly from many sources at lower quality threshold	Restrict to fewer high-quality authoritative sources	Start narrow with authoritative sources; expand coverage deliberately as governance capacity allows
Centralised vs. domain-distributed corpus	Single corpus for all AI applications — maximum consistency	Domain-owned corpora per business unit — domain autonomy	Central governance framework (shared standards, tooling, oversight); domain-managed content within the framework

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
PII in active corpus (screening miss)	Low	Critical — privacy breach; regulatory sanction	User-reported AI response containing PII; retrospective audit	Immediate removal; breach assessment; root cause in PII screener
Stale corpus not refreshed (owner unresponsive)	Medium	High — AI answers based on outdated facts	Freshness audit flags; users report incorrect answers	Escalate to corpus governance; assign surrogate owner; remove document if no resolution
Approval queue backlog (reviewers overloaded)	High	Medium — ingestion SLA missed; corpus coverage degrades	Queue depth metric exceeds threshold	Temporary approval threshold relaxation for low-risk document types; engage additional reviewers
Duplicate documents with contradictory content	Medium	Medium — AI retrieves conflicting chunks	Duplicate detection job; inconsistent AI answers	Deduplication review; identify authoritative version; remove or consolidate duplicates
Embedding model deprecation (provider retires model)	Medium	High — entire corpus must be re-embedded	Provider deprecation notice	Planned re-embedding project; test new model recall on golden query set before production cutover
Corpus quality score trend decline	Medium	Medium — gradual AI answer quality degradation	Health dashboard quality trend metric	Investigation of domains with declining scores; source quality improvement; enhanced screening

13.1 Cascading Failure Scenarios

Scenario 1: Regulatory Document Expiry Cascade. A regulatory update requires immediate replacement of 50+ policy documents. The document owners submit new versions simultaneously. The approval queue floods. SLA misses. Reviewers approve documents without full review to clear the backlog. Several documents with errors or inconsistencies are approved and ingested. AI answers begin reflecting the new (partially incorrect) policy content. Detection: increased user-reported answer errors. Resolution: recall affected documents; engage compliance review of all batch-approved documents; implement split approval workflow for bulk regulatory updates.

Scenario 2: Embedding Model Upgrade Failure. An embedding model upgrade doubles retrieval quality on the test set. The corpus is re-embedded with the new model. The previous vector index is retired. Post-deployment monitoring shows that 15% of query categories now return no relevant results — these were edge cases well-handled by the old model but missed by the new one. The old model is no longer available. Resolution requires: restore corpus from snapshot using old embeddings while emergency fine-tuning is performed; implement A/B shadow evaluation before any future model upgrades.

14. Regulatory Considerations

Regulation	Relevant Clause	Requirement	How Corpus Management Addresses It
APRA CPS 234	§15 (Information Asset Identification)	Information assets must be identified and classified	Every corpus document has a classification label; classification determines access scope
APRA CPS 230	§33 (Information Management)	Documented information management framework for material systems	Corpus governance policy, approved source register, and domain steward ownership constitute the framework
Australian Privacy Act 1988	APP 11.1 (Security of Personal Information)	Take reasonable steps to protect personal information	PII screening at ingestion; classification-scoped access; audit trail for PII-containing document events
EU AI Act	Article 10 (Data and Data Governance)	Training, validation, testing data must be subject to appropriate data governance	Corpus quality scoring, versioning, and provenance documentation satisfy data governance documentation requirements
EU GDPR	Article 17 (Right to Erasure)	Data subjects can request deletion of personal data	Document version history enables identification and removal of all versions containing a specific individual's data
ISO/IEC 42001	§8.2 (AI System Lifecycle)	Organisations must manage the AI system lifecycle including knowledge resources	Corpus lifecycle management (ingestion → quality gating → freshness → retirement) documents this
NIST AI RMF	MEASURE 2.5 (AI Risk Measurement)	Identify and measure data quality risks	Quality scoring dimensions and corpus health dashboard directly address this requirement

15. Reference Implementations

15.1 AWS

Component	AWS Service
Document storage (versioned)	S3 with versioning + Object Lock (WORM for audit)
Document classification	Amazon Comprehend custom classifier
PII screening	Amazon Comprehend PII detection
Human approval workflow	AWS Step Functions + custom React UI
Embedding generation	Amazon Bedrock Titan Embeddings
Vector database	Amazon OpenSearch with vector engine
Freshness audit job	AWS Lambda + EventBridge scheduler
Health dashboard	Amazon Managed Grafana

15.2 Azure

Component	Azure Service
Document storage (versioned)	Azure Blob Storage with versioning + immutability policies
Document classification + PII screening	Azure AI Content Safety + Azure AI Language
Human approval workflow	Azure Logic Apps + Power Apps
Embedding generation	Azure OpenAI Embeddings
Vector database	Azure AI Search
Freshness audit job	Azure Functions + Timer trigger
Health dashboard	Azure Monitor + Grafana

15.3 GCP

Component	GCP Service
Document storage	Cloud Storage with object versioning
Document classification	Vertex AI custom classifier
PII screening	Cloud DLP
Embedding generation	Vertex AI Embeddings
Vector database	Vertex AI Vector Search
Health dashboard	Google Cloud Monitoring + Grafana

15.4 On-Premises

Component	Technology
Document storage	MinIO (S3-compatible) with versioning
Document classification + PII	Hugging Face classification models; Microsoft Presidio for PII
Human approval workflow	Custom Flask/Django app; Jira integration
Embedding generation	Sentence Transformers on GPU servers
Vector database	Qdrant or Weaviate self-hosted
Health dashboard	Prometheus + Grafana

Pattern ID	Pattern Name	Relationship Type	Notes
EAAPL-KNW001	Enterprise Knowledge Graph	Complementary	Corpus documents feed NLP extraction into the knowledge graph; ontology provides domain coverage map for gap analysis
EAAPL-KNW002	Semantic Data Layer	Upstream	Semantic layer ontology defines the knowledge domains the corpus should cover
EAAPL-KNW004	Vector Database Management	Dependency	Corpus management governs content; vector DB management governs the storage and retrieval infrastructure
EAAPL-KNW006	Corpus Quality Assurance	Extension	KNW006 provides the detailed automated QA pipeline that implements the quality gating step in this pattern
EAAPL-RAG001	Retrieval Augmented Generation	Consumer	RAG systems are the primary consumers of the managed corpus
EAAPL-GOV003	AI Data Lifecycle Management	Parent	Corpus management is an application of AI data lifecycle principles

17. Maturity Assessment

Overall Maturity Label: Proven

Dimension	Score (1–5)	Rationale
Technology readiness	4	Document stores, PII scanners, vector databases, and workflow tools are all production-proven and widely deployed
Organisational capability	3	Requires content governance discipline; most organisations with a data governance function can implement with moderate uplift
Standards availability	3	No industry-standard corpus management specification; patterns derived from library science, content management, and RAG practitioner experience
Vendor ecosystem	4	All major cloud providers offer the component services; multiple open-source options for self-hosted deployment
Case evidence	4	Well-documented implementations in financial services, healthcare, and legal; growing body of practitioner experience
Regulatory alignment	5	Directly addresses the data governance, explainability, and auditability requirements of EU AI Act, APRA, and GDPR
Overall	3.8 / 5	Proven pattern with strong regulatory alignment and accessible technology; primary uplift needed in content governance discipline

18. Revision History

Version	Date	Author	Changes
1.0	2026-06-12	EAAPL Editorial Board	Initial publication — covers ingestion governance, quality gating, versioned storage, freshness management, corpus health monitoring, and point-in-time traceability

← Back to Library More Knowledge Management →

EAAPL-KNW003: AI Knowledge Corpus Management

EAAPL-KNW003: AI Knowledge Corpus Management

1. Executive Summary

2. Problem Statement

2.1 Business Problem

2.2 Technical Problem

2.3 Symptoms

2.4 Cost of Inaction

3. Context

3.1 When to Apply

3.2 When NOT to Apply

3.3 Prerequisites

3.4 Industry Applicability

4. Architecture Overview

4.1 Ingestion Governance

4.2 Quality Gating

4.3 Versioned Storage

4.4 Freshness Management

4.5 Health Monitoring

5. Architecture Diagram

6. Components

7. Data Flow

7.1 Primary Data Flow — Document Ingestion

7.2 Error Flow

8. Security Considerations

8.1 Authentication and Authorisation

8.2 Secrets Management

8.3 Data Classification

8.4 Encryption

8.5 Auditability

8.6 OWASP LLM Top 10 Mapping

9. Governance Considerations

9.1 Responsible AI

9.2 Model Risk Management

9.3 Human Approval Gates

9.4 Policy Ownership

9.5 Traceability

9.6 Governance Artefacts

10. Operational Considerations

10.1 Monitoring and SLOs

10.2 Logging

10.3 Incident Management

10.4 Disaster Recovery

10.5 Capacity Planning

11. Cost Considerations

11.1 Cost Drivers

11.2 Scaling Risks

11.3 Optimisations

11.4 Indicative Cost Ranges

12. Trade-Off Analysis

12.1 Ingestion Approach Options

12.2 Corpus Versioning Strategies

12.3 Architectural Tensions

13. Failure Modes

13.1 Cascading Failure Scenarios

14. Regulatory Considerations

15. Reference Implementations

15.1 AWS

15.2 Azure

15.3 GCP

15.4 On-Premises

16. Related Patterns

17. Maturity Assessment

18. Revision History