EAAPL-KNW002: Semantic Data Layer
Pattern ID: EAAPL-KNW002
Status: Proven
Complexity: High
Tags: knowledge-graph llm traceability high-complexity
Version: 1.0
Last Updated: 2026-06-12
1. Executive Summary
The Semantic Data Layer (SDL) is a governed translation layer that sits between an enterprise's raw data sources and its AI applications. It maps enterprise data to a shared business ontology, enabling natural language queries to be translated into precise, governed data access without requiring AI applications to understand the underlying physical data model.
The SDL solves a critical enterprise AI problem: LLMs given raw schema access (column names, table structures) produce inconsistent query interpretations because the same business concept — "revenue," "active customer," "exposure" — is defined differently across systems. The SDL establishes a single authoritative definition for every business term and maps each source system to it.
For CIOs and CTOs, the SDL delivers three compounding benefits: (1) AI applications become source-system-agnostic, so migrations and system changes do not break AI behaviour; (2) the business glossary enforces consistent AI answers because all AI applications share the same term definitions; (3) semantic caching — re-using translated queries for equivalent natural language questions — reduces LLM API costs by 30–70% in high-volume deployments.
The pattern is most valuable in organisations with ≥5 source systems, cross-domain AI use cases, and active data governance programmes. Implementation requires 3–6 months to reach production maturity.
2. Problem Statement
2.1 Business Problem
Enterprise data is physically distributed across ERP, CRM, data warehouse, operational databases, and SaaS platforms. Each system defines business concepts independently: "customer" in Salesforce may be a legal entity, while "customer" in the billing system may be an individual account, and "customer" in the analytics warehouse may be a household cluster. AI applications trained to answer business questions across these systems produce inconsistent and sometimes contradictory answers because they resolve the same term differently depending on which system they access.
2.2 Technical Problem
LLMs translating natural language questions to SQL or SPARQL queries against raw schemas frequently misinterpret column names, join conditions, and aggregation logic. Without a semantic layer, prompt engineering must embed physical schema details and business rules directly into every AI application — creating brittle integrations that break when schemas change and accumulate contradictory business rule definitions across applications.
2.3 Symptoms
- Two AI applications return different "total revenue" figures for the same time period
- AI-generated SQL queries fail intermittently due to schema changes in source systems
- Business analysts must manually verify every AI-generated data answer against known benchmarks
- Adding a new data source requires updating every AI application's prompt separately
- Data governance cannot locate where a specific business definition is operationalised in AI systems
2.4 Cost of Inaction
- Trust collapse: business users stop relying on AI-generated data insights within weeks of launch when inconsistencies are discovered
- Compliance exposure: regulatory reporting generated by AI applications with inconsistent term definitions produces incorrect submissions
- Engineering debt: each new AI application rebuilds the same business rule logic independently, creating N maintenance obligations
- Data migration risk: any source system migration risks breaking all AI applications that rely on physical schema knowledge
3. Context
3.1 When to Apply
- ≥5 source systems that must be queryable by AI applications using shared business terminology
- Active enterprise data governance programme with a business glossary in progress or completed
- Cross-domain AI use cases (e.g., finance + operations + customer) that require consistent term definitions
- High query volume AI deployments where semantic caching can deliver measurable cost reduction
- Regulatory reporting requirements that demand consistent definitions across AI outputs
3.2 When NOT to Apply
- Single source system AI applications — the abstraction overhead is not justified
- Organisations without a data governance programme — SDL without ontology ownership degrades to an unmaintained mapping layer
- Real-time streaming AI use cases where query translation latency is unacceptable
- Early MVP/PoC phases — validate AI value proposition first, add semantic governance layer when production is confirmed
3.3 Prerequisites
- Business glossary with ≥80% coverage of key business terms used in target AI use cases
- Data catalogue with documented source system schemas and ownership
- Data steward function with clear domain ownership responsibilities
- API or direct connection access to all source systems that will be mapped to the semantic layer
3.4 Industry Applicability
| Industry | Applicability | Primary Use Case |
|---|---|---|
| Financial Services | Critical | Regulatory reporting consistency, risk metric definitions, customer exposure calculation |
| Healthcare | High | Clinical terminology standardisation (SNOMED, LOINC mapping), patient data access |
| Retail / CPG | High | Product taxonomy, sales metrics consistency, customer segmentation definitions |
| Manufacturing | High | Product hierarchy, BOM relationships, operational KPI definitions |
| Telecommunications | High | Network entity relationships, service definitions, customer hierarchy |
| Government | High | Policy term consistency, inter-agency data sharing, citizen service definitions |
4. Architecture Overview
The Semantic Data Layer is structured into five functional layers that together form a pipeline from business definition to data retrieval.
4.1 Business Glossary Foundation
The business glossary is the authoritative source of business term definitions. It precedes the SDL and must be governed independently. Each glossary entry specifies: term name, canonical definition, synonyms, related terms, owning business domain, and the data steward responsible for maintaining the definition. The SDL treats the business glossary as read-only input — it does not own definitions, it operationalises them.
4.2 Ontology Layer
The ontology translates the business glossary into a machine-readable formal specification using OWL (Web Ontology Language) or a property graph schema. Business entities become ontology classes. Relationships between entities become ontology properties. Business metrics and derived measures become calculated properties with defined formulas. The ontology is maintained by the ontology governance committee and versioned in source control. Schema changes go through a formal change control process with impact analysis.
4.3 Semantic Mapping Layer
The semantic mapping layer connects the ontology to the physical source systems. For each ontology class and property, a mapping definition specifies: source system, schema/table/column path, transformation logic (type casts, aggregations, filters), validity constraints, and effective date range. Mappings are authored by data engineers in collaboration with domain data stewards. They are stored in a mapping registry — a versioned repository of mapping definitions that can be audited and rolled back.
Automated mapping suggestions use LLM-assisted column name analysis to propose initial mappings for human review, accelerating the mapping authoring process. All automated suggestions require human validation before activation. Mapping confidence is tracked: manually authored and validated mappings are marked HIGH confidence; LLM-suggested and human-validated are MEDIUM; any auto-activated mappings would be LOW (not permitted in production).
4.4 Natural Language to Query Translation
When an AI application or end user poses a natural language question, the SDL's query translation component processes it in three stages.
Semantic Disambiguation resolves ambiguous terms by referencing the ontology. If the question uses "revenue," disambiguation resolves it to the canonical FinancialMetric.GrossRevenue definition, including its precise calculation formula and applicable source systems. The disambiguated intent is represented as a structured semantic query.
Query Generation translates the structured semantic intent into an executable query (SQL, SPARQL, GraphQL, or a graph traversal) against the appropriate source system. The generated query uses the mapping definitions to navigate from ontology concepts to physical schema paths. Query templates for common patterns (aggregations, time-series, entity lookups) are pre-verified by data engineers and reused wherever possible to avoid LLM query hallucination.
Result Enrichment annotates the query result with semantic metadata: which ontology concepts were queried, which source systems were accessed, which mapping versions were used, and the data freshness timestamp. This metadata is returned to the calling application and can be surfaced to end users or logged for audit.
4.5 Semantic Caching Layer
Translated queries (NL input → structured query) are cached using a dual-key strategy: (1) exact match on the normalised NL question string; (2) semantic equivalence via embedding similarity comparison against cached question embeddings. When a semantically equivalent question is detected, the cached translated query is returned without re-invoking the LLM translation step.
Cache invalidation is triggered by: ontology changes (any change affecting the query's concept set); mapping changes for the source systems accessed; cache TTL expiry (configurable per domain based on data freshness requirements). Cache hit rates of 30–70% are typical in production deployments with diverse but patterned question sets.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Business Glossary | Governance | Authoritative business term definitions; owned by data governance, read by SDL | Collibra, Alation, Atlan, Microsoft Purview, custom metadata store | Critical |
| Ontology Engine | Governance | OWL or property graph schema; formal machine-readable term definitions; version control | Protégé (OWL), custom JSON-LD registry, dbt semantic layer, AtScale | High |
| Mapping Registry | Storage | Versioned ontology-to-physical-schema mappings; source authoring and change history | Custom PostgreSQL registry, dbt metrics layer, Cube.dev semantic layer | Critical |
| LLM Mapping Suggestion | AI | Propose initial mappings via column name/description analysis | OpenAI GPT-4o, Anthropic Claude, custom fine-tuned model | Medium |
| Query Translation Engine | Processing | NL → ontology intent → executable query generation | LangChain SQL agent, Vanna.ai, Microsoft Semantic Kernel, custom | Critical |
| Semantic Disambiguation Module | Processing | Resolve NL terms to canonical ontology concepts; handle synonyms and context | Vector similarity + ontology lookup, LLM with ontology context injection | High |
| Semantic Cache | Storage | Cache NL queries and their translated forms; semantic equivalence matching | Redis + pgvector, Weaviate, custom embedding cache | Medium |
| Result Enricher | Processing | Annotate query results with semantic metadata and provenance | Custom middleware layer | High |
| Impact Analyser | Governance | Detect source system schema changes; assess impact on active mappings | Custom schema diff tool, Monte Carlo data observability, Great Expectations | Medium |
7. Data Flow
7.1 Primary Data Flow — Natural Language Query to Result
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | End User / AI App | Submits natural language question | NL question string |
| 2 | Semantic Cache | Checks exact and semantic match against cache | Cache hit → skip to step 8; miss → continue |
| 3 | Semantic Disambiguation | Resolves NL terms against ontology; identifies concept intent | Structured semantic intent with resolved ontology URIs |
| 4 | Mapping Registry | Looks up physical schema paths for resolved concepts | Mapping definitions for each ontology concept |
| 5 | Query Generator | Produces executable query from semantic intent + mappings | SQL / SPARQL / GraphQL query |
| 6 | Source System | Executes query; returns raw result set | Raw data result |
| 7 | Result Enricher | Annotates result with ontology concept labels, source system, mapping version, freshness | Enriched result set with semantic metadata |
| 8 | Semantic Cache | Stores NL → query mapping with embedding for future hits | Cache entry written |
| 9 | Calling Application | Receives enriched result | Data answer with full semantic provenance |
7.2 Error Flow
| Error | Detection | Recovery | Escalation |
|---|---|---|---|
| Ontology term not found (unmapped NL term) | Disambiguation returns null mapping | Return "term not understood" with closest suggestions; log unmapped term | Ontology backlog: data steward creates new term |
| Source system query failure | Query executor exception | Retry ×2; return partial result with availability note; log failure | Alert data engineering; flag source system health |
| Mapping mismatch (schema drift in source) | Result validation fails post-enrichment; impact analyser detects column removal | Deactivate affected mapping; return "data unavailable" with reason; route to steward | Immediate data steward notification for mapping repair |
| Semantic cache stale (post-ontology change) | Cache invalidation job triggered by ontology change event | Flush affected cache entries; force re-translation | Operational log; no escalation required if automated |
| Query translation hallucination (LLM produces invalid SQL) | SQL validation before execution; query explainer check | Reject invalid query; fall back to template-based generation if available | Log hallucination instance; flag for translation model review |
8. Security Considerations
8.1 Authentication and Authorisation
The SDL query API enforces attribute-based access control (ABAC): the calling application's identity determines which ontology concepts and source systems it is permitted to query. A concept-level permission model prevents an AI application authorised for "customer contact information" from accessing "customer financial information" even if those concepts share a source table. OAuth 2.0 client credentials flow is used for service-to-service authentication.
8.2 Secrets Management
Source system connection credentials are stored in a secrets vault. The SDL query execution engine retrieves credentials at query time using short-lived dynamic secrets where the source system supports it (e.g., database IAM authentication). Connection strings are never stored in the mapping registry or logged.
8.3 Data Classification
Each ontology concept is tagged with a data classification level inherited from the most sensitive source system attribute mapped to it. Query results inherit the highest classification level of any concept in the query. Results above a calling application's authorised classification are blocked at the result enricher with an access denied response and audit log entry.
8.4 Encryption
All inter-component communication uses TLS 1.3. Semantic cache entries containing query results are encrypted at rest. The mapping registry is encrypted at rest. Query logs (containing potential sensitive terms) are encrypted and access-restricted to authorised operations teams.
8.5 Auditability
Every NL query, resolved semantic intent, generated physical query, and result return is logged with: caller identity, timestamp, ontology concepts accessed, source systems queried, mapping versions used, and data classification of the result. These logs provide a complete lineage record: from the AI application's question to the physical data rows accessed, with every translation step documented.
8.6 OWASP LLM Top 10 Mapping
| OWASP LLM Risk | Relevance | Mitigation |
|---|---|---|
| LLM01 Prompt Injection | Adversarial NL query designed to inject SQL or manipulate query generator | Parameterised query generation (LLM produces intent, not raw SQL); SQL injection prevention at execution layer |
| LLM02 Insecure Output Handling | LLM-generated query passed directly to database execution | Generated query validated by SQL parser before execution; reject queries with DML statements (INSERT/UPDATE/DELETE) |
| LLM03 Training Data Poisoning | Mapping suggestion LLM trained on data with incorrect mappings | Human validation required for all LLM-suggested mappings; training data provenance tracked |
| LLM04 Model Denial of Service | Adversarially complex NL queries generating expensive database queries | Query cost estimation before execution; maximum query cost limit; rate limiting per caller |
| LLM05 Supply Chain Vulnerabilities | Query translation LLM dependency could be compromised | Pinned model versions; model integrity verification; ability to swap translation model without architecture change |
| LLM06 Sensitive Information Disclosure | SDL translates NL query that inadvertently exposes restricted data | ABAC at concept level; classification check before result return; field-level masking for PII concepts |
| LLM07 Insecure Plugin Design | Source system connectors as plugins to the query engine | Connector authentication validated; schema-scoped connector permissions; connector code review |
| LLM08 Excessive Agency | Query translation agent could autonomously execute DML if not constrained | Read-only database connections for SDL execution; DML blocked at connection level |
| LLM09 Overreliance | Business users over-trust AI-generated data answers from SDL | Semantic metadata surfaced with every result: data freshness, mapping confidence, source system |
| LLM10 Model Theft | Ontology and mapping registry encode proprietary business logic | Access-controlled APIs; mapping registry not exposed externally; no bulk export endpoints |
9. Governance Considerations
9.1 Responsible AI
The SDL makes AI data access deterministic and governed, which is itself a responsible AI control. However, the ontology encodes business choices (e.g., which revenue calculation formula is canonical) that may disadvantage certain business units or stakeholders. An ontology review process must include representation from all affected business domains. Decisions that favour one domain's definition over another must be documented with rationale.
9.2 Model Risk Management
The query translation LLM is a model risk management artefact. Its performance is measured on a golden question set (questions with known correct SQL outputs). Precision and recall on the golden set are monitored per query category. A model validation report is produced when the translation model is upgraded or its prompt is substantially changed.
9.3 Human Approval Gates
All ontology changes require data steward approval from the affected domain plus sign-off from the ontology governance committee. Mapping changes for source systems that feed regulatory reporting require a secondary approval from the compliance team. The mapping staging environment allows testing NL queries against a proposed mapping change before production activation, with results compared against a golden answer set.
9.4 Policy Ownership
The business glossary is owned by the Chief Data Officer's organisation. Ontology is jointly owned by the data architecture function and domain data stewards. Mapping definitions are owned by the data engineering function with domain data steward sign-off. Query translation model prompts and configuration are owned by the AI engineering function. Changes in any of these domains trigger impact analysis in downstream layers.
9.5 Traceability
The SDL maintains a complete provenance record for every query result: which NL question was asked → which ontology concepts were resolved → which mappings were used (with version) → which source tables/columns were accessed → which rows were returned → what the data freshness was at time of query. This provenance record satisfies regulatory requirements for AI decision auditability and supports data lineage documentation in the data catalogue.
9.6 Governance Artefacts
| Artefact | Owner | Frequency | Location |
|---|---|---|---|
| Business glossary | CDO / Domain Data Stewards | Continuously maintained | Data governance platform (Collibra/Alation) |
| Ontology version history | Data Architecture + Governance Committee | Per change | Version-controlled ontology repository |
| Mapping registry | Data Engineering + Domain Stewards | Per change | Versioned mapping registry database |
| Golden question set | Data Engineering + Business Analysts | Quarterly refresh | Test suite in CI/CD pipeline |
| Translation model performance report | AI Engineering | Per model version | ML model registry |
| Query audit log | Operations | Continuous | Immutable audit log store |
10. Operational Considerations
10.1 Monitoring and SLOs
| Metric | SLO Target | Alerting Threshold | Tool |
|---|---|---|---|
| NL query end-to-end latency p95 | ≤3 seconds (cache miss); ≤200ms (cache hit) | >5s p95 over 5 min | Prometheus + Grafana |
| Semantic cache hit rate | ≥40% in steady-state production | <20% over 1 hour | Custom metric |
| Translation accuracy on golden set | ≥90% precision on validated golden questions | <85% precision | Scheduled evaluation job |
| Mapping coverage (% active ontology concepts with valid mappings) | ≥95% coverage | <90% coverage | Data quality dashboard |
| Source system query failure rate | <1% of translated queries fail execution | >5% failure rate | Query execution metrics |
| Stale mapping alert rate | 0 unacknowledged stale mapping alerts >24 hours | Any unacknowledged alert >24h | Incident management tool |
10.2 Logging
Query logs are structured JSON: {timestamp, caller_id, nl_question_hash, resolved_concepts, mappings_used, source_systems, query_execution_ms, result_row_count, data_classification, cache_hit}. NL question text is hashed in operational logs; raw text is stored in the separate audit log (access-restricted). Audit logs are immutable and retained per regulatory requirements.
10.3 Incident Management
A P1 incident is declared when the SDL query API is unavailable or when the translation accuracy rate drops below 75% on the golden set. The on-call data engineering team has a 15-minute response SLA. A P2 incident covers mapping staleness affecting regulatory reporting concepts — 2-hour response SLA. Mapping outages affecting non-critical domains are P3 with next-business-day response.
10.4 Disaster Recovery
| Scenario | RTO | RPO | Recovery Procedure |
|---|---|---|---|
| Query translation service failure | 5 min (restart; stateless) | N/A (stateless) | Container restart; validate with health check query |
| Mapping registry unavailable | 30 min | 5 min (replica promotion) | Promote read replica; validate mapping count |
| Semantic cache corruption | 15 min | 0 (cache is reconstructable) | Flush cache; warm from query log replay |
| Business glossary platform outage | SDL continues with cached ontology snapshot | Last cached snapshot (max 1 hour) | SDL reads ontology snapshot; alert data governance to restore glossary platform |
10.5 Capacity Planning
Query translation compute is CPU-intensive for cache misses (LLM invocation). At scale, 70%+ cache hit rates make the average compute cost manageable. Plan for bursty LLM API quota: cache miss spikes occur when new question types are introduced (e.g., a new AI application launches). Semantic cache storage grows at approximately 1 KB per cached entry; 1 million cached entries requires ~1 GB, which is negligible.
11. Cost Considerations
11.1 Cost Drivers
| Cost Driver | Description | Typical Range |
|---|---|---|
| LLM API calls for query translation | Per-query LLM cost for cache misses; depends on cache hit rate | $0.002–$0.02 per cache miss query |
| Semantic cache infrastructure | Vector similarity cache (Redis + embedding index) | $500–$3,000/month |
| Mapping registry database | PostgreSQL or equivalent; modest size but high availability required | $200–$1,000/month |
| Data steward and ontology governance labour | The dominant ongoing cost — human expertise to maintain mappings | 2–5 FTE (shared across data governance programme) |
| Source system query compute | Depends on source system pricing; SDL adds query overhead | Variable; monitor via query cost analysis |
| Business glossary platform | Commercial platforms (Collibra, Alation) carry significant licence cost | $50,000–$500,000/year depending on scale |
11.2 Scaling Risks
- LLM translation cost for unique queries grows linearly without cache; organisations with highly diverse question sets see lower cache hit rates
- Ontology and mapping maintenance labour scales with organisational complexity, not query volume — a large organisation needs proportionally more data stewards regardless of query load
- Source system schema drift creates mapping maintenance burden that grows with the number of source systems and their change velocity
11.3 Optimisations
- Semantic caching is the single highest-ROI optimisation — invest in cache quality and TTL tuning before any other cost reduction effort
- Template-based query generation for the most common query patterns avoids LLM invocation entirely for those patterns
- Lightweight open-source embedding models can replace commercial embeddings for cache similarity matching at substantially lower cost
- Shared ontology governance across AI programmes (not SDL-specific) distributes the data steward cost across multiple value streams
11.4 Indicative Cost Ranges
| Deployment Scale | Monthly Infrastructure Cost | Annual Total (incl. governance labour) |
|---|---|---|
| Single domain, 3 source systems | $2,000–$5,000 | $150,000–$300,000 |
| Multi-domain, 10 source systems | $8,000–$20,000 | $500,000–$1,200,000 |
| Enterprise-wide, 50+ source systems | $30,000–$80,000 | $2,000,000–$5,000,000 |
12. Trade-Off Analysis
12.1 Semantic Layer Technology Options
| Option | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Custom SDL (this pattern) | Maximum control; integrates with existing data catalogue; extensible | High build and maintenance cost; requires strong data engineering capability | Large enterprises with diverse source systems and active data governance |
| dbt Semantic Layer | Native integration with dbt-managed data warehouse; strong SQL ecosystem | Limited to SQL sources; no NL query translation built in; weak ontology expressiveness | Data warehouse-centric organisations already using dbt |
| Cube.dev / AtScale | Managed semantic layer; built-in caching; BI tool integration | Commercial; primarily metric-focused; limited relationship graph expressiveness | Analytics-heavy use cases; BI + AI hybrid access patterns |
| Microsoft Fabric Semantic Model | Deep Azure/Power BI integration; enterprise support | Azure lock-in; Power BI-centric; limited graph relationship support | Microsoft-native organisations |
12.2 Architectural Tensions
| Tension | Option A | Option B | Recommended Resolution |
|---|---|---|---|
| Ontology completeness vs. maintenance burden | Complete ontology covering all business terms before any AI deployment | Minimal ontology covering only terms needed for current AI use cases | Incremental ontology: start with the concepts needed for the first 2–3 AI use cases; expand governed by demand; never build ahead of usage |
| Query translation accuracy vs. latency | Thorough multi-step LLM translation with disambiguation for accuracy | Single-pass template matching for low latency | Hybrid: templates for high-frequency, well-understood query patterns; LLM for novel queries; cache bridges the gap |
| Semantic cache freshness vs. hit rate | Short TTL for freshness (lower hit rate, higher cost) | Long TTL for high hit rate (risk of stale results) | Domain-calibrated TTL: fast-changing operational metrics get short TTL; stable reference data gets long TTL |
| Centralised vs. federated SDL | Single centralised SDL for enterprise-wide consistency | Domain-federated semantic layers with federation standards | Centralised ontology and business glossary; domain-federated mapping registries for source system ownership |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Ontology definition conflict (two domains define same term differently) | High | High — SDL produces contradictory answers depending on resolver | Data steward conflict reports; inconsistent AI answers | Ontology governance committee arbitration; canonical definition documented; alternate terms for domain-specific variants |
| Mapping staleness (source schema change breaks mapping) | High | High for affected concepts; scoped to specific queries | Impact analyser detects schema drift; query failures for affected mappings | Mapping repair by data engineer; automated schema drift alerts minimise time to detection |
| Semantic cache poisoning (incorrect translation cached) | Low | Medium — affects all queries hitting that cache entry | Golden set regression; user-reported incorrect answers | Flush affected cache entries; identify root cause (hallucination or mapping error); fix translation |
| Translation LLM unavailability | Medium | High if no fallback — all cache-miss queries fail | LLM API health check; query failure rate spike | Fallback to template-only translation for known query patterns; queue novel queries for retry |
| Business glossary platform outage | Low | Medium — SDL continues with snapshot; new glossary updates not reflected | Glossary platform health check | SDL operates from last cached ontology snapshot; alert data governance; acceptable degradation for max 4 hours |
13.1 Cascading Failure Scenarios
Scenario 1: Mass Mapping Invalidation. A source ERP system undergoes a major version upgrade, changing 40% of table/column names. The impact analyser flags 312 mapping invalidations simultaneously. The human review queue floods beyond capacity. The SDL switches to "degraded mode" — only serving queries against concepts with valid mappings, returning "data temporarily unavailable" for others. Resolution requires a war room with data engineering and ERP administrators; a mapping batch repair tool is executed to accelerate the re-mapping process.
Scenario 2: Ontology Terminology Change Cascade. The data governance committee renames a core concept ("Client" → "Customer") to align with a new CRM system. The SDL flushes all cache entries containing the old term. All AI applications must update their prompts to use the new term. In the interim, AI applications asking about "clients" receive no results because the old ontology term is deprecated. The lesson: ontology term renames require a deprecation period where both old and new terms are accepted, with a migration window before the old term is removed.
14. Regulatory Considerations
| Regulation | Relevant Clause | Requirement | How SDL Addresses It |
|---|---|---|---|
| APRA CPS 230 | §36–§38 (Service Continuity) | Critical data services must have documented availability and recovery plans | SDL availability SLOs, DR procedures, and degraded-mode operation documented |
| APRA CPS 234 | §15 (Information Asset Management) | Information assets classified proportionate to sensitivity | Data classification on every ontology concept and query result |
| Australian Privacy Act 1988 | APP 6 (Use or Disclosure) | Personal information only used for the purpose it was collected | ABAC at concept level prevents AI applications from accessing personal data outside their authorised purpose |
| EU AI Act | Article 13 (Transparency) | High-risk AI decisions must be explainable | Semantic metadata on every query result provides the translation chain: NL → concept → physical data |
| EU GDPR | Article 5(1)(b) (Purpose Limitation) | Data only processed for specified, explicit, legitimate purposes | Purpose-scoped access control enforced at the ontology concept level |
| ISO/IEC 42001 | §8.4 (AI system transparency) | Organisations must document AI system data inputs and transformations | Mapping registry + query audit log provides full input documentation |
| NIST AI RMF | MAP 2.2 (AI Risk Characterisation) | Risks from AI data access characterised and documented | Mapping confidence levels and data classification labels quantify data access risk |
15. Reference Implementations
15.1 AWS
| Component | AWS Service |
|---|---|
| Ontology / mapping registry | Aurora PostgreSQL with custom schema |
| Query translation LLM | Amazon Bedrock (Claude or Titan) |
| Semantic cache | ElastiCache Redis + custom embedding index |
| Business glossary | AWS Glue Data Catalog (limited) or third-party Collibra on EC2 |
| Source system connectivity | Amazon Athena (data lake), RDS direct connection, Redshift |
| Monitoring | CloudWatch + Managed Prometheus/Grafana |
| Access control | AWS IAM + Lake Formation fine-grained access |
15.2 Azure
| Component | Azure Service |
|---|---|
| Ontology / mapping registry | Azure SQL Database |
| Query translation LLM | Azure OpenAI Service (GPT-4o) |
| Semantic cache | Azure Cache for Redis + Azure AI Search |
| Business glossary | Microsoft Purview Data Catalog |
| Source system connectivity | Azure Synapse Analytics, Azure SQL, Fabric OneLake |
| Monitoring | Azure Monitor + Grafana |
| Access control | Azure AD ABAC + Purview data policies |
15.3 GCP
| Component | GCP Service |
|---|---|
| Ontology / mapping registry | Cloud SQL PostgreSQL |
| Query translation LLM | Vertex AI Gemini |
| Semantic cache | Memorystore Redis + Vertex AI Vector Search |
| Business glossary | Dataplex Data Catalog |
| Source system connectivity | BigQuery, Cloud SQL, AlloyDB |
| Monitoring | Cloud Monitoring + Grafana |
15.4 On-Premises
| Component | Technology |
|---|---|
| Ontology / mapping registry | PostgreSQL + custom API layer |
| Query translation | Self-hosted Ollama (Llama 3.x) or on-prem LLM |
| Semantic cache | Redis Enterprise + pgvector |
| Business glossary | Collibra on-prem or open-source Amundsen/DataHub |
| Source connectivity | Direct JDBC/ODBC; Airbyte for data movement |
16. Related Patterns
| Pattern ID | Pattern Name | Relationship Type | Notes |
|---|---|---|---|
| EAAPL-KNW001 | Enterprise Knowledge Graph | Complementary | SDL provides the semantic interface to the knowledge graph; together they create governed NL-to-knowledge access |
| EAAPL-KNW003 | AI Knowledge Corpus Management | Upstream | Corpus documents are richer when the semantic layer provides entity and term context for ingestion |
| EAAPL-KNW006 | Corpus Quality Assurance | Supporting | Quality assurance validates that corpus documents use terms consistently with the SDL ontology |
| EAAPL-RAG002 | Text-to-SQL | Specialisation | Text-to-SQL is a simpler version of the SDL concept — SDL adds ontology governance and multi-source abstraction |
| EAAPL-GOV001 | AI Data Governance | Dependency | SDL is an implementation of AI data governance principles — requires a functioning data governance programme |
| EAAPL-SEC001 | AI Data Access Control | Supporting | SDL's ABAC implementation is an application of the AI data access control pattern |
17. Maturity Assessment
Overall Maturity Label: Proven
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Technology readiness | 4 | NL-to-SQL translation is production-proven; semantic caching is well-understood; managed semantic layers from dbt/Cube are commercial grade |
| Organisational capability | 2 | Requires mature data governance including a business glossary and data steward function — rare below large enterprise level |
| Standards availability | 3 | OWL/RDF/SPARQL are mature; property graph query standards (GQL) are emerging; semantic layer API standards are fragmented |
| Vendor ecosystem | 4 | Multiple commercial semantic layer products; multiple LLM options for translation; strong open-source tooling |
| Case evidence | 3 | Strong evidence in analytics-heavy domains (BI semantic layers); AI-specific SDL evidence is growing but less documented |
| Regulatory alignment | 5 | SDL directly addresses regulatory transparency, purpose limitation, and auditability requirements for AI data access |
| Overall | 3.5 / 5 | Proven with strong regulatory alignment; primary constraint is the prerequisite data governance programme maturity |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-06-12 | EAAPL Editorial Board | Initial publication — covers ontology governance, semantic mapping, NL query translation, semantic caching, business glossary integration, and mapping validation |