Proven

EAAPL-OBS001 · AI Telemetry Architecture

Pattern ID: EAAPL-OBS001 Status: Proven Complexity: Medium Tags: observability slo llm audit-logging medium-complexity Version: 1.0.0 Last Reviewed: 2026-06-12

1. Executive Summary

AI systems present unique observability challenges that traditional APM tooling does not address. Unlike deterministic software, AI systems exhibit probabilistic behaviour, token-based economics, and latency profiles driven by external model APIs rather than internal compute. Without purpose-built telemetry, engineering teams operate blind: they cannot distinguish a model quality regression from a latency spike, cannot allocate AI costs to business units, and cannot demonstrate regulatory compliance when auditors request evidence of AI system behaviour.

This pattern defines a comprehensive telemetry architecture for AI systems covering metrics, structured logs, and distributed traces. It establishes an AI-specific metric taxonomy — token usage, prompt vs completion token splits, cache hit rates, per-model latency percentiles, error rates by error class, and cost per request — alongside a structured logging schema and distributed tracing conventions built on OpenTelemetry GenAI semantic conventions. The outcome is full operational visibility: SLO attainment dashboards, cost attribution by team and product, compliance audit trails, and the data foundation required for hallucination detection, drift monitoring, and incident management patterns that extend this baseline.

Target Audience: CIO, CTO, Platform Engineering Lead, AI Engineering Lead Time to Implement: 6–10 weeks

2. Problem Statement

Business Problem

Organisations deploying AI are unable to answer basic operational questions: Which team is driving our AI spend? Is our AI improving or degrading over time? If a customer complains our AI gave wrong advice, can we reconstruct what happened? Without answers, AI budgets spiral, quality degrades silently, and regulatory exposure grows.

Technical Problem

Traditional APM tools (Datadog, New Relic, Dynatrace) capture HTTP latency and error rates but lack primitives for token counts, prompt versions, cache hit ratios, model-specific error codes, or cost attribution at request level. AI pipelines span multiple components — prompt assembly, vector retrieval, LLM API, output filtering — yet appear as a single opaque HTTP call in conventional traces.

Symptoms

AI costs attributed to a single cost centre with no per-team or per-feature breakdown
Latency alerts fire on p99 spikes but root cause (LLM API vs. vector DB vs. prompt assembly) is unknown
Incident post-mortems cannot reconstruct the exact prompt, model version, and context that produced a harmful output
Prompt engineering teams release updates with no visibility into production impact
Cache infrastructure exists but cache hit rate is unknown and uncorrelated to cost

Cost of Inaction

AI cost overruns of 40–200% are common when per-request cost tagging is absent (Gartner 2025)
Mean time to diagnose AI quality incidents exceeds 8 hours without structured AI telemetry
Regulatory audit findings when AI decision logs cannot be reconstructed (APRA, EU AI Act)
Repeated model regressions shipped to production due to absence of quality regression gates

3. Context

When to Apply

Any production AI system processing > 1,000 requests/day
AI systems handling financial, health, legal, or safety-relevant decisions
Organisations with multiple teams sharing AI infrastructure
Systems where LLM API cost exceeds $500/month (cost visibility ROI positive)
Before deploying dependent patterns: EAAPL-OBS002 (Prompt Monitoring), EAAPL-OBS003 (Hallucination Detection)

When NOT to Apply

Internal proof-of-concept with < 30-day planned lifespan
Single-team, single-use-case AI tools where cost allocation is irrelevant
Air-gapped environments where telemetry egress is prohibited (use on-premises collector variant)

Prerequisites

Prerequisite	Required	Notes
OpenTelemetry SDK in application stack	Required	Java, Python, Node.js, Go SDKs available
Centralised log aggregation (e.g., OpenSearch, Splunk, Loki)	Required	Must support JSON structured logs
Metrics platform (Prometheus, Datadog, CloudWatch)	Required	Must support custom metric dimensions
Distributed trace backend (Jaeger, Tempo, X-Ray, Cloud Trace)	Required	Must accept OTLP protocol
AI API cost data access	Required	Provider billing API or webhook for per-request cost
Secret management for API keys	Required	Keys must not appear in telemetry data

Industry Applicability

Industry	Applicability	Primary Driver
Financial Services	High	APRA CPS234, audit trail requirements, cost allocation
Healthcare	High	Privacy Act, clinical AI accountability, safety
Government	High	Mandatory AI transparency, FOI obligations
Retail / E-Commerce	Medium	Cost optimisation, personalisation quality
Technology / SaaS	High	Multi-tenant cost attribution, SLO management
Legal Services	High	Professional liability, output auditability
Manufacturing	Medium	Predictive maintenance quality tracking

4. Architecture Overview

The AI Telemetry Architecture is structured as a four-layer system: instrumentation, collection, storage, and consumption. Each layer has distinct responsibilities and technology choices.

Instrumentation Layer

Every AI pipeline component is instrumented using the OpenTelemetry SDK with GenAI semantic conventions. The instrumentation layer is responsible for emitting three signal types: metrics (counters, histograms, gauges), structured logs (JSON with a canonical AI log schema), and traces (spans with AI-specific span attributes). Instrumentation is implemented as middleware or interceptors on the AI SDK layer — not in application business logic — so that telemetry concerns do not leak into product code. The canonical approach is an AI client wrapper that intercepts every LLM call, records span start and end, attaches token counts from the API response, calculates cost using a cost coefficient table, and emits all three signal types atomically.

Metric Taxonomy

The AI metric taxonomy distinguishes three categories. Throughput metrics track requests per second per model, tokens per minute (split prompt vs. completion), and concurrent in-flight requests. Quality metrics track cache hit rate, error rate by error type (rate limit, context length exceeded, content filter, timeout, model error), and model availability (percentage of requests receiving non-error responses). Economics metrics track cost per request (USD), cost per 1K tokens by model, and cumulative cost by dimension (team, product, feature, environment). All metrics carry consistent dimension labels: model_id, model_version, environment, team, product, use_case, and error_type where applicable.

Structured Log Schema

Every LLM request produces a structured log record in a canonical JSON schema. The schema includes: requestId (UUID v4), traceId (W3C trace context), spanId, timestamp (ISO 8601 UTC), modelId, modelVersion, promptVersion, promptHash (SHA-256 of prompt template, not content), inputTokens, outputTokens, cachedTokens, latencyMs (total), latencyBreakdown (object with promptAssemblyMs, vectorRetrievalMs, llmApiMs, outputFilterMs), cacheHit (boolean), userId (hashed), tenantId, toolCalls (array of tool name and latency), errorCode (null or error type string), costUsd, and environment. PII is never logged in prompt content; prompt content logging requires explicit opt-in with data classification controls.

Distributed Trace Architecture

Traces propagate via W3C Trace Context headers through every component in the AI pipeline. The trace begins at the API gateway where the initial span is created. Child spans are created for: authentication check, rate limit evaluation, prompt assembly, vector store retrieval (one span per retrieval call), LLM API call (one span per model invocation, supporting multi-turn and agentic patterns), output safety filter, and response serialisation. Each span carries AI-specific span attributes per OpenTelemetry GenAI conventions.

Collection Architecture

The OpenTelemetry Collector runs as a sidecar or daemon set, receiving signals via OTLP (gRPC and HTTP). The collector pipeline includes processors for PII scrubbing (remove prompt content from logs before storage unless classified as approved), attribute enrichment (add cost calculations, environment tags), sampling (100% error traces, 10% success traces, 1% high-volume paths), and batching for efficiency. The collector fans out to multiple backends: metrics to Prometheus or CloudWatch, logs to OpenSearch or Splunk, traces to Jaeger or Tempo.

Storage Tiering

Hot storage (0–30 days): full-resolution metrics at 15-second granularity, all structured logs, all sampled traces. Warm storage (30–90 days): 1-minute metric aggregates, logs retained for compliance. Cold storage (90 days–7 years): aggregated metrics, compliance-required logs, legal hold traces. Retention periods are configurable per data classification and regulatory jurisdiction.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Pipeline["AI Request Pipeline"] A[API Gateway] B[LLM Client Wrapper] C[Output Safety Filter] end subgraph Collection["OTel Collection"] D[OTel Collector] E[PII Scrubber + Enrichment] end subgraph Storage["Signal Storage"] F[(Metrics Store)] G[(Log Store)] H[(Trace Backend)] end A --> B B --> C A -->|spans| D B -->|tokens + cost + spans| D C -->|filter spans| D D --> E E -->|metrics| F E -->|logs| G E -->|traces| H F --> I[Dashboards + Alerts] G --> J[Audit Reports] H --> J style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#fef9c3,stroke:#eab308 style G fill:#fef9c3,stroke:#eab308 style H fill:#fef9c3,stroke:#eab308 style I fill:#d1fae5,stroke:#10b981 style J fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
AI Client Wrapper	SDK Library	Intercepts all LLM API calls; emits metrics, logs, spans; calculates cost	Custom wrapper over OpenAI SDK, Anthropic SDK, Bedrock SDK	Critical
OpenTelemetry Collector	Infrastructure	Receives OTLP signals; processes (PII scrub, enrich, sample); fans out to backends	OTel Collector Contrib, AWS ADOT, Grafana Alloy	Critical
PII Scrubber Processor	Collector Processor	Detects and removes PII from log fields before storage	Presidio, custom regex processor, OTel transform processor	Critical
Metrics Backend	Storage	Time-series storage for AI metrics; supports high-cardinality dimensions	Prometheus + Thanos, Datadog, CloudWatch, InfluxDB	Critical
Log Aggregation	Storage	Structured JSON log storage; supports full-text and field queries	OpenSearch, Splunk, Loki, Azure Monitor Logs	Critical
Trace Backend	Storage	Distributed trace storage; supports trace search and waterfall UI	Jaeger, Grafana Tempo, AWS X-Ray, Google Cloud Trace	High
Cost Database	Storage	Per-request cost records with full dimension tagging	PostgreSQL with TimescaleDB, ClickHouse, BigQuery	High
Dashboarding	Consumption	SLO dashboards, cost dashboards, quality dashboards	Grafana, CloudWatch Dashboards, Datadog Dashboards	High
Alert Manager	Consumption	Route metric threshold alerts to on-call channels	Alertmanager, PagerDuty, OpsGenie, CloudWatch Alarms	Critical
Audit Report Engine	Consumption	Generate compliance audit reports from log data	Scheduled queries + PDF/CSV export; Splunk reports	High
Cost Attribution Engine	Consumption	Aggregate cost by team/product/feature; send to FinOps	Custom aggregation service; AWS Cost and Usage Report integration	Medium
Sampling Controller	Collector Processor	Dynamic sampling rates by traffic type and error status	OpenTelemetry tail-based sampling processor	Medium

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	API Gateway	Receives request; creates root span with W3C trace context; injects trace headers	Root span, request log entry
2	AI Client Wrapper	Intercepts LLM call; records start timestamp; logs requestId, userId, tenantId, promptVersion	Span started, pre-call log record
3	Prompt Assembly Service	Retrieves prompt template; assembles with context; records assemblyMs	Assembled prompt, span with promptHash attribute
4	Vector Store Client	Executes similarity search; records retrievalMs, document count, top-k scores	Retrieved context chunks, retrieval span
5	LLM API Client	Sends request to model provider; awaits response; records inputTokens, outputTokens, cachedTokens, llmApiMs	Raw completion, LLM span with token attributes
6	Cost Calculator	Applies cost coefficient (USD per 1K tokens) to token counts; calculates requestCostUsd	costUsd field populated
7	Output Safety Filter	Evaluates output for policy violations; records filter decision, filterMs	Filtered output, filter span
8	AI Client Wrapper	Closes span; emits complete structured log record; increments metrics counters and histograms	Completed trace, log record, metric increments
9	OTel Collector	Receives OTLP signals; applies PII scrub, enrichment, sampling	Clean signals ready for storage
10	Backends	Persist metrics, logs, traces to respective storage systems	Queryable telemetry data

Error Flow

Error Scenario	Detection	Action	Recovery
LLM API rate limit (429)	AI Client Wrapper catches HTTP 429	Log errorCode=RATE_LIMIT; increment error_rate counter with error_type label; create error span	Retry with exponential backoff; alert if rate > threshold
Context length exceeded	HTTP 400 with context_length_exceeded code	Log errorCode=CONTEXT_LENGTH; increment counter; record inputTokens that caused overflow	Truncation fallback; alert prompt engineering team
OTel Collector unavailable	SDK cannot connect to collector endpoint	Buffer signals in memory (up to configured limit); retry; drop oldest if buffer full	Collector restart; no data loss for short outages
PII detected in log field	PII scrubber processor identifies PII pattern	Replace field value with [REDACTED]; log PII detection event to separate audit stream	Alert data governance team; do not store raw PII
Cost calculation failure	Cost coefficient table missing model entry	Log warning; record costUsd=null; continue request	Add model to cost table; backfill cost estimates

8. Security Considerations

Authentication: All telemetry endpoints require mutual TLS (mTLS) between instrumented services and OTel Collector. Collector-to-backend connections use service account credentials stored in a secrets manager, never in environment variables or config files.

Authorisation: Telemetry data access follows least-privilege. Engineers access dashboards via SSO with RBAC. Cost data restricted to team leads and FinOps. Audit logs accessible only to compliance, legal, and senior engineering. Trace data containing AI outputs restricted to AI engineering and incident responders.

Secrets Management: AI API keys (OpenAI, Anthropic, Bedrock) are rotated quarterly and stored in HashiCorp Vault or AWS Secrets Manager. Keys never appear in logs, spans, or metrics. The AI Client Wrapper retrieves keys at runtime from the secrets manager; keys are never interpolated into log messages.

Data Classification: Prompt content is classified as Internal (employee prompts) or Confidential (customer data in context). Prompt content logging requires explicit classification approval. Default: prompt content is NOT logged; only promptHash (SHA-256 of template, no variable content) is recorded.

Encryption: All telemetry data encrypted in transit (TLS 1.3) and at rest (AES-256). Log archives in cold storage use envelope encryption with customer-managed keys.

Auditability: Every access to telemetry data is itself logged. Dashboard queries, report exports, and direct database queries produce access log records. Audit logs are write-once and stored in a separate immutable log store.

OWASP LLM Top 10 Coverage

OWASP LLM Risk	Telemetry Control	Implementation
LLM01 Prompt Injection	Prompt content anomaly detection in monitoring layer	Alert on prompts containing injection signatures; log injection attempts
LLM02 Insecure Output Handling	Output safety filter spans record filter decisions	Track filter bypass rate; alert on zero-filter-hits anomaly
LLM03 Training Data Poisoning	Input distribution drift metrics	PSI alerts on input feature drift that may indicate poisoning
LLM04 Model Denial of Service	Token usage metrics, rate limit error rate	Alert on token-per-minute spikes; rate limit enforcement at gateway
LLM05 Supply Chain Vulnerabilities	Model version tracking in all telemetry records	Detect unexpected model version changes in production
LLM06 Sensitive Information Disclosure	PII scrubber processor; prompt content classification	Alert on PII detection in prompts; restrict access to AI output logs
LLM07 Insecure Plugin Design	Tool call spans record tool name, inputs, outputs	Audit all tool invocations; alert on unexpected tool calls
LLM08 Excessive Agency	Tool call frequency metrics; scope metrics	Alert on tool call rate exceeding expected bounds per workflow
LLM09 Overreliance	User feedback metrics; hallucination rate metric	Track downstream outcomes; surface low-confidence responses
LLM10 Model Theft	API key access logs; unusual exfiltration patterns	Alert on bulk completions from single key; anomalous output volume

9. Governance Considerations

Responsible AI: Telemetry data is the primary evidence base for AI governance. The telemetry architecture must retain sufficient data to answer: Who made this AI call? What was the prompt context? What model and version responded? What was the output? Was it filtered? All these questions must be answerable from the audit trail.

Model Risk Management: Model version is a mandatory dimension on all telemetry. When a model version changes, telemetry enables before/after comparison on quality, cost, and error metrics. This supports the model risk management process for material model changes.

Human Approval: Access to raw AI telemetry (especially prompt logs and output logs) requires approval from the data governance committee. Automated systems may read aggregated metrics; raw records require human approval for access.

Policy: Telemetry data retention policies must be documented and approved by legal and compliance. Minimum retention for regulatory purposes is 7 years for financial services (APRA), 5 years for healthcare (Privacy Act). Destruction schedules must be enforceable and audited.

Traceability: Every AI-influenced decision must be traceable from the business outcome back to the specific requestId, modelId, promptVersion, and context. This traceability chain is the foundation for regulatory audit and legal discovery.

Governance Artefacts

Artefact	Owner	Frequency	Format
AI Telemetry Schema Registry	Platform Engineering	Updated on schema change	JSON Schema + changelog
Data Retention Policy	Legal / Compliance	Annual review	Policy document
Cost Attribution Report	FinOps + AI Platform	Monthly	Automated dashboard + PDF export
Model Version Change Log	AI Engineering	Per deployment	Linked to deployment record
PII Detection Incident Log	Data Governance	Per incident	Incident ticket + remediation record
Telemetry Access Audit Report	Security	Quarterly	Automated export from access log

10. Operational Considerations

Monitoring: The telemetry system itself must be monitored. Collector pipeline health (signal throughput, drop rate, processing lag), backend storage capacity and ingestion rate, and alert delivery reliability are all first-class operational metrics.

Logging: Collector and backend logs are separate from AI application logs. They are stored in an operations log store, not the AI audit log store, to prevent circular dependencies.

Incident Response: If the telemetry system degrades, AI systems continue operating but enter a "blind flight" state. Runbooks must define escalation criteria and manual verification procedures for operating without telemetry.

Disaster Recovery: The OTel Collector runs in active-active pairs. Log and metric backends use replication. Trace data is lower durability (acceptable to lose recent traces in a DR event); log and metric data requires RPO < 1 hour.

Capacity Planning: Token-based workloads can have very bursty telemetry volumes. Capacity planning must account for peak token volume, not average. A 10x burst capacity buffer is the minimum recommendation.

SLO Table

SLO	Target	Measurement	Alert Threshold
Telemetry signal delivery lag	< 30 seconds p99	Collector processing lag metric	> 60 seconds for 5 minutes
Log query response time	< 5 seconds for 24h queries	Log backend query latency	> 10 seconds sustained
Alert delivery time	< 2 minutes from threshold breach	Alert delivery timestamp vs. breach timestamp	> 5 minutes
Telemetry data completeness	> 99.9% of AI requests have log record	Correlation of AI request count vs. log record count	< 99% for 1 hour

Disaster Recovery Table

Component	RTO	RPO	Recovery Approach
OTel Collector	5 minutes	Near-zero (active-active)	Auto-failover to standby collector
Metrics Backend	15 minutes	1 hour	Prometheus TSDB snapshot restore
Log Aggregation	30 minutes	1 hour	Index restore from object storage
Trace Backend	60 minutes	4 hours	Partial data acceptable; restore from object storage
Cost Database	30 minutes	1 hour	PostgreSQL streaming replication

11. Cost Considerations

Cost Drivers

Driver	Description	Relative Cost
Log ingestion volume	Token counts and full request metadata generate ~2KB per AI request at scale	High
Trace storage	Full distributed traces with AI span attributes are 5–10x larger than typical traces	Medium-High
Metrics cardinality	High-cardinality dimensions (userId, tenantId per metric) can cause metric explosion	High if uncontrolled
PII scrubber compute	Regex and NER-based PII detection adds ~5ms per log record; scales with volume	Medium
Cold storage archival	7-year retention of log data at enterprise scale	Medium (decreasing)

Scaling Risks: Metrics cardinality is the primary cost scaling risk. Adding userId as a label on high-volume metrics multiplies series count by user count. Use aggregated dimensions (user_tier, not userId) for metrics; reserve userId for log records.

Optimisations:

Reduce log verbosity for non-error paths (omit latency breakdown for successful sub-100ms calls)
Use adaptive sampling: 100% error, 10% normal, 1% for high-volume cached paths
Compress log archives (gzip achieves 80–90% compression on JSON logs)
Use metric aggregation at collector before forwarding to reduce series count

Indicative Cost Range

Scale	AI Requests/Day	Estimated Telemetry Cost/Month
Small	10,000	$200–$500
Medium	500,000	$2,000–$5,000
Large	5,000,000	$10,000–$25,000
Enterprise	50,000,000+	$50,000–$150,000 (with optimisation)

12. Trade-Off Analysis

Approach Comparison

Approach	Pros	Cons	Best For
Full-resolution logging (log every request complete)	Complete audit trail; regulatory defensible; full debugging capability	High storage cost; PII risk requires scrubbing; complex access control	Financial services, healthcare, regulated industries
Sampled trace + aggregated metrics only	80% lower telemetry cost; simpler; no PII in trace storage	Cannot reconstruct specific requests; insufficient for audit; gaps in debugging	Internal tools, low-risk AI features, cost-sensitive environments
Vendor-managed observability (Datadog, New Relic AI monitoring)	Faster time to value; managed infrastructure; built-in AI dashboards	Vendor lock-in; data residency concerns; limited schema control; high cost at scale	Organisations without dedicated platform engineering capability

Architectural Tensions

Tension	Description	Resolution
Completeness vs. Privacy	Full logging enables audit but risks PII exposure	PII scrubber at collector; log metadata not content; prompt hash not prompt text
Low latency vs. Synchronous telemetry	Synchronous telemetry adds latency to AI calls	Async emit to collector; fire-and-forget from AI Client Wrapper; batch at collector
Cardinality vs. Granularity	High-cardinality metrics enable granular analysis but explode cost	Use granular dimensions in logs; aggregated dimensions in metrics
Retention vs. Cost	Long retention enables trend analysis but storage cost is linear	Tiered storage with automated downsampling; aggregate old data before archival

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
OTel Collector crash	Low	High (blind flight)	Collector health check alert; signal delivery lag alert	Auto-restart via process supervisor; failover to standby
Metrics backend capacity exhaustion	Medium	High (no SLO visibility)	Storage utilisation alert at 80%	Increase storage; enable metric retention reduction; emergency cardinality reduction
Log storage ingestion lag	Medium	Medium (delayed audit trail)	Ingestion lag metric > 5 minutes	Scale log ingestion nodes; enable priority queuing
PII scrubber misconfiguration allows PII through	Low	Critical (data breach risk)	Regular PII audit scans on stored logs	Immediate: quarantine affected log segment; notify privacy officer; remediate and rescrub
Cost table missing model entry	Medium	Low (cost blind for new model)	Alert on null costUsd in logs	Add model to cost table; backfill estimates from provider billing API
High cardinality metric explosion	Medium	High (metrics backend OOM)	Series count growth rate alert	Emergency: drop high-cardinality labels; increase backend capacity

Cascading Scenarios

Scenario 1: Log backend ingestion fails → Audit trail gaps → Regulatory non-compliance finding → Mandatory remediation. Mitigation: dual-write to backup log store with 24-hour buffer.
Scenario 2: Collector PII scrubber disabled for performance → PII accumulates in logs → Data breach → Privacy Act notification. Mitigation: PII scrubbing is non-bypassable; performance optimisation must find alternative.

14. Regulatory Considerations

Regulation	Clause	Requirement	Telemetry Implementation
APRA CPS 230	Para 53–57 (Operational Risk Management)	Critical AI systems require documented monitoring and incident management	SLO dashboards, incident detection alerts, audit trail per this pattern
APRA CPS 234	Para 36–37 (Information Security Incident Response)	Security events must be detected, logged, and reported within defined timeframes	Security-relevant telemetry (injection attempts, PII leaks) alert within SLO
Privacy Act 1988 (AU)	APP 11 (Security of Personal Information)	PII in AI systems must be protected and access controlled	PII scrubber, access-controlled log store, data retention limits
EU AI Act	Article 12 (Record-keeping for high-risk AI)	High-risk AI systems must log inputs and outputs to enable post-hoc review	Structured log schema with requestId linkable to outcome; 7-year retention for high-risk
EU AI Act	Article 9 (Risk Management System)	Continuous monitoring of AI system performance	Drift monitoring (EAAPL-OBS005), quality metrics, SLO attainment
ISO/IEC 42001	Clause 9.1 (Monitoring, measurement, analysis)	AI management system requires performance monitoring	This pattern implements the technical layer; governance layer in EAAPL-OBS001 governance section
NIST AI RMF	GOVERN 1.7, MANAGE 2.2	AI risks must be tracked, monitored, and reported with defined metrics	Metric taxonomy, SLO table, incident integration per this pattern

15. Reference Implementations

AWS

AI Client Wrapper: AWS SDK for Bedrock with custom interceptor; OpenTelemetry Java/Python/Node SDK
OTel Collector: AWS Distro for OpenTelemetry (ADOT) on ECS or EKS
Metrics: Amazon CloudWatch with custom namespace AI/Inference
Logs: Amazon CloudWatch Logs with structured JSON; export to S3 for cold storage
Traces: AWS X-Ray with OpenTelemetry SDK
Cost: AWS Cost and Usage Report + custom Lambda aggregation
Dashboards: Amazon CloudWatch Dashboards + Amazon Managed Grafana

Azure

AI Client Wrapper: Azure SDK for OpenAI with Application Insights telemetry initialiser
OTel Collector: Azure Monitor OpenTelemetry Distro on AKS
Metrics: Azure Monitor Metrics with custom dimensions
Logs: Azure Monitor Logs (Log Analytics Workspace)
Traces: Azure Application Insights distributed tracing
Cost: Azure Cost Management API + Power BI
Dashboards: Azure Managed Grafana or Azure Workbooks

GCP

AI Client Wrapper: Google Cloud AI Platform SDK with custom interceptor
OTel Collector: OpenTelemetry Collector on GKE with Cloud Operations exporter
Metrics: Google Cloud Monitoring with custom metrics
Logs: Google Cloud Logging with structured JSON
Traces: Google Cloud Trace
Cost: Google Cloud Billing API + BigQuery export
Dashboards: Google Cloud Monitoring Dashboards + Looker

On-Premises

AI Client Wrapper: OpenTelemetry SDK (language-native)
OTel Collector: OpenTelemetry Collector Contrib (self-hosted)
Metrics: Prometheus + Thanos for long-term storage
Logs: OpenSearch (Elasticsearch alternative) or Loki
Traces: Jaeger or Grafana Tempo
Cost: Custom aggregation service querying AI API billing endpoints
Dashboards: Grafana (open source)

Pattern ID	Pattern Name	Relationship	Notes
EAAPL-OBS002	Prompt Monitoring	Extends	Uses structured log schema from this pattern; adds prompt-specific anomaly detection
EAAPL-OBS003	Hallucination Detection	Depends On	Requires trace and log data from this pattern to link detections to specific requests
EAAPL-OBS004	AI Incident Management	Depends On	Alert rules and runbooks reference metrics and SLOs defined here
EAAPL-OBS005	Model Drift Detection	Depends On	Input/output distribution metrics sourced from this telemetry layer
EAAPL-OBS006	AI Cost Observability	Extends	Builds cost attribution and FinOps layer on cost metrics defined here
EAAPL-OBS007	Distributed AI Tracing	Extends	Detailed trace architecture; trace collection infrastructure shared with this pattern
EAAPL-OBS008	AI Performance Benchmarking	Depends On	Golden dataset regression uses telemetry data for baseline comparison

17. Maturity Assessment

Overall Maturity: Proven

Dimension	Score (1–5)	Rationale
Adoption Breadth	4	Widely adopted at organisations with mature platform engineering practices
Tooling Ecosystem	5	OpenTelemetry, Prometheus, Grafana are mature; GenAI semantic conventions stable since 2024
Operational Runbook Coverage	4	Standard runbooks exist; AI-specific runbooks still organisation-specific
Regulatory Evidence	4	Used by APRA-regulated entities; audit findings confirm pattern adequacy
Cost Predictability	3	Cardinality-driven cost surprises remain common; requires active management
Team Skill Availability	4	OpenTelemetry skills broadly available; AI-specific extensions require training

18. Revision History

Version	Date	Author	Changes
1.0.0	2026-06-12	EAAPL Working Group	Initial publication

← Back to Library More Observability & Monitoring →

EAAPL-OBS001 · AI Telemetry Architecture

EAAPL-OBS001 · AI Telemetry Architecture

1. Executive Summary

2. Problem Statement

Business Problem

Technical Problem

Symptoms

Cost of Inaction

3. Context

When to Apply

When NOT to Apply

Prerequisites

Industry Applicability

4. Architecture Overview

5. Architecture Diagram

6. Components

7. Data Flow

Primary Flow

Error Flow

8. Security Considerations

OWASP LLM Top 10 Coverage

9. Governance Considerations

Governance Artefacts

10. Operational Considerations

SLO Table

Disaster Recovery Table

11. Cost Considerations

Indicative Cost Range

12. Trade-Off Analysis

Approach Comparison

Architectural Tensions

13. Failure Modes

Cascading Scenarios

14. Regulatory Considerations

15. Reference Implementations

AWS

Azure

GCP

On-Premises

16. Related Patterns

17. Maturity Assessment

18. Revision History