EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryObservability & Monitoring
Proven
⇄ Compare

EAAPL-OBS001 · AI Telemetry Architecture

EAAPL-OBS001 · AI Telemetry Architecture

Pattern ID: EAAPL-OBS001 Status: Proven Complexity: Medium Tags: observability slo llm audit-logging medium-complexity Version: 1.0.0 Last Reviewed: 2026-06-12


1. Executive Summary

AI systems present unique observability challenges that traditional APM tooling does not address. Unlike deterministic software, AI systems exhibit probabilistic behaviour, token-based economics, and latency profiles driven by external model APIs rather than internal compute. Without purpose-built telemetry, engineering teams operate blind: they cannot distinguish a model quality regression from a latency spike, cannot allocate AI costs to business units, and cannot demonstrate regulatory compliance when auditors request evidence of AI system behaviour.

This pattern defines a comprehensive telemetry architecture for AI systems covering metrics, structured logs, and distributed traces. It establishes an AI-specific metric taxonomy — token usage, prompt vs completion token splits, cache hit rates, per-model latency percentiles, error rates by error class, and cost per request — alongside a structured logging schema and distributed tracing conventions built on OpenTelemetry GenAI semantic conventions. The outcome is full operational visibility: SLO attainment dashboards, cost attribution by team and product, compliance audit trails, and the data foundation required for hallucination detection, drift monitoring, and incident management patterns that extend this baseline.

Target Audience: CIO, CTO, Platform Engineering Lead, AI Engineering Lead Time to Implement: 6–10 weeks


2. Problem Statement

Business Problem

Organisations deploying AI are unable to answer basic operational questions: Which team is driving our AI spend? Is our AI improving or degrading over time? If a customer complains our AI gave wrong advice, can we reconstruct what happened? Without answers, AI budgets spiral, quality degrades silently, and regulatory exposure grows.

Technical Problem

Traditional APM tools (Datadog, New Relic, Dynatrace) capture HTTP latency and error rates but lack primitives for token counts, prompt versions, cache hit ratios, model-specific error codes, or cost attribution at request level. AI pipelines span multiple components — prompt assembly, vector retrieval, LLM API, output filtering — yet appear as a single opaque HTTP call in conventional traces.

Symptoms

  • AI costs attributed to a single cost centre with no per-team or per-feature breakdown
  • Latency alerts fire on p99 spikes but root cause (LLM API vs. vector DB vs. prompt assembly) is unknown
  • Incident post-mortems cannot reconstruct the exact prompt, model version, and context that produced a harmful output
  • Prompt engineering teams release updates with no visibility into production impact
  • Cache infrastructure exists but cache hit rate is unknown and uncorrelated to cost

Cost of Inaction

  • AI cost overruns of 40–200% are common when per-request cost tagging is absent (Gartner 2025)
  • Mean time to diagnose AI quality incidents exceeds 8 hours without structured AI telemetry
  • Regulatory audit findings when AI decision logs cannot be reconstructed (APRA, EU AI Act)
  • Repeated model regressions shipped to production due to absence of quality regression gates

3. Context

When to Apply

  • Any production AI system processing > 1,000 requests/day
  • AI systems handling financial, health, legal, or safety-relevant decisions
  • Organisations with multiple teams sharing AI infrastructure
  • Systems where LLM API cost exceeds $500/month (cost visibility ROI positive)
  • Before deploying dependent patterns: EAAPL-OBS002 (Prompt Monitoring), EAAPL-OBS003 (Hallucination Detection)

When NOT to Apply

  • Internal proof-of-concept with < 30-day planned lifespan
  • Single-team, single-use-case AI tools where cost allocation is irrelevant
  • Air-gapped environments where telemetry egress is prohibited (use on-premises collector variant)

Prerequisites

Prerequisite Required Notes
OpenTelemetry SDK in application stack Required Java, Python, Node.js, Go SDKs available
Centralised log aggregation (e.g., OpenSearch, Splunk, Loki) Required Must support JSON structured logs
Metrics platform (Prometheus, Datadog, CloudWatch) Required Must support custom metric dimensions
Distributed trace backend (Jaeger, Tempo, X-Ray, Cloud Trace) Required Must accept OTLP protocol
AI API cost data access Required Provider billing API or webhook for per-request cost
Secret management for API keys Required Keys must not appear in telemetry data

Industry Applicability

Industry Applicability Primary Driver
Financial Services High APRA CPS234, audit trail requirements, cost allocation
Healthcare High Privacy Act, clinical AI accountability, safety
Government High Mandatory AI transparency, FOI obligations
Retail / E-Commerce Medium Cost optimisation, personalisation quality
Technology / SaaS High Multi-tenant cost attribution, SLO management
Legal Services High Professional liability, output auditability
Manufacturing Medium Predictive maintenance quality tracking

4. Architecture Overview

The AI Telemetry Architecture is structured as a four-layer system: instrumentation, collection, storage, and consumption. Each layer has distinct responsibilities and technology choices.

Instrumentation Layer

Every AI pipeline component is instrumented using the OpenTelemetry SDK with GenAI semantic conventions. The instrumentation layer is responsible for emitting three signal types: metrics (counters, histograms, gauges), structured logs (JSON with a canonical AI log schema), and traces (spans with AI-specific span attributes). Instrumentation is implemented as middleware or interceptors on the AI SDK layer — not in application business logic — so that telemetry concerns do not leak into product code. The canonical approach is an AI client wrapper that intercepts every LLM call, records span start and end, attaches token counts from the API response, calculates cost using a cost coefficient table, and emits all three signal types atomically.

Metric Taxonomy

The AI metric taxonomy distinguishes three categories. Throughput metrics track requests per second per model, tokens per minute (split prompt vs. completion), and concurrent in-flight requests. Quality metrics track cache hit rate, error rate by error type (rate limit, context length exceeded, content filter, timeout, model error), and model availability (percentage of requests receiving non-error responses). Economics metrics track cost per request (USD), cost per 1K tokens by model, and cumulative cost by dimension (team, product, feature, environment). All metrics carry consistent dimension labels: model_id, model_version, environment, team, product, use_case, and error_type where applicable.

Structured Log Schema

Every LLM request produces a structured log record in a canonical JSON schema. The schema includes: requestId (UUID v4), traceId (W3C trace context), spanId, timestamp (ISO 8601 UTC), modelId, modelVersion, promptVersion, promptHash (SHA-256 of prompt template, not content), inputTokens, outputTokens, cachedTokens, latencyMs (total), latencyBreakdown (object with promptAssemblyMs, vectorRetrievalMs, llmApiMs, outputFilterMs), cacheHit (boolean), userId (hashed), tenantId, toolCalls (array of tool name and latency), errorCode (null or error type string), costUsd, and environment. PII is never logged in prompt content; prompt content logging requires explicit opt-in with data classification controls.

Distributed Trace Architecture

Traces propagate via W3C Trace Context headers through every component in the AI pipeline. The trace begins at the API gateway where the initial span is created. Child spans are created for: authentication check, rate limit evaluation, prompt assembly, vector store retrieval (one span per retrieval call), LLM API call (one span per model invocation, supporting multi-turn and agentic patterns), output safety filter, and response serialisation. Each span carries AI-specific span attributes per OpenTelemetry GenAI conventions.

Collection Architecture

The OpenTelemetry Collector runs as a sidecar or daemon set, receiving signals via OTLP (gRPC and HTTP). The collector pipeline includes processors for PII scrubbing (remove prompt content from logs before storage unless classified as approved), attribute enrichment (add cost calculations, environment tags), sampling (100% error traces, 10% success traces, 1% high-volume paths), and batching for efficiency. The collector fans out to multiple backends: metrics to Prometheus or CloudWatch, logs to OpenSearch or Splunk, traces to Jaeger or Tempo.

Storage Tiering

Hot storage (0–30 days): full-resolution metrics at 15-second granularity, all structured logs, all sampled traces. Warm storage (30–90 days): 1-minute metric aggregates, logs retained for compliance. Cold storage (90 days–7 years): aggregated metrics, compliance-required logs, legal hold traces. Retention periods are configurable per data classification and regulatory jurisdiction.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Pipeline["AI Request Pipeline"] A[API Gateway] B[LLM Client Wrapper] C[Output Safety Filter] end subgraph Collection["OTel Collection"] D[OTel Collector] E[PII Scrubber + Enrichment] end subgraph Storage["Signal Storage"] F[(Metrics Store)] G[(Log Store)] H[(Trace Backend)] end A --> B B --> C A -->|spans| D B -->|tokens + cost + spans| D C -->|filter spans| D D --> E E -->|metrics| F E -->|logs| G E -->|traces| H F --> I[Dashboards + Alerts] G --> J[Audit Reports] H --> J style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#fef9c3,stroke:#eab308 style G fill:#fef9c3,stroke:#eab308 style H fill:#fef9c3,stroke:#eab308 style I fill:#d1fae5,stroke:#10b981 style J fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
AI Client Wrapper SDK Library Intercepts all LLM API calls; emits metrics, logs, spans; calculates cost Custom wrapper over OpenAI SDK, Anthropic SDK, Bedrock SDK Critical
OpenTelemetry Collector Infrastructure Receives OTLP signals; processes (PII scrub, enrich, sample); fans out to backends OTel Collector Contrib, AWS ADOT, Grafana Alloy Critical
PII Scrubber Processor Collector Processor Detects and removes PII from log fields before storage Presidio, custom regex processor, OTel transform processor Critical
Metrics Backend Storage Time-series storage for AI metrics; supports high-cardinality dimensions Prometheus + Thanos, Datadog, CloudWatch, InfluxDB Critical
Log Aggregation Storage Structured JSON log storage; supports full-text and field queries OpenSearch, Splunk, Loki, Azure Monitor Logs Critical
Trace Backend Storage Distributed trace storage; supports trace search and waterfall UI Jaeger, Grafana Tempo, AWS X-Ray, Google Cloud Trace High
Cost Database Storage Per-request cost records with full dimension tagging PostgreSQL with TimescaleDB, ClickHouse, BigQuery High
Dashboarding Consumption SLO dashboards, cost dashboards, quality dashboards Grafana, CloudWatch Dashboards, Datadog Dashboards High
Alert Manager Consumption Route metric threshold alerts to on-call channels Alertmanager, PagerDuty, OpsGenie, CloudWatch Alarms Critical
Audit Report Engine Consumption Generate compliance audit reports from log data Scheduled queries + PDF/CSV export; Splunk reports High
Cost Attribution Engine Consumption Aggregate cost by team/product/feature; send to FinOps Custom aggregation service; AWS Cost and Usage Report integration Medium
Sampling Controller Collector Processor Dynamic sampling rates by traffic type and error status OpenTelemetry tail-based sampling processor Medium

7. Data Flow

Primary Flow

Step Actor Action Output
1 API Gateway Receives request; creates root span with W3C trace context; injects trace headers Root span, request log entry
2 AI Client Wrapper Intercepts LLM call; records start timestamp; logs requestId, userId, tenantId, promptVersion Span started, pre-call log record
3 Prompt Assembly Service Retrieves prompt template; assembles with context; records assemblyMs Assembled prompt, span with promptHash attribute
4 Vector Store Client Executes similarity search; records retrievalMs, document count, top-k scores Retrieved context chunks, retrieval span
5 LLM API Client Sends request to model provider; awaits response; records inputTokens, outputTokens, cachedTokens, llmApiMs Raw completion, LLM span with token attributes
6 Cost Calculator Applies cost coefficient (USD per 1K tokens) to token counts; calculates requestCostUsd costUsd field populated
7 Output Safety Filter Evaluates output for policy violations; records filter decision, filterMs Filtered output, filter span
8 AI Client Wrapper Closes span; emits complete structured log record; increments metrics counters and histograms Completed trace, log record, metric increments
9 OTel Collector Receives OTLP signals; applies PII scrub, enrichment, sampling Clean signals ready for storage
10 Backends Persist metrics, logs, traces to respective storage systems Queryable telemetry data

Error Flow

Error Scenario Detection Action Recovery
LLM API rate limit (429) AI Client Wrapper catches HTTP 429 Log errorCode=RATE_LIMIT; increment error_rate counter with error_type label; create error span Retry with exponential backoff; alert if rate > threshold
Context length exceeded HTTP 400 with context_length_exceeded code Log errorCode=CONTEXT_LENGTH; increment counter; record inputTokens that caused overflow Truncation fallback; alert prompt engineering team
OTel Collector unavailable SDK cannot connect to collector endpoint Buffer signals in memory (up to configured limit); retry; drop oldest if buffer full Collector restart; no data loss for short outages
PII detected in log field PII scrubber processor identifies PII pattern Replace field value with [REDACTED]; log PII detection event to separate audit stream Alert data governance team; do not store raw PII
Cost calculation failure Cost coefficient table missing model entry Log warning; record costUsd=null; continue request Add model to cost table; backfill cost estimates

8. Security Considerations

Authentication: All telemetry endpoints require mutual TLS (mTLS) between instrumented services and OTel Collector. Collector-to-backend connections use service account credentials stored in a secrets manager, never in environment variables or config files.

Authorisation: Telemetry data access follows least-privilege. Engineers access dashboards via SSO with RBAC. Cost data restricted to team leads and FinOps. Audit logs accessible only to compliance, legal, and senior engineering. Trace data containing AI outputs restricted to AI engineering and incident responders.

Secrets Management: AI API keys (OpenAI, Anthropic, Bedrock) are rotated quarterly and stored in HashiCorp Vault or AWS Secrets Manager. Keys never appear in logs, spans, or metrics. The AI Client Wrapper retrieves keys at runtime from the secrets manager; keys are never interpolated into log messages.

Data Classification: Prompt content is classified as Internal (employee prompts) or Confidential (customer data in context). Prompt content logging requires explicit classification approval. Default: prompt content is NOT logged; only promptHash (SHA-256 of template, no variable content) is recorded.

Encryption: All telemetry data encrypted in transit (TLS 1.3) and at rest (AES-256). Log archives in cold storage use envelope encryption with customer-managed keys.

Auditability: Every access to telemetry data is itself logged. Dashboard queries, report exports, and direct database queries produce access log records. Audit logs are write-once and stored in a separate immutable log store.

OWASP LLM Top 10 Coverage

OWASP LLM Risk Telemetry Control Implementation
LLM01 Prompt Injection Prompt content anomaly detection in monitoring layer Alert on prompts containing injection signatures; log injection attempts
LLM02 Insecure Output Handling Output safety filter spans record filter decisions Track filter bypass rate; alert on zero-filter-hits anomaly
LLM03 Training Data Poisoning Input distribution drift metrics PSI alerts on input feature drift that may indicate poisoning
LLM04 Model Denial of Service Token usage metrics, rate limit error rate Alert on token-per-minute spikes; rate limit enforcement at gateway
LLM05 Supply Chain Vulnerabilities Model version tracking in all telemetry records Detect unexpected model version changes in production
LLM06 Sensitive Information Disclosure PII scrubber processor; prompt content classification Alert on PII detection in prompts; restrict access to AI output logs
LLM07 Insecure Plugin Design Tool call spans record tool name, inputs, outputs Audit all tool invocations; alert on unexpected tool calls
LLM08 Excessive Agency Tool call frequency metrics; scope metrics Alert on tool call rate exceeding expected bounds per workflow
LLM09 Overreliance User feedback metrics; hallucination rate metric Track downstream outcomes; surface low-confidence responses
LLM10 Model Theft API key access logs; unusual exfiltration patterns Alert on bulk completions from single key; anomalous output volume

9. Governance Considerations

Responsible AI: Telemetry data is the primary evidence base for AI governance. The telemetry architecture must retain sufficient data to answer: Who made this AI call? What was the prompt context? What model and version responded? What was the output? Was it filtered? All these questions must be answerable from the audit trail.

Model Risk Management: Model version is a mandatory dimension on all telemetry. When a model version changes, telemetry enables before/after comparison on quality, cost, and error metrics. This supports the model risk management process for material model changes.

Human Approval: Access to raw AI telemetry (especially prompt logs and output logs) requires approval from the data governance committee. Automated systems may read aggregated metrics; raw records require human approval for access.

Policy: Telemetry data retention policies must be documented and approved by legal and compliance. Minimum retention for regulatory purposes is 7 years for financial services (APRA), 5 years for healthcare (Privacy Act). Destruction schedules must be enforceable and audited.

Traceability: Every AI-influenced decision must be traceable from the business outcome back to the specific requestId, modelId, promptVersion, and context. This traceability chain is the foundation for regulatory audit and legal discovery.

Governance Artefacts

Artefact Owner Frequency Format
AI Telemetry Schema Registry Platform Engineering Updated on schema change JSON Schema + changelog
Data Retention Policy Legal / Compliance Annual review Policy document
Cost Attribution Report FinOps + AI Platform Monthly Automated dashboard + PDF export
Model Version Change Log AI Engineering Per deployment Linked to deployment record
PII Detection Incident Log Data Governance Per incident Incident ticket + remediation record
Telemetry Access Audit Report Security Quarterly Automated export from access log

10. Operational Considerations

Monitoring: The telemetry system itself must be monitored. Collector pipeline health (signal throughput, drop rate, processing lag), backend storage capacity and ingestion rate, and alert delivery reliability are all first-class operational metrics.

Logging: Collector and backend logs are separate from AI application logs. They are stored in an operations log store, not the AI audit log store, to prevent circular dependencies.

Incident Response: If the telemetry system degrades, AI systems continue operating but enter a "blind flight" state. Runbooks must define escalation criteria and manual verification procedures for operating without telemetry.

Disaster Recovery: The OTel Collector runs in active-active pairs. Log and metric backends use replication. Trace data is lower durability (acceptable to lose recent traces in a DR event); log and metric data requires RPO < 1 hour.

Capacity Planning: Token-based workloads can have very bursty telemetry volumes. Capacity planning must account for peak token volume, not average. A 10x burst capacity buffer is the minimum recommendation.

SLO Table

SLO Target Measurement Alert Threshold
Telemetry signal delivery lag < 30 seconds p99 Collector processing lag metric > 60 seconds for 5 minutes
Log query response time < 5 seconds for 24h queries Log backend query latency > 10 seconds sustained
Alert delivery time < 2 minutes from threshold breach Alert delivery timestamp vs. breach timestamp > 5 minutes
Telemetry data completeness > 99.9% of AI requests have log record Correlation of AI request count vs. log record count < 99% for 1 hour

Disaster Recovery Table

Component RTO RPO Recovery Approach
OTel Collector 5 minutes Near-zero (active-active) Auto-failover to standby collector
Metrics Backend 15 minutes 1 hour Prometheus TSDB snapshot restore
Log Aggregation 30 minutes 1 hour Index restore from object storage
Trace Backend 60 minutes 4 hours Partial data acceptable; restore from object storage
Cost Database 30 minutes 1 hour PostgreSQL streaming replication

11. Cost Considerations

Cost Drivers

Driver Description Relative Cost
Log ingestion volume Token counts and full request metadata generate ~2KB per AI request at scale High
Trace storage Full distributed traces with AI span attributes are 5–10x larger than typical traces Medium-High
Metrics cardinality High-cardinality dimensions (userId, tenantId per metric) can cause metric explosion High if uncontrolled
PII scrubber compute Regex and NER-based PII detection adds ~5ms per log record; scales with volume Medium
Cold storage archival 7-year retention of log data at enterprise scale Medium (decreasing)

Scaling Risks: Metrics cardinality is the primary cost scaling risk. Adding userId as a label on high-volume metrics multiplies series count by user count. Use aggregated dimensions (user_tier, not userId) for metrics; reserve userId for log records.

Optimisations:

  • Reduce log verbosity for non-error paths (omit latency breakdown for successful sub-100ms calls)
  • Use adaptive sampling: 100% error, 10% normal, 1% for high-volume cached paths
  • Compress log archives (gzip achieves 80–90% compression on JSON logs)
  • Use metric aggregation at collector before forwarding to reduce series count

Indicative Cost Range

Scale AI Requests/Day Estimated Telemetry Cost/Month
Small 10,000 $200–$500
Medium 500,000 $2,000–$5,000
Large 5,000,000 $10,000–$25,000
Enterprise 50,000,000+ $50,000–$150,000 (with optimisation)

12. Trade-Off Analysis

Approach Comparison

Approach Pros Cons Best For
Full-resolution logging (log every request complete) Complete audit trail; regulatory defensible; full debugging capability High storage cost; PII risk requires scrubbing; complex access control Financial services, healthcare, regulated industries
Sampled trace + aggregated metrics only 80% lower telemetry cost; simpler; no PII in trace storage Cannot reconstruct specific requests; insufficient for audit; gaps in debugging Internal tools, low-risk AI features, cost-sensitive environments
Vendor-managed observability (Datadog, New Relic AI monitoring) Faster time to value; managed infrastructure; built-in AI dashboards Vendor lock-in; data residency concerns; limited schema control; high cost at scale Organisations without dedicated platform engineering capability

Architectural Tensions

Tension Description Resolution
Completeness vs. Privacy Full logging enables audit but risks PII exposure PII scrubber at collector; log metadata not content; prompt hash not prompt text
Low latency vs. Synchronous telemetry Synchronous telemetry adds latency to AI calls Async emit to collector; fire-and-forget from AI Client Wrapper; batch at collector
Cardinality vs. Granularity High-cardinality metrics enable granular analysis but explode cost Use granular dimensions in logs; aggregated dimensions in metrics
Retention vs. Cost Long retention enables trend analysis but storage cost is linear Tiered storage with automated downsampling; aggregate old data before archival

13. Failure Modes

Failure Likelihood Impact Detection Recovery
OTel Collector crash Low High (blind flight) Collector health check alert; signal delivery lag alert Auto-restart via process supervisor; failover to standby
Metrics backend capacity exhaustion Medium High (no SLO visibility) Storage utilisation alert at 80% Increase storage; enable metric retention reduction; emergency cardinality reduction
Log storage ingestion lag Medium Medium (delayed audit trail) Ingestion lag metric > 5 minutes Scale log ingestion nodes; enable priority queuing
PII scrubber misconfiguration allows PII through Low Critical (data breach risk) Regular PII audit scans on stored logs Immediate: quarantine affected log segment; notify privacy officer; remediate and rescrub
Cost table missing model entry Medium Low (cost blind for new model) Alert on null costUsd in logs Add model to cost table; backfill estimates from provider billing API
High cardinality metric explosion Medium High (metrics backend OOM) Series count growth rate alert Emergency: drop high-cardinality labels; increase backend capacity

Cascading Scenarios

  • Scenario 1: Log backend ingestion fails → Audit trail gaps → Regulatory non-compliance finding → Mandatory remediation. Mitigation: dual-write to backup log store with 24-hour buffer.
  • Scenario 2: Collector PII scrubber disabled for performance → PII accumulates in logs → Data breach → Privacy Act notification. Mitigation: PII scrubbing is non-bypassable; performance optimisation must find alternative.

14. Regulatory Considerations

Regulation Clause Requirement Telemetry Implementation
APRA CPS 230 Para 53–57 (Operational Risk Management) Critical AI systems require documented monitoring and incident management SLO dashboards, incident detection alerts, audit trail per this pattern
APRA CPS 234 Para 36–37 (Information Security Incident Response) Security events must be detected, logged, and reported within defined timeframes Security-relevant telemetry (injection attempts, PII leaks) alert within SLO
Privacy Act 1988 (AU) APP 11 (Security of Personal Information) PII in AI systems must be protected and access controlled PII scrubber, access-controlled log store, data retention limits
EU AI Act Article 12 (Record-keeping for high-risk AI) High-risk AI systems must log inputs and outputs to enable post-hoc review Structured log schema with requestId linkable to outcome; 7-year retention for high-risk
EU AI Act Article 9 (Risk Management System) Continuous monitoring of AI system performance Drift monitoring (EAAPL-OBS005), quality metrics, SLO attainment
ISO/IEC 42001 Clause 9.1 (Monitoring, measurement, analysis) AI management system requires performance monitoring This pattern implements the technical layer; governance layer in EAAPL-OBS001 governance section
NIST AI RMF GOVERN 1.7, MANAGE 2.2 AI risks must be tracked, monitored, and reported with defined metrics Metric taxonomy, SLO table, incident integration per this pattern

15. Reference Implementations

AWS

  • AI Client Wrapper: AWS SDK for Bedrock with custom interceptor; OpenTelemetry Java/Python/Node SDK
  • OTel Collector: AWS Distro for OpenTelemetry (ADOT) on ECS or EKS
  • Metrics: Amazon CloudWatch with custom namespace AI/Inference
  • Logs: Amazon CloudWatch Logs with structured JSON; export to S3 for cold storage
  • Traces: AWS X-Ray with OpenTelemetry SDK
  • Cost: AWS Cost and Usage Report + custom Lambda aggregation
  • Dashboards: Amazon CloudWatch Dashboards + Amazon Managed Grafana

Azure

  • AI Client Wrapper: Azure SDK for OpenAI with Application Insights telemetry initialiser
  • OTel Collector: Azure Monitor OpenTelemetry Distro on AKS
  • Metrics: Azure Monitor Metrics with custom dimensions
  • Logs: Azure Monitor Logs (Log Analytics Workspace)
  • Traces: Azure Application Insights distributed tracing
  • Cost: Azure Cost Management API + Power BI
  • Dashboards: Azure Managed Grafana or Azure Workbooks

GCP

  • AI Client Wrapper: Google Cloud AI Platform SDK with custom interceptor
  • OTel Collector: OpenTelemetry Collector on GKE with Cloud Operations exporter
  • Metrics: Google Cloud Monitoring with custom metrics
  • Logs: Google Cloud Logging with structured JSON
  • Traces: Google Cloud Trace
  • Cost: Google Cloud Billing API + BigQuery export
  • Dashboards: Google Cloud Monitoring Dashboards + Looker

On-Premises

  • AI Client Wrapper: OpenTelemetry SDK (language-native)
  • OTel Collector: OpenTelemetry Collector Contrib (self-hosted)
  • Metrics: Prometheus + Thanos for long-term storage
  • Logs: OpenSearch (Elasticsearch alternative) or Loki
  • Traces: Jaeger or Grafana Tempo
  • Cost: Custom aggregation service querying AI API billing endpoints
  • Dashboards: Grafana (open source)

Pattern ID Pattern Name Relationship Notes
EAAPL-OBS002 Prompt Monitoring Extends Uses structured log schema from this pattern; adds prompt-specific anomaly detection
EAAPL-OBS003 Hallucination Detection Depends On Requires trace and log data from this pattern to link detections to specific requests
EAAPL-OBS004 AI Incident Management Depends On Alert rules and runbooks reference metrics and SLOs defined here
EAAPL-OBS005 Model Drift Detection Depends On Input/output distribution metrics sourced from this telemetry layer
EAAPL-OBS006 AI Cost Observability Extends Builds cost attribution and FinOps layer on cost metrics defined here
EAAPL-OBS007 Distributed AI Tracing Extends Detailed trace architecture; trace collection infrastructure shared with this pattern
EAAPL-OBS008 AI Performance Benchmarking Depends On Golden dataset regression uses telemetry data for baseline comparison

17. Maturity Assessment

Overall Maturity: Proven

Dimension Score (1–5) Rationale
Adoption Breadth 4 Widely adopted at organisations with mature platform engineering practices
Tooling Ecosystem 5 OpenTelemetry, Prometheus, Grafana are mature; GenAI semantic conventions stable since 2024
Operational Runbook Coverage 4 Standard runbooks exist; AI-specific runbooks still organisation-specific
Regulatory Evidence 4 Used by APRA-regulated entities; audit findings confirm pattern adequacy
Cost Predictability 3 Cardinality-driven cost surprises remain common; requires active management
Team Skill Availability 4 OpenTelemetry skills broadly available; AI-specific extensions require training

18. Revision History

Version Date Author Changes
1.0.0 2026-06-12 EAAPL Working Group Initial publication
← Back to LibraryMore Observability & Monitoring