EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryObservability & Monitoring
Proven
⇄ Compare

EAAPL-OBS004 · AI Incident Management

📊 Observability & Monitoring🏭 Field-tested in AU

EAAPL-OBS004 · AI Incident Management

Pattern ID: EAAPL-OBS004 Status: Proven Complexity: Medium Tags: observability alerting slo apra-cps230 medium-complexity Version: 1.0.0 Last Reviewed: 2026-06-12


1. Executive Summary

AI system failures are qualitatively different from traditional software failures. An AI system can be operationally available (HTTP 200) while actively delivering harmful, inaccurate, or biased outputs. Traditional incident management frameworks — built around availability and error rate — are blind to quality, safety, and compliance failures that represent the greatest risk in AI deployments.

This pattern defines the operational incident management lifecycle for AI systems, covering detection, triage, escalation, response, and post-incident review. It establishes a six-category AI incident taxonomy (Availability, Quality, Security, Cost, Compliance, and Data) with severity classifications, MTTD and MTTR targets by severity, and automated detection rules for each incident type drawn from the telemetry architecture defined in EAAPL-OBS001. It specifies integration with PagerDuty, OpsGenie, and ServiceNow; AI-specific runbook templates; post-incident review processes; and APRA CPS 230 incident notification obligations that apply when AI systems support critical operations. The pattern is designed to provide evidence to regulators and auditors that AI incidents are systematically detected, managed, and learned from.

Target Audience: CIO, CTO, Head of Platform Engineering, Chief Risk Officer Time to Implement: 4–8 weeks


2. Problem Statement

Business Problem

When AI systems deliver harmful outputs, organisations face a compounding problem: they often don't know an incident occurred until a user complains; they cannot determine scope (how many users were affected); they cannot reconstruct what happened; and they have no playbook for response. APRA CPS 230 requires financial institutions to detect and manage operational disruptions — but most AI incident policies do not account for quality and compliance failures that technically aren't "outages."

Technical Problem

Existing incident management tooling is wired to infrastructure and HTTP metrics: error rate, latency, availability. AI-specific failure modes — hallucination spike, accuracy regression, prompt injection attack, PII leak in output, cost budget breach, vector DB corruption — produce none of these signals. They appear as business metric degradation (lower NPS, higher escalation rate, customer churn) weeks after the incident began.

Symptoms

  • AI incidents discovered through customer complaints or NPS drop, not monitoring alerts
  • No classification system for AI incidents; all AI issues are handled as ad-hoc engineering tasks
  • Post-incident review template asks "what was the error rate?" rather than "what was the hallucination rate?"
  • Regulatory notification assessment for AI incidents is ad-hoc; no documented criteria
  • Mean time to resolve AI quality incidents is measured in days, not hours
  • No ownership for AI cost incidents — budget overruns attributed to "AI costs increased" with no root cause

Cost of Inaction

  • APRA CPS 230 paragraph 53 requires financial institutions to detect and respond to operational disruptions; AI failures that affect service delivery qualify
  • Each undetected AI quality incident erodes user trust; trust erosion has a compounding effect on retention
  • Regulatory enforcement actions for AI incidents with no documented response procedure
  • Budget overruns from undetected AI cost incidents: typical cost incident undetected for 72 hours costs 3–5x the alert-and-correct cost

3. Context

When to Apply

  • Any production AI system in a regulated industry (APRA, EU AI Act, Privacy Act)
  • AI systems in organisations with existing ITIL or incident management frameworks that need AI-specific extensions
  • AI systems processing > 1,000 requests/day where manual monitoring is not scalable
  • Prerequisite: EAAPL-OBS001 telemetry must provide the metric and log stream for automated detection

When NOT to Apply

  • Internal proof-of-concept systems with < 30-day lifespan and no regulatory exposure
  • AI systems where the parent application already has comprehensive incident management covering AI-specific failure modes

Prerequisites

Prerequisite Required Notes
EAAPL-OBS001 AI Telemetry Infrastructure Required Alert rules depend on metrics and logs from this pattern
Incident management platform (PagerDuty / OpsGenie / ServiceNow) Required Alert routing and on-call schedule management
Defined on-call schedule for AI platform Required Someone must receive P1 alerts at 3am
APRA CPS 230 Business Services mapping Conditional Required if AI supports APRA-critical business services

Industry Applicability

Industry Applicability Primary Driver
Financial Services Critical APRA CPS 230 notification obligations, regulatory audit
Healthcare Critical Clinical AI failures have patient safety consequence
Government High Public service delivery obligations; ministerial reporting
Legal Services High Professional liability incidents require documented response
Retail / E-Commerce Medium Cost and quality incidents affect revenue
Technology / SaaS High SLA obligations; multi-tenant blast radius

4. Architecture Overview

The AI Incident Management Architecture is a layered system operating on top of the telemetry infrastructure established by EAAPL-OBS001. It introduces AI-specific detection logic, a structured taxonomy and severity framework, integration with existing incident management platforms, AI-specific runbooks, and a specialised post-incident review process.

AI Incident Taxonomy

Six incident categories are defined. Availability incidents cover model API outages, vector database unavailability, and inference service failures. Traditional monitoring covers these well; the AI-specific extension is tracking which downstream AI features are degraded during third-party model API outages, since these are external dependencies outside the organisation's control. Quality incidents cover hallucination rate spikes, accuracy regressions, output safety filter bypass, and significant latency degradation affecting user experience. Quality incidents are the most novel category and require the AI-specific telemetry from EAAPL-OBS001 and EAAPL-OBS003 to detect. Security incidents cover detected prompt injection attacks, PII leaked in model output, jailbreak attempts, and unusual API key usage patterns. Cost incidents cover budget threshold breaches, per-request cost spikes, and unexpected token usage patterns. Compliance incidents cover regulatory threshold breaches (e.g., AI-influenced decision rate exceeding regulatory limits), audit trail failures, and human oversight bypass. Data incidents cover vector database corruption, training data tampering, and retrieval index degradation.

Severity Classification

P0 (Critical): AI system causing immediate user harm or regulatory breach. Examples: PII data appearing in AI outputs at scale; AI clinical decision support providing systematically wrong guidance; security breach via prompt injection. Response time: page on-call immediately; incident commander assigned within 10 minutes. P1 (High): AI system significantly degraded; SLO breach sustained; material quality regression. Examples: hallucination rate > 3x baseline; model API down > 15 minutes; cost budget 100% exceeded. Response time: page on-call; response within 15 minutes. P2 (Medium): AI system partially degraded; SLO at risk; quality regression not yet material. Examples: latency p99 > 2x baseline; hallucination rate elevated but below P1 threshold; partial feature degradation. Response time: alert to on-call channel; response within 1 hour. P3 (Low): Minor degradation or early warning signal. Examples: single model error; cost at 80% of budget; drift warning. Response time: create ticket; address in next working day.

Automated Detection Rules

Detection rules are defined as alert conditions on the metrics and logs from EAAPL-OBS001. Each incident type has a specific detection rule. Availability: HTTP error rate > 5% for 5 minutes on model API endpoint. Quality: hallucination rate (from EAAPL-OBS003) > 2x 7-day baseline for 30 minutes. Security: injection attempt count > 10 per minute (from EAAPL-OBS002). Cost: hourly spend > 150% of rolling 7-day hourly average. Compliance: PII detection event count > 0 in output stream (zero tolerance). Data: vector retrieval success rate < 95% for 15 minutes.

Escalation Architecture

PagerDuty/OpsGenie is configured with AI-specific services and escalation policies. P0 and P1 alerts page the AI platform on-call engineer and notify the engineering manager and relevant product owner. Compliance and security incidents additionally notify the CISO, privacy officer, and compliance team. Cost incidents notify the FinOps team and department head in addition to engineering on-call. ServiceNow integration creates an incident ticket for every P1+ alert with AI-specific fields: incident_category (from taxonomy), model_id, affected_use_case, estimated_user_impact, regulatory_notification_required.

AI-Specific Runbook Templates

Generic runbooks that ask engineers to check CPU and memory are insufficient for AI incidents. AI runbooks include: current model version and recent changes; recent prompt template deployments; vector database status; token usage and cost trend; hallucination rate and quality metrics; third-party model provider status page check; and regulatory notification assessment decision tree.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Detection["Detection Layer"] A[AI Alert Engine] B[Telemetry Signals] end subgraph Triage["Triage and Response"] C{Severity Classifier} D[Incident Commander] E[AI Runbook] end subgraph Resolution["Mitigation and Close"] F{Mitigation Type} G[Rollback or Scale] H[Regulatory Notification] end B --> A A --> C C -->|P0/P1| D C -->|P2/P3| I[Channel Alert or Ticket] D --> E E --> F F -->|model/infra| G F -->|compliance/security| H G --> J[Verify Recovery] J --> K[Post-Incident Review] style B fill:#dbeafe,stroke:#3b82f6 style A fill:#f0fdf4,stroke:#22c55e style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f0fdf4,stroke:#22c55e style E fill:#fef9c3,stroke:#eab308 style F fill:#f3e8ff,stroke:#a855f7 style G fill:#f0fdf4,stroke:#22c55e style H fill:#fee2e2,stroke:#ef4444 style I fill:#dbeafe,stroke:#3b82f6 style J fill:#f0fdf4,stroke:#22c55e style K fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
AI Alert Engine Service Evaluate AI-specific alert rules against telemetry; create alert events Prometheus Alertmanager, Datadog Monitors, CloudWatch Alarms Critical
Severity Classifier Logic Apply taxonomy + threshold rules to classify incident severity Rules engine embedded in alert manager; custom Lambda/Cloud Function Critical
Incident Management Platform SaaS / On-Prem On-call routing, escalation policies, incident timeline PagerDuty, OpsGenie, ServiceNow Critical
AI Runbook Library Documentation + Automation AI-specific diagnostic and mitigation procedures; automated diagnostic scripts Confluence / Notion + runbook automation (PagerDuty Runbook Automation, Rundeck) High
Regulatory Notification Workflow Process + Tool Decision tree for APRA/Privacy Act notification; draft notifications ServiceNow GRC; custom decision tree in runbook Critical
Post-Incident Review Template Process AI-specific PIR structure covering model/prompt/data/infra dimensions Confluence template; Google Docs template High
Incident Dashboard UI Real-time incident status; active incident list; MTTD/MTTR metrics PagerDuty status page; Grafana incident dashboard Medium
Communication Templates Documentation Stakeholder and customer communications for AI incidents Pre-approved templates by incident type; legal-reviewed High
Learning Actions Tracker Workflow Track PIR action items to closure; prevent repeat incidents JIRA; ServiceNow; GitHub Issues Medium

7. Data Flow

Primary Flow

Step Actor Action Output
1 AI Alert Engine Evaluates alert rules against incoming telemetry stream Alert event with: incident_type, severity, affected_model, metric_value, threshold
2 Severity Classifier Applies severity classification rules; determines P0–P3 Classified alert with severity, incident_category, estimated_impact
3 Incident Management Platform Routes alert to on-call via PagerDuty/OpsGenie; creates ServiceNow ticket Paged engineer; incident ticket created
4 Incident Commander Acknowledges incident; activates AI-specific runbook; assigns responders Incident timeline started; runbook checklist initiated
5 Responder Team Diagnoses root cause using runbook: check model version, prompt changes, data health Root cause identified (model / prompt / data / infrastructure)
6 Regulatory Assessment Uses decision tree to determine APRA CPS 230 / Privacy Act notification requirement Notification decision + documented rationale
7 Mitigation Implements appropriate mitigation: rollback, scale, block, escalate to vendor Mitigation action logged in incident timeline
8 Verification Monitors recovery metrics; confirms metrics returning to normal Recovery confirmed; incident resolved
9 Post-Incident Review Within 5 business days for P1+; AI-specific PIR template completed PIR document; learning actions in JIRA

Error Flow

Error Scenario Detection Action Recovery
Alert engine unavailable Health check on alert engine; missed alert test (synthetic canary metric) Escalate manually; page on-call via backup channel (email/SMS) Restore alert engine; verify synthetic canary passes
False positive P1 alert (metric spike from deployment) Incident commander review during triage identifies correlated deployment Downgrade severity; resolve; adjust alert threshold Review and tune alert sensitivity
Regulatory notification deadline missed Compliance calendar alert; APRA CPS 230 72-hour window Notify APRA with explanation of delay; internal escalation to CCO Enforce regulatory assessment within 2 hours of P0/P1 incident declaration
Incident manager unresponsive Secondary on-call escalation after 10 minutes Page secondary; notify engineering manager Review on-call schedule; ensure 24/7 coverage
Mitigation makes incident worse (bad rollback) Metrics worsen after mitigation Re-declare incident at previous or higher severity Roll forward to last known good; escalate to vendor

8. Security Considerations

Authentication: Incident management platform access restricted to authorised engineering, security, and compliance staff via SSO. PagerDuty/OpsGenie API tokens stored in secrets manager. Alert engine webhooks use HMAC signatures for authentication.

Authorisation: Compliance and security incidents have restricted visibility. PII incident details are restricted to privacy officer, CISO, legal, and specific engineering respondents. Incident timelines are not publicly accessible.

Secrets Management: Any credentials used in automated remediation runbooks (e.g., model API keys for rollback) stored in secrets manager with break-glass access audit trail.

Data Classification: Incident records containing AI output examples or prompt content are classified as Confidential. Incident records for security incidents are classified as Restricted. All incident records are retained for 7 years.

Encryption: Incident management platform data encrypted at rest and in transit. On-call contact information protected at Restricted level.

Auditability: Every incident action (acknowledgement, assignment, escalation, resolution) is timestamped and immutable. Regulatory notification decisions have documented rationale and are retained permanently.

OWASP LLM Top 10 Coverage

OWASP LLM Risk Incident Management Control Implementation
LLM01 Prompt Injection Security incident category; automated detection and P1+ classification Injection incidents have dedicated runbook; SOC notification
LLM02 Insecure Output Handling Output safety incidents trigger quality incident; P0 if PII in output PII-in-output is zero-tolerance P0 incident
LLM03 Training Data Poisoning Data incident category; detected via accuracy regression Data incident runbook includes training data integrity check
LLM04 Model Denial of Service Availability incident; token abuse triggers cost incident Availability + cost incidents have auto-scaling mitigation runbook
LLM05 Supply Chain Vulnerabilities Model version change is incident trigger review; unexpected version = P1 Model version check in every quality/availability runbook
LLM06 Sensitive Information Disclosure Compliance incident P0 on PII in output; Privacy Act notification assessment Immediate Privacy Act notification decision tree
LLM07 Insecure Plugin Design Tool call anomalies feed security incident stream Tool abuse detection escalates to security incident
LLM08 Excessive Agency Agentic AI runaway actions trigger security/availability incident Agent action scope violation = P1 security incident
LLM09 Overreliance Accuracy regression = quality incident; threshold breach triggers review Quality incidents include assessment of downstream over-reliance harm
LLM10 Model Theft API key abuse patterns trigger security incident Unusual API key usage = P1 security incident with key rotation runbook

9. Governance Considerations

Responsible AI: The incident management process is a primary governance control. Every AI incident is documented, root-caused, and acted upon. The PIR process includes explicit assessment of whether the incident represents a systematic AI risk requiring model, prompt, or policy change.

Model Risk Management: P1+ quality incidents that affect material AI models are automatically flagged for model risk management review. The PIR feeds into the model risk register.

Human Approval: Escalation to executive or regulatory notification requires human decision — no automated regulatory notification. The decision tree provides the framework; the compliance officer makes the call.

Policy: AI incident management policy must define: incident category definitions, severity thresholds, response time SLOs, on-call responsibilities, PIR obligations, and regulatory notification criteria. Policy reviewed annually and after every P0 incident.

Traceability: Every incident is traceable from the alert (metric/log event) through triage, mitigation, and learning actions to the specific change that prevented recurrence. This chain is the evidence base for regulatory audit.

Governance Artefacts

Artefact Owner Frequency Format
AI Incident Register AI Platform / Risk Continuous ServiceNow CMDB; monthly report
Post-Incident Review Documents Incident Commander Within 5 business days of P1+ Structured document; linked to incident ticket
APRA CPS 230 Notification Log Compliance Officer Per notification obligation Formal document; APRA portal submission
MTTD/MTTR Trend Report Platform Engineering Monthly Dashboard + executive summary
On-Call Runbook Maintenance Log Platform Engineering Quarterly Runbook review and update record
Incident Learning Actions Tracker Engineering Leads Per PIR JIRA board; closure tracking

10. Operational Considerations

Monitoring: The incident management system itself is monitored. Alert engine availability, alert delivery time, on-call response time, and notification pipeline reliability are all tracked. A synthetic canary metric fires every 5 minutes; if it doesn't trigger the expected alert within 2 minutes, a meta-alert fires.

Logging: All incident management events are logged to an immutable audit store. PagerDuty/OpsGenie export incident timelines to the log store daily.

Incident Response: On-call engineers receive AI-specific training quarterly. Tabletop exercises simulate P0 scenarios (PII leak at scale, model API outage during peak, prompt injection attack) to validate runbooks before real incidents occur.

Disaster Recovery: Incident management platform (PagerDuty/OpsGenie) is a third-party SaaS with > 99.9% availability. Backup notification is via email and SMS directly to on-call. Alert rules are version-controlled and can be re-applied to a new instance.

Capacity Planning: On-call staffing must be planned for AI systems that expand scope. Adding a new high-risk AI feature may add 2–3 additional incident types requiring runbook development and on-call training.

SLO Table (MTTD and MTTR Targets)

Severity MTTD Target MTTR Target Alert Delivery SLO
P0 (Critical) < 5 minutes < 2 hours < 2 minutes from detection to page
P1 (High) < 15 minutes < 8 hours < 5 minutes from detection to alert
P2 (Medium) < 1 hour < 24 hours < 15 minutes from detection to channel alert
P3 (Low) < 4 hours < 5 business days < 1 hour from detection to ticket

Disaster Recovery Table

Component RTO RPO Recovery Approach
PagerDuty / OpsGenie < 5 minutes (SaaS HA) N/A Vendor HA; email/SMS backup
AI Alert Engine 10 minutes N/A (stateless rules) Auto-restart; rules version-controlled
Incident Ticket Store (ServiceNow) 30 minutes 1 hour ServiceNow HA; regular backup
Post-Incident Review Documents 24 hours 24 hours Confluence / Google Drive; cloud backup

11. Cost Considerations

Cost Drivers

Driver Description Relative Cost
Incident Management Platform (PagerDuty/OpsGenie) Per-user SaaS subscription Medium
On-call engineering time Incident response, PIR, learning actions High (human labour)
Alert engine compute Rule evaluation on telemetry stream Low
ServiceNow integration Enterprise ITSM licensing High at enterprise scale
False positive alert investigation Engineer time investigating non-incidents Medium if uncontrolled

Scaling Risks: Alert fatigue is the primary risk. If too many P3 alerts fire for minor fluctuations, engineers begin ignoring alerts. Maintain false positive rate < 10% at each severity level — review alert thresholds monthly.

Optimisations:

  • Consolidate low-severity alerts into a daily digest instead of individual notifications
  • Auto-resolve P3 tickets if metrics self-recover within 30 minutes without intervention
  • Use composite alert rules (multiple conditions AND) for P1 to reduce false positives

Indicative Cost Range

Scale AI Incidents/Month Estimated Incident Management Cost/Month
Small 5–20 $2,000–$5,000 (mostly on-call time)
Medium 20–100 $8,000–$20,000
Large 100–500 $25,000–$60,000
Enterprise 500+ $50,000–$150,000 (dedicated AI SRE team)

12. Trade-Off Analysis

Approach Comparison

Approach Pros Cons Best For
AI-specific incident taxonomy + dedicated runbooks Precise, actionable; regulator-defensible; enables metrics Implementation overhead; requires AI-specific on-call training Regulated industries; mature AI deployments; teams with dedicated platform engineering
Generic ITIL incident management (no AI extension) Leverages existing tooling and processes; no incremental training Cannot detect quality/compliance/cost incidents; insufficient for AI risk Low-risk AI features as a temporary measure only
Vendor-managed AI monitoring (Datadog AI, Arize AI) Faster time to value; managed alerting; some AI-specific detection built-in Vendor lock-in; limited taxonomy customisation; regulatory defensibility weaker Organisations lacking platform engineering capacity

Architectural Tensions

Tension Description Resolution
Alert sensitivity vs. Alert fatigue Sensitive alerts catch real incidents early but generate false positives; engineers tune them out Monthly alert tuning reviews; target false positive rate < 10% per severity
Speed vs. Completeness Fast incident response means mitigating before root cause is known; may make things worse Define "stabilise then diagnose" protocol: mitigate blast radius first, then root cause
Automation vs. Human judgment Automated mitigation (rollback) is faster but may be wrong Automated mitigation only for P2+; P0 requires human decision; all mitigations logged
Transparency vs. Liability Detailed PIRs are good governance but create documentation that may be discoverable Legal review of PIR template; privilege consideration for legally significant incidents

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Alert rule misconfigured; incident not detected Medium Critical (missed P0/P1) Synthetic canary test; periodic alert rule audit Monthly alert rule validation with synthetic test events
On-call engineer lacks AI expertise to diagnose Medium High (MTTR extended) On-call response time SLO breach AI-specific on-call training; runbook automation for common diagnoses
Regulatory notification missed within APRA window Low Critical (regulatory sanction) Compliance calendar alert; escalation trigger Immediate APRA contact with explanation; internal RCA
PIR action items not completed High Medium (repeat incidents) Action item ageing report; 30-day escalation Weekly review of open PIR actions; escalation to engineering lead
Alert storm masks P0 amid P3 noise Medium High (P0 buried) P0 count anomaly; human escalation from field P0 alerting on separate, high-priority channel not shared with P3 noise

Cascading Scenarios

  • Scenario 1: AI model API outage → availability incident declared → quality incidents (hallucinations on degraded fallback model) not declared separately → quality SLO breach undetected during incident → post-incident reveals quality degradation affected 50K users. Mitigation: availability and quality incidents are independent; quality monitoring continues during availability incidents.
  • Scenario 2: Cost incident P2 ignored as "not urgent" → daily budget exceeded by 300% → monthly budget depleted in week 1 → AI features disabled for remainder of month → revenue impact. Mitigation: Cost incidents have business escalation path; P2 cost incidents require FinOps notification within 1 hour.

14. Regulatory Considerations

Regulation Clause Requirement AI Incident Management Implementation
APRA CPS 230 Para 53 (Operational Risk Management) Critical service disruptions must be identified, managed, and reported AI availability and quality incidents affecting critical services mapped to CPS 230 notification
APRA CPS 230 Para 57 (Incident Response) Documented incident response procedures for material operational disruptions This pattern provides AI-specific incident response procedures
APRA CPS 230 Para 61 (Notification) APRA notification within 24 hours for severe disruptions; 72 hours for material Regulatory notification decision tree in every P0/P1 runbook
APRA CPS 234 Para 36 (Cyber Incident Response) Information security incidents detected and responded to within defined timeframes Security incidents (injection, PII leak) have MTTD < 5 minutes, MTTR < 2 hours
Privacy Act 1988 (AU) NDB Scheme (Part IIIC) Eligible data breaches (including AI-driven) notified to OAIC and affected individuals PII-in-output P0 incident triggers NDB assessment; 30-day notification window
EU AI Act Article 18 (Serious Incident Reporting) Providers of high-risk AI must report serious incidents to national authorities Serious AI incidents (physical/psychological harm) reported per Article 18; incident log maintained
ISO/IEC 42001 Clause 10.2 (Nonconformity and Corrective Action) Nonconformities must be corrected and root causes addressed PIR process directly implements corrective action requirement
NIST AI RMF MANAGE 3.2 Documented procedures for AI incidents including recovery and learning This pattern implements NIST MANAGE 3.2 in full

15. Reference Implementations

AWS

  • Alert Engine: CloudWatch Alarms + EventBridge rules; custom Lambda for AI-specific composite rules
  • Incident Platform: PagerDuty with AWS EventBridge integration; ServiceNow with AWS Service Management Connector
  • Runbook Automation: AWS Systems Manager Automation documents; PagerDuty Runbook Automation
  • Incident Log: AWS DynamoDB (incident events); Amazon S3 (PIR documents)
  • Regulatory Notification: AWS Step Functions for notification decision workflow
  • Dashboard: Amazon QuickSight MTTD/MTTR dashboard; CloudWatch operational dashboard

Azure

  • Alert Engine: Azure Monitor Alerts; Azure Logic Apps for composite alert rules
  • Incident Platform: PagerDuty with Azure Monitor integration; ServiceNow with Azure DevOps
  • Runbook Automation: Azure Automation Runbooks; ITSM Connector for ServiceNow
  • Incident Log: Azure Cosmos DB (events); Azure Blob Storage (documents)
  • Regulatory Notification: Power Automate workflow for notification decision tree
  • Dashboard: Azure Monitor Workbooks; Power BI MTTD/MTTR reports

GCP

  • Alert Engine: Cloud Monitoring Alerting; Cloud Functions for composite rules
  • Incident Platform: PagerDuty with Google Cloud integration; ServiceNow
  • Runbook Automation: Cloud Run jobs; PagerDuty Runbook Automation
  • Incident Log: Firestore (events); Cloud Storage (documents)
  • Regulatory Notification: Cloud Workflows for notification decision workflow
  • Dashboard: Looker; Cloud Monitoring dashboards

On-Premises

  • Alert Engine: Prometheus Alertmanager with custom AI alert rules; Grafana alerting
  • Incident Platform: OpsGenie (SaaS); Jira Service Management (self-hosted)
  • Runbook Automation: Rundeck; Ansible playbooks for common mitigations
  • Incident Log: PostgreSQL (events); Confluence (documents)
  • Regulatory Notification: Manual workflow with compliance team; tracked in GRC tool
  • Dashboard: Grafana operational dashboard

Pattern ID Pattern Name Relationship Notes
EAAPL-OBS001 AI Telemetry Architecture Foundation All detection rules consume metrics and logs from this pattern
EAAPL-OBS002 Prompt Monitoring Feeds Into Security incidents (injection, PII) detected by OBS002; routed here
EAAPL-OBS003 Hallucination Detection Feeds Into Quality incidents triggered by hallucination rate alerts from OBS003
EAAPL-OBS005 Model Drift Detection Feeds Into Drift alerts generate P2/P1 quality incidents in this framework
EAAPL-OBS006 AI Cost Observability Feeds Into Cost incidents triggered by OBS006 budget alerts

17. Maturity Assessment

Overall Maturity: Proven

Dimension Score (1–5) Rationale
Adoption Breadth 3 AI-specific incident management adopted by regulated industries; general market still maturing
Tooling Ecosystem 4 PagerDuty/OpsGenie/ServiceNow mature; AI-specific alert rules and runbooks are custom
Operational Runbook Coverage 3 Generic runbooks well-established; AI-specific runbooks require custom development
Regulatory Evidence 5 APRA CPS 230 and Privacy Act NDB scheme are mature and well-understood obligations
Cost Predictability 4 Incident management platform costs are predictable; on-call labour costs are variable
Team Skill Availability 4 SRE/incident management skills broadly available; AI-specific extensions require training

18. Revision History

Version Date Author Changes
1.0.0 2026-06-12 EAAPL Working Group Initial publication
← Back to LibraryMore Observability & Monitoring