Proven

EAAPL-OBS004 · AI Incident Management

📊 Observability & Monitoring🏭 Field-tested in AU

EAAPL-OBS004 · AI Incident Management

Pattern ID: EAAPL-OBS004 Status: Proven Complexity: Medium Tags: observability alerting slo apra-cps230 medium-complexity Version: 1.0.0 Last Reviewed: 2026-06-12

1. Executive Summary

AI system failures are qualitatively different from traditional software failures. An AI system can be operationally available (HTTP 200) while actively delivering harmful, inaccurate, or biased outputs. Traditional incident management frameworks — built around availability and error rate — are blind to quality, safety, and compliance failures that represent the greatest risk in AI deployments.

This pattern defines the operational incident management lifecycle for AI systems, covering detection, triage, escalation, response, and post-incident review. It establishes a six-category AI incident taxonomy (Availability, Quality, Security, Cost, Compliance, and Data) with severity classifications, MTTD and MTTR targets by severity, and automated detection rules for each incident type drawn from the telemetry architecture defined in EAAPL-OBS001. It specifies integration with PagerDuty, OpsGenie, and ServiceNow; AI-specific runbook templates; post-incident review processes; and APRA CPS 230 incident notification obligations that apply when AI systems support critical operations. The pattern is designed to provide evidence to regulators and auditors that AI incidents are systematically detected, managed, and learned from.

Target Audience: CIO, CTO, Head of Platform Engineering, Chief Risk Officer Time to Implement: 4–8 weeks

2. Problem Statement

Business Problem

When AI systems deliver harmful outputs, organisations face a compounding problem: they often don't know an incident occurred until a user complains; they cannot determine scope (how many users were affected); they cannot reconstruct what happened; and they have no playbook for response. APRA CPS 230 requires financial institutions to detect and manage operational disruptions — but most AI incident policies do not account for quality and compliance failures that technically aren't "outages."

Technical Problem

Existing incident management tooling is wired to infrastructure and HTTP metrics: error rate, latency, availability. AI-specific failure modes — hallucination spike, accuracy regression, prompt injection attack, PII leak in output, cost budget breach, vector DB corruption — produce none of these signals. They appear as business metric degradation (lower NPS, higher escalation rate, customer churn) weeks after the incident began.

Symptoms

AI incidents discovered through customer complaints or NPS drop, not monitoring alerts
No classification system for AI incidents; all AI issues are handled as ad-hoc engineering tasks
Post-incident review template asks "what was the error rate?" rather than "what was the hallucination rate?"
Regulatory notification assessment for AI incidents is ad-hoc; no documented criteria
Mean time to resolve AI quality incidents is measured in days, not hours
No ownership for AI cost incidents — budget overruns attributed to "AI costs increased" with no root cause

Cost of Inaction

APRA CPS 230 paragraph 53 requires financial institutions to detect and respond to operational disruptions; AI failures that affect service delivery qualify
Each undetected AI quality incident erodes user trust; trust erosion has a compounding effect on retention
Regulatory enforcement actions for AI incidents with no documented response procedure
Budget overruns from undetected AI cost incidents: typical cost incident undetected for 72 hours costs 3–5x the alert-and-correct cost

3. Context

When to Apply

Any production AI system in a regulated industry (APRA, EU AI Act, Privacy Act)
AI systems in organisations with existing ITIL or incident management frameworks that need AI-specific extensions
AI systems processing > 1,000 requests/day where manual monitoring is not scalable
Prerequisite: EAAPL-OBS001 telemetry must provide the metric and log stream for automated detection

When NOT to Apply

Internal proof-of-concept systems with < 30-day lifespan and no regulatory exposure
AI systems where the parent application already has comprehensive incident management covering AI-specific failure modes

Prerequisites

Prerequisite	Required	Notes
EAAPL-OBS001 AI Telemetry Infrastructure	Required	Alert rules depend on metrics and logs from this pattern
Incident management platform (PagerDuty / OpsGenie / ServiceNow)	Required	Alert routing and on-call schedule management
Defined on-call schedule for AI platform	Required	Someone must receive P1 alerts at 3am
APRA CPS 230 Business Services mapping	Conditional	Required if AI supports APRA-critical business services

Industry Applicability

Industry	Applicability	Primary Driver
Financial Services	Critical	APRA CPS 230 notification obligations, regulatory audit
Healthcare	Critical	Clinical AI failures have patient safety consequence
Government	High	Public service delivery obligations; ministerial reporting
Legal Services	High	Professional liability incidents require documented response
Retail / E-Commerce	Medium	Cost and quality incidents affect revenue
Technology / SaaS	High	SLA obligations; multi-tenant blast radius

4. Architecture Overview

The AI Incident Management Architecture is a layered system operating on top of the telemetry infrastructure established by EAAPL-OBS001. It introduces AI-specific detection logic, a structured taxonomy and severity framework, integration with existing incident management platforms, AI-specific runbooks, and a specialised post-incident review process.

AI Incident Taxonomy

Six incident categories are defined. Availability incidents cover model API outages, vector database unavailability, and inference service failures. Traditional monitoring covers these well; the AI-specific extension is tracking which downstream AI features are degraded during third-party model API outages, since these are external dependencies outside the organisation's control. Quality incidents cover hallucination rate spikes, accuracy regressions, output safety filter bypass, and significant latency degradation affecting user experience. Quality incidents are the most novel category and require the AI-specific telemetry from EAAPL-OBS001 and EAAPL-OBS003 to detect. Security incidents cover detected prompt injection attacks, PII leaked in model output, jailbreak attempts, and unusual API key usage patterns. Cost incidents cover budget threshold breaches, per-request cost spikes, and unexpected token usage patterns. Compliance incidents cover regulatory threshold breaches (e.g., AI-influenced decision rate exceeding regulatory limits), audit trail failures, and human oversight bypass. Data incidents cover vector database corruption, training data tampering, and retrieval index degradation.

Severity Classification

P0 (Critical): AI system causing immediate user harm or regulatory breach. Examples: PII data appearing in AI outputs at scale; AI clinical decision support providing systematically wrong guidance; security breach via prompt injection. Response time: page on-call immediately; incident commander assigned within 10 minutes. P1 (High): AI system significantly degraded; SLO breach sustained; material quality regression. Examples: hallucination rate > 3x baseline; model API down > 15 minutes; cost budget 100% exceeded. Response time: page on-call; response within 15 minutes. P2 (Medium): AI system partially degraded; SLO at risk; quality regression not yet material. Examples: latency p99 > 2x baseline; hallucination rate elevated but below P1 threshold; partial feature degradation. Response time: alert to on-call channel; response within 1 hour. P3 (Low): Minor degradation or early warning signal. Examples: single model error; cost at 80% of budget; drift warning. Response time: create ticket; address in next working day.

Automated Detection Rules

Detection rules are defined as alert conditions on the metrics and logs from EAAPL-OBS001. Each incident type has a specific detection rule. Availability: HTTP error rate > 5% for 5 minutes on model API endpoint. Quality: hallucination rate (from EAAPL-OBS003) > 2x 7-day baseline for 30 minutes. Security: injection attempt count > 10 per minute (from EAAPL-OBS002). Cost: hourly spend > 150% of rolling 7-day hourly average. Compliance: PII detection event count > 0 in output stream (zero tolerance). Data: vector retrieval success rate < 95% for 15 minutes.

Escalation Architecture

PagerDuty/OpsGenie is configured with AI-specific services and escalation policies. P0 and P1 alerts page the AI platform on-call engineer and notify the engineering manager and relevant product owner. Compliance and security incidents additionally notify the CISO, privacy officer, and compliance team. Cost incidents notify the FinOps team and department head in addition to engineering on-call. ServiceNow integration creates an incident ticket for every P1+ alert with AI-specific fields: incident_category (from taxonomy), model_id, affected_use_case, estimated_user_impact, regulatory_notification_required.

AI-Specific Runbook Templates

Generic runbooks that ask engineers to check CPU and memory are insufficient for AI incidents. AI runbooks include: current model version and recent changes; recent prompt template deployments; vector database status; token usage and cost trend; hallucination rate and quality metrics; third-party model provider status page check; and regulatory notification assessment decision tree.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Detection["Detection Layer"] A[AI Alert Engine] B[Telemetry Signals] end subgraph Triage["Triage and Response"] C{Severity Classifier} D[Incident Commander] E[AI Runbook] end subgraph Resolution["Mitigation and Close"] F{Mitigation Type} G[Rollback or Scale] H[Regulatory Notification] end B --> A A --> C C -->|P0/P1| D C -->|P2/P3| I[Channel Alert or Ticket] D --> E E --> F F -->|model/infra| G F -->|compliance/security| H G --> J[Verify Recovery] J --> K[Post-Incident Review] style B fill:#dbeafe,stroke:#3b82f6 style A fill:#f0fdf4,stroke:#22c55e style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f0fdf4,stroke:#22c55e style E fill:#fef9c3,stroke:#eab308 style F fill:#f3e8ff,stroke:#a855f7 style G fill:#f0fdf4,stroke:#22c55e style H fill:#fee2e2,stroke:#ef4444 style I fill:#dbeafe,stroke:#3b82f6 style J fill:#f0fdf4,stroke:#22c55e style K fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
AI Alert Engine	Service	Evaluate AI-specific alert rules against telemetry; create alert events	Prometheus Alertmanager, Datadog Monitors, CloudWatch Alarms	Critical
Severity Classifier	Logic	Apply taxonomy + threshold rules to classify incident severity	Rules engine embedded in alert manager; custom Lambda/Cloud Function	Critical
Incident Management Platform	SaaS / On-Prem	On-call routing, escalation policies, incident timeline	PagerDuty, OpsGenie, ServiceNow	Critical
AI Runbook Library	Documentation + Automation	AI-specific diagnostic and mitigation procedures; automated diagnostic scripts	Confluence / Notion + runbook automation (PagerDuty Runbook Automation, Rundeck)	High
Regulatory Notification Workflow	Process + Tool	Decision tree for APRA/Privacy Act notification; draft notifications	ServiceNow GRC; custom decision tree in runbook	Critical
Post-Incident Review Template	Process	AI-specific PIR structure covering model/prompt/data/infra dimensions	Confluence template; Google Docs template	High
Incident Dashboard	UI	Real-time incident status; active incident list; MTTD/MTTR metrics	PagerDuty status page; Grafana incident dashboard	Medium
Communication Templates	Documentation	Stakeholder and customer communications for AI incidents	Pre-approved templates by incident type; legal-reviewed	High
Learning Actions Tracker	Workflow	Track PIR action items to closure; prevent repeat incidents	JIRA; ServiceNow; GitHub Issues	Medium

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	AI Alert Engine	Evaluates alert rules against incoming telemetry stream	Alert event with: incident_type, severity, affected_model, metric_value, threshold
2	Severity Classifier	Applies severity classification rules; determines P0–P3	Classified alert with severity, incident_category, estimated_impact
3	Incident Management Platform	Routes alert to on-call via PagerDuty/OpsGenie; creates ServiceNow ticket	Paged engineer; incident ticket created
4	Incident Commander	Acknowledges incident; activates AI-specific runbook; assigns responders	Incident timeline started; runbook checklist initiated
5	Responder Team	Diagnoses root cause using runbook: check model version, prompt changes, data health	Root cause identified (model / prompt / data / infrastructure)
6	Regulatory Assessment	Uses decision tree to determine APRA CPS 230 / Privacy Act notification requirement	Notification decision + documented rationale
7	Mitigation	Implements appropriate mitigation: rollback, scale, block, escalate to vendor	Mitigation action logged in incident timeline
8	Verification	Monitors recovery metrics; confirms metrics returning to normal	Recovery confirmed; incident resolved
9	Post-Incident Review	Within 5 business days for P1+; AI-specific PIR template completed	PIR document; learning actions in JIRA

Error Flow

Error Scenario	Detection	Action	Recovery
Alert engine unavailable	Health check on alert engine; missed alert test (synthetic canary metric)	Escalate manually; page on-call via backup channel (email/SMS)	Restore alert engine; verify synthetic canary passes
False positive P1 alert (metric spike from deployment)	Incident commander review during triage identifies correlated deployment	Downgrade severity; resolve; adjust alert threshold	Review and tune alert sensitivity
Regulatory notification deadline missed	Compliance calendar alert; APRA CPS 230 72-hour window	Notify APRA with explanation of delay; internal escalation to CCO	Enforce regulatory assessment within 2 hours of P0/P1 incident declaration
Incident manager unresponsive	Secondary on-call escalation after 10 minutes	Page secondary; notify engineering manager	Review on-call schedule; ensure 24/7 coverage
Mitigation makes incident worse (bad rollback)	Metrics worsen after mitigation	Re-declare incident at previous or higher severity	Roll forward to last known good; escalate to vendor

8. Security Considerations

Authentication: Incident management platform access restricted to authorised engineering, security, and compliance staff via SSO. PagerDuty/OpsGenie API tokens stored in secrets manager. Alert engine webhooks use HMAC signatures for authentication.

Authorisation: Compliance and security incidents have restricted visibility. PII incident details are restricted to privacy officer, CISO, legal, and specific engineering respondents. Incident timelines are not publicly accessible.

Secrets Management: Any credentials used in automated remediation runbooks (e.g., model API keys for rollback) stored in secrets manager with break-glass access audit trail.

Data Classification: Incident records containing AI output examples or prompt content are classified as Confidential. Incident records for security incidents are classified as Restricted. All incident records are retained for 7 years.

Encryption: Incident management platform data encrypted at rest and in transit. On-call contact information protected at Restricted level.

Auditability: Every incident action (acknowledgement, assignment, escalation, resolution) is timestamped and immutable. Regulatory notification decisions have documented rationale and are retained permanently.

OWASP LLM Top 10 Coverage

OWASP LLM Risk	Incident Management Control	Implementation
LLM01 Prompt Injection	Security incident category; automated detection and P1+ classification	Injection incidents have dedicated runbook; SOC notification
LLM02 Insecure Output Handling	Output safety incidents trigger quality incident; P0 if PII in output	PII-in-output is zero-tolerance P0 incident
LLM03 Training Data Poisoning	Data incident category; detected via accuracy regression	Data incident runbook includes training data integrity check
LLM04 Model Denial of Service	Availability incident; token abuse triggers cost incident	Availability + cost incidents have auto-scaling mitigation runbook
LLM05 Supply Chain Vulnerabilities	Model version change is incident trigger review; unexpected version = P1	Model version check in every quality/availability runbook
LLM06 Sensitive Information Disclosure	Compliance incident P0 on PII in output; Privacy Act notification assessment	Immediate Privacy Act notification decision tree
LLM07 Insecure Plugin Design	Tool call anomalies feed security incident stream	Tool abuse detection escalates to security incident
LLM08 Excessive Agency	Agentic AI runaway actions trigger security/availability incident	Agent action scope violation = P1 security incident
LLM09 Overreliance	Accuracy regression = quality incident; threshold breach triggers review	Quality incidents include assessment of downstream over-reliance harm
LLM10 Model Theft	API key abuse patterns trigger security incident	Unusual API key usage = P1 security incident with key rotation runbook

9. Governance Considerations

Responsible AI: The incident management process is a primary governance control. Every AI incident is documented, root-caused, and acted upon. The PIR process includes explicit assessment of whether the incident represents a systematic AI risk requiring model, prompt, or policy change.

Model Risk Management: P1+ quality incidents that affect material AI models are automatically flagged for model risk management review. The PIR feeds into the model risk register.

Human Approval: Escalation to executive or regulatory notification requires human decision — no automated regulatory notification. The decision tree provides the framework; the compliance officer makes the call.

Policy: AI incident management policy must define: incident category definitions, severity thresholds, response time SLOs, on-call responsibilities, PIR obligations, and regulatory notification criteria. Policy reviewed annually and after every P0 incident.

Traceability: Every incident is traceable from the alert (metric/log event) through triage, mitigation, and learning actions to the specific change that prevented recurrence. This chain is the evidence base for regulatory audit.

Governance Artefacts

Artefact	Owner	Frequency	Format
AI Incident Register	AI Platform / Risk	Continuous	ServiceNow CMDB; monthly report
Post-Incident Review Documents	Incident Commander	Within 5 business days of P1+	Structured document; linked to incident ticket
APRA CPS 230 Notification Log	Compliance Officer	Per notification obligation	Formal document; APRA portal submission
MTTD/MTTR Trend Report	Platform Engineering	Monthly	Dashboard + executive summary
On-Call Runbook Maintenance Log	Platform Engineering	Quarterly	Runbook review and update record
Incident Learning Actions Tracker	Engineering Leads	Per PIR	JIRA board; closure tracking

10. Operational Considerations

Monitoring: The incident management system itself is monitored. Alert engine availability, alert delivery time, on-call response time, and notification pipeline reliability are all tracked. A synthetic canary metric fires every 5 minutes; if it doesn't trigger the expected alert within 2 minutes, a meta-alert fires.

Logging: All incident management events are logged to an immutable audit store. PagerDuty/OpsGenie export incident timelines to the log store daily.

Incident Response: On-call engineers receive AI-specific training quarterly. Tabletop exercises simulate P0 scenarios (PII leak at scale, model API outage during peak, prompt injection attack) to validate runbooks before real incidents occur.

Disaster Recovery: Incident management platform (PagerDuty/OpsGenie) is a third-party SaaS with > 99.9% availability. Backup notification is via email and SMS directly to on-call. Alert rules are version-controlled and can be re-applied to a new instance.

Capacity Planning: On-call staffing must be planned for AI systems that expand scope. Adding a new high-risk AI feature may add 2–3 additional incident types requiring runbook development and on-call training.

SLO Table (MTTD and MTTR Targets)

Severity	MTTD Target	MTTR Target	Alert Delivery SLO
P0 (Critical)	< 5 minutes	< 2 hours	< 2 minutes from detection to page
P1 (High)	< 15 minutes	< 8 hours	< 5 minutes from detection to alert
P2 (Medium)	< 1 hour	< 24 hours	< 15 minutes from detection to channel alert
P3 (Low)	< 4 hours	< 5 business days	< 1 hour from detection to ticket

Disaster Recovery Table

Component	RTO	RPO	Recovery Approach
PagerDuty / OpsGenie	< 5 minutes (SaaS HA)	N/A	Vendor HA; email/SMS backup
AI Alert Engine	10 minutes	N/A (stateless rules)	Auto-restart; rules version-controlled
Incident Ticket Store (ServiceNow)	30 minutes	1 hour	ServiceNow HA; regular backup
Post-Incident Review Documents	24 hours	24 hours	Confluence / Google Drive; cloud backup

11. Cost Considerations

Cost Drivers

Driver	Description	Relative Cost
Incident Management Platform (PagerDuty/OpsGenie)	Per-user SaaS subscription	Medium
On-call engineering time	Incident response, PIR, learning actions	High (human labour)
Alert engine compute	Rule evaluation on telemetry stream	Low
ServiceNow integration	Enterprise ITSM licensing	High at enterprise scale
False positive alert investigation	Engineer time investigating non-incidents	Medium if uncontrolled

Scaling Risks: Alert fatigue is the primary risk. If too many P3 alerts fire for minor fluctuations, engineers begin ignoring alerts. Maintain false positive rate < 10% at each severity level — review alert thresholds monthly.

Optimisations:

Consolidate low-severity alerts into a daily digest instead of individual notifications
Auto-resolve P3 tickets if metrics self-recover within 30 minutes without intervention
Use composite alert rules (multiple conditions AND) for P1 to reduce false positives

Indicative Cost Range

Scale	AI Incidents/Month	Estimated Incident Management Cost/Month
Small	5–20	$2,000–$5,000 (mostly on-call time)
Medium	20–100	$8,000–$20,000
Large	100–500	$25,000–$60,000
Enterprise	500+	$50,000–$150,000 (dedicated AI SRE team)

12. Trade-Off Analysis

Approach Comparison

Approach	Pros	Cons	Best For
AI-specific incident taxonomy + dedicated runbooks	Precise, actionable; regulator-defensible; enables metrics	Implementation overhead; requires AI-specific on-call training	Regulated industries; mature AI deployments; teams with dedicated platform engineering
Generic ITIL incident management (no AI extension)	Leverages existing tooling and processes; no incremental training	Cannot detect quality/compliance/cost incidents; insufficient for AI risk	Low-risk AI features as a temporary measure only
Vendor-managed AI monitoring (Datadog AI, Arize AI)	Faster time to value; managed alerting; some AI-specific detection built-in	Vendor lock-in; limited taxonomy customisation; regulatory defensibility weaker	Organisations lacking platform engineering capacity

Architectural Tensions

Tension	Description	Resolution
Alert sensitivity vs. Alert fatigue	Sensitive alerts catch real incidents early but generate false positives; engineers tune them out	Monthly alert tuning reviews; target false positive rate < 10% per severity
Speed vs. Completeness	Fast incident response means mitigating before root cause is known; may make things worse	Define "stabilise then diagnose" protocol: mitigate blast radius first, then root cause
Automation vs. Human judgment	Automated mitigation (rollback) is faster but may be wrong	Automated mitigation only for P2+; P0 requires human decision; all mitigations logged
Transparency vs. Liability	Detailed PIRs are good governance but create documentation that may be discoverable	Legal review of PIR template; privilege consideration for legally significant incidents

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Alert rule misconfigured; incident not detected	Medium	Critical (missed P0/P1)	Synthetic canary test; periodic alert rule audit	Monthly alert rule validation with synthetic test events
On-call engineer lacks AI expertise to diagnose	Medium	High (MTTR extended)	On-call response time SLO breach	AI-specific on-call training; runbook automation for common diagnoses
Regulatory notification missed within APRA window	Low	Critical (regulatory sanction)	Compliance calendar alert; escalation trigger	Immediate APRA contact with explanation; internal RCA
PIR action items not completed	High	Medium (repeat incidents)	Action item ageing report; 30-day escalation	Weekly review of open PIR actions; escalation to engineering lead
Alert storm masks P0 amid P3 noise	Medium	High (P0 buried)	P0 count anomaly; human escalation from field	P0 alerting on separate, high-priority channel not shared with P3 noise

Cascading Scenarios

Scenario 1: AI model API outage → availability incident declared → quality incidents (hallucinations on degraded fallback model) not declared separately → quality SLO breach undetected during incident → post-incident reveals quality degradation affected 50K users. Mitigation: availability and quality incidents are independent; quality monitoring continues during availability incidents.
Scenario 2: Cost incident P2 ignored as "not urgent" → daily budget exceeded by 300% → monthly budget depleted in week 1 → AI features disabled for remainder of month → revenue impact. Mitigation: Cost incidents have business escalation path; P2 cost incidents require FinOps notification within 1 hour.

14. Regulatory Considerations

Regulation	Clause	Requirement	AI Incident Management Implementation
APRA CPS 230	Para 53 (Operational Risk Management)	Critical service disruptions must be identified, managed, and reported	AI availability and quality incidents affecting critical services mapped to CPS 230 notification
APRA CPS 230	Para 57 (Incident Response)	Documented incident response procedures for material operational disruptions	This pattern provides AI-specific incident response procedures
APRA CPS 230	Para 61 (Notification)	APRA notification within 24 hours for severe disruptions; 72 hours for material	Regulatory notification decision tree in every P0/P1 runbook
APRA CPS 234	Para 36 (Cyber Incident Response)	Information security incidents detected and responded to within defined timeframes	Security incidents (injection, PII leak) have MTTD < 5 minutes, MTTR < 2 hours
Privacy Act 1988 (AU)	NDB Scheme (Part IIIC)	Eligible data breaches (including AI-driven) notified to OAIC and affected individuals	PII-in-output P0 incident triggers NDB assessment; 30-day notification window
EU AI Act	Article 18 (Serious Incident Reporting)	Providers of high-risk AI must report serious incidents to national authorities	Serious AI incidents (physical/psychological harm) reported per Article 18; incident log maintained
ISO/IEC 42001	Clause 10.2 (Nonconformity and Corrective Action)	Nonconformities must be corrected and root causes addressed	PIR process directly implements corrective action requirement
NIST AI RMF	MANAGE 3.2	Documented procedures for AI incidents including recovery and learning	This pattern implements NIST MANAGE 3.2 in full

15. Reference Implementations

AWS

Alert Engine: CloudWatch Alarms + EventBridge rules; custom Lambda for AI-specific composite rules
Incident Platform: PagerDuty with AWS EventBridge integration; ServiceNow with AWS Service Management Connector
Runbook Automation: AWS Systems Manager Automation documents; PagerDuty Runbook Automation
Incident Log: AWS DynamoDB (incident events); Amazon S3 (PIR documents)
Regulatory Notification: AWS Step Functions for notification decision workflow
Dashboard: Amazon QuickSight MTTD/MTTR dashboard; CloudWatch operational dashboard

Azure

Alert Engine: Azure Monitor Alerts; Azure Logic Apps for composite alert rules
Incident Platform: PagerDuty with Azure Monitor integration; ServiceNow with Azure DevOps
Runbook Automation: Azure Automation Runbooks; ITSM Connector for ServiceNow
Incident Log: Azure Cosmos DB (events); Azure Blob Storage (documents)
Regulatory Notification: Power Automate workflow for notification decision tree
Dashboard: Azure Monitor Workbooks; Power BI MTTD/MTTR reports

GCP

Alert Engine: Cloud Monitoring Alerting; Cloud Functions for composite rules
Incident Platform: PagerDuty with Google Cloud integration; ServiceNow
Runbook Automation: Cloud Run jobs; PagerDuty Runbook Automation
Incident Log: Firestore (events); Cloud Storage (documents)
Regulatory Notification: Cloud Workflows for notification decision workflow
Dashboard: Looker; Cloud Monitoring dashboards

On-Premises

Alert Engine: Prometheus Alertmanager with custom AI alert rules; Grafana alerting
Incident Platform: OpsGenie (SaaS); Jira Service Management (self-hosted)
Runbook Automation: Rundeck; Ansible playbooks for common mitigations
Incident Log: PostgreSQL (events); Confluence (documents)
Regulatory Notification: Manual workflow with compliance team; tracked in GRC tool
Dashboard: Grafana operational dashboard

Pattern ID	Pattern Name	Relationship	Notes
EAAPL-OBS001	AI Telemetry Architecture	Foundation	All detection rules consume metrics and logs from this pattern
EAAPL-OBS002	Prompt Monitoring	Feeds Into	Security incidents (injection, PII) detected by OBS002; routed here
EAAPL-OBS003	Hallucination Detection	Feeds Into	Quality incidents triggered by hallucination rate alerts from OBS003
EAAPL-OBS005	Model Drift Detection	Feeds Into	Drift alerts generate P2/P1 quality incidents in this framework
EAAPL-OBS006	AI Cost Observability	Feeds Into	Cost incidents triggered by OBS006 budget alerts

17. Maturity Assessment

Overall Maturity: Proven

Dimension	Score (1–5)	Rationale
Adoption Breadth	3	AI-specific incident management adopted by regulated industries; general market still maturing
Tooling Ecosystem	4	PagerDuty/OpsGenie/ServiceNow mature; AI-specific alert rules and runbooks are custom
Operational Runbook Coverage	3	Generic runbooks well-established; AI-specific runbooks require custom development
Regulatory Evidence	5	APRA CPS 230 and Privacy Act NDB scheme are mature and well-understood obligations
Cost Predictability	4	Incident management platform costs are predictable; on-call labour costs are variable
Team Skill Availability	4	SRE/incident management skills broadly available; AI-specific extensions require training

18. Revision History

Version	Date	Author	Changes
1.0.0	2026-06-12	EAAPL Working Group	Initial publication

← Back to Library More Observability & Monitoring →

EAAPL-OBS004 · AI Incident Management

EAAPL-OBS004 · AI Incident Management

1. Executive Summary

2. Problem Statement

Business Problem

Technical Problem

Symptoms

Cost of Inaction

3. Context

When to Apply

When NOT to Apply

Prerequisites

Industry Applicability

4. Architecture Overview

5. Architecture Diagram

6. Components

7. Data Flow

Primary Flow

Error Flow

8. Security Considerations

OWASP LLM Top 10 Coverage

9. Governance Considerations

Governance Artefacts

10. Operational Considerations

SLO Table (MTTD and MTTR Targets)

Disaster Recovery Table

11. Cost Considerations

Indicative Cost Range

12. Trade-Off Analysis

Approach Comparison

Architectural Tensions

13. Failure Modes

Cascading Scenarios

14. Regulatory Considerations

15. Reference Implementations

AWS

Azure

GCP

On-Premises

16. Related Patterns

17. Maturity Assessment

18. Revision History