EAAPL-OBS004 · AI Incident Management
Pattern ID: EAAPL-OBS004
Status: Proven
Complexity: Medium
Tags: observability alerting slo apra-cps230 medium-complexity
Version: 1.0.0
Last Reviewed: 2026-06-12
1. Executive Summary
AI system failures are qualitatively different from traditional software failures. An AI system can be operationally available (HTTP 200) while actively delivering harmful, inaccurate, or biased outputs. Traditional incident management frameworks — built around availability and error rate — are blind to quality, safety, and compliance failures that represent the greatest risk in AI deployments.
This pattern defines the operational incident management lifecycle for AI systems, covering detection, triage, escalation, response, and post-incident review. It establishes a six-category AI incident taxonomy (Availability, Quality, Security, Cost, Compliance, and Data) with severity classifications, MTTD and MTTR targets by severity, and automated detection rules for each incident type drawn from the telemetry architecture defined in EAAPL-OBS001. It specifies integration with PagerDuty, OpsGenie, and ServiceNow; AI-specific runbook templates; post-incident review processes; and APRA CPS 230 incident notification obligations that apply when AI systems support critical operations. The pattern is designed to provide evidence to regulators and auditors that AI incidents are systematically detected, managed, and learned from.
Target Audience: CIO, CTO, Head of Platform Engineering, Chief Risk Officer Time to Implement: 4–8 weeks
2. Problem Statement
Business Problem
When AI systems deliver harmful outputs, organisations face a compounding problem: they often don't know an incident occurred until a user complains; they cannot determine scope (how many users were affected); they cannot reconstruct what happened; and they have no playbook for response. APRA CPS 230 requires financial institutions to detect and manage operational disruptions — but most AI incident policies do not account for quality and compliance failures that technically aren't "outages."
Technical Problem
Existing incident management tooling is wired to infrastructure and HTTP metrics: error rate, latency, availability. AI-specific failure modes — hallucination spike, accuracy regression, prompt injection attack, PII leak in output, cost budget breach, vector DB corruption — produce none of these signals. They appear as business metric degradation (lower NPS, higher escalation rate, customer churn) weeks after the incident began.
Symptoms
- AI incidents discovered through customer complaints or NPS drop, not monitoring alerts
- No classification system for AI incidents; all AI issues are handled as ad-hoc engineering tasks
- Post-incident review template asks "what was the error rate?" rather than "what was the hallucination rate?"
- Regulatory notification assessment for AI incidents is ad-hoc; no documented criteria
- Mean time to resolve AI quality incidents is measured in days, not hours
- No ownership for AI cost incidents — budget overruns attributed to "AI costs increased" with no root cause
Cost of Inaction
- APRA CPS 230 paragraph 53 requires financial institutions to detect and respond to operational disruptions; AI failures that affect service delivery qualify
- Each undetected AI quality incident erodes user trust; trust erosion has a compounding effect on retention
- Regulatory enforcement actions for AI incidents with no documented response procedure
- Budget overruns from undetected AI cost incidents: typical cost incident undetected for 72 hours costs 3–5x the alert-and-correct cost
3. Context
When to Apply
- Any production AI system in a regulated industry (APRA, EU AI Act, Privacy Act)
- AI systems in organisations with existing ITIL or incident management frameworks that need AI-specific extensions
- AI systems processing > 1,000 requests/day where manual monitoring is not scalable
- Prerequisite: EAAPL-OBS001 telemetry must provide the metric and log stream for automated detection
When NOT to Apply
- Internal proof-of-concept systems with < 30-day lifespan and no regulatory exposure
- AI systems where the parent application already has comprehensive incident management covering AI-specific failure modes
Prerequisites
| Prerequisite | Required | Notes |
|---|---|---|
| EAAPL-OBS001 AI Telemetry Infrastructure | Required | Alert rules depend on metrics and logs from this pattern |
| Incident management platform (PagerDuty / OpsGenie / ServiceNow) | Required | Alert routing and on-call schedule management |
| Defined on-call schedule for AI platform | Required | Someone must receive P1 alerts at 3am |
| APRA CPS 230 Business Services mapping | Conditional | Required if AI supports APRA-critical business services |
Industry Applicability
| Industry | Applicability | Primary Driver |
|---|---|---|
| Financial Services | Critical | APRA CPS 230 notification obligations, regulatory audit |
| Healthcare | Critical | Clinical AI failures have patient safety consequence |
| Government | High | Public service delivery obligations; ministerial reporting |
| Legal Services | High | Professional liability incidents require documented response |
| Retail / E-Commerce | Medium | Cost and quality incidents affect revenue |
| Technology / SaaS | High | SLA obligations; multi-tenant blast radius |
4. Architecture Overview
The AI Incident Management Architecture is a layered system operating on top of the telemetry infrastructure established by EAAPL-OBS001. It introduces AI-specific detection logic, a structured taxonomy and severity framework, integration with existing incident management platforms, AI-specific runbooks, and a specialised post-incident review process.
AI Incident Taxonomy
Six incident categories are defined. Availability incidents cover model API outages, vector database unavailability, and inference service failures. Traditional monitoring covers these well; the AI-specific extension is tracking which downstream AI features are degraded during third-party model API outages, since these are external dependencies outside the organisation's control. Quality incidents cover hallucination rate spikes, accuracy regressions, output safety filter bypass, and significant latency degradation affecting user experience. Quality incidents are the most novel category and require the AI-specific telemetry from EAAPL-OBS001 and EAAPL-OBS003 to detect. Security incidents cover detected prompt injection attacks, PII leaked in model output, jailbreak attempts, and unusual API key usage patterns. Cost incidents cover budget threshold breaches, per-request cost spikes, and unexpected token usage patterns. Compliance incidents cover regulatory threshold breaches (e.g., AI-influenced decision rate exceeding regulatory limits), audit trail failures, and human oversight bypass. Data incidents cover vector database corruption, training data tampering, and retrieval index degradation.
Severity Classification
P0 (Critical): AI system causing immediate user harm or regulatory breach. Examples: PII data appearing in AI outputs at scale; AI clinical decision support providing systematically wrong guidance; security breach via prompt injection. Response time: page on-call immediately; incident commander assigned within 10 minutes. P1 (High): AI system significantly degraded; SLO breach sustained; material quality regression. Examples: hallucination rate > 3x baseline; model API down > 15 minutes; cost budget 100% exceeded. Response time: page on-call; response within 15 minutes. P2 (Medium): AI system partially degraded; SLO at risk; quality regression not yet material. Examples: latency p99 > 2x baseline; hallucination rate elevated but below P1 threshold; partial feature degradation. Response time: alert to on-call channel; response within 1 hour. P3 (Low): Minor degradation or early warning signal. Examples: single model error; cost at 80% of budget; drift warning. Response time: create ticket; address in next working day.
Automated Detection Rules
Detection rules are defined as alert conditions on the metrics and logs from EAAPL-OBS001. Each incident type has a specific detection rule. Availability: HTTP error rate > 5% for 5 minutes on model API endpoint. Quality: hallucination rate (from EAAPL-OBS003) > 2x 7-day baseline for 30 minutes. Security: injection attempt count > 10 per minute (from EAAPL-OBS002). Cost: hourly spend > 150% of rolling 7-day hourly average. Compliance: PII detection event count > 0 in output stream (zero tolerance). Data: vector retrieval success rate < 95% for 15 minutes.
Escalation Architecture
PagerDuty/OpsGenie is configured with AI-specific services and escalation policies. P0 and P1 alerts page the AI platform on-call engineer and notify the engineering manager and relevant product owner. Compliance and security incidents additionally notify the CISO, privacy officer, and compliance team. Cost incidents notify the FinOps team and department head in addition to engineering on-call. ServiceNow integration creates an incident ticket for every P1+ alert with AI-specific fields: incident_category (from taxonomy), model_id, affected_use_case, estimated_user_impact, regulatory_notification_required.
AI-Specific Runbook Templates
Generic runbooks that ask engineers to check CPU and memory are insufficient for AI incidents. AI runbooks include: current model version and recent changes; recent prompt template deployments; vector database status; token usage and cost trend; hallucination rate and quality metrics; third-party model provider status page check; and regulatory notification assessment decision tree.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| AI Alert Engine | Service | Evaluate AI-specific alert rules against telemetry; create alert events | Prometheus Alertmanager, Datadog Monitors, CloudWatch Alarms | Critical |
| Severity Classifier | Logic | Apply taxonomy + threshold rules to classify incident severity | Rules engine embedded in alert manager; custom Lambda/Cloud Function | Critical |
| Incident Management Platform | SaaS / On-Prem | On-call routing, escalation policies, incident timeline | PagerDuty, OpsGenie, ServiceNow | Critical |
| AI Runbook Library | Documentation + Automation | AI-specific diagnostic and mitigation procedures; automated diagnostic scripts | Confluence / Notion + runbook automation (PagerDuty Runbook Automation, Rundeck) | High |
| Regulatory Notification Workflow | Process + Tool | Decision tree for APRA/Privacy Act notification; draft notifications | ServiceNow GRC; custom decision tree in runbook | Critical |
| Post-Incident Review Template | Process | AI-specific PIR structure covering model/prompt/data/infra dimensions | Confluence template; Google Docs template | High |
| Incident Dashboard | UI | Real-time incident status; active incident list; MTTD/MTTR metrics | PagerDuty status page; Grafana incident dashboard | Medium |
| Communication Templates | Documentation | Stakeholder and customer communications for AI incidents | Pre-approved templates by incident type; legal-reviewed | High |
| Learning Actions Tracker | Workflow | Track PIR action items to closure; prevent repeat incidents | JIRA; ServiceNow; GitHub Issues | Medium |
7. Data Flow
Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | AI Alert Engine | Evaluates alert rules against incoming telemetry stream | Alert event with: incident_type, severity, affected_model, metric_value, threshold |
| 2 | Severity Classifier | Applies severity classification rules; determines P0–P3 | Classified alert with severity, incident_category, estimated_impact |
| 3 | Incident Management Platform | Routes alert to on-call via PagerDuty/OpsGenie; creates ServiceNow ticket | Paged engineer; incident ticket created |
| 4 | Incident Commander | Acknowledges incident; activates AI-specific runbook; assigns responders | Incident timeline started; runbook checklist initiated |
| 5 | Responder Team | Diagnoses root cause using runbook: check model version, prompt changes, data health | Root cause identified (model / prompt / data / infrastructure) |
| 6 | Regulatory Assessment | Uses decision tree to determine APRA CPS 230 / Privacy Act notification requirement | Notification decision + documented rationale |
| 7 | Mitigation | Implements appropriate mitigation: rollback, scale, block, escalate to vendor | Mitigation action logged in incident timeline |
| 8 | Verification | Monitors recovery metrics; confirms metrics returning to normal | Recovery confirmed; incident resolved |
| 9 | Post-Incident Review | Within 5 business days for P1+; AI-specific PIR template completed | PIR document; learning actions in JIRA |
Error Flow
| Error Scenario | Detection | Action | Recovery |
|---|---|---|---|
| Alert engine unavailable | Health check on alert engine; missed alert test (synthetic canary metric) | Escalate manually; page on-call via backup channel (email/SMS) | Restore alert engine; verify synthetic canary passes |
| False positive P1 alert (metric spike from deployment) | Incident commander review during triage identifies correlated deployment | Downgrade severity; resolve; adjust alert threshold | Review and tune alert sensitivity |
| Regulatory notification deadline missed | Compliance calendar alert; APRA CPS 230 72-hour window | Notify APRA with explanation of delay; internal escalation to CCO | Enforce regulatory assessment within 2 hours of P0/P1 incident declaration |
| Incident manager unresponsive | Secondary on-call escalation after 10 minutes | Page secondary; notify engineering manager | Review on-call schedule; ensure 24/7 coverage |
| Mitigation makes incident worse (bad rollback) | Metrics worsen after mitigation | Re-declare incident at previous or higher severity | Roll forward to last known good; escalate to vendor |
8. Security Considerations
Authentication: Incident management platform access restricted to authorised engineering, security, and compliance staff via SSO. PagerDuty/OpsGenie API tokens stored in secrets manager. Alert engine webhooks use HMAC signatures for authentication.
Authorisation: Compliance and security incidents have restricted visibility. PII incident details are restricted to privacy officer, CISO, legal, and specific engineering respondents. Incident timelines are not publicly accessible.
Secrets Management: Any credentials used in automated remediation runbooks (e.g., model API keys for rollback) stored in secrets manager with break-glass access audit trail.
Data Classification: Incident records containing AI output examples or prompt content are classified as Confidential. Incident records for security incidents are classified as Restricted. All incident records are retained for 7 years.
Encryption: Incident management platform data encrypted at rest and in transit. On-call contact information protected at Restricted level.
Auditability: Every incident action (acknowledgement, assignment, escalation, resolution) is timestamped and immutable. Regulatory notification decisions have documented rationale and are retained permanently.
OWASP LLM Top 10 Coverage
| OWASP LLM Risk | Incident Management Control | Implementation |
|---|---|---|
| LLM01 Prompt Injection | Security incident category; automated detection and P1+ classification | Injection incidents have dedicated runbook; SOC notification |
| LLM02 Insecure Output Handling | Output safety incidents trigger quality incident; P0 if PII in output | PII-in-output is zero-tolerance P0 incident |
| LLM03 Training Data Poisoning | Data incident category; detected via accuracy regression | Data incident runbook includes training data integrity check |
| LLM04 Model Denial of Service | Availability incident; token abuse triggers cost incident | Availability + cost incidents have auto-scaling mitigation runbook |
| LLM05 Supply Chain Vulnerabilities | Model version change is incident trigger review; unexpected version = P1 | Model version check in every quality/availability runbook |
| LLM06 Sensitive Information Disclosure | Compliance incident P0 on PII in output; Privacy Act notification assessment | Immediate Privacy Act notification decision tree |
| LLM07 Insecure Plugin Design | Tool call anomalies feed security incident stream | Tool abuse detection escalates to security incident |
| LLM08 Excessive Agency | Agentic AI runaway actions trigger security/availability incident | Agent action scope violation = P1 security incident |
| LLM09 Overreliance | Accuracy regression = quality incident; threshold breach triggers review | Quality incidents include assessment of downstream over-reliance harm |
| LLM10 Model Theft | API key abuse patterns trigger security incident | Unusual API key usage = P1 security incident with key rotation runbook |
9. Governance Considerations
Responsible AI: The incident management process is a primary governance control. Every AI incident is documented, root-caused, and acted upon. The PIR process includes explicit assessment of whether the incident represents a systematic AI risk requiring model, prompt, or policy change.
Model Risk Management: P1+ quality incidents that affect material AI models are automatically flagged for model risk management review. The PIR feeds into the model risk register.
Human Approval: Escalation to executive or regulatory notification requires human decision — no automated regulatory notification. The decision tree provides the framework; the compliance officer makes the call.
Policy: AI incident management policy must define: incident category definitions, severity thresholds, response time SLOs, on-call responsibilities, PIR obligations, and regulatory notification criteria. Policy reviewed annually and after every P0 incident.
Traceability: Every incident is traceable from the alert (metric/log event) through triage, mitigation, and learning actions to the specific change that prevented recurrence. This chain is the evidence base for regulatory audit.
Governance Artefacts
| Artefact | Owner | Frequency | Format |
|---|---|---|---|
| AI Incident Register | AI Platform / Risk | Continuous | ServiceNow CMDB; monthly report |
| Post-Incident Review Documents | Incident Commander | Within 5 business days of P1+ | Structured document; linked to incident ticket |
| APRA CPS 230 Notification Log | Compliance Officer | Per notification obligation | Formal document; APRA portal submission |
| MTTD/MTTR Trend Report | Platform Engineering | Monthly | Dashboard + executive summary |
| On-Call Runbook Maintenance Log | Platform Engineering | Quarterly | Runbook review and update record |
| Incident Learning Actions Tracker | Engineering Leads | Per PIR | JIRA board; closure tracking |
10. Operational Considerations
Monitoring: The incident management system itself is monitored. Alert engine availability, alert delivery time, on-call response time, and notification pipeline reliability are all tracked. A synthetic canary metric fires every 5 minutes; if it doesn't trigger the expected alert within 2 minutes, a meta-alert fires.
Logging: All incident management events are logged to an immutable audit store. PagerDuty/OpsGenie export incident timelines to the log store daily.
Incident Response: On-call engineers receive AI-specific training quarterly. Tabletop exercises simulate P0 scenarios (PII leak at scale, model API outage during peak, prompt injection attack) to validate runbooks before real incidents occur.
Disaster Recovery: Incident management platform (PagerDuty/OpsGenie) is a third-party SaaS with > 99.9% availability. Backup notification is via email and SMS directly to on-call. Alert rules are version-controlled and can be re-applied to a new instance.
Capacity Planning: On-call staffing must be planned for AI systems that expand scope. Adding a new high-risk AI feature may add 2–3 additional incident types requiring runbook development and on-call training.
SLO Table (MTTD and MTTR Targets)
| Severity | MTTD Target | MTTR Target | Alert Delivery SLO |
|---|---|---|---|
| P0 (Critical) | < 5 minutes | < 2 hours | < 2 minutes from detection to page |
| P1 (High) | < 15 minutes | < 8 hours | < 5 minutes from detection to alert |
| P2 (Medium) | < 1 hour | < 24 hours | < 15 minutes from detection to channel alert |
| P3 (Low) | < 4 hours | < 5 business days | < 1 hour from detection to ticket |
Disaster Recovery Table
| Component | RTO | RPO | Recovery Approach |
|---|---|---|---|
| PagerDuty / OpsGenie | < 5 minutes (SaaS HA) | N/A | Vendor HA; email/SMS backup |
| AI Alert Engine | 10 minutes | N/A (stateless rules) | Auto-restart; rules version-controlled |
| Incident Ticket Store (ServiceNow) | 30 minutes | 1 hour | ServiceNow HA; regular backup |
| Post-Incident Review Documents | 24 hours | 24 hours | Confluence / Google Drive; cloud backup |
11. Cost Considerations
Cost Drivers
| Driver | Description | Relative Cost |
|---|---|---|
| Incident Management Platform (PagerDuty/OpsGenie) | Per-user SaaS subscription | Medium |
| On-call engineering time | Incident response, PIR, learning actions | High (human labour) |
| Alert engine compute | Rule evaluation on telemetry stream | Low |
| ServiceNow integration | Enterprise ITSM licensing | High at enterprise scale |
| False positive alert investigation | Engineer time investigating non-incidents | Medium if uncontrolled |
Scaling Risks: Alert fatigue is the primary risk. If too many P3 alerts fire for minor fluctuations, engineers begin ignoring alerts. Maintain false positive rate < 10% at each severity level — review alert thresholds monthly.
Optimisations:
- Consolidate low-severity alerts into a daily digest instead of individual notifications
- Auto-resolve P3 tickets if metrics self-recover within 30 minutes without intervention
- Use composite alert rules (multiple conditions AND) for P1 to reduce false positives
Indicative Cost Range
| Scale | AI Incidents/Month | Estimated Incident Management Cost/Month |
|---|---|---|
| Small | 5–20 | $2,000–$5,000 (mostly on-call time) |
| Medium | 20–100 | $8,000–$20,000 |
| Large | 100–500 | $25,000–$60,000 |
| Enterprise | 500+ | $50,000–$150,000 (dedicated AI SRE team) |
12. Trade-Off Analysis
Approach Comparison
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| AI-specific incident taxonomy + dedicated runbooks | Precise, actionable; regulator-defensible; enables metrics | Implementation overhead; requires AI-specific on-call training | Regulated industries; mature AI deployments; teams with dedicated platform engineering |
| Generic ITIL incident management (no AI extension) | Leverages existing tooling and processes; no incremental training | Cannot detect quality/compliance/cost incidents; insufficient for AI risk | Low-risk AI features as a temporary measure only |
| Vendor-managed AI monitoring (Datadog AI, Arize AI) | Faster time to value; managed alerting; some AI-specific detection built-in | Vendor lock-in; limited taxonomy customisation; regulatory defensibility weaker | Organisations lacking platform engineering capacity |
Architectural Tensions
| Tension | Description | Resolution |
|---|---|---|
| Alert sensitivity vs. Alert fatigue | Sensitive alerts catch real incidents early but generate false positives; engineers tune them out | Monthly alert tuning reviews; target false positive rate < 10% per severity |
| Speed vs. Completeness | Fast incident response means mitigating before root cause is known; may make things worse | Define "stabilise then diagnose" protocol: mitigate blast radius first, then root cause |
| Automation vs. Human judgment | Automated mitigation (rollback) is faster but may be wrong | Automated mitigation only for P2+; P0 requires human decision; all mitigations logged |
| Transparency vs. Liability | Detailed PIRs are good governance but create documentation that may be discoverable | Legal review of PIR template; privilege consideration for legally significant incidents |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Alert rule misconfigured; incident not detected | Medium | Critical (missed P0/P1) | Synthetic canary test; periodic alert rule audit | Monthly alert rule validation with synthetic test events |
| On-call engineer lacks AI expertise to diagnose | Medium | High (MTTR extended) | On-call response time SLO breach | AI-specific on-call training; runbook automation for common diagnoses |
| Regulatory notification missed within APRA window | Low | Critical (regulatory sanction) | Compliance calendar alert; escalation trigger | Immediate APRA contact with explanation; internal RCA |
| PIR action items not completed | High | Medium (repeat incidents) | Action item ageing report; 30-day escalation | Weekly review of open PIR actions; escalation to engineering lead |
| Alert storm masks P0 amid P3 noise | Medium | High (P0 buried) | P0 count anomaly; human escalation from field | P0 alerting on separate, high-priority channel not shared with P3 noise |
Cascading Scenarios
- Scenario 1: AI model API outage → availability incident declared → quality incidents (hallucinations on degraded fallback model) not declared separately → quality SLO breach undetected during incident → post-incident reveals quality degradation affected 50K users. Mitigation: availability and quality incidents are independent; quality monitoring continues during availability incidents.
- Scenario 2: Cost incident P2 ignored as "not urgent" → daily budget exceeded by 300% → monthly budget depleted in week 1 → AI features disabled for remainder of month → revenue impact. Mitigation: Cost incidents have business escalation path; P2 cost incidents require FinOps notification within 1 hour.
14. Regulatory Considerations
| Regulation | Clause | Requirement | AI Incident Management Implementation |
|---|---|---|---|
| APRA CPS 230 | Para 53 (Operational Risk Management) | Critical service disruptions must be identified, managed, and reported | AI availability and quality incidents affecting critical services mapped to CPS 230 notification |
| APRA CPS 230 | Para 57 (Incident Response) | Documented incident response procedures for material operational disruptions | This pattern provides AI-specific incident response procedures |
| APRA CPS 230 | Para 61 (Notification) | APRA notification within 24 hours for severe disruptions; 72 hours for material | Regulatory notification decision tree in every P0/P1 runbook |
| APRA CPS 234 | Para 36 (Cyber Incident Response) | Information security incidents detected and responded to within defined timeframes | Security incidents (injection, PII leak) have MTTD < 5 minutes, MTTR < 2 hours |
| Privacy Act 1988 (AU) | NDB Scheme (Part IIIC) | Eligible data breaches (including AI-driven) notified to OAIC and affected individuals | PII-in-output P0 incident triggers NDB assessment; 30-day notification window |
| EU AI Act | Article 18 (Serious Incident Reporting) | Providers of high-risk AI must report serious incidents to national authorities | Serious AI incidents (physical/psychological harm) reported per Article 18; incident log maintained |
| ISO/IEC 42001 | Clause 10.2 (Nonconformity and Corrective Action) | Nonconformities must be corrected and root causes addressed | PIR process directly implements corrective action requirement |
| NIST AI RMF | MANAGE 3.2 | Documented procedures for AI incidents including recovery and learning | This pattern implements NIST MANAGE 3.2 in full |
15. Reference Implementations
AWS
- Alert Engine: CloudWatch Alarms + EventBridge rules; custom Lambda for AI-specific composite rules
- Incident Platform: PagerDuty with AWS EventBridge integration; ServiceNow with AWS Service Management Connector
- Runbook Automation: AWS Systems Manager Automation documents; PagerDuty Runbook Automation
- Incident Log: AWS DynamoDB (incident events); Amazon S3 (PIR documents)
- Regulatory Notification: AWS Step Functions for notification decision workflow
- Dashboard: Amazon QuickSight MTTD/MTTR dashboard; CloudWatch operational dashboard
Azure
- Alert Engine: Azure Monitor Alerts; Azure Logic Apps for composite alert rules
- Incident Platform: PagerDuty with Azure Monitor integration; ServiceNow with Azure DevOps
- Runbook Automation: Azure Automation Runbooks; ITSM Connector for ServiceNow
- Incident Log: Azure Cosmos DB (events); Azure Blob Storage (documents)
- Regulatory Notification: Power Automate workflow for notification decision tree
- Dashboard: Azure Monitor Workbooks; Power BI MTTD/MTTR reports
GCP
- Alert Engine: Cloud Monitoring Alerting; Cloud Functions for composite rules
- Incident Platform: PagerDuty with Google Cloud integration; ServiceNow
- Runbook Automation: Cloud Run jobs; PagerDuty Runbook Automation
- Incident Log: Firestore (events); Cloud Storage (documents)
- Regulatory Notification: Cloud Workflows for notification decision workflow
- Dashboard: Looker; Cloud Monitoring dashboards
On-Premises
- Alert Engine: Prometheus Alertmanager with custom AI alert rules; Grafana alerting
- Incident Platform: OpsGenie (SaaS); Jira Service Management (self-hosted)
- Runbook Automation: Rundeck; Ansible playbooks for common mitigations
- Incident Log: PostgreSQL (events); Confluence (documents)
- Regulatory Notification: Manual workflow with compliance team; tracked in GRC tool
- Dashboard: Grafana operational dashboard
16. Related Patterns
| Pattern ID | Pattern Name | Relationship | Notes |
|---|---|---|---|
| EAAPL-OBS001 | AI Telemetry Architecture | Foundation | All detection rules consume metrics and logs from this pattern |
| EAAPL-OBS002 | Prompt Monitoring | Feeds Into | Security incidents (injection, PII) detected by OBS002; routed here |
| EAAPL-OBS003 | Hallucination Detection | Feeds Into | Quality incidents triggered by hallucination rate alerts from OBS003 |
| EAAPL-OBS005 | Model Drift Detection | Feeds Into | Drift alerts generate P2/P1 quality incidents in this framework |
| EAAPL-OBS006 | AI Cost Observability | Feeds Into | Cost incidents triggered by OBS006 budget alerts |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Adoption Breadth | 3 | AI-specific incident management adopted by regulated industries; general market still maturing |
| Tooling Ecosystem | 4 | PagerDuty/OpsGenie/ServiceNow mature; AI-specific alert rules and runbooks are custom |
| Operational Runbook Coverage | 3 | Generic runbooks well-established; AI-specific runbooks require custom development |
| Regulatory Evidence | 5 | APRA CPS 230 and Privacy Act NDB scheme are mature and well-understood obligations |
| Cost Predictability | 4 | Incident management platform costs are predictable; on-call labour costs are variable |
| Team Skill Availability | 4 | SRE/incident management skills broadly available; AI-specific extensions require training |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0.0 | 2026-06-12 | EAAPL Working Group | Initial publication |