Secrets Management for AI
[EAAPL-SEC008] Secrets Management for AI
Category: Security / Credential Management
Sub-category: API Key and Token Lifecycle
Version: 2.0
Maturity: Mature
Tags: secrets-management api-keys vault dynamic-secrets rotation audit credential-hygiene
Regulatory Relevance: APRA CPS234 §22, ISO 27001 A.9.4, NIST CSF PR.AC-1, EU AI Act Art. 9, SOC 2 CC6.1
1. Executive Summary
Secrets Management for AI addresses one of the most prevalent and consequential security failures in enterprise AI deployments: the mishandling of model API keys, service tokens, and credentials used by AI systems. Leaked or improperly managed API keys for commercial LLM providers (OpenAI, Anthropic, Azure OpenAI) give attackers the ability to generate costs at the organisation's expense, access data sent in API calls, and exfiltrate model outputs — with no attribution back to the attacker.
The business risk is material and immediate. A single leaked sk- OpenAI API key has been used to generate tens of thousands of dollars in API charges within hours of exposure. AI API keys embedded in mobile applications, JavaScript bundles, GitHub repositories, or CI/CD pipeline logs represent standing vulnerabilities that can be exploited at any time, from anywhere, with no prior access to the organisation's infrastructure.
This pattern establishes the complete lifecycle for AI credentials: vault storage and retrieval (no secrets in code or environment variables), dynamic secret generation where supported, automated rotation, granular access control, and comprehensive audit logging of every secret access event. For VITE_-prefixed or client-side secret exposure in particular, this pattern defines the architectural controls that prevent secrets from reaching user-accessible environments — a critical failure mode in frontend AI applications.
2. Problem Statement
Business Problem
AI model API keys are high-value credentials: they grant direct access to model inference, with billing charged to the owner. They are also widely mishandled — treated as configuration values rather than secrets, hard-coded in source code, embedded in build artifacts, and shared across teams without access controls. Every insecure placement is a potential financial and security exposure.
Beyond cost, AI API keys may be used to submit requests that the organisation would not authorise — generating harmful content, extracting information, or probing model capabilities in ways that create legal and reputational risk, with the charges appearing on the organisation's bill.
Technical Problem
Common failure modes in AI credential management:
- Hard-coded in source code:
OPENAI_API_KEY = "sk-..."in application code, committed to Git history. - Environment variables in containers: Accessible to any process in the container; visible in orchestrator logs; leaked in crash dumps.
VITE_-prefixed secrets in frontend builds: Build tools like Vite bundleVITE_*environment variables into client-side JavaScript, making them accessible to every user who loads the page.- Shared keys across environments: Same key used in dev, staging, and production — a leaked dev key gives production access.
- No rotation: API keys never rotated; a key compromised 6 months ago may still be valid.
- No audit: No record of which systems use which keys; key compromise cannot be scoped.
- Broad-permission keys: Using an OpenAI "all models" key for an application that only needs GPT-3.5-turbo.
Symptoms
- AI API keys in Git history (discoverable via
git log -S "sk-"). VITE_OPENAI_API_KEYin production JavaScript bundles.- Unexplained spikes in model API spend (possible key misuse).
- No centralised record of which applications hold which API keys.
- Application secrets in CI/CD pipeline logs.
- Keys with no expiry date and no rotation history.
Cost of Inaction
| Dimension | Impact |
|---|---|
| Financial | Unauthorised API usage; $10K–$100K+ costs generated by a single leaked key |
| Security | Attacker can query model with organisation's key; results and charges attributed to organisation |
| Regulatory | APRA CPS234 requires access control for information assets — unmanaged API keys violate this |
| Data | API call contents may be logged by provider — attacker using leaked key can submit data that appears in provider audit logs attributed to the organisation |
| Operational | Key rotation requires emergency application redeployments; no rotation history means incident scoping is impossible |
3. Context
When to Apply
- Any application, service, or pipeline that holds credentials for AI model providers.
- CI/CD pipelines that need model API access for testing or evaluation.
- Frontend or mobile applications that call AI APIs — the frontend must NEVER hold model provider credentials directly.
- Multi-environment deployments (dev/staging/prod) needing credential isolation.
- Teams managing more than one AI model provider integration.
When NOT to Apply
- Single-developer local development with credentials that never leave the developer's machine and are isolated to their personal API account.
Prerequisites
| Prerequisite | Detail |
|---|---|
| Vault Infrastructure | HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault |
| Identity Provider | OIDC/SAML IdP for human user authentication to vault |
| Workload Identity | Kubernetes Service Accounts or cloud-managed identity for application-level vault access |
| AI Gateway (EAAPL-SEC001) | All AI API calls proxied through gateway; credentials held only by gateway |
Industry Applicability
| Industry | Applicability | Key Driver |
|---|---|---|
| All industries | Critical | Universal risk — AI API keys are credentials regardless of industry |
| Financial Services | Critical | APRA CPS234 explicit credential management requirements |
| Healthcare | Critical | Credentials protecting PHI-containing AI pipelines |
| Government | Critical | Classified system access controls |
| Technology / SaaS | High | Developer tooling and CI/CD credential exposure risk |
4. Architecture Overview
The secrets management architecture for AI systems is built on a single foundational principle: model provider credentials must never be present outside the vault and the AI Gateway's runtime memory. Every other location — source code, environment variables, build artifacts, logs, client bundles — is prohibited.
Vault as Single Source of Truth
All AI credentials are stored in a centralised vault. The vault is the only place where credentials exist at rest. Applications do not store credentials — they request them at runtime. This separation is enforced through:
- Architecture policy: no secret values in application configuration files or source code.
- Automated scanning: CI/CD pipeline executes a secret-grep step that fails the build if any known secret pattern (OpenAI
sk-, Anthropicsk-ant-,sbp_for Supabase) appears in source files, build artifacts, or environment variable names withVITE_prefix. - Vault access control: applications can only read the specific secrets they need, verified by workload identity.
Dynamic Secrets Where Supported
Some AI providers support dynamic or short-lived credential generation:
- AWS Bedrock: Uses IAM roles via
assume_role; credentials are time-limited by STS. - Azure OpenAI: Uses Azure Managed Identity; credentials retrieved from Azure AD, never static keys.
- Anthropic, OpenAI: Do not currently support dynamic credentials — static API keys must be managed and rotated manually.
For providers requiring static keys, vault rotation is the control: vault stores the key, tracks its age, and triggers rotation workflows. Rotation involves: generating a new key via the provider's API, updating vault, confirming applications are using the new key, and revoking the old key.
Key Scoping and Least Privilege
Every AI application is issued credentials scoped to its specific requirements:
- A customer service bot that only uses
gpt-4o-miniis issued a key with rate limits and, where supported, model access restrictions. - A batch processing pipeline with no human interaction is issued a separate key from the real-time serving key — isolating blast radius if either is compromised.
- Separate keys for dev, staging, and production environments. Dev key has the smallest spending limit; production key has the strictest access controls.
Frontend Architecture — The Critical Rule
Frontend applications (React, Vue, Angular) and mobile applications must NEVER hold model provider API keys. The technical reason: any value in a JavaScript bundle or mobile binary is publicly accessible — to every user, to security researchers, and to attackers. The architecture must be:
Browser/Mobile → Application Backend → AI Gateway → Model Provider
The application backend holds the session context, authenticates the user, enforces application-level access controls, and proxies requests to the AI Gateway (which holds provider credentials). The browser never sees a model provider API key.
Build systems that use VITE_, REACT_APP_, EXPO_PUBLIC_, or similar prefixes bundle the prefixed variable into client-side code. No secret should ever have these prefixes. CI/CD must fail on any VITE_OPENAI_, REACT_APP_ANTHROPIC_, or similar pattern.
Audit Logging
Every secret read from vault generates an immutable audit record: which application, which secret path, which accessor (human or workload identity), timestamp, and IP/network source. This enables:
- Scoping of a credential compromise (which systems accessed the key in the last 90 days).
- Detection of anomalous access patterns (secret accessed at 3am from an unusual IP).
- Compliance evidence (APRA CPS234 requires audit trails for access to critical information assets).
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Vault | Secrets Store | Authoritative store for all AI credentials; access control; audit logging | HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, GCP Secret Manager | Critical |
| Vault Access Policies | Policy | Defines which workloads can access which secret paths | HashiCorp Vault Policies, AWS IAM, Azure RBAC | Critical |
| Dynamic Credential Provider | Identity | Generates time-limited credentials for supported providers (AWS Bedrock, Azure OpenAI) | AWS STS, Azure Managed Identity, Vault AWS Secrets Engine | High |
| Rotation Orchestrator | Automation | Automates key rotation: generate, update vault, confirm, revoke old | Vault Agent rotation, AWS Secrets Manager auto-rotation, custom Lambda/Function | High |
| Secret Grep CI Gate | CI/CD Security | Scans source code and build artifacts for secret patterns; fails build on detection | TruffleHog, GitLeaks, detect-secrets, custom regex-based gate | Critical |
| Frontend Secret Enforcer | Build Security | Fails build if VITE_*, REACT_APP_*, or similar prefix is applied to a secret |
Custom CI step, ESLint plugin for env var naming conventions | Critical |
| Vault Audit Log | Compliance | Immutable, append-only record of all vault access events | Vault Audit Backend → Kafka → S3 Object Lock | Critical |
| Anomaly Detector | Security | Monitors vault access patterns for anomalous events | Vault + Splunk/Datadog SIEM integration; custom access pattern alert | High |
| Secret Version Manager | Lifecycle | Tracks secret versions; enables rollback; maintains rotation history | Vault versioned KV, AWS Secrets Manager versioning | High |
7. Data Flow
Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | AI Gateway startup | Authenticates to Vault using Kubernetes Service Account OIDC token | Vault token with gateway's policy attached |
| 2 | AI Gateway | Reads model provider API key from vault path /ai/providers/{provider}/keys/{application} |
API key in runtime memory; never written to disk |
| 3 | Vault | Records audit event: gateway workload identity, secret path, timestamp | Immutable audit record |
| 4 | AI Gateway | Holds key in memory; sets timer for renewal before expiry (at 75% of lease TTL) | Key available for model API calls |
| 5 | AI Gateway | Makes model API call using key stored in memory; key value never appears in logs | Successful model API call |
| 6 | Rotation Orchestrator | At rotation schedule (or vault lease expiry): generates new key via provider API | New key value |
| 7 | Rotation Orchestrator | Writes new key to vault; updates version | New version active in vault |
| 8 | AI Gateway | On next renewal cycle: reads new key; begins using new key | Seamless key rotation |
| 9 | Rotation Orchestrator | After confirmation that all consumers have switched: revokes old key via provider API | Old key invalidated |
Error Flow
| Error | Handling | Alert |
|---|---|---|
| Secret grep gate detects API key in source | Build fails; PR blocked; developer notified | Security: API key pattern in source code |
VITE_* prefix on secret detected |
Build fails immediately | Critical: client-side secret exposure risk |
| Vault unavailable at gateway startup | Gateway fails to start; P1 alert | Critical: secrets infrastructure unavailable |
| Vault lease expiry before renewal | Gateway evicts expired key; new requests fail until key renewed | High: key expiry |
| Rotation fails (provider API error) | Alert rotation team; maintain old key temporarily; retry with backoff | High: rotation failure |
| Anomalous vault access (unusual hours/IP) | SIEM alert; potential credential investigation | Security: anomalous credential access |
8. Security Considerations
Authentication & Authorisation
- Applications authenticate to Vault using workload identity (Kubernetes SA, AWS IAM role, Azure Managed Identity) — no human-visible credentials required to access vault.
- Human operators access vault via SSO + MFA; break-glass access with dual approval and automatic alerting.
- Vault policies grant access to specific secret paths only — an application cannot read secrets outside its namespace.
Secrets Management (Meta)
- The vault itself requires protection: vault unseal keys are split using Shamir's secret sharing (e.g., 3-of-5) and distributed to security officers.
- Vault root token is generated once, used to configure vault, and then deleted — normal operations use lower-privileged tokens only.
Data Classification
- Secret paths in vault are classified:
ai/providers/openai/prod-keysis RESTRICTED. Audit access to RESTRICTED paths generates alerts for any access outside business hours.
Encryption
- All vault secrets encrypted at rest with AES-256-GCM; encryption keys managed by vault's built-in seal (or externally via AWS KMS, Azure Key Vault HSM).
- All communication to vault over TLS 1.3.
- Vault audit log: write-ahead log; each entry signed to detect tampering.
OWASP LLM Top 10 Coverage
| OWASP LLM Risk | Secrets Management Mitigation | Coverage |
|---|---|---|
| LLM01: Prompt Injection | Not applicable | None |
| LLM02: Insecure Output Handling | Not applicable | None |
| LLM03: Training Data Poisoning | Not applicable | None |
| LLM04: Model Denial of Service | Scoped keys with provider-side rate limits prevent unlimited model use | Medium |
| LLM05: Supply Chain Vulnerabilities | Vault-managed credentials prevent supply chain compromise of credential stores | High |
| LLM06: Sensitive Information Disclosure | API keys never in logs or client code eliminates a key exfiltration vector | High |
| LLM07: Insecure Plugin Design | Tool-specific credentials scoped per tool reduce blast radius | Medium |
| LLM08: Excessive Agency | Credential scoping limits what model provider capabilities an application can access | Medium |
| LLM09: Overreliance | Not applicable | None |
| LLM10: Model Theft | Credentials not exposed to end users; cannot be used to systematically query model | High |
9. Governance Considerations
Governance Artefacts
| Artefact | Owner | Frequency | Purpose |
|---|---|---|---|
| Secret Inventory | Security Team | Updated with each new integration | Complete inventory of all AI credentials, their owners, and their rotation schedules |
| Rotation Schedule | AI Platform | Monthly review | Ensures all static keys are on rotation schedule |
| Vault Access Audit Report | Security Operations | Monthly | Identifies anomalous access patterns |
| CI Gate Violation Log | DevSecOps | Continuous | Records all build failures due to secret exposure detection |
| Key Compromise Response Runbook | Security Team | Reviewed quarterly | Step-by-step response to detected key compromise |
10. Operational Considerations
SLOs
| SLO | Target | Measurement |
|---|---|---|
| Secret retrieval latency (p99) | <50ms | Vault read latency metric |
| Key rotation success rate | >99.9% | Rotation job success/failure metric |
| Time from compromise detection to revocation | <15min | MTTD + MTTR for key compromise incidents |
| Vault availability | 99.99% | Vault health check uptime |
| Secret grep CI gate execution time | <30s | CI pipeline step timing |
Incident Management
Key Compromise Response (15-minute target):
- Receive alert (automated detection or developer report).
- Immediately revoke key via provider API.
- Generate and vault new key.
- Verify all consumers have picked up new key (vault lease renewal cycle or forced restart).
- Scope the incident: review vault audit log for all accesses using the compromised key path; review model provider's usage logs for anomalous queries.
- File incident report; update runbook.
11. Cost Considerations
Cost Drivers
| Cost Driver | Description | Relative Impact |
|---|---|---|
| Vault infrastructure | HashiCorp Vault Enterprise or cloud-native equivalent | Medium |
| Rotation automation | Engineering for rotation workflows | Medium (one-time) |
| CI secret scanning | Adds 10–30s to build pipeline; negligible compute cost | Very Low |
| Audit log storage | Vault audit log grows with access volume | Low |
Indicative Cost Range
| Scale | Monthly Cost (USD) | Notes |
|---|---|---|
| Small | $200–$600 | Cloud-native secrets manager (AWS Secrets Manager ~$0.40/secret/month + $0.05/10K API calls) |
| Medium | $800–$2,500 | HashiCorp Vault Enterprise (or HCP Vault) |
| Large | $3,000–$10,000 | HashiCorp Vault Enterprise; dedicated HSM for seal; multi-region HA |
12. Trade-Off Analysis
Option Comparison
| Option | Description | Pros | Cons | Best For |
|---|---|---|---|---|
| A: Environment variables only | Secrets in container env vars | Simple; widely supported | Visible in orchestrator; leaked in crash dumps; no rotation | Development only — never production |
| B: Cloud-native secrets manager | AWS Secrets Manager / Azure Key Vault | Managed; auto-rotation supported; low operational overhead | Vendor lock-in; per-secret cost at scale | Cloud-committed; small–medium secret count |
| C: HashiCorp Vault (this pattern) | Self-hosted or HCP Vault | Full-featured; dynamic secrets; multi-cloud; FIPS 140-2 | Operational complexity; self-hosted has ops burden | Enterprise; regulated; multi-cloud |
| D: CI/CD secret injection only | Secrets injected at deployment; not held at runtime | No runtime vault dependency | Secrets in CI logs risk; no dynamic rotation; not suitable for long-running services | Short-lived batch jobs only |
Architectural Tensions
| Tension | Trade-Off |
|---|---|
| Rotation Frequency vs Stability | More frequent rotation reduces exposure window but increases rotation failure risk and operational complexity. Resolution: rotate every 90 days for static keys; dynamic credentials where possible. |
| Dynamic vs Static Credentials | Dynamic credentials are safer but require provider API support. Resolution: use dynamic where available (AWS Bedrock, Azure OpenAI via MI); manage rotation for static (OpenAI, Anthropic). |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Key leaked via Git history | High (industry-wide) | Critical | Secret scanning on push; TruffleHog in CI | Immediately revoke; force-push removal from history; rotate all secrets in affected repo |
VITE_* secret in production bundle |
High (common mistake) | Critical | CI gate on build; runtime CSP violation detector | Emergency redeployment; revoke exposed key; new key with correct prefix-free naming |
| Vault HA failure | Low | Critical | Vault health metrics | Vault HA cluster (3-node Raft); multi-AZ |
| Rotation failure (provider API down) | Medium | Medium | Rotation job failure alert | Retry with backoff; extend key validity if possible; manual rotation runbook |
| Secret path misconfiguration (wrong app reads wrong key) | Low | High | Vault audit log anomaly | Vault policy fix; immediate key rotation |
14. Regulatory Considerations
| Regulation | Requirement | Implementation |
|---|---|---|
| APRA CPS234 §22 | Manage access to systems according to information sensitivity | Vault policies + workload identity implement CPS234 §22 access management |
| ISO 27001 A.9.4 (Access to Systems and Applications) | Prevent unauthorised access to systems | Vault access control + MFA for human access |
| SOC 2 CC6.1 | Logical access controls | Vault policies + audit log provide evidence for CC6.1 |
| NIST CSF PR.AC-1 | Identities and credentials are managed for authorised devices and users | Vault secret lifecycle management implements PR.AC-1 |
| GDPR Art. 32 | Appropriate technical security measures | API key management is a technical security measure protecting AI systems that may process personal data |
15. Reference Implementations
AWS
| Component | AWS Service |
|---|---|
| Secret storage | AWS Secrets Manager (auto-rotation for supported services) |
| Dynamic credentials | AWS IAM + STS for Bedrock; no static key needed |
| Application access | IAM Roles for Service Accounts (IRSA) for EKS |
| CI gate | GitHub Actions + detect-secrets; CodeBuild phase |
| Audit | CloudTrail + CloudWatch Logs |
| Rotation | Secrets Manager Lambda rotation function |
Azure
| Component | Azure Service |
|---|---|
| Secret storage | Azure Key Vault |
| Dynamic credentials | Azure Managed Identity for Azure OpenAI (no static key) |
| Application access | AKS Workload Identity + Key Vault CSI driver |
| Audit | Azure Monitor + Azure AD Audit Logs |
| Rotation | Key Vault rotation policies |
On-Premises
| Component | Technology |
|---|---|
| Secret storage | HashiCorp Vault (Raft HA) |
| Dynamic credentials | Vault AWS Secrets Engine (for cloud providers) |
| Application access | Kubernetes Auth Method (OIDC JWT) |
| CI gate | TruffleHog in Jenkins/GitLab CI pre-commit hooks |
| Audit | Vault Audit Backend → Kafka → Elasticsearch |
| Rotation | Vault Agent + custom rotation scripts |
16. Related Patterns
| Pattern | ID | Relationship |
|---|---|---|
| AI Gateway | EAAPL-SEC001 | Gateway is the primary runtime holder of model provider credentials |
| Model Isolation | EAAPL-SEC003 | Model isolation's secret sidecar depends on SEC008 vault infrastructure |
| Secure Tool Invocation | EAAPL-SEC004 | Per-tool JIT credentials are issued from the vault infrastructure in SEC008 |
| Zero-Trust AI Pipeline | EAAPL-SEC007 | JIT access pillar of SEC007 is underpinned by SEC008 vault |
| AI Data Classification | EAAPL-SEC009 | Secret classification levels stored in vault metadata |
17. Maturity Assessment
Overall Maturity: Mature
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Pattern definition clarity | 5 | Well-understood problem with clear, proven solutions |
| Technology availability | 5 | Vault, Secrets Manager, Key Vault are all production-ready, battle-tested |
| Industry adoption | 4 | Vault and cloud-native secrets managers widely adopted; AI-specific guidance less common |
| CI secret scanning | 5 | TruffleHog, GitLeaks, detect-secrets are mature tools |
| Regulatory alignment | 5 | Directly maps to APRA CPS234, ISO 27001, SOC 2 |
| Developer experience | 3 | Vault integration requires application-level code changes; friction point for adoption |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2024-01-10 | Security Architecture Team | Initial pattern definition |
| 1.1 | 2024-04-25 | Security Architecture Team | Added frontend/VITE_ critical guidance; expanded CI gate detail |
| 2.0 | 2025-02-15 | Security Architecture Team | Major revision: added dynamic credential architecture; APRA mapping; key compromise runbook; updated to reflect production incidents |