EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryData ArchitectureEAAPL-DAT005
EAAPL-DAT005Proven
⇄ Compare

Privacy by Design for AI Data

🗄️ Data ArchitectureEU AI ActISO/IEC 42001🏭 Field-tested in AU21 signals · Q2 2026

[EAAPL-DAT005] Privacy by Design for AI Data

Category: Data Architecture
Sub-category: Privacy Engineering / AI Data Pipelines
Version: 1.2
Maturity: Proven
Tags: privacy-by-design, data-minimisation, pseudonymisation, consent-management, machine-unlearning, purpose-limitation
Regulatory Relevance: GDPR Articles 5/17/25, Privacy Act (Australia) APP 1–13, EU AI Act Article 10, ISO 27701, ISO 42001 §6.1, NIST AI RMF GOVERN-1.7


1. Executive Summary

Privacy-by-design mandates that privacy controls are embedded into AI data pipelines at every stage — not bolted on retrospectively. For AI systems, this is uniquely challenging: the same model that must protect personal data also depends on it for predictive accuracy. This pattern resolves that tension through five privacy engineering techniques applied systematically across the AI data lifecycle.

Data minimisation at collection ensures only data strictly necessary for the AI purpose is ingested. Purpose limitation controls enforce that data collected for one AI use case cannot be used for another without explicit governance approval. Pseudonymisation pipelines protect identity while preserving the statistical relationships models need. Consent management integration gates data use on current consent status. Machine unlearning mechanisms respond to right-to-erasure requests by identifying and selectively removing the influence of specific individuals' data from trained models.

Organisations that implement this pattern reduce regulatory exposure for AI programmes by 60–80% (measured by privacy impact assessment findings), enable AI use cases on sensitive data that would otherwise be prohibited, and establish a defensible posture for privacy regulator inquiries.

Target audience: Chief Privacy Officers, Data Protection Officers, AI Architects, ML Platform leads.


2. Problem Statement

Business Problem

AI programmes operating on personal data face escalating regulatory scrutiny. A single privacy incident in an AI system — training on consent-withdrawn data, using data outside stated purpose, or failing to honour erasure requests — can result in regulatory fines, reputational damage, and programme shutdown.

Technical Problem

  • Most AI pipelines treat privacy as a compliance checkbox (PII masking before data warehouse load) rather than an architectural property.
  • Purpose limitation is not technically enforced: once data is in a data lake or feature store, downstream AI uses are not validated against the original consent scope.
  • Right-to-erasure requests are handled for operational databases but the privacy team cannot identify whether a data subject's records were used in AI training.
  • Consent management systems are siloed from AI training pipelines; consent withdrawal does not propagate to feature stores or retrain queues.
  • Pseudonymisation is often implemented inconsistently across teams — some pipelines re-identify pseudonymous records by joining with other datasets.

Symptoms

  • Privacy Impact Assessment finds AI programme is using customer data beyond consented purposes.
  • Data subject requests right-to-erasure; organisation cannot determine if subject data was used in AI training.
  • Consent withdrawal not reflected in AI inference; system continues to make predictions using withdrawn-consent data.
  • Regulatory audit finds data minimisation principle not applied to AI training data.
  • Privacy review is a deployment gate that consistently delays AI releases by 4–8 weeks.

Cost of Inaction

Dimension Impact
Regulatory GDPR Art. 83: fines up to €20M or 4% global turnover; Privacy Act civil penalties
Reputational Consumer trust erosion if AI privacy incident publicised
Operational Retroactive privacy remediation in production AI is 10–30× more expensive than design-time
Legal Class action exposure for AI profiling on unlawfully processed data

3. Context

When to Apply

  • Any AI system processing personal information about identifiable individuals.
  • AI systems in jurisdictions subject to GDPR, Australian Privacy Act, CCPA, or equivalent.
  • AI systems where data subjects have consent rights or erasure rights.
  • Organisations building AI on sensitive personal data (health, financial, employment, location).
  • New AI programme initiation (most effective when applied at design stage).

When NOT to Apply

  • AI systems processing entirely synthetic or fully anonymised data (not personal information by definition).
  • AI systems operating only on clearly non-personal data (e.g., pure sensor telemetry with no individual linkage).
  • Research AI under specific statutory exemption from privacy regulation (specific carve-outs apply and must be legally verified).

Prerequisites

Prerequisite Minimum Viable Preferred
Privacy Impact Assessment capability Ad hoc privacy review Systematic DPIA process with templates
Consent management system Simple opt-in/out database Full consent management platform (OneTrust, Didomi)
Data catalogue Spreadsheet DataHub/Atlan with PII tagging
Pseudonymisation key management Shared key per system Dedicated KMS with per-subject keys
Legal counsel Internal privacy counsel DPO + external specialist counsel

Industry Applicability

Industry Applicability Driver
Healthcare Critical Sensitive health data; clinical AI; patient consent
Financial Services Critical Credit/fraud AI on personal financial data; APRA + GDPR
Retail High Customer purchase + behavioural data; personalisation AI
Telecommunications High Call records; location data; churn AI
Government High Citizen data; mandatory Privacy Act compliance
HR / Recruitment Critical Employment AI; GDPR Art. 9 special category data

4. Architecture Overview

Design Philosophy

Privacy-by-design for AI is implemented as a set of pipeline controls that operate at defined points in the data lifecycle, enforced architecturally rather than by policy alone. The five controls are applied in sequence from data collection through to model serving, and the architecture includes a dedicated Privacy Propagation Bus that carries consent and erasure signals across all pipeline stages.

Control 1 — Data Minimisation at Collection. Before any personal data enters the AI data pipeline, a minimisation gate evaluates whether each data element is strictly necessary for the declared AI purpose. This is enforced by a purpose-mapped schema: only fields with a documented necessity justification for the specific AI use case are allowed to pass. This is implemented as a schema enforcement step in the ingestion pipeline, not as a manual approval gate — the schema definition itself codifies the minimisation decision.

Control 2 — Purpose Limitation. Each data element is tagged with the AI purpose(s) for which it may be used. The Feature Composition Service (or equivalent) enforces that features used in training or inference are only assembled from data elements whose purpose tags include the current AI use case. This is enforced by the governance plane (OPA policies) at feature query time. Attempting to use a feature outside its declared purpose scope generates a policy violation and audit log entry.

Control 3 — Pseudonymisation Pipeline. Identifiers (name, email, account number, national ID) are replaced with pseudonyms using a keyed hash function. The pseudonymisation key is managed by a dedicated KMS, separate from the data pipeline. Importantly, the same subject receives the same pseudonym consistently across all datasets in the pipeline, preserving join-ability for model training, while preventing direct identification. Re-identification requires access to the KMS pseudonymisation key, which is restricted, audited, and rotated annually.

Control 4 — Consent Management Integration. The pipeline subscribes to consent change events from the consent management platform. When a data subject withdraws consent for AI use, a propagation event is published to the Privacy Propagation Bus. Downstream services (feature store, training pipeline, inference service) subscribe and respond: the feature store marks the subject's features as consent-withdrawn; the inference service stops making predictions for that subject; the training pipeline flags the subject's records for exclusion from the next training run.

Control 5 — Machine Unlearning (Right-to-Erasure). When a data subject requests erasure, the system must assess whether their data was used in AI training. If yes, exact erasure from the model is often infeasible without full retraining. The pattern implements three tiers: (a) exact erasure for subjects whose data was not yet used in training — delete records; (b) approximate unlearning via gradient reversal or SISA (Sharded, Isolated, Sliced, and Aggregated) training for subjects whose data was used in a recent model version; (c) full retraining for high-risk cases where approximate unlearning is insufficient. The SISA training architecture partitions training data into shards; erasure requires only retraining the affected shard rather than the full dataset, reducing unlearning cost by 60–80%.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Input["Data Collection Controls"] A[Raw Data Source] B[Minimisation Gate] C[Pseudonymisation Service] end subgraph Pipeline["AI Data Pipeline"] D[Consent-Gated Feature Store] E[Training Pipeline] F[Inference Service] end subgraph Rights["Erasure and Consent"] G[Consent Management Platform] H[Erasure Request Handler] end A --> B B --> C C --> D G -->|consent withdrawal| D G -->|withdrawal signal| F D --> E E --> F H -->|shard retrain| E H -->|delete record| D style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f3e8ff,stroke:#a855f7 style C fill:#f0fdf4,stroke:#22c55e style D fill:#fef9c3,stroke:#eab308 style E fill:#f0fdf4,stroke:#22c55e style F fill:#d1fae5,stroke:#10b981 style G fill:#f3e8ff,stroke:#a855f7 style H fill:#fee2e2,stroke:#ef4444

6. Components

Component Type Responsibility Technology Options Criticality
Minimisation Gate Processing Enforces purpose-mapped schema; rejects fields not necessary for declared AI purpose Custom pipeline step; dbt meta enforcement; Great Expectations custom rules Critical
Pseudonymisation Service Processing HMAC-SHA256 keyed pseudonymisation; consistent pseudonyms across datasets Custom Python service; HashiCorp Vault transit engine; AWS Encryption SDK Critical
KMS (Pseudonymisation Keys) Infrastructure Manages pseudonymisation keys; enforces rotation; audits key usage AWS KMS, Azure Key Vault, Google Cloud KMS, HashiCorp Vault Critical
Purpose Tag Engine Processing Tags data elements with permitted AI purposes; enforced at feature query time Custom OPA policy; Collibra data governance; Atlan purpose tags Critical
Governance Plane (Purpose Enforcement) Processing OPA policies enforce purpose limitation at feature assembly Open Policy Agent (OPA), custom middleware Critical
Consent Management Platform SaaS / On-Prem Manages data subject consent; emits consent change events OneTrust, Didomi, TrustArc, custom High
Privacy Propagation Bus Messaging Carries consent withdrawal and erasure events to all pipeline subscribers Apache Kafka, AWS EventBridge, Google Pub/Sub Critical
Consent State Store Storage Current consent status per data subject per purpose; queried by inference service Redis, DynamoDB, PostgreSQL Critical
SISA Shard Retrainer Processing Retrains only the affected shard when erasure is requested Custom PyTorch/TF training harness with shard management High
Erasure Request Receiver Service Intake data subject erasure requests; triggers impact assessment Custom API, OneTrust DSR module High

7. Data Flow

Primary Flow

Step Actor Action Output
1 Data source Provides raw data including personal information Raw personal data
2 Minimisation Gate Validates each field against purpose-mapped schema Minimised dataset (only necessary fields)
3 Pseudonymisation Service Replaces identifiers with HMAC-SHA256 pseudonyms using KMS-managed key Pseudonymised dataset
4 Purpose Tag Engine Tags each data element with permitted AI purposes Purpose-tagged pseudonymised dataset
5 Governance Plane At feature assembly, enforces purpose tags match current AI use case Approved feature set for declared use case
6 Feature Store Stores features with consent status metadata per subject Consent-aware feature store
7 Training Pipeline Reads features; excludes records with withdrawn consent; trains model on SISA shards Model trained only on consented data
8 Inference Service At inference request, checks consent state for subject; proceeds or declines Prediction (if consented) or privacy-compliant decline
9 Consent withdrawal event Consent Platform emits withdrawal event Privacy Propagation Bus event
10 Downstream subscribers Feature Store + Inference Service update consent state Subject excluded from future training and inference
11 Erasure request Data subject requests erasure Erasure event triggers impact assessment
12 SISA Shard Retrainer Identifies subject's shard; retrains affected shard Updated model without subject's influence

Error Flow

Error Condition Trigger Response Recovery
Consent State Store unavailable Service down Inference service defaults to consent-required (deny prediction) — fail-safe Restore consent state store; replay buffered consent events
Privacy Propagation Bus lag High event volume Consent withdrawal propagation delayed Monitor bus lag; alert if >5 minutes; pause new training runs during lag
Erasure request for subject in non-SISA model Subject in old full-batch trained model Full retrain required; subject data excluded Schedule full retrain; log erasure obligation; confirm completion
Pseudonymisation key rotation causes join failure Key rotated without re-pseudonymising historical data Feature join fails across rotated key boundary Re-pseudonymise historical data before key rotation; test joins after rotation

8. Security Considerations

Authentication & Authorisation

  • Pseudonymisation key access restricted to Pseudonymisation Service service identity; human access requires break-glass procedure with dual approval.
  • Consent State Store access restricted to feature store, inference service, and consent platform.

Secrets Management

  • Pseudonymisation keys managed in dedicated KMS; never stored in code, config, or pipeline artefacts.
  • KMS key usage audited; every pseudonymisation operation logged with purpose.

Data Classification

  • Pseudonymised data classified as Restricted until KMS key destroyed; reclassifiable to Internal post key destruction.
  • Consent state classified as Confidential; contains information about subjects' privacy choices.

Encryption

  • All personal data encrypted at rest (AES-256) and in transit (TLS 1.3).
  • Pseudonymisation key encrypted in KMS; key never exposed in plaintext outside KMS.

Auditability

  • Every purpose limitation check logged: data element + purpose claim + policy decision.
  • Erasure request handling fully audited: request received → impact assessed → action taken → completion confirmed.
  • Consent propagation events logged with timestamp; used to prove timely propagation.

OWASP LLM Top 10 Mapping

OWASP LLM Risk Relevance Mitigation
LLM06: Sensitive Information Disclosure PII in training data may surface in model outputs Pseudonymisation before training; output PII scanning
LLM01: Prompt Injection Adversarial prompts attempting to extract PII from LLM trained on personal data Purpose limitation controls scope of LLM training data; output monitoring
LLM02: Insecure Output Handling AI predictions revealing pseudonymous subject attributes Purpose limitation on output data; output classification
LLM09: Overreliance Privacy staff trusting pseudonymisation alone as sufficient protection Defence-in-depth: pseudonymisation + DP + purpose limitation + consent

9. Governance Considerations

Responsible AI

  • Data minimisation directly reduces AI bias risk by preventing use of data elements that may introduce proxies for protected attributes.
  • Consent management ensures AI predictions are only made for subjects who have given informed consent.

Model Risk Management

  • Machine unlearning completeness is a model risk metric: what percentage of erasure requests have been fulfilled within regulatory timeframe?
  • SISA architecture must be documented in model risk documentation: which shards, shard size, retraining SLA.

Human Approval Checkpoints

  • Purpose extension (using data element for new AI use case) requires Privacy Officer approval.
  • Special category data (health, political opinion, ethnicity) processing for AI requires Data Protection Impact Assessment (DPIA) and DPO sign-off.
  • Full model retrain triggered by erasure must be reviewed by ML lead and Privacy Officer before deployment.

Governance Artefacts

Artefact Owner Cadence Purpose
Purpose-Mapped Schema Registry Privacy + Data Engineering On change Documents permitted uses per data element; basis for OPA policies
Consent Propagation Audit Log Privacy Platform Continuous Proves consent withdrawal was propagated within required timeframe
Erasure Fulfilment Register DPO Per request Tracks erasure requests; impact assessment outcome; fulfilment action; confirmation
DPIA for AI Use Case DPO Per new high-risk AI use case Structured privacy impact assessment; legal basis; mitigation measures
Pseudonymisation Key Audit Log Security Continuous Records every key usage; supports re-identification prohibition enforcement

10. Operational Considerations

Monitoring

Metric Alert Threshold Tooling
Consent propagation lag (withdrawal to feature store update) >5 minutes Kafka consumer lag metrics
Erasure request fulfilment SLA >30 days (GDPR) / >30 days (Privacy Act) DSR management system
Purpose limitation violation rate Any violation OPA audit log + Grafana
Minimisation gate rejection rate (unexpected spike) >10% above baseline Pipeline metrics
Pseudonymisation service error rate >0.1% Service health metrics

SLOs

SLO Target Measurement
Consent withdrawal propagation <5 minutes to feature store + inference service Event timestamp comparison
Erasure request fulfilment <30 days GDPR / <30 days Privacy Act DSR register
SISA shard retrain completion <24 hours from erasure trigger Training pipeline logs
Minimisation gate latency overhead <100ms per record batch Pipeline timing

11. Cost Considerations

Cost Drivers

Cost Driver Typical Range Notes
Consent Management Platform $1,000–$8,000/month OneTrust/Didomi SaaS; scales with consent volume
Pseudonymisation Service $100–$500/month Lightweight compute; KMS key operations cost
SISA architecture overhead 20–40% training cost increase More training jobs (per-shard); offset by faster erasure
Privacy Propagation Bus $100–$1,000/month Kafka/EventBridge; scales with consent event volume
Engineering (privacy controls maintenance) 0.5–1 FTE Ongoing consent integration, purpose tag management

Optimisations

  • Batch consent withdrawal processing during off-peak hours to reduce Kafka throughput requirements.
  • Use SISA architecture to reduce full-retrain costs for erasure; size shards to balance erasure cost vs. training overhead.
  • Implement lazy pseudonymisation: pseudonymise at feature extraction time, not at ingestion, to avoid re-pseudonymising historical data on key rotation.

Indicative Cost Range

Scale Monthly Cost Basis
Small (1–3 AI use cases, <100K data subjects) $2,000–$6,000 Lightweight consent platform + basic SISA
Medium (5–10 use cases, 1M data subjects) $6,000–$20,000 Full consent platform + SISA + privacy bus
Large (20+ use cases, 10M+ data subjects) $20,000–$80,000 Enterprise consent + high-throughput bus + automated SISA

12. Trade-Off Analysis

Option Comparison

Option Pros Cons Recommended When
A: Full privacy-by-design pipeline (this pattern) Regulatory-grade; defensible posture; enables sensitive data AI High setup cost; SISA adds training complexity Regulated industry; sensitive personal data; GDPR/Privacy Act obligation
B: Point-in-time anonymisation before AI ingestion Simpler; one-time effort Cannot honour ongoing consent withdrawal; destroys join-ability Static, historical datasets; no ongoing consent management needed
C: Pseudonymisation only (no purpose limitation/consent propagation) Simple; preserves some privacy Incomplete: doesn't address purpose limitation or right-to-erasure Partial compliance in low-risk contexts only
D: Compliance-as-documentation (policy without technical enforcement) Near-zero cost Fails regulatory audit; high breach risk; unenforceable Only for pilot/experimental AI with no personal data

Architectural Tensions

Tension Trade-Off Resolution
Pseudonymisation consistency vs. key rotation Consistent pseudonyms needed for joins; key rotation breaks consistency Re-pseudonymise historical data on rotation; test joins before rotation completes
SISA shard granularity vs. privacy guarantee Large shards = faster training; smaller shards = faster erasure Size shards based on expected erasure request frequency; 1,000–10,000 records/shard typical
Consent enforcement strictness vs. inference availability Strict consent check → some subjects get no predictions → business value loss Communicate consent requirements to subjects; design graceful decline UX

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Consent withdrawal not propagated (bus failure) Medium Critical — inference continues for withdrawn subject Bus lag monitoring; consent state verification Fail-safe: inference denies if consent state stale >threshold
Purpose limitation bypass (governance plane down) Low High — data used outside permitted purpose Governance plane health check; pipeline alert on bypass Fail-safe: block feature assembly if governance plane unavailable
Erasure request not fulfilled within statutory period Medium High — GDPR/Privacy Act violation DSR register SLA monitoring Escalation to DPO; accelerated retrain; regulatory notification if SLA breached
Re-identification via auxiliary data join Low Critical — pseudonymisation circumvented Quarterly re-identification risk assessment Restrict auxiliary data availability; enforce purpose-mapped schema
SISA retrain produces worse model Medium Medium — erasure degrades model quality TSTR validation after shard retrain Accept quality trade-off (privacy obligation overrides quality preference); document in model card

14. Regulatory Considerations

Regulation Article/Clause Requirement Pattern Response
GDPR Article 5(1)(b) Purpose limitation OPA-enforced purpose tags per data element
GDPR Article 5(1)(c) Data minimisation Minimisation gate at collection
GDPR Article 17 Right to erasure SISA architecture + erasure fulfilment register
GDPR Article 25 Privacy by design and by default This entire pattern is the Article 25 implementation
GDPR Article 35 Data Protection Impact Assessment for high-risk AI DPIA governance artefact per use case
Privacy Act (Australia) APP 1/3/6/11 Privacy management; collection; use; security Full pattern addresses all four APPs
EU AI Act Article 10(5) Processing of sensitive data for bias detection Purpose limitation + consent gates allow narrow exception
ISO 27701 §7.4 Privacy controls in information systems Controls 1–5 map to ISO 27701 privacy controls
ISO 42001 §6.1 AI risk management (privacy dimension) DPIA process maps to ISO 42001 risk assessment

15. Reference Implementations

AWS

Component AWS Service
Minimisation gate AWS Glue schema enforcement + custom Lambda
Pseudonymisation AWS Encryption SDK + AWS KMS
Purpose tags Lake Formation column tags + OPA on Lambda
Consent platform OneTrust (SaaS) + EventBridge integration
Privacy bus Amazon EventBridge
Consent state store Amazon ElastiCache (Redis)
SISA retraining SageMaker Training Jobs (shard-partitioned)

Azure

Component Azure Service
Pseudonymisation Azure Key Vault + custom Python service
Purpose tags Azure Purview sensitivity labels + OPA on AKS
Consent platform Didomi (SaaS) + Azure Service Bus
Privacy bus Azure Service Bus
SISA retraining Azure ML Training (shard-partitioned)

GCP

Component GCP Service
Pseudonymisation Cloud DLP de-identification + Cloud KMS
Purpose tags Dataplex policy tags + OPA on Cloud Run
Consent platform OneTrust / custom on Cloud Run
Privacy bus Google Pub/Sub
SISA retraining Vertex AI Training

On-Premises

Component Technology
Pseudonymisation HashiCorp Vault transit secrets engine
Purpose tags OPA + custom purpose registry
Consent platform Custom PostgreSQL + Kafka integration
Privacy bus Apache Kafka
SISA retraining PyTorch Lightning (shard management) on Kubernetes

Pattern ID Relationship Notes
Synthetic Data Generation EAAPL-DAT004 Complements Synthetic data reduces need for real personal data in AI training
Data Lineage for AI EAAPL-DAT003 Enables Lineage identifies all models trained on data subject to erasure request
AI Training Data Governance EAAPL-DAT007 Depends on Consent records are training data governance artefacts
AI Data Mesh Integration EAAPL-DAT001 Complements Consent scope enforcement is a governance plane responsibility
Human Approval Gateway EAAPL-HIL001 Complements DPIA review and SISA retrain decisions are human approval gates

17. Maturity Assessment

Overall Maturity: Proven — Pseudonymisation, consent management, and purpose limitation are mature practices. Machine unlearning (SISA) is a more recent but production-proven technique, particularly in EU GDPR-regulated environments.

Dimension Score (1–5) Notes
Architectural clarity 5 Five controls clearly defined with implementation detail
Tooling maturity 4 KMS, consent platforms mature; SISA tooling maturing
Regulatory alignment 5 Strongest GDPR Art. 17/25 alignment of any pattern
Operational complexity 3 SISA training management and consent propagation require ongoing attention
Cost efficiency 3 Consent platform + SISA overhead significant; offset by regulatory risk reduction
Security 5 KMS-managed pseudonymisation; fail-safe consent enforcement

18. Revision History

Version Date Author Changes
1.0 2023-12-01 EAAPL Working Group Initial publication
1.1 2024-06-15 EAAPL Working Group Added SISA machine unlearning architecture; EU AI Act Art. 10(5)
1.2 2025-03-01 EAAPL Working Group Updated Privacy Act (Australia) alignment; expanded failure modes
← Back to LibraryMore Data Architecture