EAAPL-DAT005Proven

Privacy by Design for AI Data

🗄️ Data ArchitectureEU AI ActISO/IEC 42001🏭 Field-tested in AU↑ 21 signals · Q2 2026

[EAAPL-DAT005] Privacy by Design for AI Data

Category: Data Architecture
Sub-category: Privacy Engineering / AI Data Pipelines
Version: 1.2
Maturity: Proven
Tags: privacy-by-design, data-minimisation, pseudonymisation, consent-management, machine-unlearning, purpose-limitation
Regulatory Relevance: GDPR Articles 5/17/25, Privacy Act (Australia) APP 1–13, EU AI Act Article 10, ISO 27701, ISO 42001 §6.1, NIST AI RMF GOVERN-1.7

1. Executive Summary

Privacy-by-design mandates that privacy controls are embedded into AI data pipelines at every stage — not bolted on retrospectively. For AI systems, this is uniquely challenging: the same model that must protect personal data also depends on it for predictive accuracy. This pattern resolves that tension through five privacy engineering techniques applied systematically across the AI data lifecycle.

Data minimisation at collection ensures only data strictly necessary for the AI purpose is ingested. Purpose limitation controls enforce that data collected for one AI use case cannot be used for another without explicit governance approval. Pseudonymisation pipelines protect identity while preserving the statistical relationships models need. Consent management integration gates data use on current consent status. Machine unlearning mechanisms respond to right-to-erasure requests by identifying and selectively removing the influence of specific individuals' data from trained models.

Organisations that implement this pattern reduce regulatory exposure for AI programmes by 60–80% (measured by privacy impact assessment findings), enable AI use cases on sensitive data that would otherwise be prohibited, and establish a defensible posture for privacy regulator inquiries.

Target audience: Chief Privacy Officers, Data Protection Officers, AI Architects, ML Platform leads.

2. Problem Statement

Business Problem

AI programmes operating on personal data face escalating regulatory scrutiny. A single privacy incident in an AI system — training on consent-withdrawn data, using data outside stated purpose, or failing to honour erasure requests — can result in regulatory fines, reputational damage, and programme shutdown.

Technical Problem

Most AI pipelines treat privacy as a compliance checkbox (PII masking before data warehouse load) rather than an architectural property.
Purpose limitation is not technically enforced: once data is in a data lake or feature store, downstream AI uses are not validated against the original consent scope.
Right-to-erasure requests are handled for operational databases but the privacy team cannot identify whether a data subject's records were used in AI training.
Consent management systems are siloed from AI training pipelines; consent withdrawal does not propagate to feature stores or retrain queues.
Pseudonymisation is often implemented inconsistently across teams — some pipelines re-identify pseudonymous records by joining with other datasets.

Symptoms

Privacy Impact Assessment finds AI programme is using customer data beyond consented purposes.
Data subject requests right-to-erasure; organisation cannot determine if subject data was used in AI training.
Consent withdrawal not reflected in AI inference; system continues to make predictions using withdrawn-consent data.
Regulatory audit finds data minimisation principle not applied to AI training data.
Privacy review is a deployment gate that consistently delays AI releases by 4–8 weeks.

Cost of Inaction

Dimension	Impact
Regulatory	GDPR Art. 83: fines up to €20M or 4% global turnover; Privacy Act civil penalties
Reputational	Consumer trust erosion if AI privacy incident publicised
Operational	Retroactive privacy remediation in production AI is 10–30× more expensive than design-time
Legal	Class action exposure for AI profiling on unlawfully processed data

3. Context

When to Apply

Any AI system processing personal information about identifiable individuals.
AI systems in jurisdictions subject to GDPR, Australian Privacy Act, CCPA, or equivalent.
AI systems where data subjects have consent rights or erasure rights.
Organisations building AI on sensitive personal data (health, financial, employment, location).
New AI programme initiation (most effective when applied at design stage).

When NOT to Apply

AI systems processing entirely synthetic or fully anonymised data (not personal information by definition).
AI systems operating only on clearly non-personal data (e.g., pure sensor telemetry with no individual linkage).
Research AI under specific statutory exemption from privacy regulation (specific carve-outs apply and must be legally verified).

Prerequisites

Prerequisite	Minimum Viable	Preferred
Privacy Impact Assessment capability	Ad hoc privacy review	Systematic DPIA process with templates
Consent management system	Simple opt-in/out database	Full consent management platform (OneTrust, Didomi)
Data catalogue	Spreadsheet	DataHub/Atlan with PII tagging
Pseudonymisation key management	Shared key per system	Dedicated KMS with per-subject keys
Legal counsel	Internal privacy counsel	DPO + external specialist counsel

Industry Applicability

Industry	Applicability	Driver
Healthcare	Critical	Sensitive health data; clinical AI; patient consent
Financial Services	Critical	Credit/fraud AI on personal financial data; APRA + GDPR
Retail	High	Customer purchase + behavioural data; personalisation AI
Telecommunications	High	Call records; location data; churn AI
Government	High	Citizen data; mandatory Privacy Act compliance
HR / Recruitment	Critical	Employment AI; GDPR Art. 9 special category data

4. Architecture Overview

Design Philosophy

Privacy-by-design for AI is implemented as a set of pipeline controls that operate at defined points in the data lifecycle, enforced architecturally rather than by policy alone. The five controls are applied in sequence from data collection through to model serving, and the architecture includes a dedicated Privacy Propagation Bus that carries consent and erasure signals across all pipeline stages.

Control 1 — Data Minimisation at Collection. Before any personal data enters the AI data pipeline, a minimisation gate evaluates whether each data element is strictly necessary for the declared AI purpose. This is enforced by a purpose-mapped schema: only fields with a documented necessity justification for the specific AI use case are allowed to pass. This is implemented as a schema enforcement step in the ingestion pipeline, not as a manual approval gate — the schema definition itself codifies the minimisation decision.

Control 2 — Purpose Limitation. Each data element is tagged with the AI purpose(s) for which it may be used. The Feature Composition Service (or equivalent) enforces that features used in training or inference are only assembled from data elements whose purpose tags include the current AI use case. This is enforced by the governance plane (OPA policies) at feature query time. Attempting to use a feature outside its declared purpose scope generates a policy violation and audit log entry.

Control 3 — Pseudonymisation Pipeline. Identifiers (name, email, account number, national ID) are replaced with pseudonyms using a keyed hash function. The pseudonymisation key is managed by a dedicated KMS, separate from the data pipeline. Importantly, the same subject receives the same pseudonym consistently across all datasets in the pipeline, preserving join-ability for model training, while preventing direct identification. Re-identification requires access to the KMS pseudonymisation key, which is restricted, audited, and rotated annually.

Control 4 — Consent Management Integration. The pipeline subscribes to consent change events from the consent management platform. When a data subject withdraws consent for AI use, a propagation event is published to the Privacy Propagation Bus. Downstream services (feature store, training pipeline, inference service) subscribe and respond: the feature store marks the subject's features as consent-withdrawn; the inference service stops making predictions for that subject; the training pipeline flags the subject's records for exclusion from the next training run.

Control 5 — Machine Unlearning (Right-to-Erasure). When a data subject requests erasure, the system must assess whether their data was used in AI training. If yes, exact erasure from the model is often infeasible without full retraining. The pattern implements three tiers: (a) exact erasure for subjects whose data was not yet used in training — delete records; (b) approximate unlearning via gradient reversal or SISA (Sharded, Isolated, Sliced, and Aggregated) training for subjects whose data was used in a recent model version; (c) full retraining for high-risk cases where approximate unlearning is insufficient. The SISA training architecture partitions training data into shards; erasure requires only retraining the affected shard rather than the full dataset, reducing unlearning cost by 60–80%.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Input["Data Collection Controls"] A[Raw Data Source] B[Minimisation Gate] C[Pseudonymisation Service] end subgraph Pipeline["AI Data Pipeline"] D[Consent-Gated Feature Store] E[Training Pipeline] F[Inference Service] end subgraph Rights["Erasure and Consent"] G[Consent Management Platform] H[Erasure Request Handler] end A --> B B --> C C --> D G -->|consent withdrawal| D G -->|withdrawal signal| F D --> E E --> F H -->|shard retrain| E H -->|delete record| D style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f3e8ff,stroke:#a855f7 style C fill:#f0fdf4,stroke:#22c55e style D fill:#fef9c3,stroke:#eab308 style E fill:#f0fdf4,stroke:#22c55e style F fill:#d1fae5,stroke:#10b981 style G fill:#f3e8ff,stroke:#a855f7 style H fill:#fee2e2,stroke:#ef4444

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Minimisation Gate	Processing	Enforces purpose-mapped schema; rejects fields not necessary for declared AI purpose	Custom pipeline step; dbt meta enforcement; Great Expectations custom rules	Critical
Pseudonymisation Service	Processing	HMAC-SHA256 keyed pseudonymisation; consistent pseudonyms across datasets	Custom Python service; HashiCorp Vault transit engine; AWS Encryption SDK	Critical
KMS (Pseudonymisation Keys)	Infrastructure	Manages pseudonymisation keys; enforces rotation; audits key usage	AWS KMS, Azure Key Vault, Google Cloud KMS, HashiCorp Vault	Critical
Purpose Tag Engine	Processing	Tags data elements with permitted AI purposes; enforced at feature query time	Custom OPA policy; Collibra data governance; Atlan purpose tags	Critical
Governance Plane (Purpose Enforcement)	Processing	OPA policies enforce purpose limitation at feature assembly	Open Policy Agent (OPA), custom middleware	Critical
Consent Management Platform	SaaS / On-Prem	Manages data subject consent; emits consent change events	OneTrust, Didomi, TrustArc, custom	High
Privacy Propagation Bus	Messaging	Carries consent withdrawal and erasure events to all pipeline subscribers	Apache Kafka, AWS EventBridge, Google Pub/Sub	Critical
Consent State Store	Storage	Current consent status per data subject per purpose; queried by inference service	Redis, DynamoDB, PostgreSQL	Critical
SISA Shard Retrainer	Processing	Retrains only the affected shard when erasure is requested	Custom PyTorch/TF training harness with shard management	High
Erasure Request Receiver	Service	Intake data subject erasure requests; triggers impact assessment	Custom API, OneTrust DSR module	High

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Data source	Provides raw data including personal information	Raw personal data
2	Minimisation Gate	Validates each field against purpose-mapped schema	Minimised dataset (only necessary fields)
3	Pseudonymisation Service	Replaces identifiers with HMAC-SHA256 pseudonyms using KMS-managed key	Pseudonymised dataset
4	Purpose Tag Engine	Tags each data element with permitted AI purposes	Purpose-tagged pseudonymised dataset
5	Governance Plane	At feature assembly, enforces purpose tags match current AI use case	Approved feature set for declared use case
6	Feature Store	Stores features with consent status metadata per subject	Consent-aware feature store
7	Training Pipeline	Reads features; excludes records with withdrawn consent; trains model on SISA shards	Model trained only on consented data
8	Inference Service	At inference request, checks consent state for subject; proceeds or declines	Prediction (if consented) or privacy-compliant decline
9	Consent withdrawal event	Consent Platform emits withdrawal event	Privacy Propagation Bus event
10	Downstream subscribers	Feature Store + Inference Service update consent state	Subject excluded from future training and inference
11	Erasure request	Data subject requests erasure	Erasure event triggers impact assessment
12	SISA Shard Retrainer	Identifies subject's shard; retrains affected shard	Updated model without subject's influence

Error Flow

Error Condition	Trigger	Response	Recovery
Consent State Store unavailable	Service down	Inference service defaults to consent-required (deny prediction) — fail-safe	Restore consent state store; replay buffered consent events
Privacy Propagation Bus lag	High event volume	Consent withdrawal propagation delayed	Monitor bus lag; alert if >5 minutes; pause new training runs during lag
Erasure request for subject in non-SISA model	Subject in old full-batch trained model	Full retrain required; subject data excluded	Schedule full retrain; log erasure obligation; confirm completion
Pseudonymisation key rotation causes join failure	Key rotated without re-pseudonymising historical data	Feature join fails across rotated key boundary	Re-pseudonymise historical data before key rotation; test joins after rotation

8. Security Considerations

Authentication & Authorisation

Pseudonymisation key access restricted to Pseudonymisation Service service identity; human access requires break-glass procedure with dual approval.
Consent State Store access restricted to feature store, inference service, and consent platform.

Secrets Management

Pseudonymisation keys managed in dedicated KMS; never stored in code, config, or pipeline artefacts.
KMS key usage audited; every pseudonymisation operation logged with purpose.

Data Classification

Pseudonymised data classified as Restricted until KMS key destroyed; reclassifiable to Internal post key destruction.
Consent state classified as Confidential; contains information about subjects' privacy choices.

Encryption

All personal data encrypted at rest (AES-256) and in transit (TLS 1.3).
Pseudonymisation key encrypted in KMS; key never exposed in plaintext outside KMS.

Auditability

Every purpose limitation check logged: data element + purpose claim + policy decision.
Erasure request handling fully audited: request received → impact assessed → action taken → completion confirmed.
Consent propagation events logged with timestamp; used to prove timely propagation.

OWASP LLM Top 10 Mapping

OWASP LLM Risk	Relevance	Mitigation
LLM06: Sensitive Information Disclosure	PII in training data may surface in model outputs	Pseudonymisation before training; output PII scanning
LLM01: Prompt Injection	Adversarial prompts attempting to extract PII from LLM trained on personal data	Purpose limitation controls scope of LLM training data; output monitoring
LLM02: Insecure Output Handling	AI predictions revealing pseudonymous subject attributes	Purpose limitation on output data; output classification
LLM09: Overreliance	Privacy staff trusting pseudonymisation alone as sufficient protection	Defence-in-depth: pseudonymisation + DP + purpose limitation + consent

9. Governance Considerations

Responsible AI

Data minimisation directly reduces AI bias risk by preventing use of data elements that may introduce proxies for protected attributes.
Consent management ensures AI predictions are only made for subjects who have given informed consent.

Model Risk Management

Machine unlearning completeness is a model risk metric: what percentage of erasure requests have been fulfilled within regulatory timeframe?
SISA architecture must be documented in model risk documentation: which shards, shard size, retraining SLA.

Human Approval Checkpoints

Purpose extension (using data element for new AI use case) requires Privacy Officer approval.
Special category data (health, political opinion, ethnicity) processing for AI requires Data Protection Impact Assessment (DPIA) and DPO sign-off.
Full model retrain triggered by erasure must be reviewed by ML lead and Privacy Officer before deployment.

Governance Artefacts

Artefact	Owner	Cadence	Purpose
Purpose-Mapped Schema Registry	Privacy + Data Engineering	On change	Documents permitted uses per data element; basis for OPA policies
Consent Propagation Audit Log	Privacy Platform	Continuous	Proves consent withdrawal was propagated within required timeframe
Erasure Fulfilment Register	DPO	Per request	Tracks erasure requests; impact assessment outcome; fulfilment action; confirmation
DPIA for AI Use Case	DPO	Per new high-risk AI use case	Structured privacy impact assessment; legal basis; mitigation measures
Pseudonymisation Key Audit Log	Security	Continuous	Records every key usage; supports re-identification prohibition enforcement

10. Operational Considerations

Monitoring

Metric	Alert Threshold	Tooling
Consent propagation lag (withdrawal to feature store update)	>5 minutes	Kafka consumer lag metrics
Erasure request fulfilment SLA	>30 days (GDPR) / >30 days (Privacy Act)	DSR management system
Purpose limitation violation rate	Any violation	OPA audit log + Grafana
Minimisation gate rejection rate (unexpected spike)	>10% above baseline	Pipeline metrics
Pseudonymisation service error rate	>0.1%	Service health metrics

SLOs

SLO	Target	Measurement
Consent withdrawal propagation	<5 minutes to feature store + inference service	Event timestamp comparison
Erasure request fulfilment	<30 days GDPR / <30 days Privacy Act	DSR register
SISA shard retrain completion	<24 hours from erasure trigger	Training pipeline logs
Minimisation gate latency overhead	<100ms per record batch	Pipeline timing

11. Cost Considerations

Cost Drivers

Cost Driver	Typical Range	Notes
Consent Management Platform	$1,000–$8,000/month	OneTrust/Didomi SaaS; scales with consent volume
Pseudonymisation Service	$100–$500/month	Lightweight compute; KMS key operations cost
SISA architecture overhead	20–40% training cost increase	More training jobs (per-shard); offset by faster erasure
Privacy Propagation Bus	$100–$1,000/month	Kafka/EventBridge; scales with consent event volume
Engineering (privacy controls maintenance)	0.5–1 FTE	Ongoing consent integration, purpose tag management

Optimisations

Batch consent withdrawal processing during off-peak hours to reduce Kafka throughput requirements.
Use SISA architecture to reduce full-retrain costs for erasure; size shards to balance erasure cost vs. training overhead.
Implement lazy pseudonymisation: pseudonymise at feature extraction time, not at ingestion, to avoid re-pseudonymising historical data on key rotation.

Indicative Cost Range

Scale	Monthly Cost	Basis
Small (1–3 AI use cases, <100K data subjects)	$2,000–$6,000	Lightweight consent platform + basic SISA
Medium (5–10 use cases, 1M data subjects)	$6,000–$20,000	Full consent platform + SISA + privacy bus
Large (20+ use cases, 10M+ data subjects)	$20,000–$80,000	Enterprise consent + high-throughput bus + automated SISA

12. Trade-Off Analysis

Option Comparison

Option	Pros	Cons	Recommended When
A: Full privacy-by-design pipeline (this pattern)	Regulatory-grade; defensible posture; enables sensitive data AI	High setup cost; SISA adds training complexity	Regulated industry; sensitive personal data; GDPR/Privacy Act obligation
B: Point-in-time anonymisation before AI ingestion	Simpler; one-time effort	Cannot honour ongoing consent withdrawal; destroys join-ability	Static, historical datasets; no ongoing consent management needed
C: Pseudonymisation only (no purpose limitation/consent propagation)	Simple; preserves some privacy	Incomplete: doesn't address purpose limitation or right-to-erasure	Partial compliance in low-risk contexts only
D: Compliance-as-documentation (policy without technical enforcement)	Near-zero cost	Fails regulatory audit; high breach risk; unenforceable	Only for pilot/experimental AI with no personal data

Architectural Tensions

Tension	Trade-Off	Resolution
Pseudonymisation consistency vs. key rotation	Consistent pseudonyms needed for joins; key rotation breaks consistency	Re-pseudonymise historical data on rotation; test joins before rotation completes
SISA shard granularity vs. privacy guarantee	Large shards = faster training; smaller shards = faster erasure	Size shards based on expected erasure request frequency; 1,000–10,000 records/shard typical
Consent enforcement strictness vs. inference availability	Strict consent check → some subjects get no predictions → business value loss	Communicate consent requirements to subjects; design graceful decline UX

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Consent withdrawal not propagated (bus failure)	Medium	Critical — inference continues for withdrawn subject	Bus lag monitoring; consent state verification	Fail-safe: inference denies if consent state stale >threshold
Purpose limitation bypass (governance plane down)	Low	High — data used outside permitted purpose	Governance plane health check; pipeline alert on bypass	Fail-safe: block feature assembly if governance plane unavailable
Erasure request not fulfilled within statutory period	Medium	High — GDPR/Privacy Act violation	DSR register SLA monitoring	Escalation to DPO; accelerated retrain; regulatory notification if SLA breached
Re-identification via auxiliary data join	Low	Critical — pseudonymisation circumvented	Quarterly re-identification risk assessment	Restrict auxiliary data availability; enforce purpose-mapped schema
SISA retrain produces worse model	Medium	Medium — erasure degrades model quality	TSTR validation after shard retrain	Accept quality trade-off (privacy obligation overrides quality preference); document in model card

14. Regulatory Considerations

Regulation	Article/Clause	Requirement	Pattern Response
GDPR	Article 5(1)(b)	Purpose limitation	OPA-enforced purpose tags per data element
GDPR	Article 5(1)(c)	Data minimisation	Minimisation gate at collection
GDPR	Article 17	Right to erasure	SISA architecture + erasure fulfilment register
GDPR	Article 25	Privacy by design and by default	This entire pattern is the Article 25 implementation
GDPR	Article 35	Data Protection Impact Assessment for high-risk AI	DPIA governance artefact per use case
Privacy Act (Australia)	APP 1/3/6/11	Privacy management; collection; use; security	Full pattern addresses all four APPs
EU AI Act	Article 10(5)	Processing of sensitive data for bias detection	Purpose limitation + consent gates allow narrow exception
ISO 27701	§7.4	Privacy controls in information systems	Controls 1–5 map to ISO 27701 privacy controls
ISO 42001	§6.1	AI risk management (privacy dimension)	DPIA process maps to ISO 42001 risk assessment

15. Reference Implementations

AWS

Component	AWS Service
Minimisation gate	AWS Glue schema enforcement + custom Lambda
Pseudonymisation	AWS Encryption SDK + AWS KMS
Purpose tags	Lake Formation column tags + OPA on Lambda
Consent platform	OneTrust (SaaS) + EventBridge integration
Privacy bus	Amazon EventBridge
Consent state store	Amazon ElastiCache (Redis)
SISA retraining	SageMaker Training Jobs (shard-partitioned)

Azure

Component	Azure Service
Pseudonymisation	Azure Key Vault + custom Python service
Purpose tags	Azure Purview sensitivity labels + OPA on AKS
Consent platform	Didomi (SaaS) + Azure Service Bus
Privacy bus	Azure Service Bus
SISA retraining	Azure ML Training (shard-partitioned)

GCP

Component	GCP Service
Pseudonymisation	Cloud DLP de-identification + Cloud KMS
Purpose tags	Dataplex policy tags + OPA on Cloud Run
Consent platform	OneTrust / custom on Cloud Run
Privacy bus	Google Pub/Sub
SISA retraining	Vertex AI Training

On-Premises

Component	Technology
Pseudonymisation	HashiCorp Vault transit secrets engine
Purpose tags	OPA + custom purpose registry
Consent platform	Custom PostgreSQL + Kafka integration
Privacy bus	Apache Kafka
SISA retraining	PyTorch Lightning (shard management) on Kubernetes

Pattern	ID	Relationship	Notes
Synthetic Data Generation	EAAPL-DAT004	Complements	Synthetic data reduces need for real personal data in AI training
Data Lineage for AI	EAAPL-DAT003	Enables	Lineage identifies all models trained on data subject to erasure request
AI Training Data Governance	EAAPL-DAT007	Depends on	Consent records are training data governance artefacts
AI Data Mesh Integration	EAAPL-DAT001	Complements	Consent scope enforcement is a governance plane responsibility
Human Approval Gateway	EAAPL-HIL001	Complements	DPIA review and SISA retrain decisions are human approval gates

17. Maturity Assessment

Overall Maturity: Proven — Pseudonymisation, consent management, and purpose limitation are mature practices. Machine unlearning (SISA) is a more recent but production-proven technique, particularly in EU GDPR-regulated environments.

Dimension	Score (1–5)	Notes
Architectural clarity	5	Five controls clearly defined with implementation detail
Tooling maturity	4	KMS, consent platforms mature; SISA tooling maturing
Regulatory alignment	5	Strongest GDPR Art. 17/25 alignment of any pattern
Operational complexity	3	SISA training management and consent propagation require ongoing attention
Cost efficiency	3	Consent platform + SISA overhead significant; offset by regulatory risk reduction
Security	5	KMS-managed pseudonymisation; fail-safe consent enforcement

18. Revision History

Version	Date	Author	Changes
1.0	2023-12-01	EAAPL Working Group	Initial publication
1.1	2024-06-15	EAAPL Working Group	Added SISA machine unlearning architecture; EU AI Act Art. 10(5)
1.2	2025-03-01	EAAPL Working Group	Updated Privacy Act (Australia) alignment; expanded failure modes

← Back to Library More Data Architecture →