EAAPL-DAT004Proven

Synthetic Data Generation

🗄️ Data ArchitectureEU AI ActISO/IEC 42001

[EAAPL-DAT004] Synthetic Data Generation

Category: Data Architecture
Sub-category: Synthetic Data / Privacy-Preserving AI
Version: 1.2
Maturity: Proven
Tags: synthetic-data, GAN, VAE, differential-privacy, k-anonymity, privacy-validation, utility-validation
Regulatory Relevance: GDPR Article 5, Privacy Act Australia APP 3/11, EU AI Act Article 10, APRA CPS 234, ISO 42001 §8.4

1. Executive Summary

Many of the most valuable enterprise AI use cases — fraud detection, clinical risk prediction, credit underwriting — require training on data that organisations cannot freely share: patient records, account transactions, personally identifiable information. Synthetic data generation creates statistically faithful, privacy-preserving datasets that can be used for AI training, testing, and sharing without exposing real individuals.

This pattern defines a production synthetic data generation pipeline covering three generation techniques (GAN/VAE/statistical), privacy validation (differential privacy, k-anonymity, membership inference testing), utility validation (statistical fidelity, downstream model performance parity), and the regulatory framework for accepting synthetic data in lieu of real data for AI training and testing.

Organisations that implement this pattern have unlocked AI use cases previously blocked by privacy constraints, accelerated model development cycle times by 40–60% through unrestricted test data availability, and reduced privacy incident risk in development and testing environments.

Target audience: Chief Privacy Officers, Chief Data Officers, ML Platform leads, Data Science leads.

2. Problem Statement

Business Problem

AI programmes are blocked or slowed by inability to use real data outside production environments. Development and test environments cannot receive production data due to privacy regulation; third-party data science partners cannot access customer data; cross-border data transfer restrictions prevent AI development in global teams.

Technical Problem

Real patient/customer/financial records cannot legally be used in development, test, or partner-facing environments.
Data anonymisation (masking, tokenisation) degrades statistical relationships, destroying the signal AI models need.
Small datasets for rare events (fraud, rare diseases) are insufficient for model training.
Class imbalance in real data requires augmentation techniques that preserve statistical properties.
Testing AI edge cases with real data creates production data exposure in test environments.

Symptoms

AI development cycle stalled waiting for "data access approval" that may never come.
Test environments using obviously fake data that does not reflect real statistical patterns, causing test AI models to fail in production.
Third-party data science partners blocked from receiving any data.
Rare event classes (fraud, rare diseases) under-represented in training data, degrading model recall.
Privacy incidents caused by real production data in development environments.

Cost of Inaction

Dimension	Impact
Velocity	AI use cases delayed 6–18 months waiting for data access approval
Privacy risk	Real PII in dev/test environments creates regulatory exposure and breach risk
Model quality	Artificially balanced or anonymised datasets produce worse models
Competitive	Data-rich competitors accelerate AI while your organisation waits for approvals

3. Context

When to Apply

AI training or testing requires data that cannot be shared due to privacy, regulatory, or contractual restrictions.
Small training datasets need augmentation (class imbalance; rare events).
Development/test environments need representative data without PII.
Third-party partners (model vendors, data scientists) need data to work with.
Cross-border transfer restrictions prevent sharing real data across jurisdictions.

When NOT to Apply

Real data is freely available and shareable (no privacy constraint) — synthetic data adds cost and validation overhead with no benefit.
The AI use case requires exact real-world records (e.g., training on specific known fraud patterns that must be preserved exactly).
Synthetic data utility validation cannot be performed (no access to real data even for validation).
The risk of generated data containing memorised real records is unacceptable (very small original datasets).

Prerequisites

Prerequisite	Minimum Viable	Preferred
Source data access	Sample of real data for training generator	Full production dataset with proper access controls
Generation tooling	SDV (Synthetic Data Vault), Faker + statistical	Dedicated GAN/VAE pipeline; enterprise synthetic data platform
Privacy validation	k-anonymity check	Differential privacy budget tracking + membership inference testing
Utility validation	Basic statistical comparison	Downstream model performance parity testing
Legal sign-off	Privacy team review	External privacy counsel opinion for regulatory-sensitive use cases

Industry Applicability

Industry	Applicability	Driver
Healthcare	Critical	Patient privacy; clinical AI training data scarcity
Financial Services	High	PCI DSS; APRA; customer data privacy; fraud model training
Insurance	High	Actuarial data privacy; claims data restrictions
Government	High	Privacy Act; sensitive citizen data
Retail	Medium	Customer purchase history; personalisation model testing
Telecommunications	Medium	Call records; network data; churn model development

4. Architecture Overview

Design Philosophy

Synthetic data generation is not a single technique — it is a pipeline with four stages: generation, privacy validation, utility validation, and certified publication. Skipping any stage creates either privacy risk (insufficiently private synthetic data) or utility failure (synthetic data that does not produce models with parity performance to real data).

Generation Techniques. The pattern supports three generation approaches, selected based on data type and privacy requirements:

Statistical/parametric synthesis uses marginal and joint distributions estimated from real data (SDV's GaussianCopula, CTGAN-light). It is computationally cheap and interpretable but may not capture complex non-linear feature dependencies. Best for tabular data with moderate complexity.

Variational Autoencoders (VAE) learn a continuous latent space representation of the data and sample from it. VAEs are effective for tabular data with complex correlations and naturally support conditional generation (generate samples with a specific class label). They are faster to train than GANs and produce more stable outputs.

Generative Adversarial Networks (GAN) — specifically CTGAN and TabularGAN — train a generator/discriminator pair to produce synthetic records indistinguishable from real. GANs produce the highest-fidelity synthetic data but are prone to mode collapse (under-representing some data regions) and are computationally expensive. Best for high-fidelity requirements where computational budget is available.

Privacy Validation — Three Layers. No single privacy metric is sufficient:

k-anonymity and l-diversity check that no synthetic record is unique to an identifiable individual in the original dataset. These are necessary but not sufficient.
Differential privacy (DP) provides mathematical privacy guarantees: the synthetic dataset's statistical properties would be the same regardless of whether any individual record was included in the training data. DP is applied as a privacy mechanism during GAN/VAE training (DP-SGD), adding calibrated noise to gradient updates. The privacy budget (ε) is tracked and reported; typical production thresholds are ε ≤ 1 (high privacy) to ε ≤ 10 (moderate privacy).
Membership inference attack testing trains an adversarial classifier to determine whether a specific real record was in the training data. If the attack accuracy is near 50% (random chance), the synthetic data provides strong privacy. If attack accuracy is significantly above 50%, the synthetic data leaks real record membership.

Utility Validation. Privacy and utility trade off: more privacy noise reduces data fidelity. Utility validation measures this trade-off across three dimensions:

Statistical fidelity: Compare marginal distributions (KS test), pairwise correlations, and higher-order statistics between real and synthetic datasets.
Train-on-Synthetic, Test-on-Real (TSTR): Train an AI model on synthetic data; evaluate on real data. Compare AUC/F1 against a model trained on real data. A TSTR performance ratio ≥ 0.90 indicates high utility.
Train-on-Real, Test-on-Synthetic (TRTS): Train on real, test on synthetic — validates that synthetic data represents the same distribution as real.

A synthetic dataset is certified for AI training only when both privacy validation and utility validation pass their thresholds.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Input["Source Data"] A[Real Restricted Dataset] B{Generator Selection} end subgraph Validation["Privacy and Utility Validation"] C[Privacy Validation] D{Privacy Gate} E[Utility Validation] F{Utility Gate} end subgraph Output["Certified Publication"] G[(Synthetic Data Catalogue)] H[Approved Consumers] end A --> B B --> C C --> D D -->|fail| B D -->|pass| E E --> F F -->|fail| B F -->|pass| G G --> H style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f3e8ff,stroke:#a855f7 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f3e8ff,stroke:#a855f7 style E fill:#f0fdf4,stroke:#22c55e style F fill:#f3e8ff,stroke:#a855f7 style G fill:#fef9c3,stroke:#eab308 style H fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Data Profiler	Processing	Analyses source data distributions, correlations, and data types to configure generator	YData Profiling, Pandas Profiling, custom	High
Statistical Generator	ML Model	Parametric synthesis using marginal + joint distributions	SDV (GaussianCopula, CopulaGAN), Faker	Medium
VAE Generator	ML Model	Latent space sampling for complex tabular data	Custom PyTorch VAE, SDV TVAE	High
GAN Generator (with DP-SGD)	ML Model	High-fidelity synthesis with differential privacy training	CTGAN + Opacus DP-SGD, Gretel.ai, YData	High
k-Anonymity / l-Diversity Checker	Processing	Tests that no synthetic record is uniquely re-identifiable	Custom Python + ARX library	Critical
Differential Privacy Budget Tracker	Processing	Accounts for total privacy cost (ε); validates DP-SGD parameters	Opacus, Google DP library, custom tracker	Critical
Membership Inference Attack Tester	Processing	Adversarial attack simulation to test privacy leakage	Adversarial Robustness Toolbox (ART), custom	High
Statistical Fidelity Validator	Processing	KS test, chi-squared, correlation comparison between real and synthetic	scipy, custom Python, SDV metrics	High
TSTR / TRTS Validator	Processing	Downstream model performance comparison	Custom ML evaluation harness	Critical
Synthetic Data Certificate	Artefact	Machine-readable certificate with privacy + utility scores, usage policy	JSON schema, stored in data catalogue	High
Synthetic Data Catalogue	Storage + Discovery	Governs synthetic dataset publication, access control, expiry	DataHub, Atlan, custom catalogue	High

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Data Profiler	Analyses real dataset; extracts statistical profile	Data profile (distributions, correlations, data types)
2	Generator Selection	Evaluates data profile + privacy requirements → selects generation approach	Generator configuration
3	Synthetic Generator	Trains on real data (with DP-SGD if required); generates synthetic dataset	Raw synthetic dataset
4	k-Anonymity Checker	Tests uniqueness of synthetic records against real dataset	k-anonymity score; l-diversity score
5	DP Budget Tracker	Verifies DP-SGD parameters; computes cumulative ε budget	Privacy budget report (ε value)
6	Membership Inference Tester	Trains attack classifier; measures attack accuracy	Attack accuracy score (target: ≤55%)
7	Privacy Gate	Evaluates all three privacy checks; passes or rejects	Pass/fail + privacy validation report
8	Statistical Fidelity Validator	Compares distributions, correlations between real and synthetic	Statistical fidelity scores per feature
9	TSTR / TRTS Validator	Trains models on synthetic/real; evaluates cross-performance	TSTR ratio; TRTS ratio
10	Utility Gate	Evaluates utility metrics; passes or triggers regeneration	Pass/fail + utility validation report
11	Certification	Generates Synthetic Dataset Certificate; publishes to catalogue	Certified synthetic dataset with usage policy
12	Approved Consumer	Accesses synthetic dataset via catalogue; uses for AI training/testing	AI model trained on synthetic data

Error Flow

Error Condition	Trigger	Response	Recovery
Privacy gate failure (membership inference >55%)	Attack accuracy too high	Synthetic dataset rejected; regenerate with higher DP noise (lower ε)	Increase DP-SGD noise multiplier; regenerate; re-run privacy validation
Utility gate failure (TSTR ratio <0.90)	Synthetic data too noisy for useful AI training	Synthetic dataset rejected; privacy-utility trade-off re-evaluated	Increase training epochs; adjust ε budget; consider less strict privacy target
Mode collapse (GAN produces limited variety)	Generator produces repetitive records	GAN training failure detected by diversity metric	Switch to VAE generator; adjust GAN hyperparameters
Source data access revoked before validation	Real data access removed mid-pipeline	Pipeline paused; cannot complete utility validation	Resume with new data access grant; or use previously validated synthetic version

8. Security Considerations

Authentication & Authorisation

Real source data access for generator training is highly restricted; access logged and time-limited.
Synthetic dataset access controlled by usage policy in catalogue; different tiers for internal/partner/public.

Secrets Management

Source data credentials for generator training stored in secrets manager; not retained beyond training session.
Generator model artefacts access-controlled; a trained generator can be used to generate more synthetic data and must be treated as sensitive.

Data Classification

Generator model (trained on real data) classified at least as Confidential — it encodes statistical properties of real data.
Certified synthetic datasets classified per usage policy; may be Internal or Shareable depending on privacy validation.

Encryption

Source data encrypted at rest during generator training; access keys in KMS.
Synthetic datasets encrypted at rest; encryption may be relaxed for low-sensitivity certified datasets per policy.

Auditability

All access to source data for generation logged.
Synthetic dataset access logged per usage policy.
Privacy and utility validation results stored immutably with dataset version.

OWASP LLM Top 10 Mapping

OWASP LLM Risk	Relevance	Mitigation
LLM06: Sensitive Information Disclosure	Generator memorises and reproduces real records	Membership inference attack testing; DP-SGD prevents memorisation
LLM03: Training Data Poisoning	Synthetic data with adversarial patterns used to poison AI model	Statistical fidelity validation; TSTR validation catches adversarial deviations
LLM04: Model Denial of Service	Generator attacked to produce malformed synthetic data	Input validation on generation requests; rate limiting

9. Governance Considerations

Responsible AI

Synthetic data must preserve demographic representation; if real data is biased, synthetic data may amplify bias.
Bias audit required as part of utility validation: compare demographic distributions in real vs. synthetic.

Model Risk Management

Models trained on synthetic data must be validated on real data before production deployment.
TSTR ratio ≥ 0.90 is minimum bar; risk committee may require higher threshold for high-risk AI.

Human Approval Checkpoints

Privacy Officer must approve Synthetic Dataset Certificate before publication to external partners.
Legal counsel review required for cross-border synthetic data sharing.
Risk committee approval required for synthetic data used in high-risk AI (EU AI Act Annex III).

Governance Artefacts

Artefact	Owner	Cadence	Purpose
Synthetic Dataset Certificate	Privacy / ML Platform	Per generation run	Privacy + utility scores; ε budget; usage policy; expiry date
Privacy Validation Report	Privacy Team	Per generation run	k-anonymity, DP budget, membership inference test results
Utility Validation Report	ML Platform	Per generation run	Statistical fidelity; TSTR/TRTS ratios
Usage Policy Record	Privacy Officer	Per publication	Permitted use cases; sharing permissions; expiry; approved consumers
Generator Model Audit Log	ML Platform	Continuous	Who trained/used which generator; source data access log

10. Operational Considerations

Monitoring

Metric	Alert Threshold	Tooling
Membership inference attack accuracy	>55%	Validation pipeline output
TSTR ratio	<0.90	Validation pipeline output
DP budget cumulative ε	>configured threshold	Budget tracker
Synthetic dataset expiry	30 days before expiry	Catalogue alert
Generator training compute cost	>budget threshold	Cloud cost alert

SLOs

SLO	Target	Measurement
Synthetic dataset generation + validation	<24 hours end-to-end	Pipeline execution time
Synthetic dataset catalogue availability	99.9%	Availability monitor
Privacy validation completion	<4 hours	Validation pipeline time

Logging

All generation runs logged with source dataset version, generator type, privacy parameters, validation results.
Retained 7 years for regulatory compliance.

Incident Management

Privacy gate failure with external partner data → P1; Privacy Officer notified immediately.
Unexpected source data access to generate synthetic data → P1 security incident.

Disaster Recovery

Component	RTO	RPO	Strategy
Synthetic Data Catalogue	4 hours	24 hours	Database backup; synthetic datasets re-generatable
Generator Model Artefacts	8 hours	24 hours	Artefact store backup; can retrain if lost
Validation Pipeline	2 hours	N/A	Stateless; redeploy from IaC

11. Cost Considerations

Cost Drivers

Cost Driver	Typical Range	Notes
GAN/VAE training compute	$100–$5,000 per run	GPU compute; scales with dataset size; amortised across many generation runs
Privacy validation compute	$50–$500 per run	Membership inference attack training
Synthetic data storage	$10–$200/month	Modest; synthetic datasets typically smaller than real
Enterprise platform licence	$2,000–$20,000/month	Gretel.ai, Mostly AI, YData enterprise
Legal / privacy review	$5,000–$20,000 per use case	One-time for new use case type; ongoing for regulatory changes

Optimisations

Use open-source SDV or CTGAN for initial synthetic data; move to enterprise platform only when scale demands.
Cache trained generators; regenerate synthetic data without retraining if source distribution unchanged.
Run membership inference testing on a sample rather than full synthetic dataset.

Indicative Cost Range

Scale	Monthly Cost	Basis
Small (1–3 use cases, monthly generation)	$500–$3,000	SDV OSS + custom validation + light storage
Medium (5–10 use cases, weekly generation)	$3,000–$15,000	CTGAN + validation pipeline + Gretel.ai OSS
Large (20+ use cases, daily generation, external sharing)	$15,000–$60,000	Enterprise platform + legal + comprehensive validation

12. Trade-Off Analysis

Option Comparison

Option	Pros	Cons	Recommended When
A: Full privacy-validated synthetic data pipeline (this pattern)	Mathematically sound privacy; high utility; regulatory-acceptable	High setup cost; DP reduces data utility; requires real data for generator training	Regulated industry; external data sharing; high-risk AI training
B: Statistical anonymisation (masking/tokenisation)	Simple; no generator training needed	Destroys statistical relationships; models trained on anonymised data perform poorly	Low-complexity AI; non-statistical test data
C: Rule-based test data generation (Faker)	Zero privacy risk; instant	No statistical fidelity; useless for ML model training	Functional software testing only; not ML
D: Commercial synthetic data platform (Mostly AI, Gretel)	Best-in-class fidelity and privacy; legal opinion packages	High cost; vendor dependency	Enterprise at scale; legal opinion needed; limited internal ML capacity

Architectural Tensions

Tension	Trade-Off	Resolution
Privacy (low ε) vs. Utility (high TSTR ratio)	More DP noise → better privacy → worse utility	Tune ε per use case risk level; accept lower TSTR ratio for high-risk cases
Generation fidelity vs. training speed	GANs produce best synthetic data but are slow and unstable	Use VAE by default; GAN only when TSTR ratio requirement is very high
Internal generation vs. external platform	Internal = control and cost; external = better fidelity and legal opinion	Use internal for mature use cases; external for new/sensitive use cases

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Generator memorises outlier real records	Medium	High — privacy breach	Membership inference test	Retrain with higher DP noise; or exclude outliers from training
GAN mode collapse — synthetic data under-represents minority class	High	Medium — model trained on synthetic data misses minority class	Statistical fidelity check on class distribution	Switch to conditional VAE; oversample minority class in real data before generation
Synthetic data used beyond approved use case	Medium	High — privacy and legal violation	Usage policy enforcement in catalogue	Usage policy automated enforcement; access revocation on violation
Utility degrades after real data distribution shift	Medium	Medium — TSTR ratio drops; old synthetic data used for new models	Periodic re-validation of existing synthetic datasets	Trigger regeneration on source distribution drift detection

14. Regulatory Considerations

Regulation	Requirement	Pattern Response
GDPR Article 5(1)(b)	Purpose limitation — data used only for specified purposes	Synthetic dataset usage policy enforces purpose limitation
GDPR Recital 26	Synthetic data that re-identifies individuals not anonymous	Membership inference testing + k-anonymity validate true anonymisation
Privacy Act (Australia) APP 3	Collection of personal information limitation	Synthetic data reduces real data collection requirements in AI development
Privacy Act (Australia) APP 11	Security of personal information	DP-SGD prevents memorisation; generator model access controlled
EU AI Act Article 10(3)	Examine data for biases	Bias distribution comparison in utility validation
EU AI Act Article 10(5)	Sensitive attribute processing for bias detection/correction	Utility validation includes demographic distribution comparison
APRA CPS 234	Data integrity	Privacy + utility validation certificates provide attestation of synthetic data integrity
ISO 42001 §8.4	Data governance for AI	Synthetic Dataset Certificate is a documented governance artefact

15. Reference Implementations

AWS

Component	AWS Service
Generator training compute	SageMaker Training Jobs (GPU)
Generator type	CTGAN on SageMaker + Opacus DP-SGD
Privacy validation	SageMaker Processing Jobs
Synthetic data storage	S3
Catalogue	AWS Glue Data Catalog + custom certificate store in DynamoDB

Azure

Component	Azure Service
Generator training	Azure ML Compute (GPU)
DP framework	Opacus or SmartNoise on Azure ML
Privacy / utility validation	Azure ML Pipelines
Synthetic data storage	ADLS Gen2
Catalogue	Azure Purview

GCP

Component	GCP Service
Generator training	Vertex AI Custom Training (GPU)
DP framework	Google DP library + Opacus
Validation	Vertex AI Pipelines
Storage	GCS
Catalogue	Google Dataplex

On-Premises

Component	Technology
Generator	CTGAN + Opacus on GPU Kubernetes
Validation	Custom Python pipeline on Kubernetes
Storage	MinIO
Catalogue	OpenMetadata or DataHub

Pattern	ID	Relationship	Notes
Privacy by Design for AI Data	EAAPL-DAT005	Complements	Synthetic data is a privacy-by-design technique
AI Training Data Governance	EAAPL-DAT007	Depends on	Synthetic datasets must be governed in training data registry
Data Quality for AI	EAAPL-DAT002	Complements	Utility validation aligns with quality dimension of training data
Active Learning Loop	EAAPL-HIL002	Complements	Synthetic data augments rare-class samples for annotation
Fine-Tuning Pipeline	EAAPL-MDL006	Enables	Synthetic data enables fine-tuning where real data is restricted

17. Maturity Assessment

Overall Maturity: Proven — Core synthetic data generation techniques (CTGAN, VAE, SDV) are mature and production-proven. Differential privacy integration (Opacus) is mature. Regulatory acceptance of DP-validated synthetic data is growing but jurisdiction-specific.

Dimension	Score (1–5)	Notes
Architectural clarity	4	Generation pipeline well-defined; DP parameter tuning remains specialist skill
Tooling maturity	4	CTGAN/VAE/SDV mature; enterprise platforms (Mostly AI) mature
Regulatory alignment	4	Strong GDPR alignment; EU AI Act acceptance emerging
Operational complexity	3	DP parameter tuning requires expertise; GAN training unstable
Cost efficiency	4	OSS stack cost-effective; amortised across many use cases
Security	4	DP-SGD prevents memorisation; generator access controls required

18. Revision History

Version	Date	Author	Changes
1.0	2023-10-01	EAAPL Working Group	Initial publication
1.1	2024-04-15	EAAPL Working Group	Added DP-SGD framework; membership inference testing detail
1.2	2025-03-01	EAAPL Working Group	Added EU AI Act Article 10(5) alignment; updated enterprise platform options

← Back to Library More Data Architecture →