EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryData ArchitectureEAAPL-DAT004
EAAPL-DAT004Proven
⇄ Compare

Synthetic Data Generation

🗄️ Data ArchitectureEU AI ActISO/IEC 42001

[EAAPL-DAT004] Synthetic Data Generation

Category: Data Architecture
Sub-category: Synthetic Data / Privacy-Preserving AI
Version: 1.2
Maturity: Proven
Tags: synthetic-data, GAN, VAE, differential-privacy, k-anonymity, privacy-validation, utility-validation
Regulatory Relevance: GDPR Article 5, Privacy Act Australia APP 3/11, EU AI Act Article 10, APRA CPS 234, ISO 42001 §8.4


1. Executive Summary

Many of the most valuable enterprise AI use cases — fraud detection, clinical risk prediction, credit underwriting — require training on data that organisations cannot freely share: patient records, account transactions, personally identifiable information. Synthetic data generation creates statistically faithful, privacy-preserving datasets that can be used for AI training, testing, and sharing without exposing real individuals.

This pattern defines a production synthetic data generation pipeline covering three generation techniques (GAN/VAE/statistical), privacy validation (differential privacy, k-anonymity, membership inference testing), utility validation (statistical fidelity, downstream model performance parity), and the regulatory framework for accepting synthetic data in lieu of real data for AI training and testing.

Organisations that implement this pattern have unlocked AI use cases previously blocked by privacy constraints, accelerated model development cycle times by 40–60% through unrestricted test data availability, and reduced privacy incident risk in development and testing environments.

Target audience: Chief Privacy Officers, Chief Data Officers, ML Platform leads, Data Science leads.


2. Problem Statement

Business Problem

AI programmes are blocked or slowed by inability to use real data outside production environments. Development and test environments cannot receive production data due to privacy regulation; third-party data science partners cannot access customer data; cross-border data transfer restrictions prevent AI development in global teams.

Technical Problem

  • Real patient/customer/financial records cannot legally be used in development, test, or partner-facing environments.
  • Data anonymisation (masking, tokenisation) degrades statistical relationships, destroying the signal AI models need.
  • Small datasets for rare events (fraud, rare diseases) are insufficient for model training.
  • Class imbalance in real data requires augmentation techniques that preserve statistical properties.
  • Testing AI edge cases with real data creates production data exposure in test environments.

Symptoms

  • AI development cycle stalled waiting for "data access approval" that may never come.
  • Test environments using obviously fake data that does not reflect real statistical patterns, causing test AI models to fail in production.
  • Third-party data science partners blocked from receiving any data.
  • Rare event classes (fraud, rare diseases) under-represented in training data, degrading model recall.
  • Privacy incidents caused by real production data in development environments.

Cost of Inaction

Dimension Impact
Velocity AI use cases delayed 6–18 months waiting for data access approval
Privacy risk Real PII in dev/test environments creates regulatory exposure and breach risk
Model quality Artificially balanced or anonymised datasets produce worse models
Competitive Data-rich competitors accelerate AI while your organisation waits for approvals

3. Context

When to Apply

  • AI training or testing requires data that cannot be shared due to privacy, regulatory, or contractual restrictions.
  • Small training datasets need augmentation (class imbalance; rare events).
  • Development/test environments need representative data without PII.
  • Third-party partners (model vendors, data scientists) need data to work with.
  • Cross-border transfer restrictions prevent sharing real data across jurisdictions.

When NOT to Apply

  • Real data is freely available and shareable (no privacy constraint) — synthetic data adds cost and validation overhead with no benefit.
  • The AI use case requires exact real-world records (e.g., training on specific known fraud patterns that must be preserved exactly).
  • Synthetic data utility validation cannot be performed (no access to real data even for validation).
  • The risk of generated data containing memorised real records is unacceptable (very small original datasets).

Prerequisites

Prerequisite Minimum Viable Preferred
Source data access Sample of real data for training generator Full production dataset with proper access controls
Generation tooling SDV (Synthetic Data Vault), Faker + statistical Dedicated GAN/VAE pipeline; enterprise synthetic data platform
Privacy validation k-anonymity check Differential privacy budget tracking + membership inference testing
Utility validation Basic statistical comparison Downstream model performance parity testing
Legal sign-off Privacy team review External privacy counsel opinion for regulatory-sensitive use cases

Industry Applicability

Industry Applicability Driver
Healthcare Critical Patient privacy; clinical AI training data scarcity
Financial Services High PCI DSS; APRA; customer data privacy; fraud model training
Insurance High Actuarial data privacy; claims data restrictions
Government High Privacy Act; sensitive citizen data
Retail Medium Customer purchase history; personalisation model testing
Telecommunications Medium Call records; network data; churn model development

4. Architecture Overview

Design Philosophy

Synthetic data generation is not a single technique — it is a pipeline with four stages: generation, privacy validation, utility validation, and certified publication. Skipping any stage creates either privacy risk (insufficiently private synthetic data) or utility failure (synthetic data that does not produce models with parity performance to real data).

Generation Techniques. The pattern supports three generation approaches, selected based on data type and privacy requirements:

Statistical/parametric synthesis uses marginal and joint distributions estimated from real data (SDV's GaussianCopula, CTGAN-light). It is computationally cheap and interpretable but may not capture complex non-linear feature dependencies. Best for tabular data with moderate complexity.

Variational Autoencoders (VAE) learn a continuous latent space representation of the data and sample from it. VAEs are effective for tabular data with complex correlations and naturally support conditional generation (generate samples with a specific class label). They are faster to train than GANs and produce more stable outputs.

Generative Adversarial Networks (GAN) — specifically CTGAN and TabularGAN — train a generator/discriminator pair to produce synthetic records indistinguishable from real. GANs produce the highest-fidelity synthetic data but are prone to mode collapse (under-representing some data regions) and are computationally expensive. Best for high-fidelity requirements where computational budget is available.

Privacy Validation — Three Layers. No single privacy metric is sufficient:

  1. k-anonymity and l-diversity check that no synthetic record is unique to an identifiable individual in the original dataset. These are necessary but not sufficient.
  2. Differential privacy (DP) provides mathematical privacy guarantees: the synthetic dataset's statistical properties would be the same regardless of whether any individual record was included in the training data. DP is applied as a privacy mechanism during GAN/VAE training (DP-SGD), adding calibrated noise to gradient updates. The privacy budget (ε) is tracked and reported; typical production thresholds are ε ≤ 1 (high privacy) to ε ≤ 10 (moderate privacy).
  3. Membership inference attack testing trains an adversarial classifier to determine whether a specific real record was in the training data. If the attack accuracy is near 50% (random chance), the synthetic data provides strong privacy. If attack accuracy is significantly above 50%, the synthetic data leaks real record membership.

Utility Validation. Privacy and utility trade off: more privacy noise reduces data fidelity. Utility validation measures this trade-off across three dimensions:

  1. Statistical fidelity: Compare marginal distributions (KS test), pairwise correlations, and higher-order statistics between real and synthetic datasets.
  2. Train-on-Synthetic, Test-on-Real (TSTR): Train an AI model on synthetic data; evaluate on real data. Compare AUC/F1 against a model trained on real data. A TSTR performance ratio ≥ 0.90 indicates high utility.
  3. Train-on-Real, Test-on-Synthetic (TRTS): Train on real, test on synthetic — validates that synthetic data represents the same distribution as real.

A synthetic dataset is certified for AI training only when both privacy validation and utility validation pass their thresholds.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Input["Source Data"] A[Real Restricted Dataset] B{Generator Selection} end subgraph Validation["Privacy and Utility Validation"] C[Privacy Validation] D{Privacy Gate} E[Utility Validation] F{Utility Gate} end subgraph Output["Certified Publication"] G[(Synthetic Data Catalogue)] H[Approved Consumers] end A --> B B --> C C --> D D -->|fail| B D -->|pass| E E --> F F -->|fail| B F -->|pass| G G --> H style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f3e8ff,stroke:#a855f7 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f3e8ff,stroke:#a855f7 style E fill:#f0fdf4,stroke:#22c55e style F fill:#f3e8ff,stroke:#a855f7 style G fill:#fef9c3,stroke:#eab308 style H fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Data Profiler Processing Analyses source data distributions, correlations, and data types to configure generator YData Profiling, Pandas Profiling, custom High
Statistical Generator ML Model Parametric synthesis using marginal + joint distributions SDV (GaussianCopula, CopulaGAN), Faker Medium
VAE Generator ML Model Latent space sampling for complex tabular data Custom PyTorch VAE, SDV TVAE High
GAN Generator (with DP-SGD) ML Model High-fidelity synthesis with differential privacy training CTGAN + Opacus DP-SGD, Gretel.ai, YData High
k-Anonymity / l-Diversity Checker Processing Tests that no synthetic record is uniquely re-identifiable Custom Python + ARX library Critical
Differential Privacy Budget Tracker Processing Accounts for total privacy cost (ε); validates DP-SGD parameters Opacus, Google DP library, custom tracker Critical
Membership Inference Attack Tester Processing Adversarial attack simulation to test privacy leakage Adversarial Robustness Toolbox (ART), custom High
Statistical Fidelity Validator Processing KS test, chi-squared, correlation comparison between real and synthetic scipy, custom Python, SDV metrics High
TSTR / TRTS Validator Processing Downstream model performance comparison Custom ML evaluation harness Critical
Synthetic Data Certificate Artefact Machine-readable certificate with privacy + utility scores, usage policy JSON schema, stored in data catalogue High
Synthetic Data Catalogue Storage + Discovery Governs synthetic dataset publication, access control, expiry DataHub, Atlan, custom catalogue High

7. Data Flow

Primary Flow

Step Actor Action Output
1 Data Profiler Analyses real dataset; extracts statistical profile Data profile (distributions, correlations, data types)
2 Generator Selection Evaluates data profile + privacy requirements → selects generation approach Generator configuration
3 Synthetic Generator Trains on real data (with DP-SGD if required); generates synthetic dataset Raw synthetic dataset
4 k-Anonymity Checker Tests uniqueness of synthetic records against real dataset k-anonymity score; l-diversity score
5 DP Budget Tracker Verifies DP-SGD parameters; computes cumulative ε budget Privacy budget report (ε value)
6 Membership Inference Tester Trains attack classifier; measures attack accuracy Attack accuracy score (target: ≤55%)
7 Privacy Gate Evaluates all three privacy checks; passes or rejects Pass/fail + privacy validation report
8 Statistical Fidelity Validator Compares distributions, correlations between real and synthetic Statistical fidelity scores per feature
9 TSTR / TRTS Validator Trains models on synthetic/real; evaluates cross-performance TSTR ratio; TRTS ratio
10 Utility Gate Evaluates utility metrics; passes or triggers regeneration Pass/fail + utility validation report
11 Certification Generates Synthetic Dataset Certificate; publishes to catalogue Certified synthetic dataset with usage policy
12 Approved Consumer Accesses synthetic dataset via catalogue; uses for AI training/testing AI model trained on synthetic data

Error Flow

Error Condition Trigger Response Recovery
Privacy gate failure (membership inference >55%) Attack accuracy too high Synthetic dataset rejected; regenerate with higher DP noise (lower ε) Increase DP-SGD noise multiplier; regenerate; re-run privacy validation
Utility gate failure (TSTR ratio <0.90) Synthetic data too noisy for useful AI training Synthetic dataset rejected; privacy-utility trade-off re-evaluated Increase training epochs; adjust ε budget; consider less strict privacy target
Mode collapse (GAN produces limited variety) Generator produces repetitive records GAN training failure detected by diversity metric Switch to VAE generator; adjust GAN hyperparameters
Source data access revoked before validation Real data access removed mid-pipeline Pipeline paused; cannot complete utility validation Resume with new data access grant; or use previously validated synthetic version

8. Security Considerations

Authentication & Authorisation

  • Real source data access for generator training is highly restricted; access logged and time-limited.
  • Synthetic dataset access controlled by usage policy in catalogue; different tiers for internal/partner/public.

Secrets Management

  • Source data credentials for generator training stored in secrets manager; not retained beyond training session.
  • Generator model artefacts access-controlled; a trained generator can be used to generate more synthetic data and must be treated as sensitive.

Data Classification

  • Generator model (trained on real data) classified at least as Confidential — it encodes statistical properties of real data.
  • Certified synthetic datasets classified per usage policy; may be Internal or Shareable depending on privacy validation.

Encryption

  • Source data encrypted at rest during generator training; access keys in KMS.
  • Synthetic datasets encrypted at rest; encryption may be relaxed for low-sensitivity certified datasets per policy.

Auditability

  • All access to source data for generation logged.
  • Synthetic dataset access logged per usage policy.
  • Privacy and utility validation results stored immutably with dataset version.

OWASP LLM Top 10 Mapping

OWASP LLM Risk Relevance Mitigation
LLM06: Sensitive Information Disclosure Generator memorises and reproduces real records Membership inference attack testing; DP-SGD prevents memorisation
LLM03: Training Data Poisoning Synthetic data with adversarial patterns used to poison AI model Statistical fidelity validation; TSTR validation catches adversarial deviations
LLM04: Model Denial of Service Generator attacked to produce malformed synthetic data Input validation on generation requests; rate limiting

9. Governance Considerations

Responsible AI

  • Synthetic data must preserve demographic representation; if real data is biased, synthetic data may amplify bias.
  • Bias audit required as part of utility validation: compare demographic distributions in real vs. synthetic.

Model Risk Management

  • Models trained on synthetic data must be validated on real data before production deployment.
  • TSTR ratio ≥ 0.90 is minimum bar; risk committee may require higher threshold for high-risk AI.

Human Approval Checkpoints

  • Privacy Officer must approve Synthetic Dataset Certificate before publication to external partners.
  • Legal counsel review required for cross-border synthetic data sharing.
  • Risk committee approval required for synthetic data used in high-risk AI (EU AI Act Annex III).

Governance Artefacts

Artefact Owner Cadence Purpose
Synthetic Dataset Certificate Privacy / ML Platform Per generation run Privacy + utility scores; ε budget; usage policy; expiry date
Privacy Validation Report Privacy Team Per generation run k-anonymity, DP budget, membership inference test results
Utility Validation Report ML Platform Per generation run Statistical fidelity; TSTR/TRTS ratios
Usage Policy Record Privacy Officer Per publication Permitted use cases; sharing permissions; expiry; approved consumers
Generator Model Audit Log ML Platform Continuous Who trained/used which generator; source data access log

10. Operational Considerations

Monitoring

Metric Alert Threshold Tooling
Membership inference attack accuracy >55% Validation pipeline output
TSTR ratio <0.90 Validation pipeline output
DP budget cumulative ε >configured threshold Budget tracker
Synthetic dataset expiry 30 days before expiry Catalogue alert
Generator training compute cost >budget threshold Cloud cost alert

SLOs

SLO Target Measurement
Synthetic dataset generation + validation <24 hours end-to-end Pipeline execution time
Synthetic dataset catalogue availability 99.9% Availability monitor
Privacy validation completion <4 hours Validation pipeline time

Logging

  • All generation runs logged with source dataset version, generator type, privacy parameters, validation results.
  • Retained 7 years for regulatory compliance.

Incident Management

  • Privacy gate failure with external partner data → P1; Privacy Officer notified immediately.
  • Unexpected source data access to generate synthetic data → P1 security incident.

Disaster Recovery

Component RTO RPO Strategy
Synthetic Data Catalogue 4 hours 24 hours Database backup; synthetic datasets re-generatable
Generator Model Artefacts 8 hours 24 hours Artefact store backup; can retrain if lost
Validation Pipeline 2 hours N/A Stateless; redeploy from IaC

11. Cost Considerations

Cost Drivers

Cost Driver Typical Range Notes
GAN/VAE training compute $100–$5,000 per run GPU compute; scales with dataset size; amortised across many generation runs
Privacy validation compute $50–$500 per run Membership inference attack training
Synthetic data storage $10–$200/month Modest; synthetic datasets typically smaller than real
Enterprise platform licence $2,000–$20,000/month Gretel.ai, Mostly AI, YData enterprise
Legal / privacy review $5,000–$20,000 per use case One-time for new use case type; ongoing for regulatory changes

Optimisations

  • Use open-source SDV or CTGAN for initial synthetic data; move to enterprise platform only when scale demands.
  • Cache trained generators; regenerate synthetic data without retraining if source distribution unchanged.
  • Run membership inference testing on a sample rather than full synthetic dataset.

Indicative Cost Range

Scale Monthly Cost Basis
Small (1–3 use cases, monthly generation) $500–$3,000 SDV OSS + custom validation + light storage
Medium (5–10 use cases, weekly generation) $3,000–$15,000 CTGAN + validation pipeline + Gretel.ai OSS
Large (20+ use cases, daily generation, external sharing) $15,000–$60,000 Enterprise platform + legal + comprehensive validation

12. Trade-Off Analysis

Option Comparison

Option Pros Cons Recommended When
A: Full privacy-validated synthetic data pipeline (this pattern) Mathematically sound privacy; high utility; regulatory-acceptable High setup cost; DP reduces data utility; requires real data for generator training Regulated industry; external data sharing; high-risk AI training
B: Statistical anonymisation (masking/tokenisation) Simple; no generator training needed Destroys statistical relationships; models trained on anonymised data perform poorly Low-complexity AI; non-statistical test data
C: Rule-based test data generation (Faker) Zero privacy risk; instant No statistical fidelity; useless for ML model training Functional software testing only; not ML
D: Commercial synthetic data platform (Mostly AI, Gretel) Best-in-class fidelity and privacy; legal opinion packages High cost; vendor dependency Enterprise at scale; legal opinion needed; limited internal ML capacity

Architectural Tensions

Tension Trade-Off Resolution
Privacy (low ε) vs. Utility (high TSTR ratio) More DP noise → better privacy → worse utility Tune ε per use case risk level; accept lower TSTR ratio for high-risk cases
Generation fidelity vs. training speed GANs produce best synthetic data but are slow and unstable Use VAE by default; GAN only when TSTR ratio requirement is very high
Internal generation vs. external platform Internal = control and cost; external = better fidelity and legal opinion Use internal for mature use cases; external for new/sensitive use cases

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Generator memorises outlier real records Medium High — privacy breach Membership inference test Retrain with higher DP noise; or exclude outliers from training
GAN mode collapse — synthetic data under-represents minority class High Medium — model trained on synthetic data misses minority class Statistical fidelity check on class distribution Switch to conditional VAE; oversample minority class in real data before generation
Synthetic data used beyond approved use case Medium High — privacy and legal violation Usage policy enforcement in catalogue Usage policy automated enforcement; access revocation on violation
Utility degrades after real data distribution shift Medium Medium — TSTR ratio drops; old synthetic data used for new models Periodic re-validation of existing synthetic datasets Trigger regeneration on source distribution drift detection

14. Regulatory Considerations

Regulation Requirement Pattern Response
GDPR Article 5(1)(b) Purpose limitation — data used only for specified purposes Synthetic dataset usage policy enforces purpose limitation
GDPR Recital 26 Synthetic data that re-identifies individuals not anonymous Membership inference testing + k-anonymity validate true anonymisation
Privacy Act (Australia) APP 3 Collection of personal information limitation Synthetic data reduces real data collection requirements in AI development
Privacy Act (Australia) APP 11 Security of personal information DP-SGD prevents memorisation; generator model access controlled
EU AI Act Article 10(3) Examine data for biases Bias distribution comparison in utility validation
EU AI Act Article 10(5) Sensitive attribute processing for bias detection/correction Utility validation includes demographic distribution comparison
APRA CPS 234 Data integrity Privacy + utility validation certificates provide attestation of synthetic data integrity
ISO 42001 §8.4 Data governance for AI Synthetic Dataset Certificate is a documented governance artefact

15. Reference Implementations

AWS

Component AWS Service
Generator training compute SageMaker Training Jobs (GPU)
Generator type CTGAN on SageMaker + Opacus DP-SGD
Privacy validation SageMaker Processing Jobs
Synthetic data storage S3
Catalogue AWS Glue Data Catalog + custom certificate store in DynamoDB

Azure

Component Azure Service
Generator training Azure ML Compute (GPU)
DP framework Opacus or SmartNoise on Azure ML
Privacy / utility validation Azure ML Pipelines
Synthetic data storage ADLS Gen2
Catalogue Azure Purview

GCP

Component GCP Service
Generator training Vertex AI Custom Training (GPU)
DP framework Google DP library + Opacus
Validation Vertex AI Pipelines
Storage GCS
Catalogue Google Dataplex

On-Premises

Component Technology
Generator CTGAN + Opacus on GPU Kubernetes
Validation Custom Python pipeline on Kubernetes
Storage MinIO
Catalogue OpenMetadata or DataHub

Pattern ID Relationship Notes
Privacy by Design for AI Data EAAPL-DAT005 Complements Synthetic data is a privacy-by-design technique
AI Training Data Governance EAAPL-DAT007 Depends on Synthetic datasets must be governed in training data registry
Data Quality for AI EAAPL-DAT002 Complements Utility validation aligns with quality dimension of training data
Active Learning Loop EAAPL-HIL002 Complements Synthetic data augments rare-class samples for annotation
Fine-Tuning Pipeline EAAPL-MDL006 Enables Synthetic data enables fine-tuning where real data is restricted

17. Maturity Assessment

Overall Maturity: Proven — Core synthetic data generation techniques (CTGAN, VAE, SDV) are mature and production-proven. Differential privacy integration (Opacus) is mature. Regulatory acceptance of DP-validated synthetic data is growing but jurisdiction-specific.

Dimension Score (1–5) Notes
Architectural clarity 4 Generation pipeline well-defined; DP parameter tuning remains specialist skill
Tooling maturity 4 CTGAN/VAE/SDV mature; enterprise platforms (Mostly AI) mature
Regulatory alignment 4 Strong GDPR alignment; EU AI Act acceptance emerging
Operational complexity 3 DP parameter tuning requires expertise; GAN training unstable
Cost efficiency 4 OSS stack cost-effective; amortised across many use cases
Security 4 DP-SGD prevents memorisation; generator access controls required

18. Revision History

Version Date Author Changes
1.0 2023-10-01 EAAPL Working Group Initial publication
1.1 2024-04-15 EAAPL Working Group Added DP-SGD framework; membership inference testing detail
1.2 2025-03-01 EAAPL Working Group Added EU AI Act Article 10(5) alignment; updated enterprise platform options
← Back to LibraryMore Data Architecture