EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryData ArchitectureEAAPL-DAT006
EAAPL-DAT006Proven
⇄ Compare

Federated Learning Pattern

🗄️ Data ArchitectureEU AI ActISO/IEC 42001

[EAAPL-DAT006] Federated Learning Pattern

Category: Data Architecture
Sub-category: Distributed AI / Privacy-Preserving ML
Version: 1.1
Maturity: Emerging
Tags: federated-learning, FedAvg, differential-privacy, gradient-aggregation, consortium-AI, cross-silo, cross-device
Regulatory Relevance: GDPR Article 5/25, Privacy Act (Australia) APP 3, EU AI Act Article 10, APRA CPS 234, ISO 42001 §8.4


1. Executive Summary

Federated learning enables multiple organisations or devices to collaboratively train a shared AI model without any participant sharing their raw data. Each participant trains a local model on their own data and shares only model gradients or weights with a central coordinator, which aggregates them into an improved global model. No raw data ever leaves the participant's environment.

This pattern is transformative for industries where data cannot be centralised due to privacy regulation, competitive sensitivity, or jurisdictional restrictions: hospital consortia training clinical AI without sharing patient records; competing banks jointly training fraud detection models; mobile devices improving language models without uploading private messages.

The pattern covers cross-silo federation (organisation-to-organisation, typically <100 participants with high-quality data) and cross-device federation (device-to-server, potentially millions of participants with intermittent connectivity). It addresses the critical engineering challenges: communication efficiency, statistical heterogeneity (non-IID data), system heterogeneity, and privacy amplification through differential privacy on gradient updates.

Target audience: Chief Data Officers, AI Research leads, Healthcare/Banking consortium architects, ML Platform leads.


2. Problem Statement

Business Problem

High-value AI use cases require data that no single organisation can legally or competitively accumulate. Healthcare networks want clinical AI trained on all patient populations; banks want fraud models trained on cross-institution transaction patterns; governments want public health AI trained across jurisdictions. Traditional data sharing is blocked by privacy law, competition law, or data sovereignty requirements.

Technical Problem

  • Centralising personal data from multiple organisations violates GDPR, APRA, and similar privacy regulations.
  • Data sharing agreements between competing organisations are commercially infeasible.
  • Cross-border data transfer restrictions prevent cloud-based centralisation.
  • Each organisation has too little data alone; combined they have sufficient statistical power.
  • Even with anonymisation, competitive organisations will not share detailed customer data.

Symptoms

  • AI use case identified as high-value but blocked at data access stage indefinitely.
  • Each participant's model has poor performance due to small or non-representative local dataset.
  • Privacy counsel blocking data consortium proposals.
  • Regulators signalling openness to federated approaches (e.g., EBA guidance on federated fraud detection).

Cost of Inaction

Dimension Impact
Model quality Individual models underperform (insufficient data) vs. federated models
Competitive Organisations with more data accumulation win; federated levels playing field
Regulatory Data centralisation attempts attract regulatory scrutiny; federated is regulatorily preferred
Healthcare Clinical AI trained on single-hospital data misses rare conditions; patient harm

3. Context

When to Apply

  • Multiple organisations or devices hold relevant training data that cannot be centralised.
  • Privacy regulation or competition law prohibits raw data sharing.
  • Cross-border data transfer restrictions apply.
  • Data is sufficiently heterogeneous that centralisation would require complex harmonisation.
  • Participants have sufficient compute to run local training (cross-silo: always; cross-device: modern smartphones).

When NOT to Apply

  • Data can be legally and practically centralised (federated adds unnecessary complexity).
  • Participants lack compute for local training.
  • Data is severely heterogeneous (non-IID) to the point where federated training diverges (use transfer learning instead).
  • Security of gradient transmission cannot be guaranteed (gradient inversion attacks possible in low-participant settings).
  • Regulatory framework requires centralised data audit (some jurisdictions require data in a single auditable location).

Prerequisites

Prerequisite Minimum Viable Preferred
Participant compute Modern CPU for tabular; GPU for deep learning Dedicated GPU nodes at each participant
Network connectivity Reliable internet (cross-silo) High-bandwidth private network
Federation framework Flower (FL framework), PySyft Enterprise: NVIDIA FLARE, IBM FL, Google FL
DP integration Opacus DP-SGD Calibrated DP budget per participant
Legal agreement Data sharing agreement (data stays local) Multilateral federated learning agreement with IP provisions

Industry Applicability

Industry Applicability Driver
Healthcare Critical Patient privacy; multi-hospital clinical AI; rare disease research
Financial Services High Cross-bank fraud detection; credit risk consortium; AML
Telecommunications High Shared network anomaly detection; cross-carrier fraud
Government Medium Cross-agency AI without central data lake
Retail Medium Cross-retailer demand forecasting consortium
Automotive High Cross-manufacturer autonomous driving model improvement

4. Architecture Overview

Design Philosophy

Federated learning inverts the traditional ML paradigm: rather than bringing data to the model, the model is brought to the data. The architecture must solve five engineering challenges simultaneously.

Challenge 1 — Communication Efficiency. In cross-silo FL, gradient transmission between participants and the coordinator is the primary bottleneck. Full gradient transmission for large neural networks can involve gigabytes per round. The pattern addresses this through gradient compression (sparsification: transmit only top-k% gradients by magnitude; quantisation: reduce gradient precision from float32 to int8 or 4-bit), achieving 10–100× communication reduction with minimal accuracy loss.

Challenge 2 — Statistical Heterogeneity (Non-IID Data). Each participant's data reflects their specific population, which may differ significantly from the global distribution. Naive FedAvg converges poorly on non-IID data. The pattern addresses this through FedProx (adds a proximal term to local loss function, preventing local models from drifting too far from the global model) and SCAFFOLD (corrects for client drift using control variates). For severely heterogeneous data, personalised federated learning (each participant maintains a local fine-tuned head on top of the shared global representation) is used.

Challenge 3 — Privacy Amplification. Gradient sharing, while safer than raw data sharing, is not perfectly private — gradient inversion attacks can recover training data from gradients in small-participant settings. The pattern applies Differential Privacy via DP-SGD (Opacus) at each participant before gradient transmission, adding calibrated Gaussian noise to clip-and-noised gradients. Secure aggregation protocols (Google's SecAgg) allow the coordinator to compute the aggregate gradient without seeing individual participants' gradients — even the coordinator cannot invert a specific participant's gradient.

Challenge 4 — System Heterogeneity. In cross-device FL, participants have wildly varying compute capabilities and connectivity. The pattern implements asynchronous federated learning (FedAsync): participants submit gradients when ready rather than in synchronised rounds; the global model is updated with each received gradient using a mixing hyperparameter. This handles stragglers without blocking the federation round.

Challenge 5 — Byzantine Robustness. Malicious or faulty participants may submit adversarial gradients to corrupt the global model (model poisoning). The coordinator applies gradient validation: per-participant gradient norms are compared; outliers (>3σ from mean) are rejected or down-weighted. FedMedian aggregation (median instead of mean) provides Byzantine-robust aggregation in adversarial settings.

Federation Coordinator. The coordinator orchestrates rounds: selects participants, distributes the global model, collects and aggregates gradients, validates gradient integrity, and updates the global model. The coordinator does not process raw data. In cross-silo settings, the coordinator may be hosted by a neutral third party (industry consortium body, regulator-approved platform) to ensure no participant gains competitive advantage.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Participants["Participant Environments"] A[Participant A Local Data] B[Participant B Local Data] C[Participant C Local Data] end subgraph Training["Local Training"] D[Local Model + DP-SGD] end subgraph Coordinator["Federation Coordinator"] E[Gradient Validator] F[Secure Aggregator] G[(Global Model Registry)] end A --> D B --> D C --> D D -->|compressed DP gradients| E E --> F F --> G G -->|updated global model| D style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#dbeafe,stroke:#3b82f6 style D fill:#f0fdf4,stroke:#22c55e style E fill:#f3e8ff,stroke:#a855f7 style F fill:#f0fdf4,stroke:#22c55e style G fill:#fef9c3,stroke:#eab308

6. Components

Component Type Responsibility Technology Options Criticality
Federation Coordinator Service Round orchestration; global model distribution; gradient aggregation Flower (flwr), NVIDIA FLARE, IBM FL, Google Federated Core Critical
Secure Aggregator Processing Aggregates gradients without exposing individual contributions (SecAgg protocol) Google SecAgg, PySyft SecureSum, custom MPC High
Gradient Validator Processing Detects and rejects outlier/adversarial gradients; Byzantine robustness Custom norm-based filter; FedMedian aggregation High
Local Training Engine Processing (per participant) Trains local model on participant's data; implements FedProx/SCAFFOLD PyTorch + Flower client, TensorFlow Federated, NVIDIA FLARE client Critical
DP-SGD Engine (per participant) Processing Applies differential privacy to gradients before transmission Opacus (PyTorch), TensorFlow Privacy Critical
Gradient Compressor Processing (per participant) Sparsification and quantisation of gradients for transmission efficiency Custom Python; PowerSGD; TopK sparsification High
Global Model Registry Storage Stores global model versions; tracks federation round history MLflow, DVC, Weights & Biases, custom High
DP Budget Tracker Processing Tracks cumulative privacy budget (ε) across rounds per participant Opacus privacy accounting, custom ε tracker Critical
Hold-out Evaluator Processing Evaluates global model on neutral validation dataset not owned by any participant Custom Python evaluation harness High
Federation Agreement Registry Governance Stores legal federation agreement; approved use cases; participant consent Custom registry; legal document management High

7. Data Flow

Primary Flow

Step Actor Action Output
1 Coordinator Initialises global model; selects participants for round Round configuration with global model checkpoint
2 Participants Download global model checkpoint Local copy of global model
3 Each participant Trains local model on local data using FedProx with proximal term Updated local model weights
4 Each participant Applies DP-SGD: clips gradients; adds Gaussian noise DP-protected gradient
5 Each participant Compresses gradient: sparsification + quantisation Compressed DP gradient
6 Coordinator Receives compressed DP gradients from participants Gradient set for round
7 Gradient Validator Checks gradient norms; rejects outliers Validated gradient set
8 Secure Aggregator Aggregates validated gradients (FedAvg or FedProx mean) Aggregated global gradient
9 Coordinator Updates global model with aggregated gradient New global model version
10 Hold-out Evaluator Evaluates global model on neutral validation set Accuracy + fairness metrics
11 Coordinator If metrics meet threshold: promote global model; begin next round Promoted global model; federated training continues

Error Flow

Error Condition Trigger Response Recovery
Participant dropout mid-round Network failure; compute failure Coordinator proceeds with available participants (minimum quorum check) Retry dropped participant in next round
Gradient norm outlier (possible poisoning) Gradient norm >3σ from round mean Gradient rejected; participant flagged for review Review participant's local training setup; escalate to consortium governance
DP budget exhausted (ε > threshold) Cumulative rounds exceed privacy budget Participant stops contributing; new training data or budget reset required Negotiate new DP budget; assess if prior training meets privacy requirements
Global model divergence (loss increases) Non-IID data; insufficient FedProx proximal term Round rejected; proximal term strength increased Adjust FedProx μ hyperparameter; reduce local training epochs

8. Security Considerations

Authentication & Authorisation

  • Each participant authenticates to coordinator using mutual TLS certificates; participant identity linked to federation legal agreement.
  • Coordinator validates participant identity before distributing global model.

Secrets Management

  • DP-SGD noise seed managed locally by each participant; not shared with coordinator.
  • Secure Aggregation protocol keys ephemeral per round; participants derive shared secrets using Diffie-Hellman.

Data Classification

  • Raw training data classified as Confidential or higher at each participant; never transmitted.
  • DP gradients classified as Internal; coordinator sees only aggregated gradient.
  • Global model classified per use case sensitivity; clinical AI models typically Confidential.

Encryption

  • All gradient transmission encrypted using TLS 1.3.
  • Secure Aggregation provides additional cryptographic guarantee: coordinator cannot decrypt individual participant gradients.

Auditability

  • Every federation round logged: round number, participants, aggregation method, global model version, evaluation metrics.
  • DP budget consumption per participant logged; shared with participant for their own privacy accounting.
  • Gradient rejection events logged with reason; reviewed by consortium governance.

OWASP LLM Top 10 Mapping

OWASP LLM Risk Relevance Mitigation
LLM03: Training Data Poisoning Malicious participant submits adversarial gradients Gradient validation; FedMedian aggregation; Byzantine-robust aggregation
LLM06: Sensitive Information Disclosure Gradient inversion could recover training data DP-SGD prevents inversion; SecAgg hides individual gradients from coordinator
LLM04: Model Denial of Service Participant submits malformed gradient causing coordinator crash Gradient schema validation; norm check before aggregation

9. Governance Considerations

Responsible AI

  • Federated models may perform differently across participant populations (fairness concern): hold-out evaluation must include per-participant subgroup performance metrics.
  • Consortium governance board responsible for global model promotion decisions when performance is uneven.

Model Risk Management

  • Global model is trained on distributed data with different quality levels; model risk documentation must describe participant data quality standards.
  • Model risk committee approval required before production deployment of federated model.

Human Approval Checkpoints

  • Federation agreement signed by legal representatives of all participants.
  • Global model promotion requires consortium governance board approval.
  • DP budget reset (increasing privacy expenditure) requires participant-level DPO approval.

Governance Artefacts

Artefact Owner Cadence Purpose
Federation Agreement Legal / Consortium On establishment Defines IP ownership of global model; permitted uses; data sovereignty
Round Audit Log Coordinator Per round Immutable log of participants, gradients received/rejected, model version
DP Budget Report Each Participant Per round Cumulative ε consumption; participant's own privacy accounting
Per-Participant Fairness Report Coordinator Per promotion Global model performance on each participant's subpopulation
Model Card (Federated) Consortium ML Team Per model version Training cohort summary; DP parameters; known limitations per participant

10. Operational Considerations

Monitoring

Metric Alert Threshold Tooling
Round completion rate <80% participants completing round Coordinator logs
Global model accuracy (hold-out) <performance floor Evaluation pipeline
Gradient rejection rate >10% in a round Coordinator metrics
DP budget consumption rate >budget plan Budget tracker
Per-participant contribution latency >round deadline Coordinator timing

SLOs

SLO Target Measurement
Federation round completion <4 hours per round (cross-silo) Round timing logs
Global model hold-out evaluation <1 hour after round completion Evaluation pipeline
Gradient transmission availability >99.5% per participant Network monitoring

Disaster Recovery

Component RTO RPO Strategy
Coordinator 2 hours Last completed round Stateless except model registry; restore from model registry
Global Model Registry 4 hours 1 hour Cross-region replication
Participant Local Training Per participant Per participant Each participant manages their own training infrastructure

11. Cost Considerations

Cost Drivers

Cost Driver Typical Range Notes
Coordinator compute $500–$5,000/month Lightweight for aggregation; scales with participant count
Participant training compute $1,000–$10,000/month per participant GPU training; largest cost component
Network (gradient transmission) $100–$1,000/month Reduced by compression; scales with model size × rounds
Secure Aggregation compute $200–$2,000/month Cryptographic overhead; scales with participant count
Legal (federation agreement) $20,000–$100,000 one-time Multilateral consortium agreement

Indicative Cost Range

Scale Monthly Cost (Coordinator) Monthly Cost (Per Participant)
Small consortium (3–5 participants) $1,000–$5,000 $1,000–$5,000
Medium consortium (10–20 participants) $3,000–$15,000 $2,000–$8,000
Large / cross-device (100+ participants) $10,000–$50,000 Varies widely

12. Trade-Off Analysis

Option Comparison

Option Pros Cons Recommended When
A: Federated Learning (this pattern) True data locality; privacy-preserving; legally viable Complex; convergence slower than centralised; requires participant compute Data cannot be centralised; regulatory blocking; >3 participants
B: Data clean room / privacy sandbox Raw data never shared; analytics on aggregate queries Limited to query-based insights; cannot train complex ML models Analytics use cases; not full ML model training
C: Centralised with contractual data sharing Better convergence; simpler architecture Legally complex; GDPR/competition risk; single point of breach Trusted consortium; single jurisdiction; non-sensitive data
D: Transfer learning (pre-train centrally, fine-tune locally) No raw data sharing for fine-tuning; good performance Requires large public pre-training dataset; may not transfer to specialist domains Public pre-training data available; specialist fine-tuning needed

Architectural Tensions

Tension Trade-Off Resolution
Privacy (DP noise) vs. model utility More DP noise = better privacy, worse model quality Tune ε per risk level; accept utility reduction for high-risk use cases
Communication efficiency vs. convergence speed More compression = faster rounds, slower convergence Use TopK sparsification (top 10% gradients) as default; adjust per round budget
Cross-silo vs. cross-device architecture Cross-device needs async + partial participation; cross-silo needs synchronous consensus Implement FedAsync for cross-device; synchronous FedProx for cross-silo

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Byzantine participant (model poisoning via adversarial gradients) Low High — global model corrupted Gradient norm outlier detection; FedMedian Isolate participant; revert to last clean model; re-run rounds without participant
Global model divergence on non-IID data Medium High — useless global model Hold-out evaluation after each round Increase FedProx proximal term; reduce local epochs; use SCAFFOLD
DP budget exhaustion — participant cannot contribute Medium Medium — federation loses participant DP budget tracker per participant Assess privacy-utility trade-off; negotiate extended budget or retire participant
Coordinator compromise Very Low Critical — adversary sees round gradients Intrusion detection; secure aggregation Secure aggregation prevents coordinator from seeing individual gradients; rotate keys
Legal dispute over global model IP Low High — programme halted Legal agreement monitoring Federation agreement must pre-define IP ownership; no execution dependency on dispute resolution

14. Regulatory Considerations

Regulation Requirement Pattern Response
GDPR Article 5(1)(b) Purpose limitation Federation agreement defines permitted use of global model
GDPR Article 25 Privacy by design Federated architecture keeps data local; DP-SGD minimises gradient leakage
EU AI Act Article 10 Training data governance Global model inherits data governance requirements of all participants' data
APRA CPS 234 Third-party risk Coordinator is a third party; security standards contractually required
Competition law Anti-trust compliance Coordinator must not aggregate commercially sensitive information; legal sign-off required
Cross-border data transfer Data sovereignty Raw data stays local; gradient transmission reviewed per jurisdiction

15. Reference Implementations

AWS

Component AWS Service
Coordinator Amazon SageMaker Federated Learning + custom Flower coordinator on ECS
Participant training SageMaker Training Jobs at each participant site
Global model registry SageMaker Model Registry
Secure communication AWS PrivateLink (cross-silo)

Azure

Component Azure Service
Coordinator Azure ML Federated Learning (preview) + NVIDIA FLARE on AKS
Participant training Azure ML Compute at each participant
Model registry Azure ML Model Registry

GCP

Component GCP Service
Coordinator Vertex AI + Flower on Cloud Run
Participant training Vertex AI Training per participant
Global model Vertex AI Model Registry

On-Premises

Component Technology
Coordinator Flower (flwr) or NVIDIA FLARE on Kubernetes
Participant PyTorch + Opacus on GPU node
Communication mTLS over private network
Model registry MLflow

Pattern ID Relationship Notes
Privacy by Design for AI Data EAAPL-DAT005 Complements DP-SGD is a privacy-by-design technique
Synthetic Data Generation EAAPL-DAT004 Alternative Synthetic data is an alternative when federated training is infeasible
AI Training Data Governance EAAPL-DAT007 Depends on Federation agreement is a training data governance artefact
Model Versioning EAAPL-MDL001 Depends on Global model versioning per federation round
Fine-Tuning Pipeline EAAPL-MDL006 Complements Global federated model fine-tuned locally per participant

17. Maturity Assessment

Overall Maturity: Emerging — Federated learning frameworks (Flower, NVIDIA FLARE) are production-ready. Cross-silo deployments are proven in healthcare and finance consortia. Cross-device FL at scale is mature (Google deployed FL for Gboard). However, regulatory frameworks for federated AI governance are still developing.

Dimension Score (1–5) Notes
Architectural clarity 4 Well-defined federation patterns; non-IID handling still research-active
Tooling maturity 3 Flower/NVIDIA FLARE production-ready; enterprise tooling maturing
Regulatory alignment 3 GDPR alignment good; AI Act treatment of federated models unclear
Operational complexity 2 High operational complexity; multi-party coordination challenging
Cost efficiency 3 High participant compute cost; offset by regulatory compliance enablement
Security 4 DP + SecAgg provides strong privacy guarantees

18. Revision History

Version Date Author Changes
1.0 2024-03-01 EAAPL Working Group Initial publication
1.1 2025-03-01 EAAPL Working Group Added FedProx/SCAFFOLD detail; Byzantine robustness; NVIDIA FLARE reference
← Back to LibraryMore Data Architecture