EAAPL-DAT008Proven

Real-Time Feature Engineering

🗄️ Data ArchitectureEU AI ActISO/IEC 42001

[EAAPL-DAT008] Real-Time Feature Engineering

Category: Data Architecture
Sub-category: Feature Engineering / Online ML Serving
Version: 1.3
Maturity: Proven
Tags: feature-engineering, feature-store, real-time, Kafka, Flink, Redis, point-in-time-correctness, feature-drift
Regulatory Relevance: EU AI Act Article 10, APRA CPS 234 (operational resilience), ISO 42001 §8.4

1. Executive Summary

AI models that make real-time decisions — fraud detection, personalised recommendations, credit pre-screening — require features computed from the most current available data. Batch feature engineering, where features are pre-computed overnight, introduces staleness that degrades model performance in time-sensitive use cases: fraud features computed 8 hours ago miss the most recent spending patterns that indicate fraud.

This pattern defines a production real-time feature engineering architecture that delivers features within milliseconds to inference endpoints, with freshness SLAs as tight as 10 seconds from event to serving. It covers the streaming pipeline (Kafka/Flink for feature computation), the dual-layer feature store (Redis for online serving, object store for offline training), point-in-time correctness for training data consistency, feature freshness SLA management, and feature drift detection.

Organisations that implement this pattern achieve fraud detection precision improvements of 15–25% through fresher features, and reduce time-to-feature from hours to seconds for real-time AI use cases.

Target audience: ML Platform leads, Data Engineering leads, Enterprise Architects.

2. Problem Statement

Business Problem

Real-time AI decisions are only as good as the data behind them. Stale features cause models to miss current context: a customer who spent $5,000 in the last 10 minutes looks identical to a normal customer on yesterday's features. Business consequences include missed fraud, poor real-time personalisation, and incorrect risk assessments.

Technical Problem

Batch feature pipelines (nightly ETL) produce features that are 8–24 hours stale by inference time.
Ad hoc feature computation at inference time introduces latency (50–500ms per lookup) and creates unmanaged technical debt.
Features computed differently for training (batch) vs. serving (real-time) introduce training-serving skew — one of the top causes of production model underperformance.
Point-in-time correctness is violated: training data uses features computed at the current time, not the time of the historical event, leaking future information into training.
Feature pipelines are duplicated across teams; no reuse; inconsistent feature definitions.

Symptoms

Fraud detection model AUC degrades during peak transaction periods (features are stale).
Model performs well in offline evaluation but poorly in production (training-serving skew).
Multiple teams computing the same features with slightly different logic.
Feature pipeline failures cause model fallback to default values with no alerting.
Training labels point-in-time valid but features are current-time (future data leakage detected in post-hoc audit).

Cost of Inaction

Dimension	Impact
Model quality	15–40% degradation in time-sensitive use cases from feature staleness
Fraud / risk	Missed fraud events during staleness window; direct financial loss
Engineering	Feature duplication across teams; multiply estimated at 2–3× engineering waste
Compliance	APRA operational resilience: feature pipeline failure = AI system degradation

3. Context

When to Apply

AI models making real-time decisions where feature freshness affects quality (fraud, recommendations, pricing, content ranking).
Models where latency budget for inference is <100ms total.
Organisations with >3 ML teams benefiting from shared feature definitions.
Use cases requiring training-serving consistency (training-serving skew is a documented issue).

When NOT to Apply

Batch inference where features can be pre-computed (nightly batch is sufficient).
Very simple AI with 1–3 features that can be trivially computed at request time.
Organisation at early AI maturity where a shared feature store adds operational complexity beyond team capability.

Prerequisites

Prerequisite	Minimum Viable	Preferred
Event streaming infrastructure	Kafka (basic)	Kafka + Schema Registry + Kafka Streams
Stream processing	Flink (basic)	Apache Flink + Flink SQL
Online serving store	Redis (standalone)	Redis Enterprise Cluster
ML platform	MLflow	Feast / Tecton / Vertex AI Feature Store
Feature ownership model	Informal	Formal feature ownership with SLA commitments

Industry Applicability

Industry	Applicability	Driver
Financial Services / Fintech	Critical	Real-time fraud; credit pre-screening; risk pricing
E-commerce	High	Real-time personalisation; dynamic pricing; recommendation
Ride-sharing / Delivery	High	Dynamic pricing; driver-rider matching; ETA prediction
Telecommunications	High	Real-time churn intervention; network anomaly
Gaming	Medium	Real-time player behaviour; churn prediction
Healthcare	Medium	Real-time clinical risk; patient monitoring

4. Architecture Overview

Design Philosophy

The real-time feature engineering architecture solves the fundamental tension between feature freshness and serving reliability by maintaining two feature stores simultaneously — an online store for low-latency serving and an offline store for training data — with a shared computation layer ensuring they produce identical feature values.

Stream-First, Batch-Fallback Design. Features are computed from streaming events (Kafka) using Flink streaming jobs. This provides freshness from seconds (aggregated windows) to sub-second (lookup features). For features that cannot be computed from streams (complex aggregations requiring full historical data), a batch top-up runs at configurable intervals (hourly, daily) to pre-compute the batch component, which is then merged with the streaming-computed component in the feature store.

Dual-Layer Feature Store. The architecture maintains two feature store layers with different characteristics:

Online store (Redis): Serves features at inference time with <10ms p99 latency. Stores only the most recent feature values per entity (customer_id, account_id, etc.). Redis sorted sets serve time-windowed aggregations. Optimised for point-in-time latest reads.
Offline store (object store + columnar format): Stores the full history of feature values with timestamps. Used for training dataset creation. Enables point-in-time correct feature retrieval: given a historical event at time T, retrieve the feature value that would have been available at time T — not the current value.

Training-Serving Consistency via Shared Computation Logic. The most critical architectural property: the Flink streaming job and the offline feature pipeline must use identical computation logic. This is achieved through a shared Feature Transformation Library — a versioned Python/Java library containing the feature computation logic, consumed by both the Flink streaming job and the batch offline pipeline. When feature logic changes, both pipelines are updated together, eliminating training-serving skew by construction.

Point-in-Time Correctness. When creating a training dataset, features must be retrieved at the time of the label event, not the current time. The offline store's timestamp index enables point-in-time queries: for a fraud label at 2024-03-15 14:32:00, retrieve the feature values that existed at that exact time. This requires the offline store to retain historical feature snapshots (not just the latest value) — a storage cost trade-off managed through tiered retention (recent history at full fidelity; older history at reduced fidelity or summarised).

Feature Freshness SLA Management. Each feature has a declared freshness SLA — the maximum acceptable age of a feature value when served at inference time. The feature store monitoring layer continuously tracks feature ages and alerts when staleness exceeds SLA. Inference services receive feature freshness metadata alongside feature values, enabling them to downgrade confidence or trigger fallback behaviour when features are stale beyond SLA.

Feature Drift Detection. Feature values are monitored for distribution drift using Population Stability Index (PSI) computed over rolling windows. A feature whose distribution shifts significantly (PSI > 0.1) may indicate upstream data pipeline issues, concept drift, or business context changes. Drift alerts are routed to the feature owner for investigation.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Input["Event Sources"] A[Streaming Events] B[Batch Sources] end subgraph Compute["Feature Computation"] C[Flink Streaming Pipeline] D[Batch Top-Up Pipeline] end subgraph Store["Dual-Layer Feature Store"] E[(Online Store Redis)] F[(Offline Store Parquet)] end subgraph Serving["Model Serving"] G{Freshness Check} H[Inference Service] I[Training Dataset Builder] end A --> C B --> D C -->|real-time writes| E C -->|time-indexed writes| F D -->|batch writes| E D -->|history writes| F E --> G G -->|fresh| H G -->|stale| H F --> I style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#fef9c3,stroke:#eab308 style F fill:#fef9c3,stroke:#eab308 style G fill:#f3e8ff,stroke:#a855f7 style H fill:#d1fae5,stroke:#10b981 style I fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Kafka Event Bus	Messaging	Ingests operational events; provides ordered, partitioned stream per entity	Apache Kafka, AWS MSK, Confluent Cloud	Critical
Flink Streaming Job	Processing	Computes real-time features from event streams; sliding window aggregations	Apache Flink, AWS Kinesis Data Analytics, Azure Stream Analytics	Critical
Feature Transform Library	Library	Shared versioned feature computation logic; consumed by both stream and batch pipelines	Python library (pip), Java library (Maven), dbt macros	Critical
Online Feature Store	Storage + Serving	Low-latency feature serving for inference; latest feature values per entity	Redis Enterprise, DynamoDB, Aerospike, Feast online store	Critical
Offline Feature Store	Storage	Historical time-indexed feature values; point-in-time correct training data creation	Parquet on S3/GCS/ADLS, Delta Lake, Apache Hudi	Critical
Feature Registry	Metadata Store	Defines feature schemas, ownership, freshness SLA, lineage, approved use cases	Feast Registry, Tecton, Vertex AI Feature Store Registry, custom PostgreSQL	High
Batch Top-Up Pipeline	Processing	Computes batch-only or computationally heavy features on schedule	Apache Spark, dbt, AWS Glue	High
Feature Retrieval Client	Library	Batch entity lookup from online store for inference; handles missing features	Feast Python SDK, Tecton SDK, custom Redis client	Critical
Point-in-Time Query Engine	Processing	Retrieves feature values at historical timestamp for training data creation	Feast historical retrieval, custom SQL with time-travel, Delta Lake time travel	High
Feature Monitoring Service	Processing	Tracks freshness, drift (PSI), null rate per feature; fires alerts	Custom Python + Grafana, Evidently AI feature monitoring, WhyLabs	High

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Operational system	Emits business event to Kafka topic	Kafka event record
2	Flink streaming job	Consumes event; applies Feature Transform Library logic; computes aggregations over sliding windows	Feature value update
3	Online store writer	Writes updated feature values to Redis with timestamp	Redis HSET per entity ID
4	Offline store writer	Appends feature values with event timestamp to Parquet/Delta	Time-indexed feature row
5	Inference service	Receives prediction request with entity ID	Feature retrieval request
6	Feature Retrieval Client	Batch-reads feature values for entity from Redis	Feature vector with freshness metadata
7	Freshness Checker	Validates each feature age against SLA	Pass (fresh) or degraded mode
8	Inference service	Passes feature vector to model; returns prediction	Prediction with confidence score
9	Training data builder	Point-in-time query: for label events at time T, retrieve feature values at T	Point-in-time correct training dataset
10	Feature Monitor	Continuously computes PSI, null rate, freshness age per feature	Drift alerts; staleness alerts

Error Flow

Error Condition	Trigger	Response	Recovery
Flink job failure	Stream processing crash	Online store stops updating; features become stale; freshness alert raised	Flink job restarted; replay from Kafka retention window; freshness alert until caught up
Redis unavailable	Cache failure	Inference service falls back to cached last-known value or model default	Redis failover (sentinel/cluster); alert; degrade predictions rather than error
Feature freshness SLA breach	Feature age > SLA	Inference service receives stale flag; confidence reduced or fallback triggered	Investigate Flink job; Kafka lag; upstream event source
Training-serving skew detected	Model performance drops in production vs. offline eval	Feature Transform Library version audit; verify stream and batch use same version	Roll back feature library version; retrain on corrected features
Point-in-time query failure	Offline store missing historical window	Training dataset creation fails	Increase offline store retention window; backfill missing history from source

8. Security Considerations

Authentication & Authorisation

Redis online store access requires authentication; separate read-only credentials for inference services vs. read-write for feature pipelines.
Feature Registry access control: feature ownership declaration restricts writes; reads are broadly accessible within organisation.

Secrets Management

Redis credentials, Kafka credentials stored in secrets manager; rotated quarterly.

Data Classification

Online feature store contains derived features from personal data; classified as Confidential minimum.
Offline feature store contains historical personal data features; strict access control; encryption.

Encryption

Redis at rest encrypted (Redis Enterprise AES-256); TLS for Redis connections.
Offline feature store (S3/GCS) encrypted at rest; in transit TLS 1.3.

Auditability

Feature access at inference time logged per entity ID (for right-to-explanation requests).
Feature schema changes logged in Feature Registry with version history.

OWASP LLM Top 10 Mapping

OWASP LLM Risk	Relevance	Mitigation
LLM04: Model Denial of Service	Feature store overloaded by high-QPS inference could degrade	Redis cluster horizontal scaling; rate limiting on feature retrieval API
LLM02: Insecure Output Handling	Stale or incorrect features cause model to produce wrong outputs	Freshness checking; feature validation before inference
LLM06: Sensitive Information Disclosure	Feature store access exposes derived personal data	Access control on feature store; encryption; access logging

9. Governance Considerations

Responsible AI

Feature definitions must document what personal data they are derived from; subject to privacy review.
Features based on protected attributes (age, ethnicity, gender, postcode as proxy) require explicit bias review before use in consequential AI.

Model Risk Management

Feature freshness SLA is a model risk parameter: breach triggers model risk review.
Training-serving skew detection is a model risk control; regular skew audits (quarterly) required.

Human Approval Checkpoints

New feature requiring personal data processing: Privacy Officer review.
Feature based on protected attribute proxy: Bias Review Board sign-off.
Freshness SLA reduction (making SLA tighter): capacity review by ML Platform.

Governance Artefacts

Artefact	Owner	Cadence	Purpose
Feature Registry Entry	Feature Owner	Per feature version	Schema, SLA, lineage, data source declaration, approved uses
Freshness SLA Compliance Report	ML Platform	Weekly	Per-feature compliance with declared freshness SLA
Drift Report	Feature Monitor	Daily	PSI and null rate per feature; trend over 30 days
Training-Serving Skew Audit	ML Platform	Quarterly	Feature value comparison: training snapshot vs. current production feature values

10. Operational Considerations

Monitoring

Metric	Alert Threshold	Tooling
Online store write lag (event to Redis)	>freshness SLA	Flink job lag + Redis write latency
Online store read latency (p99)	>10ms	Redis metrics + Grafana
Feature null rate	>2% for any feature	Feature Monitor
Feature PSI	>0.1 warning; >0.25 block training	Feature Monitor
Kafka consumer group lag	>10,000 messages	Kafka metrics
Flink job backpressure	>50% sustained for 5 min	Flink monitoring

SLOs

SLO	Target	Measurement
Event to online store (end-to-end freshness)	<30 seconds (standard); <10 seconds (high-priority)	Event timestamp to Redis write timestamp
Online store read latency (p99)	<10ms	Redis latency histogram
Online store availability	99.99%	Health check
Training dataset creation (point-in-time query)	<4 hours for 12-month history, 10M entities	Query execution time

Capacity Planning

Redis: size based on peak entity count × feature count × average feature value size × replication factor.
Flink: size based on peak event throughput × window size complexity.
Offline store: size based on entity count × feature count × retention period × snapshot frequency.

Disaster Recovery

Component	RTO	RPO	Strategy
Online Store (Redis)	5 minutes	1 minute	Redis Sentinel/Cluster failover; persistence enabled
Flink Streaming Job	10 minutes	Kafka retention window	Flink checkpoint recovery; Kafka replay
Offline Store	4 hours	24 hours	Cross-region replication; incremental backup

11. Cost Considerations

Cost Drivers

Cost Driver	Typical Range	Notes
Redis Enterprise (online store)	$1,000–$15,000/month	Scales with data size × replication factor
Flink compute	$500–$8,000/month	Scales with event throughput × window complexity
Kafka / MSK	$300–$5,000/month	Scales with throughput × retention period
Offline store (S3 + query)	$200–$3,000/month	Scales with entity × feature × history
Managed feature store (Tecton, Vertex AI FS)	$3,000–$30,000/month	Includes all components; simpler ops
Feature Monitor	$200–$2,000/month	Custom or Evidently AI / WhyLabs

Optimisations

Cache frequently accessed entity feature vectors at inference service level (TTL = freshness SLA) to reduce Redis load.
Use Redis Cluster hash slots to partition entities across nodes; avoids hot-key issues for high-velocity entities.
Compress feature values in Redis using MessagePack or CBOR; typically 3–5× storage reduction.
Tier offline store: recent 90 days at full fidelity; older at reduced granularity.

Indicative Cost Range

Scale	Monthly Cost	Basis
Small (<1M entities, <50 features, <10K events/sec)	$2,000–$8,000	Redis standalone + Flink on Kubernetes + S3
Medium (10M entities, 200 features, 100K events/sec)	$10,000–$40,000	Redis Enterprise Cluster + managed Flink + managed Kafka
Large (100M+ entities, 500+ features, 1M+ events/sec)	$40,000–$200,000	Multi-region feature store + managed streaming platform

12. Trade-Off Analysis

Option Comparison

Option	Pros	Cons	Recommended When
A: Full real-time feature pipeline (this pattern)	Freshest features; training-serving consistency; shared features	High operational complexity; significant infrastructure cost	Fraud, real-time pricing, personalisation; >3 ML teams
B: Managed feature store (Tecton, Vertex AI FS)	Reduced ops; built-in monitoring; enterprise support	High licence cost; vendor lock-in	Large organisation; ML platform team lacks streaming expertise
C: Batch feature computation (nightly ETL)	Simple; low cost; well-understood	Features 8–24 hours stale; misses real-time signals	Batch inference only; freshness not a business requirement
D: Compute features at inference time (ad hoc)	Zero infrastructure; feature always fresh	High per-request latency (50–500ms DB queries); no reuse; creates technical debt	Simple models; 1–3 features; no team sharing required

Architectural Tensions

Tension	Trade-Off	Resolution
Freshness vs. cost	Fresher features require more streaming compute and Redis memory	Declare freshness SLA per feature; only high-priority features use <30s freshness
Streaming vs. batch consistency	Streaming and batch paths must produce identical feature values	Shared Feature Transform Library consumed by both; skew testing on schedule
Online store size vs. latency	More features = more Redis memory = higher cost; but fewer features = worse model	Feature importance analysis; only features with significant model impact in online store
Operational complexity vs. velocity	Full feature store platform has high setup cost	Start with managed platform (Feast hosted) or Vertex AI FS; build custom when scale justifies

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Training-serving skew (feature logic divergence)	Medium	High — model underperforms silently in production	Skew audit; post-deployment performance monitoring	Roll back Feature Transform Library version; retrain on corrected features
Hot key in Redis (all requests for same entity)	Medium	Medium — Redis latency spikes for affected entity	Redis slow log; per-key latency monitoring	Implement local inference-side cache; Redis cluster rebalancing
Kafka consumer lag spike	Medium	Medium — features become stale; SLA breach	Kafka consumer lag metrics	Scale Flink parallelism; increase Kafka partition count
Point-in-time query produces future-leaked features	Low	Critical — model trained with data leakage	Post-hoc leakage detection in training data quality checks	Rebuild offline store with correct timestamp indexing; retrain
Redis cluster split-brain	Very Low	High — inconsistent feature values	Redis cluster health monitoring	Redis Cluster automatic failover; read from primary only during split

14. Regulatory Considerations

Regulation	Requirement	Pattern Response
EU AI Act Article 10	Training data representativeness and freshness	Freshness SLAs and point-in-time correct training address data currency requirements
APRA CPS 230	Operational resilience of AI systems	Online store HA (99.99% SLO); Flink checkpoint recovery; DR targets defined
Privacy Act (Australia) APP 6	Personal data used for primary purpose	Feature Registry declares data sources; purpose-limited feature access
GDPR Article 22	Right to explanation for automated decisions	Feature access audit log per entity enables feature-level explanation

15. Reference Implementations

AWS

Component	AWS Service
Event bus	Amazon MSK (Managed Kafka)
Stream processing	Amazon Kinesis Data Analytics (Flink)
Online store	Amazon ElastiCache (Redis Cluster)
Offline store	S3 + AWS Glue Catalog (Parquet/Delta)
Feature registry	Amazon SageMaker Feature Store
Point-in-time query	SageMaker Feature Store offline API

Azure

Component	Azure Service
Event bus	Azure Event Hubs (Kafka compatible)
Stream processing	Azure Stream Analytics + HDInsight Flink
Online store	Azure Cache for Redis Enterprise
Offline store	ADLS Gen2 + Delta Lake
Feature registry	Azure ML Feature Store (preview)

GCP

Component	GCP Service
Event bus	Cloud Pub/Sub + Dataflow (Apache Beam)
Stream processing	Cloud Dataflow (Flink runner)
Online store	Vertex AI Feature Store (online serving)
Offline store	Vertex AI Feature Store (offline) + GCS
Feature registry	Vertex AI Feature Store

On-Premises

Component	Technology
Event bus	Apache Kafka on Kubernetes
Stream processing	Apache Flink on Kubernetes
Online store	Redis Enterprise on Kubernetes
Offline store	MinIO + Delta Lake
Feature registry	Feast (self-hosted on Kubernetes)

Pattern	ID	Relationship	Notes
AI Data Mesh Integration	EAAPL-DAT001	Complements	Online feature serving is a specialised data product
Data Quality for AI	EAAPL-DAT002	Depends on	Feature null rate and drift checks are quality dimensions
Data Lineage for AI	EAAPL-DAT003	Complements	Feature computation events captured in lineage graph
AI Training Data Governance	EAAPL-DAT007	Depends on	Feature definitions registered in training data governance framework
Shadow Model Deployment	EAAPL-MDL002	Complements	Shadow mode uses same feature pipeline; validates feature parity
Model Versioning	EAAPL-MDL001	Complements	Feature store version linked to model version

17. Maturity Assessment

Overall Maturity: Proven — Real-time feature engineering is production-proven at scale (Netflix, Uber, LinkedIn). Managed feature stores (Vertex AI, SageMaker) are GA. Operational complexity remains high for self-managed deployments.

Dimension	Score (1–5)	Notes
Architectural clarity	5	Well-defined dual-store architecture; point-in-time correctness well-understood
Tooling maturity	4	Redis, Kafka, Flink mature; feature store managed services maturing
Regulatory alignment	4	Freshness SLAs address EU AI Act data currency; APRA resilience covered
Operational complexity	2	High; requires streaming + Redis + feature store expertise
Cost efficiency	3	Significant infrastructure cost; justified for high-value real-time use cases
Security	4	Access controls, encryption, audit logging well-defined

18. Revision History

Version	Date	Author	Changes
1.0	2023-08-01	EAAPL Working Group	Initial publication
1.1	2024-01-15	EAAPL Working Group	Added point-in-time correctness detail; training-serving skew section
1.2	2024-07-01	EAAPL Working Group	Added managed feature store options; GCP Vertex AI FS reference
1.3	2025-03-01	EAAPL Working Group	Updated cost ranges; added feature drift detection section

← Back to Library More Data Architecture →