[EAAPL-DAT008] Real-Time Feature Engineering
Category: Data Architecture
Sub-category: Feature Engineering / Online ML Serving
Version: 1.3
Maturity: Proven
Tags: feature-engineering, feature-store, real-time, Kafka, Flink, Redis, point-in-time-correctness, feature-drift
Regulatory Relevance: EU AI Act Article 10, APRA CPS 234 (operational resilience), ISO 42001 §8.4
1. Executive Summary
AI models that make real-time decisions — fraud detection, personalised recommendations, credit pre-screening — require features computed from the most current available data. Batch feature engineering, where features are pre-computed overnight, introduces staleness that degrades model performance in time-sensitive use cases: fraud features computed 8 hours ago miss the most recent spending patterns that indicate fraud.
This pattern defines a production real-time feature engineering architecture that delivers features within milliseconds to inference endpoints, with freshness SLAs as tight as 10 seconds from event to serving. It covers the streaming pipeline (Kafka/Flink for feature computation), the dual-layer feature store (Redis for online serving, object store for offline training), point-in-time correctness for training data consistency, feature freshness SLA management, and feature drift detection.
Organisations that implement this pattern achieve fraud detection precision improvements of 15–25% through fresher features, and reduce time-to-feature from hours to seconds for real-time AI use cases.
Target audience: ML Platform leads, Data Engineering leads, Enterprise Architects.
2. Problem Statement
Business Problem
Real-time AI decisions are only as good as the data behind them. Stale features cause models to miss current context: a customer who spent $5,000 in the last 10 minutes looks identical to a normal customer on yesterday's features. Business consequences include missed fraud, poor real-time personalisation, and incorrect risk assessments.
Technical Problem
- Batch feature pipelines (nightly ETL) produce features that are 8–24 hours stale by inference time.
- Ad hoc feature computation at inference time introduces latency (50–500ms per lookup) and creates unmanaged technical debt.
- Features computed differently for training (batch) vs. serving (real-time) introduce training-serving skew — one of the top causes of production model underperformance.
- Point-in-time correctness is violated: training data uses features computed at the current time, not the time of the historical event, leaking future information into training.
- Feature pipelines are duplicated across teams; no reuse; inconsistent feature definitions.
Symptoms
- Fraud detection model AUC degrades during peak transaction periods (features are stale).
- Model performs well in offline evaluation but poorly in production (training-serving skew).
- Multiple teams computing the same features with slightly different logic.
- Feature pipeline failures cause model fallback to default values with no alerting.
- Training labels point-in-time valid but features are current-time (future data leakage detected in post-hoc audit).
Cost of Inaction
| Dimension |
Impact |
| Model quality |
15–40% degradation in time-sensitive use cases from feature staleness |
| Fraud / risk |
Missed fraud events during staleness window; direct financial loss |
| Engineering |
Feature duplication across teams; multiply estimated at 2–3× engineering waste |
| Compliance |
APRA operational resilience: feature pipeline failure = AI system degradation |
3. Context
When to Apply
- AI models making real-time decisions where feature freshness affects quality (fraud, recommendations, pricing, content ranking).
- Models where latency budget for inference is <100ms total.
- Organisations with >3 ML teams benefiting from shared feature definitions.
- Use cases requiring training-serving consistency (training-serving skew is a documented issue).
When NOT to Apply
- Batch inference where features can be pre-computed (nightly batch is sufficient).
- Very simple AI with 1–3 features that can be trivially computed at request time.
- Organisation at early AI maturity where a shared feature store adds operational complexity beyond team capability.
Prerequisites
| Prerequisite |
Minimum Viable |
Preferred |
| Event streaming infrastructure |
Kafka (basic) |
Kafka + Schema Registry + Kafka Streams |
| Stream processing |
Flink (basic) |
Apache Flink + Flink SQL |
| Online serving store |
Redis (standalone) |
Redis Enterprise Cluster |
| ML platform |
MLflow |
Feast / Tecton / Vertex AI Feature Store |
| Feature ownership model |
Informal |
Formal feature ownership with SLA commitments |
Industry Applicability
| Industry |
Applicability |
Driver |
| Financial Services / Fintech |
Critical |
Real-time fraud; credit pre-screening; risk pricing |
| E-commerce |
High |
Real-time personalisation; dynamic pricing; recommendation |
| Ride-sharing / Delivery |
High |
Dynamic pricing; driver-rider matching; ETA prediction |
| Telecommunications |
High |
Real-time churn intervention; network anomaly |
| Gaming |
Medium |
Real-time player behaviour; churn prediction |
| Healthcare |
Medium |
Real-time clinical risk; patient monitoring |
4. Architecture Overview
Design Philosophy
The real-time feature engineering architecture solves the fundamental tension between feature freshness and serving reliability by maintaining two feature stores simultaneously — an online store for low-latency serving and an offline store for training data — with a shared computation layer ensuring they produce identical feature values.
Stream-First, Batch-Fallback Design. Features are computed from streaming events (Kafka) using Flink streaming jobs. This provides freshness from seconds (aggregated windows) to sub-second (lookup features). For features that cannot be computed from streams (complex aggregations requiring full historical data), a batch top-up runs at configurable intervals (hourly, daily) to pre-compute the batch component, which is then merged with the streaming-computed component in the feature store.
Dual-Layer Feature Store. The architecture maintains two feature store layers with different characteristics:
- Online store (Redis): Serves features at inference time with <10ms p99 latency. Stores only the most recent feature values per entity (customer_id, account_id, etc.). Redis sorted sets serve time-windowed aggregations. Optimised for point-in-time latest reads.
- Offline store (object store + columnar format): Stores the full history of feature values with timestamps. Used for training dataset creation. Enables point-in-time correct feature retrieval: given a historical event at time T, retrieve the feature value that would have been available at time T — not the current value.
Training-Serving Consistency via Shared Computation Logic. The most critical architectural property: the Flink streaming job and the offline feature pipeline must use identical computation logic. This is achieved through a shared Feature Transformation Library — a versioned Python/Java library containing the feature computation logic, consumed by both the Flink streaming job and the batch offline pipeline. When feature logic changes, both pipelines are updated together, eliminating training-serving skew by construction.
Point-in-Time Correctness. When creating a training dataset, features must be retrieved at the time of the label event, not the current time. The offline store's timestamp index enables point-in-time queries: for a fraud label at 2024-03-15 14:32:00, retrieve the feature values that existed at that exact time. This requires the offline store to retain historical feature snapshots (not just the latest value) — a storage cost trade-off managed through tiered retention (recent history at full fidelity; older history at reduced fidelity or summarised).
Feature Freshness SLA Management. Each feature has a declared freshness SLA — the maximum acceptable age of a feature value when served at inference time. The feature store monitoring layer continuously tracks feature ages and alerts when staleness exceeds SLA. Inference services receive feature freshness metadata alongside feature values, enabling them to downgrade confidence or trigger fallback behaviour when features are stale beyond SLA.
Feature Drift Detection. Feature values are monitored for distribution drift using Population Stability Index (PSI) computed over rolling windows. A feature whose distribution shifts significantly (PSI > 0.1) may indicate upstream data pipeline issues, concept drift, or business context changes. Drift alerts are routed to the feature owner for investigation.
5. Architecture Diagram
flowchart TD
subgraph Input["Event Sources"]
A[Streaming Events]
B[Batch Sources]
end
subgraph Compute["Feature Computation"]
C[Flink Streaming Pipeline]
D[Batch Top-Up Pipeline]
end
subgraph Store["Dual-Layer Feature Store"]
E[(Online Store Redis)]
F[(Offline Store Parquet)]
end
subgraph Serving["Model Serving"]
G{Freshness Check}
H[Inference Service]
I[Training Dataset Builder]
end
A --> C
B --> D
C -->|real-time writes| E
C -->|time-indexed writes| F
D -->|batch writes| E
D -->|history writes| F
E --> G
G -->|fresh| H
G -->|stale| H
F --> I
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#dbeafe,stroke:#3b82f6
style C fill:#f0fdf4,stroke:#22c55e
style D fill:#f0fdf4,stroke:#22c55e
style E fill:#fef9c3,stroke:#eab308
style F fill:#fef9c3,stroke:#eab308
style G fill:#f3e8ff,stroke:#a855f7
style H fill:#d1fae5,stroke:#10b981
style I fill:#d1fae5,stroke:#10b981
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Kafka Event Bus |
Messaging |
Ingests operational events; provides ordered, partitioned stream per entity |
Apache Kafka, AWS MSK, Confluent Cloud |
Critical |
| Flink Streaming Job |
Processing |
Computes real-time features from event streams; sliding window aggregations |
Apache Flink, AWS Kinesis Data Analytics, Azure Stream Analytics |
Critical |
| Feature Transform Library |
Library |
Shared versioned feature computation logic; consumed by both stream and batch pipelines |
Python library (pip), Java library (Maven), dbt macros |
Critical |
| Online Feature Store |
Storage + Serving |
Low-latency feature serving for inference; latest feature values per entity |
Redis Enterprise, DynamoDB, Aerospike, Feast online store |
Critical |
| Offline Feature Store |
Storage |
Historical time-indexed feature values; point-in-time correct training data creation |
Parquet on S3/GCS/ADLS, Delta Lake, Apache Hudi |
Critical |
| Feature Registry |
Metadata Store |
Defines feature schemas, ownership, freshness SLA, lineage, approved use cases |
Feast Registry, Tecton, Vertex AI Feature Store Registry, custom PostgreSQL |
High |
| Batch Top-Up Pipeline |
Processing |
Computes batch-only or computationally heavy features on schedule |
Apache Spark, dbt, AWS Glue |
High |
| Feature Retrieval Client |
Library |
Batch entity lookup from online store for inference; handles missing features |
Feast Python SDK, Tecton SDK, custom Redis client |
Critical |
| Point-in-Time Query Engine |
Processing |
Retrieves feature values at historical timestamp for training data creation |
Feast historical retrieval, custom SQL with time-travel, Delta Lake time travel |
High |
| Feature Monitoring Service |
Processing |
Tracks freshness, drift (PSI), null rate per feature; fires alerts |
Custom Python + Grafana, Evidently AI feature monitoring, WhyLabs |
High |
7. Data Flow
Primary Flow
| Step |
Actor |
Action |
Output |
| 1 |
Operational system |
Emits business event to Kafka topic |
Kafka event record |
| 2 |
Flink streaming job |
Consumes event; applies Feature Transform Library logic; computes aggregations over sliding windows |
Feature value update |
| 3 |
Online store writer |
Writes updated feature values to Redis with timestamp |
Redis HSET per entity ID |
| 4 |
Offline store writer |
Appends feature values with event timestamp to Parquet/Delta |
Time-indexed feature row |
| 5 |
Inference service |
Receives prediction request with entity ID |
Feature retrieval request |
| 6 |
Feature Retrieval Client |
Batch-reads feature values for entity from Redis |
Feature vector with freshness metadata |
| 7 |
Freshness Checker |
Validates each feature age against SLA |
Pass (fresh) or degraded mode |
| 8 |
Inference service |
Passes feature vector to model; returns prediction |
Prediction with confidence score |
| 9 |
Training data builder |
Point-in-time query: for label events at time T, retrieve feature values at T |
Point-in-time correct training dataset |
| 10 |
Feature Monitor |
Continuously computes PSI, null rate, freshness age per feature |
Drift alerts; staleness alerts |
Error Flow
| Error Condition |
Trigger |
Response |
Recovery |
| Flink job failure |
Stream processing crash |
Online store stops updating; features become stale; freshness alert raised |
Flink job restarted; replay from Kafka retention window; freshness alert until caught up |
| Redis unavailable |
Cache failure |
Inference service falls back to cached last-known value or model default |
Redis failover (sentinel/cluster); alert; degrade predictions rather than error |
| Feature freshness SLA breach |
Feature age > SLA |
Inference service receives stale flag; confidence reduced or fallback triggered |
Investigate Flink job; Kafka lag; upstream event source |
| Training-serving skew detected |
Model performance drops in production vs. offline eval |
Feature Transform Library version audit; verify stream and batch use same version |
Roll back feature library version; retrain on corrected features |
| Point-in-time query failure |
Offline store missing historical window |
Training dataset creation fails |
Increase offline store retention window; backfill missing history from source |
8. Security Considerations
Authentication & Authorisation
- Redis online store access requires authentication; separate read-only credentials for inference services vs. read-write for feature pipelines.
- Feature Registry access control: feature ownership declaration restricts writes; reads are broadly accessible within organisation.
Secrets Management
- Redis credentials, Kafka credentials stored in secrets manager; rotated quarterly.
Data Classification
- Online feature store contains derived features from personal data; classified as Confidential minimum.
- Offline feature store contains historical personal data features; strict access control; encryption.
Encryption
- Redis at rest encrypted (Redis Enterprise AES-256); TLS for Redis connections.
- Offline feature store (S3/GCS) encrypted at rest; in transit TLS 1.3.
Auditability
- Feature access at inference time logged per entity ID (for right-to-explanation requests).
- Feature schema changes logged in Feature Registry with version history.
OWASP LLM Top 10 Mapping
| OWASP LLM Risk |
Relevance |
Mitigation |
| LLM04: Model Denial of Service |
Feature store overloaded by high-QPS inference could degrade |
Redis cluster horizontal scaling; rate limiting on feature retrieval API |
| LLM02: Insecure Output Handling |
Stale or incorrect features cause model to produce wrong outputs |
Freshness checking; feature validation before inference |
| LLM06: Sensitive Information Disclosure |
Feature store access exposes derived personal data |
Access control on feature store; encryption; access logging |
9. Governance Considerations
Responsible AI
- Feature definitions must document what personal data they are derived from; subject to privacy review.
- Features based on protected attributes (age, ethnicity, gender, postcode as proxy) require explicit bias review before use in consequential AI.
Model Risk Management
- Feature freshness SLA is a model risk parameter: breach triggers model risk review.
- Training-serving skew detection is a model risk control; regular skew audits (quarterly) required.
Human Approval Checkpoints
- New feature requiring personal data processing: Privacy Officer review.
- Feature based on protected attribute proxy: Bias Review Board sign-off.
- Freshness SLA reduction (making SLA tighter): capacity review by ML Platform.
Governance Artefacts
| Artefact |
Owner |
Cadence |
Purpose |
| Feature Registry Entry |
Feature Owner |
Per feature version |
Schema, SLA, lineage, data source declaration, approved uses |
| Freshness SLA Compliance Report |
ML Platform |
Weekly |
Per-feature compliance with declared freshness SLA |
| Drift Report |
Feature Monitor |
Daily |
PSI and null rate per feature; trend over 30 days |
| Training-Serving Skew Audit |
ML Platform |
Quarterly |
Feature value comparison: training snapshot vs. current production feature values |
10. Operational Considerations
Monitoring
| Metric |
Alert Threshold |
Tooling |
| Online store write lag (event to Redis) |
>freshness SLA |
Flink job lag + Redis write latency |
| Online store read latency (p99) |
>10ms |
Redis metrics + Grafana |
| Feature null rate |
>2% for any feature |
Feature Monitor |
| Feature PSI |
>0.1 warning; >0.25 block training |
Feature Monitor |
| Kafka consumer group lag |
>10,000 messages |
Kafka metrics |
| Flink job backpressure |
>50% sustained for 5 min |
Flink monitoring |
SLOs
| SLO |
Target |
Measurement |
| Event to online store (end-to-end freshness) |
<30 seconds (standard); <10 seconds (high-priority) |
Event timestamp to Redis write timestamp |
| Online store read latency (p99) |
<10ms |
Redis latency histogram |
| Online store availability |
99.99% |
Health check |
| Training dataset creation (point-in-time query) |
<4 hours for 12-month history, 10M entities |
Query execution time |
Capacity Planning
- Redis: size based on peak entity count × feature count × average feature value size × replication factor.
- Flink: size based on peak event throughput × window size complexity.
- Offline store: size based on entity count × feature count × retention period × snapshot frequency.
Disaster Recovery
| Component |
RTO |
RPO |
Strategy |
| Online Store (Redis) |
5 minutes |
1 minute |
Redis Sentinel/Cluster failover; persistence enabled |
| Flink Streaming Job |
10 minutes |
Kafka retention window |
Flink checkpoint recovery; Kafka replay |
| Offline Store |
4 hours |
24 hours |
Cross-region replication; incremental backup |
11. Cost Considerations
Cost Drivers
| Cost Driver |
Typical Range |
Notes |
| Redis Enterprise (online store) |
$1,000–$15,000/month |
Scales with data size × replication factor |
| Flink compute |
$500–$8,000/month |
Scales with event throughput × window complexity |
| Kafka / MSK |
$300–$5,000/month |
Scales with throughput × retention period |
| Offline store (S3 + query) |
$200–$3,000/month |
Scales with entity × feature × history |
| Managed feature store (Tecton, Vertex AI FS) |
$3,000–$30,000/month |
Includes all components; simpler ops |
| Feature Monitor |
$200–$2,000/month |
Custom or Evidently AI / WhyLabs |
Optimisations
- Cache frequently accessed entity feature vectors at inference service level (TTL = freshness SLA) to reduce Redis load.
- Use Redis Cluster hash slots to partition entities across nodes; avoids hot-key issues for high-velocity entities.
- Compress feature values in Redis using MessagePack or CBOR; typically 3–5× storage reduction.
- Tier offline store: recent 90 days at full fidelity; older at reduced granularity.
Indicative Cost Range
| Scale |
Monthly Cost |
Basis |
| Small (<1M entities, <50 features, <10K events/sec) |
$2,000–$8,000 |
Redis standalone + Flink on Kubernetes + S3 |
| Medium (10M entities, 200 features, 100K events/sec) |
$10,000–$40,000 |
Redis Enterprise Cluster + managed Flink + managed Kafka |
| Large (100M+ entities, 500+ features, 1M+ events/sec) |
$40,000–$200,000 |
Multi-region feature store + managed streaming platform |
12. Trade-Off Analysis
Option Comparison
| Option |
Pros |
Cons |
Recommended When |
| A: Full real-time feature pipeline (this pattern) |
Freshest features; training-serving consistency; shared features |
High operational complexity; significant infrastructure cost |
Fraud, real-time pricing, personalisation; >3 ML teams |
| B: Managed feature store (Tecton, Vertex AI FS) |
Reduced ops; built-in monitoring; enterprise support |
High licence cost; vendor lock-in |
Large organisation; ML platform team lacks streaming expertise |
| C: Batch feature computation (nightly ETL) |
Simple; low cost; well-understood |
Features 8–24 hours stale; misses real-time signals |
Batch inference only; freshness not a business requirement |
| D: Compute features at inference time (ad hoc) |
Zero infrastructure; feature always fresh |
High per-request latency (50–500ms DB queries); no reuse; creates technical debt |
Simple models; 1–3 features; no team sharing required |
Architectural Tensions
| Tension |
Trade-Off |
Resolution |
| Freshness vs. cost |
Fresher features require more streaming compute and Redis memory |
Declare freshness SLA per feature; only high-priority features use <30s freshness |
| Streaming vs. batch consistency |
Streaming and batch paths must produce identical feature values |
Shared Feature Transform Library consumed by both; skew testing on schedule |
| Online store size vs. latency |
More features = more Redis memory = higher cost; but fewer features = worse model |
Feature importance analysis; only features with significant model impact in online store |
| Operational complexity vs. velocity |
Full feature store platform has high setup cost |
Start with managed platform (Feast hosted) or Vertex AI FS; build custom when scale justifies |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Training-serving skew (feature logic divergence) |
Medium |
High — model underperforms silently in production |
Skew audit; post-deployment performance monitoring |
Roll back Feature Transform Library version; retrain on corrected features |
| Hot key in Redis (all requests for same entity) |
Medium |
Medium — Redis latency spikes for affected entity |
Redis slow log; per-key latency monitoring |
Implement local inference-side cache; Redis cluster rebalancing |
| Kafka consumer lag spike |
Medium |
Medium — features become stale; SLA breach |
Kafka consumer lag metrics |
Scale Flink parallelism; increase Kafka partition count |
| Point-in-time query produces future-leaked features |
Low |
Critical — model trained with data leakage |
Post-hoc leakage detection in training data quality checks |
Rebuild offline store with correct timestamp indexing; retrain |
| Redis cluster split-brain |
Very Low |
High — inconsistent feature values |
Redis cluster health monitoring |
Redis Cluster automatic failover; read from primary only during split |
14. Regulatory Considerations
| Regulation |
Requirement |
Pattern Response |
| EU AI Act Article 10 |
Training data representativeness and freshness |
Freshness SLAs and point-in-time correct training address data currency requirements |
| APRA CPS 230 |
Operational resilience of AI systems |
Online store HA (99.99% SLO); Flink checkpoint recovery; DR targets defined |
| Privacy Act (Australia) APP 6 |
Personal data used for primary purpose |
Feature Registry declares data sources; purpose-limited feature access |
| GDPR Article 22 |
Right to explanation for automated decisions |
Feature access audit log per entity enables feature-level explanation |
15. Reference Implementations
AWS
| Component |
AWS Service |
| Event bus |
Amazon MSK (Managed Kafka) |
| Stream processing |
Amazon Kinesis Data Analytics (Flink) |
| Online store |
Amazon ElastiCache (Redis Cluster) |
| Offline store |
S3 + AWS Glue Catalog (Parquet/Delta) |
| Feature registry |
Amazon SageMaker Feature Store |
| Point-in-time query |
SageMaker Feature Store offline API |
Azure
| Component |
Azure Service |
| Event bus |
Azure Event Hubs (Kafka compatible) |
| Stream processing |
Azure Stream Analytics + HDInsight Flink |
| Online store |
Azure Cache for Redis Enterprise |
| Offline store |
ADLS Gen2 + Delta Lake |
| Feature registry |
Azure ML Feature Store (preview) |
GCP
| Component |
GCP Service |
| Event bus |
Cloud Pub/Sub + Dataflow (Apache Beam) |
| Stream processing |
Cloud Dataflow (Flink runner) |
| Online store |
Vertex AI Feature Store (online serving) |
| Offline store |
Vertex AI Feature Store (offline) + GCS |
| Feature registry |
Vertex AI Feature Store |
On-Premises
| Component |
Technology |
| Event bus |
Apache Kafka on Kubernetes |
| Stream processing |
Apache Flink on Kubernetes |
| Online store |
Redis Enterprise on Kubernetes |
| Offline store |
MinIO + Delta Lake |
| Feature registry |
Feast (self-hosted on Kubernetes) |
| Pattern |
ID |
Relationship |
Notes |
| AI Data Mesh Integration |
EAAPL-DAT001 |
Complements |
Online feature serving is a specialised data product |
| Data Quality for AI |
EAAPL-DAT002 |
Depends on |
Feature null rate and drift checks are quality dimensions |
| Data Lineage for AI |
EAAPL-DAT003 |
Complements |
Feature computation events captured in lineage graph |
| AI Training Data Governance |
EAAPL-DAT007 |
Depends on |
Feature definitions registered in training data governance framework |
| Shadow Model Deployment |
EAAPL-MDL002 |
Complements |
Shadow mode uses same feature pipeline; validates feature parity |
| Model Versioning |
EAAPL-MDL001 |
Complements |
Feature store version linked to model version |
17. Maturity Assessment
Overall Maturity: Proven — Real-time feature engineering is production-proven at scale (Netflix, Uber, LinkedIn). Managed feature stores (Vertex AI, SageMaker) are GA. Operational complexity remains high for self-managed deployments.
| Dimension |
Score (1–5) |
Notes |
| Architectural clarity |
5 |
Well-defined dual-store architecture; point-in-time correctness well-understood |
| Tooling maturity |
4 |
Redis, Kafka, Flink mature; feature store managed services maturing |
| Regulatory alignment |
4 |
Freshness SLAs address EU AI Act data currency; APRA resilience covered |
| Operational complexity |
2 |
High; requires streaming + Redis + feature store expertise |
| Cost efficiency |
3 |
Significant infrastructure cost; justified for high-value real-time use cases |
| Security |
4 |
Access controls, encryption, audit logging well-defined |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2023-08-01 |
EAAPL Working Group |
Initial publication |
| 1.1 |
2024-01-15 |
EAAPL Working Group |
Added point-in-time correctness detail; training-serving skew section |
| 1.2 |
2024-07-01 |
EAAPL Working Group |
Added managed feature store options; GCP Vertex AI FS reference |
| 1.3 |
2025-03-01 |
EAAPL Working Group |
Updated cost ranges; added feature drift detection section |