[EAAPL-PLT009] Feature Store Integration
Category: Platform Engineering
Sub-category: ML Infrastructure / Data Engineering
Version: 1.1
Maturity: Proven
Tags: feature-store, feature-serving, online-inference, offline-training, feature-pipeline, point-in-time, feature-monitoring, training-serving-skew
Regulatory Relevance: EU AI Act Article 10 (Data Governance), ISO 42001 Clause 6, NIST AI RMF MAP 3.5
1. Executive Summary
Feature stores solve a deceptively simple problem: when an ML model needs a feature during inference, how does it get the right value, freshly computed, at low latency? And when training a new model, how does it get the exact same feature values that would have been available at prediction time in the past—preventing the data leakage that invalidates backtests and production evaluations?
The Feature Store Integration pattern establishes a shared infrastructure layer that decouples feature computation from feature consumption, enabling features to be computed once and reused across models, teams, and use cases. The online store serves low-latency feature retrieval for real-time inference; the offline store enables point-in-time correct training data generation. Feature pipelines manage computation and freshness; feature monitoring detects drift that would degrade model performance before it reaches users. For enterprises with multiple ML models consuming overlapping signals, the feature store is the difference between duplicated, inconsistent feature computation and a shared, governed, quality-assured data layer.
2. Problem Statement
Business Problem
Multiple ML models within the same organisation compute the same features independently, consuming redundant engineering effort and producing inconsistent values (e.g., "30-day spend" computed differently for fraud, recommendation, and credit risk models). Business decisions made on these models are implicitly inconsistent. When models are retrained, the historical features used for training may not match what would have been available at prediction time, leading to overoptimistic evaluation metrics and production performance gaps.
Technical Problem
Online inference requires feature values available in <10ms at the model API boundary; this requires a pre-computed, low-latency store. Training requires point-in-time correct historical feature values to avoid look-ahead bias. Without a feature store, teams either accept this bias or build expensive, fragile point-in-time joins from raw data. Feature pipelines are duplicated across teams with no shared infrastructure.
Symptoms
- Same feature (e.g., customer 30-day transaction count) computed differently in 3 different model codebases
- Production model performance consistently below offline evaluation metrics (training-serving skew)
- Feature pipeline failures causing model inference to serve stale or missing features
- No visibility into when a feature was last updated or what its current distribution is
- Training datasets built from current feature values rather than the values available at the historical prediction time
Cost of Inaction
- Training-serving skew causing production models underperforming by 5–20% vs offline evaluation
- 30–50% of ML engineering time spent on feature engineering that duplicates existing work
- Model regressions caused by undetected feature drift going undetected for weeks
- Regulatory audits unable to reproduce model predictions due to no record of feature values at decision time
3. Context
When to Apply
- Organisation has ≥2 ML models sharing overlapping input features
- Real-time inference latency requirements (<50ms) demand pre-computed feature values
- Training pipelines require point-in-time correct historical data
- Feature reuse across teams is a stated engineering goal
- Model performance monitoring requires feature drift detection
When NOT to Apply
- Single simple model with unique features: feature store overhead not warranted
- LLM-only organisation with no traditional ML models: most LLM use cases don't benefit from traditional feature stores (embeddings have their own infrastructure path)
- Research experiments: use pandas and raw data; migrate to feature store when productionising
Prerequisites
- Operational data sources (databases, event streams) producing features
- Feature computation infrastructure (Spark, Flink, or dbt for offline; streaming processor for online)
- Online store infrastructure (Redis or equivalent <10ms lookup)
- Offline store infrastructure (data warehouse or object storage for point-in-time joins)
- ML model serving infrastructure that can retrieve features at inference time
Industry Applicability
| Industry |
Applicability |
Key Use Case |
| Financial Services |
Very High |
Credit risk, fraud detection, CLV, trading signals |
| E-commerce / Retail |
Very High |
Personalisation, recommendation, dynamic pricing |
| Technology / SaaS |
High |
User behaviour, churn prediction, abuse detection |
| Healthcare |
High |
Risk stratification, readmission prediction |
| Telecommunications |
High |
Churn, network anomaly, usage prediction |
| Media / Streaming |
High |
Content recommendation, engagement prediction |
4. Architecture Overview
The feature store architecture is defined by the separation between its online and offline paths, each serving a different consumer with different latency and freshness characteristics.
The Online Store is a low-latency key-value store containing pre-computed feature values, indexed by entity key (e.g., customer_id, product_id, session_id). Lookup latency must be <10ms at P99 to be compatible with real-time inference SLAs. The online store is populated by the feature materialisation pipeline, which computes features from source data and writes them on a schedule (for batch features) or in near-real-time (for streaming features). Redis is the canonical technology for the online store; its GET operation with a compound key (entity_type:entity_id:feature_set) delivers sub-millisecond lookup at scale.
The online store does not store feature history—only the current value for each entity. This makes it fast and cheap. When a model is called for inference, the feature serving layer assembles the feature vector by looking up all required features for the request's entity IDs from the online store, combining them with request-time context (features that cannot be pre-computed because they depend on the current request), and passing the assembled feature vector to the model.
The Offline Store serves training data generation and batch inference. Unlike the online store, the offline store retains historical feature values—specifically, the feature value that was current at any given point in time. This enables point-in-time correct training data generation: given a set of training examples with timestamps, retrieve the feature values that were available just before each timestamp. This prevents look-ahead bias (using future data to predict the past), which is the most common source of training-serving skew. The offline store is implemented as a time-partitioned table in a data warehouse (BigQuery, Redshift, Snowflake) or as Parquet files in object storage, with a time dimension on every feature record.
Feature Pipelines compute and refresh feature values from source data. Batch pipelines run on a schedule (hourly, daily) using Spark or dbt and write to both the offline store (appending the new time-partitioned record) and the online store (overwriting the current value). Streaming pipelines consume event streams (Kafka, Kinesis) and compute features in near-real-time using Flink or Spark Streaming, writing to the online store with low latency. The choice between batch and streaming for a feature depends on its staleness tolerance: fraud detection features require seconds-old values; monthly customer metrics can be daily.
Feature Registry is the metadata layer for the feature store. It records: feature name, description, data type, computation logic (the transformation that produces the feature), data source, update frequency, entity type, business owner, and deprecation status. The feature registry is the discovery mechanism that enables engineers to find existing features before building new ones. It also serves as the configuration source for the feature materialisation pipeline and the feature serving layer.
Feature Monitoring is the operational quality layer. For each feature, monitoring tracks: distribution statistics (mean, std, percentile distribution) on a rolling basis, freshness (time since last update vs. configured threshold), null rate (unexpected nulls indicate pipeline failures), and drift (statistical distance between the current distribution and the training-time distribution, using measures like PSI or Jensen-Shannon divergence). Alerts on feature drift enable proactive model retraining before production performance degrades significantly.
5. Architecture Diagram
flowchart TD
subgraph Sources["Data Sources"]
A[Operational Databases]
B[Event Streams]
end
subgraph Pipelines["Feature Pipelines"]
C[Batch Pipeline]
D[Streaming Pipeline]
end
subgraph Store["Feature Store"]
E[(Online Store Redis)]
F[(Offline Store Point-in-Time)]
G[Feature Registry]
end
subgraph Consumers["Consumers"]
H[Real-Time Inference]
I[Model Training]
end
A --> C
B --> D
C --> E
C --> F
D --> E
G --> C
G --> D
E --> H
F --> I
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#dbeafe,stroke:#3b82f6
style C fill:#f0fdf4,stroke:#22c55e
style D fill:#f0fdf4,stroke:#22c55e
style E fill:#fef9c3,stroke:#eab308
style F fill:#fef9c3,stroke:#eab308
style G fill:#fef9c3,stroke:#eab308
style H fill:#d1fae5,stroke:#10b981
style I fill:#d1fae5,stroke:#10b981
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Online Store |
Infrastructure |
Sub-10ms feature lookup by entity key |
Redis, DynamoDB, Bigtable, Cassandra |
Critical |
| Offline Store |
Infrastructure |
Point-in-time correct historical feature retrieval |
BigQuery, Redshift, Snowflake, Hive, Parquet on S3 |
Critical |
| Feature Registry |
Service |
Metadata catalogue for all features |
Feast (open source), Tecton, Hopsworks, custom DB |
High |
| Batch Feature Pipeline |
Service |
Compute and materialise batch features |
Apache Spark, dbt + Airflow, DBT Cloud |
Critical |
| Streaming Feature Pipeline |
Service |
Compute and materialise near-real-time features |
Apache Flink, Spark Structured Streaming, Kafka Streams |
High |
| Feature Materialisation Orchestrator |
Service |
Schedule and coordinate pipeline execution |
Apache Airflow, Prefect, Dagster |
High |
| Feature Server |
Service |
Assemble multi-feature vectors for inference requests |
Feast Feature Server, Tecton Online Serving, custom FastAPI |
Critical |
| Point-in-Time Join Engine |
Service |
Generate point-in-time correct training datasets |
Feast point-in-time join, custom SQL |
High |
| Feature Monitor |
Service |
Track distribution, drift, freshness, null rate |
Evidently AI, WhyLogs, Great Expectations, custom |
High |
| Feature Discovery UI |
Service |
Search and explore feature registry |
Feast UI, Tecton portal, DataHub, custom |
Medium |
7. Data Flow
Primary Flow — Real-Time Inference with Feature Store
| Step |
Actor |
Action |
Output |
| 1 |
Model API |
Receive inference request with entity IDs (customer_id: 12345, product_id: P789) |
Entity IDs extracted |
| 2 |
Feature Server |
Look up required features from feature registry for this model version |
Required feature list: [customer_30d_spend, customer_risk_score, product_view_count_7d] |
| 3 |
Feature Server |
Batch lookup: MGET customer:12345:spend_features, customer:12345:risk_features, product:P789:engagement |
Feature values retrieved from Redis in <5ms |
| 4 |
Feature Server |
Combine pre-computed features with request-time context (e.g., current timestamp, request channel) |
Complete feature vector assembled |
| 5 |
Model Inference |
Pass feature vector to model; receive prediction |
Prediction |
| 6 |
Feature Monitor |
Log feature values and prediction for drift monitoring |
Monitoring record |
Error Flow
| Error |
Detection |
Response |
| Feature missing from online store (entity not materialised) |
Redis miss |
Return feature default value or null; log missing feature; alert if rate >1% |
| Stale feature (pipeline hasn't run) |
Freshness monitor |
Log staleness; serve stale value with staleness metadata; alert pipeline operator |
| Online store unavailable |
Feature server health check |
Serve null features or use fallback model without feature enrichment; alert |
| Feature schema mismatch (pipeline produced wrong type) |
Feature monitor type check |
Reject feature batch write; alert pipeline owner; serve last-known-good value |
8. Security Considerations
- Feature data may contain derived personal information (spending patterns, risk scores, health indicators); access to the online store must be restricted to authorised model serving infrastructure
- The offline store contains historical PII-derived features; access requires the same data classification controls as the source data
- Entity keys in the online store must not leak information about underlying entities; compound keys should use opaque IDs (UUIDs), not readable identifiers
OWASP LLM Controls
| OWASP LLM Risk |
Feature Store Control |
| LLM03 Training Data Poisoning |
Feature registry enforces approved computation logic; point-in-time joins prevent future-data contamination |
| LLM09 Overreliance |
Feature monitoring detects when input data quality degrades, which would degrade model predictions |
9. Governance Considerations
Data Governance
- Every feature must have a registered owner responsible for pipeline health and data quality
- Features derived from personal information must document the legal basis and retention policy in the feature registry
- Deprecated features must be retained in the registry with deprecation date and migration guidance; never silently deleted
Model Risk
- Point-in-time join methodology must be validated and documented as part of the model development process; incorrect point-in-time logic is a model risk event
- Feature drift alerts must be routed to the model owner, not just the platform team; the model owner is accountable for model performance
Governance Artefacts
| Artefact |
Owner |
Cadence |
Location |
| Feature registry |
Feature Owner + Data Team |
Continuous |
Feature registry service |
| Feature lineage documentation |
Data Engineering |
Per feature |
Feature registry |
| Feature monitoring thresholds |
Feature Owner |
Quarterly review |
Monitoring configuration |
| Privacy impact for PII-derived features |
Privacy Team |
Per feature with PII |
Privacy register |
| Feature drift incident log |
Model Owner |
Per incident |
Incident management |
10. Operational Considerations
Monitoring
| Signal |
Source |
Alert Threshold |
Owner |
| Online store cache miss rate |
Feature server metrics |
>5% miss (entities not materialised) |
Feature Owner |
| Feature pipeline SLA miss |
Pipeline orchestrator |
Any pipeline overdue by >2× schedule interval |
Feature Owner + Data Eng |
| Feature distribution drift (PSI) |
Feature monitor |
PSI > 0.2 (significant drift) |
Model Owner |
| Online store P99 latency |
Feature server metrics |
>20ms P99 |
Platform On-Call |
SLOs
| SLO |
Target |
Window |
| Online feature retrieval P99 latency |
<10ms |
Rolling 7 days |
| Feature freshness (batch features) |
<2× schedule interval |
Per feature |
| Feature pipeline success rate |
>99.5% |
Rolling 30 days |
| Online store availability |
99.9% |
Rolling 30 days |
Disaster Recovery
| Component |
RPO |
RTO |
Strategy |
| Online store (Redis) |
1 hour |
5 min |
Redis Sentinel + persistence; rebuild from offline store |
| Offline store |
<1 hour |
30 min |
Data warehouse replication |
| Feature pipelines |
N/A (stateless) |
15 min |
Redeploy from IaC; re-run pipeline to catch up |
11. Cost Considerations
Cost Drivers
| Driver |
Description |
Relative Weight |
| Online store (Redis) memory |
Proportional to entity count × feature vector size |
Medium-High |
| Batch computation (Spark) |
Proportional to data volume and feature count |
Medium |
| Offline store (data warehouse) |
Storage + query compute for training data generation |
Medium |
| Streaming computation (Flink) |
Always-on for streaming features |
Medium |
Indicative Cost Range
| Scale |
Monthly Feature Store Infra Cost |
| Small (1M entities, 10 features) |
$500–$2,000 |
| Medium (100M entities, 50 features) |
$5,000–$20,000 |
| Large (1B+ entities, 200+ features) |
$30,000–$100,000+ |
12. Trade-Off Analysis
Feature Store Architecture Options
| Option |
Description |
Pros |
Cons |
Best For |
| Open Source (Feast) |
Self-managed Feast with Redis + data warehouse |
Full control; no vendor lock-in; community support |
High operational overhead; less out-of-box tooling |
Strong engineering team; cloud-agnostic |
| Managed (Tecton, Hopsworks) |
SaaS feature store with managed pipelines |
Low ops overhead; strong tooling |
Vendor lock-in; cost at scale |
Organisations prioritising velocity over cost optimisation |
| Cloud-Native (Vertex AI Feature Store, AWS SageMaker Feature Store) |
Cloud provider native |
Deep integration with cloud ML stack |
Tied to cloud provider; variable feature richness |
Orgs committed to single cloud |
Online Store Technology Options
| Option |
Latency |
Cost |
Scalability |
Best For |
| Redis |
<1ms |
Medium |
High (cluster) |
Most deployments; canonical choice |
| DynamoDB |
1–5ms |
Variable (high at scale) |
Very High |
AWS-native; serverless operations |
| Bigtable |
1–5ms |
High |
Extremely High |
Google Cloud; very large entity counts |
Architectural Tensions
| Tension |
Option A |
Option B |
Resolution |
| Feature freshness vs. computation cost |
Streaming (fresh) |
Batch (cheap) |
Feature-level decision based on staleness tolerance; most features are batch |
| Centralised feature store vs. team-owned features |
Platform team owns all features |
Teams own their features in shared store |
Teams own features in shared store with platform managing infrastructure |
| Online store size vs. cost |
Store all features for all entities |
Store only high-usage features |
Tiered: hot features in Redis; warm features in DynamoDB; cold in offline only |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Online store memory exhaustion (Redis OOM) |
Medium |
High — feature serving fails |
Redis memory metrics |
LRU eviction; increase Redis memory; audit feature set for unused features |
| Batch pipeline failure (features stale) |
Medium |
High — model consuming stale features |
Pipeline SLA monitor; freshness alert |
Re-run pipeline; serve stale with staleness flag; alert model owner |
| Training-serving skew (wrong PIT logic) |
Low |
Critical — model production performance << offline eval |
Production vs offline metric gap |
Audit PIT join logic; retrain with corrected data; model risk event |
| Feature leakage (future data in training) |
Low |
Critical — optimistic backtests; poor production performance |
PIT join timestamp validation |
Audit all PIT joins; retrain affected models |
| Feature drift undetected |
Medium |
High — gradual model degradation |
Production metric monitoring |
Improve drift monitoring coverage; lower alert thresholds |
14. Regulatory Considerations
EU AI Act Article 10 (Data Governance)
- Feature computation logic must be documented (in feature registry) as part of the training data governance requirements for high-risk AI systems
- Point-in-time join methodology must be documented to demonstrate absence of data leakage in training data
Privacy Act / GDPR
- PII-derived features (spending patterns, health indicators) must have a documented legal basis in the feature registry
- Data subject deletion requests must propagate to the online store (delete entity's feature values) and be documented in the offline store (mark as deleted rather than hard delete, to preserve training data integrity)
NIST AI RMF MAP 3.5
- Feature monitoring and drift detection implement MAP 3.5's requirement for ongoing monitoring of AI system inputs
15. Reference Implementations
AWS
| Component |
AWS Service |
| Online store |
Amazon ElastiCache Redis or DynamoDB |
| Offline store |
Amazon Redshift or S3 Parquet |
| Feature registry |
Amazon SageMaker Feature Store (metadata) |
| Batch pipeline |
AWS Glue / EMR (Spark) |
| Streaming pipeline |
Amazon Kinesis Data Analytics (Flink) |
| Orchestration |
Amazon MWAA (Managed Airflow) |
GCP
| Component |
GCP Service |
| Online store |
Vertex AI Feature Store (Online) or Memorystore |
| Offline store |
Vertex AI Feature Store (Offline) or BigQuery |
| Batch pipeline |
Dataflow or BigQuery ML |
| Streaming pipeline |
Dataflow (Apache Beam) |
On-Premises / Open Source
| Component |
Technology |
| Feature store framework |
Feast (open source) |
| Online store |
Redis Enterprise |
| Offline store |
Apache Hive or Delta Lake on MinIO |
| Batch pipeline |
Apache Spark + Apache Airflow |
| Streaming pipeline |
Apache Flink |
| Pattern ID |
Name |
Relationship |
| EAAPL-PLT001 |
Enterprise AI Platform |
Parent — feature store is a platform ML infrastructure component |
| EAAPL-PLT008 |
AI Experiment Tracking |
Complementary — training datasets generated via feature store feed experiment tracking |
| EAAPL-INT004 |
Real-Time AI Stream Processing |
Integration — streaming feature pipelines share infrastructure with real-time inference |
| EAAPL-INT005 |
Batch AI Processing |
Integration — batch feature pipelines share scheduling infrastructure |
17. Maturity Assessment
Overall Maturity: Proven
Feature stores are production-proven at major technology and financial services companies. Open-source tooling (Feast) and managed services (Tecton, SageMaker Feature Store) are both mature. Point-in-time joins are well-understood. Feature monitoring is less standardised.
Scoring Matrix
| Dimension |
Score (1–5) |
Rationale |
| Pattern Completeness |
5 |
All sections documented |
| Implementation Evidence |
5 |
Deployed at Netflix, Uber, LinkedIn, major banks at scale |
| Tooling Maturity |
4 |
Feast/Tecton/SageMaker mature; feature monitoring less so |
| Regulatory Alignment |
4 |
EU AI Act Article 10 mapping; privacy patterns documented |
| Operational Complexity |
High |
Requires data engineering expertise; streaming pipelines operationally demanding |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2024-09-01 |
EAAPL Working Group |
Initial publication |
| 1.1 |
2025-06-12 |
EAAPL Working Group |
Feature monitoring section expanded; privacy Act data deletion patterns added; Vertex AI Feature Store reference updated |