ML Infrastructure

Fraud Detection ML Platform

A real-time ML pipeline for financial fraud detection: Kafka-driven transaction ingestion, PySpark feature engineering, MLflow model lifecycle management, FastAPI scoring at sub-ms P99 latency, and a Grafana observability stack. Built to handle 100+ transactions per second with a full feedback loop from detection to model retraining.

100+ TPS sustained sub-ms P99 scoring 94%+ model accuracy <2% false positive rate MLflow model registry Prometheus + Grafana
Apache Kafka PySpark MLflow Apache Airflow FastAPI Prometheus Grafana Docker PostgreSQL scikit-learn
100+ transactions/sec
<1ms P99 scoring latency
94%+ detection accuracy
<2% false positive rate

The Problem

Most fraud detection systems are batch-oriented: transactions are scored hours after they occur, by which time the fraudulent activity has already cascaded. Real-time scoring requires a completely different architecture - one where latency is measured in milliseconds, throughput in hundreds of transactions per second, and the model feedback loop runs continuously.

The secondary problem is model drift. Fraud patterns shift fast. A model trained three months ago has already been partially circumvented by new attack vectors. The system needed to support rapid retraining, A/B testing between model versions, and rollback without downtime.

System Architecture

The pipeline has five distinct layers, each with a clear boundary and independent failure mode.

Ingestion
Transaction Producer
Kafka: raw_transactions
Feature Engineering
PySpark Consumer
Feature Store (Kafka: feature_vectors)
Scoring
FastAPI Scoring Service
MLflow Model (in-memory)
Storage + Observability
PostgreSQL (predictions)
Prometheus (metrics)
Grafana (dashboards)
Training Loop
Airflow DAG (retrain)
MLflow Model Registry

Ingestion layer

A Kafka producer simulates a realistic transaction stream: varying amounts, merchant categories, geolocation signals, and behavioral anomalies. Transactions flow into a raw_transactions topic at configurable TPS (tested to 100+ sustained). Partitioned by account ID for ordering guarantees.

Feature engineering

PySpark Structured Streaming consumes the raw topic and computes derived features: rolling velocity (transactions in last 5 min / 1 hr), merchant category risk score, geographic anomaly flag, amount deviation from 30-day mean. Output goes to a feature_vectors topic.

Scoring service

FastAPI loads the production model from MLflow at startup and keeps it in memory. No network hop per prediction - sub-ms P99 latency achieved by eliminating the model-loading roundtrip. Model swaps happen via a reload endpoint that pulls the new production model without dropping existing connections.

Model Lifecycle with MLflow

Every training run is tracked in MLflow: hyperparameters, dataset version, evaluation metrics, and the serialized model artifact. Models graduate through three stages in the MLflow Registry: Staging, Production, Archived. The Airflow retraining DAG runs nightly, compares the new model against the current Production baseline on a holdout set, and only promotes if recall improves and false positive rate stays below 2%.

This gave a meaningful CI/CD pattern for models: new versions are tested before they serve traffic, rollback is a single registry stage change, and experiment history is fully reproducible from any commit SHA.

MLflow Experiment Tracker - showing training run comparison across 12 model versions, F1 score, precision-recall curves, and feature importance rankings.

Observability Stack

Prometheus scrapes three endpoints: the FastAPI scoring service (prediction count, latency histogram, fraud rate), the Kafka consumer (consumer lag, processing time), and the Airflow scheduler (DAG run durations, task failures). Grafana pulls from Prometheus and renders four dashboards:

  • Real-time fraud rate - rolling 5-minute fraud detection rate with alert threshold
  • Scoring latency - P50, P95, P99 breakdown by transaction type
  • Pipeline health - Kafka consumer lag, PySpark processing backlog
  • Model performance - precision and recall tracked over time against ground truth labels
Grafana dashboard showing real-time fraud rate (top), scoring latency P99 (middle), and Kafka consumer lag across partitions (bottom). Alert firing at 3.1% fraud rate spike.

Honest Assessment: What Worked and What Didn't

What worked

In-memory model serving. Loading the MLflow model at FastAPI startup and keeping it in RAM was the single biggest latency win. Eliminated 15-40ms per prediction compared to loading per-request.

What failed

Schema evolution. Changing the feature schema mid-stream broke the PySpark consumer. Needed a schema registry (Confluent or AWS Glue) to version feature contracts properly.

What worked

MLflow model staging. The Staging - Production - Archived lifecycle prevented accidental promotion of underperforming models and gave clean rollback at any point.

What failed

Spark executor memory. Under sustained 100+ TPS, PySpark GC pressure caused processing spikes. Fixed by tuning spark.executor.memory, spark.sql.shuffle.partitions, and switching from default GC to G1GC.

What worked

Airflow LocalExecutor. Dropped the CeleryExecutor for local dev. Eliminated Redis + Celery dependencies and cut environment setup time from 20 minutes to under 5.

What failed

Kafka consumer group rebalancing. Under rapid producer scale-up, consumer group rebalancing caused processing gaps up to 8 seconds. Mitigated with session.timeout.ms and heartbeat.interval.ms tuning, but not fully resolved.

What I Would Build Next

  • Schema Registry (Confluent) to version feature contracts and prevent silent consumer failures
  • Online feature store (Feast or Hopsworks) to serve precomputed features at scoring time with sub-millisecond lookup
  • Shadow mode: run new model candidates on live traffic, log predictions without serving them, compare offline against ground truth before promoting
  • Drift detection: track input feature distribution against training distribution using KS tests, alert when drift exceeds threshold
  • Explainability: SHAP values per prediction stored in PostgreSQL for compliance audit trail