ML Infrastructure

FinTune

A production-grade financial sentiment analysis system: QLoRA fine-tuning of Mistral-7B with 4-bit NF4 quantization, PII guardrails via Presidio, circuit breaker self-recovery, real-time monitoring with KL-divergence drift detection, and FastAPI serving with batch inference. Built to classify financial text (earnings calls, analyst reports, SEC filings) into positive, neutral, and negative sentiment with production reliability.

29 automated tests 4-bit quantized inference circuit breaker recovery PII guardrails drift detection Docker + CI/CD

GitHub repo

PyTorch Hugging Face QLoRA / PEFT bitsandbytes FastAPI scikit-learn Presidio Docker GitHub Actions

75% memory reduction via 4-bit

29 automated tests

3 sklearn baselines beaten

5 guardrail checks per request

The Problem

Financial sentiment analysis looks simple on paper - positive, neutral, negative - but production deployment exposes three hard problems that demo models ignore. First, 7B-parameter models don't fit on consumer or mid-tier GPUs without quantization, and quantization done carelessly destroys accuracy. Second, financial text contains PII (account numbers, SSNs, emails) that must never reach the model or logs. Third, models drift silently in production - the distribution of predictions shifts away from training data, and without drift detection you don't know until a downstream system breaks.

FinTune addresses all three: parameter-efficient fine-tuning that fits on a single GPU, guardrails that intercept PII before inference, and a monitoring system that detects drift and triggers self-recovery without human intervention.

System Architecture

The system has four layers, each independently testable and replaceable.

Input

Financial text

financial_phrasebank (2,264 sentences)

↓

Guardrails

PII Detection (Presidio)

Confidence Thresholding

Label Validation

↓

Inference

Mistral-7B QLoRA (4-bit NF4)

PEFT LoRA adapters

↓

Serving + Monitoring

FastAPI (/predict, /batch, /metrics)

Circuit Breaker

Drift Detection (KL-div)

QLoRA fine-tuning

Mistral-7B loaded in 4-bit NF4 precision via bitsandbytes, with LoRA adapters (rank 16, alpha 32) applied to attention projection layers. This reduces trainable parameters from 7B to ~4M while preserving model quality. Training runs on a single T4 GPU in under 15 minutes using Hugging Face TRL's SFTTrainer.

PII guardrails

Every request passes through Presidio-based detection before reaching the model. SSN patterns, credit card numbers, email addresses, and phone numbers are detected and masked. The guardrail result includes flags for low confidence, invalid labels, and a list of detected PII types - all returned to the caller for audit.

Self-recovery engine

A circuit breaker (CLOSED / OPEN / HALF_OPEN) tracks consecutive failures. On threshold breach, requests are rejected immediately instead of cascading. The RecoveryManager handles OOM by clearing CUDA cache and reloading the model, latency spikes by logging diagnostics, and drift by alerting registered hooks. All without human intervention.

Sklearn Baselines

Before fine-tuning a 7B model, I established rigorous baselines. Three TF-IDF pipelines (LogisticRegression, RandomForest, LinearSVC) trained on financial_phrasebank provide the floor that the QLoRA model must beat. All baselines run on CPU in under 30 seconds with confusion matrices, classification reports, and inference latency benchmarks saved to outputs/.

This matters because it answers the question every reviewer should ask: "Did the deep learning model actually earn its complexity, or would a logistic regression have done the same job?" The baselines make that comparison auditable.

Real-Time Monitoring

The SystemMonitor is a thread-safe singleton that tracks every request: latency percentiles (p50, p95, p99), throughput (requests/sec), error rate, and prediction distribution. A composite health score (0-100) combines all signals with weighted penalties for latency spikes, error rates, and low throughput.

Drift detection computes KL-divergence between the current prediction distribution and a stored baseline. When divergence exceeds the configurable threshold (default 0.15), the registered drift hook fires. The /metrics endpoint exports everything as JSON for dashboard integration.

Honest Assessment: What Worked and What Didn't

What worked

4-bit NF4 quantization. Reduced GPU memory from ~14GB to ~3.5GB with negligible accuracy loss. Made the model runnable on consumer hardware (T4, RTX 3060) without cloud GPU costs.

What failed

Dataset library versioning. HuggingFace datasets v4.x dropped support for script-based datasets entirely. The financial_phrasebank repo uses a legacy loading script, requiring a pin to datasets<3.0 and trust_remote_code=True. Cost two days of debugging.

What worked

Circuit breaker pattern. Clean state machine (CLOSED/OPEN/HALF_OPEN) with configurable thresholds. Prevents cascading failures during model reload and makes the system self-healing without operator pages.

What failed

CPU-only fallback path. DistilBERT CPU config was intended as a no-GPU fallback, but the HF Trainer + TRL integration assumes GPU-aware features. The fallback needs more isolation from the QLoRA codepath.

What worked

Sklearn baselines as a discipline. Having auditable baselines prevented the "it's deep learning so it must be better" trap. The comparison is saved as JSON artifacts that any reviewer can inspect.

What failed

bitsandbytes on Windows. Native Windows support for bitsandbytes is fragile. WSL2 or Linux containers are the practical path for QLoRA training. Documentation should have been clearer about this upfront.

What I Would Build Next

GGUF export for llama.cpp inference - eliminate the PyTorch dependency for serving
A/B testing between QLoRA and baseline models with traffic splitting at the FastAPI layer
SHAP explanations per prediction for compliance audit trail on financial decisions
Streaming inference with Server-Sent Events for long financial documents
Feature store integration to cache embeddings and reduce redundant tokenization