A production-grade financial sentiment analysis system: QLoRA fine-tuning of Mistral-7B with 4-bit NF4 quantization, PII guardrails via Presidio, circuit breaker self-recovery, real-time monitoring with KL-divergence drift detection, and FastAPI serving with batch inference. Built to classify financial text (earnings calls, analyst reports, SEC filings) into positive, neutral, and negative sentiment with production reliability.
Financial sentiment analysis looks simple on paper - positive, neutral, negative - but production deployment exposes three hard problems that demo models ignore. First, 7B-parameter models don't fit on consumer or mid-tier GPUs without quantization, and quantization done carelessly destroys accuracy. Second, financial text contains PII (account numbers, SSNs, emails) that must never reach the model or logs. Third, models drift silently in production - the distribution of predictions shifts away from training data, and without drift detection you don't know until a downstream system breaks.
FinTune addresses all three: parameter-efficient fine-tuning that fits on a single GPU, guardrails that intercept PII before inference, and a monitoring system that detects drift and triggers self-recovery without human intervention.
The system has four layers, each independently testable and replaceable.
Mistral-7B loaded in 4-bit NF4 precision via bitsandbytes, with LoRA adapters (rank 16, alpha 32) applied to attention projection layers. This reduces trainable parameters from 7B to ~4M while preserving model quality. Training runs on a single T4 GPU in under 15 minutes using Hugging Face TRL's SFTTrainer.
Every request passes through Presidio-based detection before reaching the model. SSN patterns, credit card numbers, email addresses, and phone numbers are detected and masked. The guardrail result includes flags for low confidence, invalid labels, and a list of detected PII types - all returned to the caller for audit.
A circuit breaker (CLOSED / OPEN / HALF_OPEN) tracks consecutive failures. On threshold breach, requests are rejected immediately instead of cascading. The RecoveryManager handles OOM by clearing CUDA cache and reloading the model, latency spikes by logging diagnostics, and drift by alerting registered hooks. All without human intervention.
Before fine-tuning a 7B model, I established rigorous baselines. Three TF-IDF pipelines
(LogisticRegression, RandomForest, LinearSVC) trained on financial_phrasebank provide the
floor that the QLoRA model must beat. All baselines run on CPU in under 30 seconds with
confusion matrices, classification reports, and inference latency benchmarks saved to
outputs/.
This matters because it answers the question every reviewer should ask: "Did the deep learning model actually earn its complexity, or would a logistic regression have done the same job?" The baselines make that comparison auditable.
The SystemMonitor is a thread-safe singleton that tracks every request: latency percentiles (p50, p95, p99), throughput (requests/sec), error rate, and prediction distribution. A composite health score (0-100) combines all signals with weighted penalties for latency spikes, error rates, and low throughput.
Drift detection computes KL-divergence between the current prediction distribution
and a stored baseline. When divergence exceeds the configurable threshold (default 0.15),
the registered drift hook fires. The /metrics endpoint exports everything
as JSON for dashboard integration.
4-bit NF4 quantization. Reduced GPU memory from ~14GB to ~3.5GB with negligible accuracy loss. Made the model runnable on consumer hardware (T4, RTX 3060) without cloud GPU costs.
Dataset library versioning. HuggingFace datasets v4.x dropped support for script-based datasets entirely. The financial_phrasebank repo uses a legacy loading script, requiring a pin to datasets<3.0 and trust_remote_code=True. Cost two days of debugging.
Circuit breaker pattern. Clean state machine (CLOSED/OPEN/HALF_OPEN) with configurable thresholds. Prevents cascading failures during model reload and makes the system self-healing without operator pages.
CPU-only fallback path. DistilBERT CPU config was intended as a no-GPU fallback, but the HF Trainer + TRL integration assumes GPU-aware features. The fallback needs more isolation from the QLoRA codepath.
Sklearn baselines as a discipline. Having auditable baselines prevented the "it's deep learning so it must be better" trap. The comparison is saved as JSON artifacts that any reviewer can inspect.
bitsandbytes on Windows. Native Windows support for bitsandbytes is fragile. WSL2 or Linux containers are the practical path for QLoRA training. Documentation should have been clearer about this upfront.