← back to blog
$ cat ./blog/insurance-reshopping-predictor.md  ·  MAR 2026  ·  ~8 min read

Building the Insurance Re-Shopping
Predictor — Data Quality First


// why this project

Car insurance re-shopping prediction is a real problem in the insurance industry — knowing when a customer is likely to switch providers is worth millions. I wanted to build the full pipeline end-to-end, from raw data validation through explainable predictions, to learn how production insurance ML actually works.

The job description emphasized data quality above all else. So I made that the foundation of the entire project.

// data quality as the primary concern

Most portfolio projects jump straight to modeling. This one starts with a DataQualityReport class that runs 8 categories of validation checks before any preprocessing begins.

CheckStatusDetails
Schema ValidationPASS12 cols, 381,109 rows
Missing ValuesPASS0 missing across all columns
Class BalanceWARN12.26% positive rate
Duplicate IDsPASS0 duplicates
Range ViolationsPASS0 out-of-range values
Suspicious PatternsWARNPremium outliers, sparse channels
Quality Score95/100Weighted composite

Every check includes equivalent SQL validation queries as docstrings. In production insurance ML, you’d run these against the data warehouse before any model training. I documented them because thinking in SQL is a signal that you understand how data actually flows in production systems.

-- Flag sparse policy sales channels
SELECT Policy_Sales_Channel, COUNT(*) AS n
FROM policies
GROUP BY Policy_Sales_Channel
HAVING COUNT(*) < 100;

The quality score is a weighted composite: schema validation (25%), missing values (20%), duplicates (15%), range violations (15%), suspicious patterns (15%), and class balance (10%). The weights reflect how much each issue would actually break a production pipeline.

// architecture decisions

raw CSV ──> DataQualityReport ──> preprocessing ──> LightGBM ──> SHAP ──> Streamlit 8 checks encode + scale SQL queries SMOTE (train only) quality score log transform

The pipeline is intentionally linear. Each stage is a separate module with clear inputs and outputs. No circular dependencies, no shared mutable state. The preprocessing pipeline is saved as a pickle so the Streamlit app can transform new inputs identically to training data.

// what the data told me

Class imbalance is severe but expected

Only 12.26% of customers expressed interest in vehicle insurance (the re-shopping proxy). This 88/12 split means accuracy is a useless metric — you could get 88% accuracy by predicting “not interested” for everyone. ROC-AUC is the right primary metric here.

Vehicle damage is the strongest signal

Customers with prior vehicle damage are roughly 3.5x more likely to be interested in re-shopping. This makes intuitive sense — if your car’s been damaged, you’ve already experienced the claims process and have direct motivation to compare prices.

Previously_Insured is suspiciously predictive

99.8% of previously insured customers show zero interest in re-shopping. This is almost too clean — it could indicate label leakage rather than genuine signal. In a production setting, I’d investigate whether this feature was constructed from the target variable.

Age 40–50 is the peak re-shopping window

Interest in re-shopping peaks in the 40–50 age range, then drops sharply. Younger customers may not have enough policy history to motivate switching, while older customers may have loyalty discounts that reduce incentive.

// preprocessing decisions

Every encoding choice has a documented reason:

StepWhatWhy
1Drop idRow identifier, would cause memorization
2Gender → 0/1Binary feature, OHE adds redundant column
3Vehicle_Age → 0/1/2Ordinal: natural age ordering for tree splits
4Vehicle_Damage → 0/1Binary, same rationale as Gender
5log1p(Annual_Premium)Heavily right-skewed (max ~540K, median ~31K)
6StandardScalerConsistent SHAP interpretation scale
7SMOTE (train only)Address 88/12 split without losing majority class data

The SMOTE step is the one that matters most to get right. Applying SMOTE to validation or test sets would leak information about the minority class distribution and produce unrealistically optimistic metrics. The code enforces this: apply_smote() is called only on the training split, never on val/test.

// model selection and results

LightGBM was the obvious choice for tabular insurance data. It handles class imbalance natively via class_weight='balanced', trains fast on 380K+ rows, and is fully compatible with SHAP’s TreeExplainer for exact Shapley value computation.

GridSearchCV tuned over n_estimators, max_depth, learning_rate, min_child_samples, and class_weight. The best configuration used 500 estimators with unlimited depth and balanced class weights.

MetricTrainValTest
ROC-AUC0.97050.84780.8468
F1 (positive)0.89120.37270.3776
Precision0.90220.35620.3568
Recall0.88050.39090.4010

The train-test gap is expected — SMOTE inflates training metrics by creating synthetic minority samples. The test metrics are the honest numbers. Val and test AUC being nearly identical suggests the model generalizes well and isn’t overfit to the validation set.

// making predictions explainable

A probability score alone isn’t useful to a customer. The app provides three layers of explanation:

SHAP waterfall chart — shows exactly which features pushed the score up or down for this specific customer, using SHAP’s TreeExplainer for exact (not approximate) Shapley values.

Plain-English factor descriptions — each top factor is translated into a sentence like “Your vehicle damage history increases your re-shopping likelihood by 34%.” No one should need to understand SHAP values to use this tool.

Counterfactual suggestions — the app identifies the single most actionable change that would move the customer’s score. This is the feature with the largest SHAP magnitude among features the customer could realistically influence.

// honest limitations

This model is trained on Indian market insurance data. Premium ranges, region codes, and sales channels are India-specific. Applying it directly to North American customers would require recalibration at minimum.

We predict re-shopping propensity, not actual savings. A customer flagged as “likely to save” may or may not find a better price when they actually compare quotes.

The Previously_Insured feature’s near-perfect predictive power warrants serious investigation before production deployment. If it’s genuinely this predictive, the model barely needs other features. If it’s leakage, the model’s real performance is worse than reported.

SMOTE generates synthetic data points by interpolating between existing minority samples. These are plausible but not real customers. The synthetic distribution should be validated against domain expertise before any production use.


Repository ↗  ·  Live Demo ↗  ·  ← more writing

// related posts

→ Triagegeist — from 0.891 to 0.9995 CV accuracy in emergency triage → Building RAG From Scratch — every algorithm derived from first principles → SupportOps AI Monitor — LLM-powered ticket triage