// why this project
Car insurance re-shopping prediction is a real problem in the insurance industry — knowing when a customer is likely to switch providers is worth millions. I wanted to build the full pipeline end-to-end, from raw data validation through explainable predictions, to learn how production insurance ML actually works.
The job description emphasized data quality above all else. So I made that the foundation of the entire project.
// data quality as the primary concern
Most portfolio projects jump straight to modeling. This one starts with a DataQualityReport class that runs 8 categories of validation checks before any preprocessing begins.
| Check | Status | Details |
|---|---|---|
| Schema Validation | PASS | 12 cols, 381,109 rows |
| Missing Values | PASS | 0 missing across all columns |
| Class Balance | WARN | 12.26% positive rate |
| Duplicate IDs | PASS | 0 duplicates |
| Range Violations | PASS | 0 out-of-range values |
| Suspicious Patterns | WARN | Premium outliers, sparse channels |
| Quality Score | 95/100 | Weighted composite |
Every check includes equivalent SQL validation queries as docstrings. In production insurance ML, you’d run these against the data warehouse before any model training. I documented them because thinking in SQL is a signal that you understand how data actually flows in production systems.
-- Flag sparse policy sales channels
SELECT Policy_Sales_Channel, COUNT(*) AS n
FROM policies
GROUP BY Policy_Sales_Channel
HAVING COUNT(*) < 100;
The quality score is a weighted composite: schema validation (25%), missing values (20%), duplicates (15%), range violations (15%), suspicious patterns (15%), and class balance (10%). The weights reflect how much each issue would actually break a production pipeline.
// architecture decisions
The pipeline is intentionally linear. Each stage is a separate module with clear inputs and outputs. No circular dependencies, no shared mutable state. The preprocessing pipeline is saved as a pickle so the Streamlit app can transform new inputs identically to training data.
// what the data told me
Class imbalance is severe but expected
Only 12.26% of customers expressed interest in vehicle insurance (the re-shopping proxy). This 88/12 split means accuracy is a useless metric — you could get 88% accuracy by predicting “not interested” for everyone. ROC-AUC is the right primary metric here.
Vehicle damage is the strongest signal
Customers with prior vehicle damage are roughly 3.5x more likely to be interested in re-shopping. This makes intuitive sense — if your car’s been damaged, you’ve already experienced the claims process and have direct motivation to compare prices.
Previously_Insured is suspiciously predictive
Age 40–50 is the peak re-shopping window
Interest in re-shopping peaks in the 40–50 age range, then drops sharply. Younger customers may not have enough policy history to motivate switching, while older customers may have loyalty discounts that reduce incentive.
// preprocessing decisions
Every encoding choice has a documented reason:
| Step | What | Why |
|---|---|---|
| 1 | Drop id | Row identifier, would cause memorization |
| 2 | Gender → 0/1 | Binary feature, OHE adds redundant column |
| 3 | Vehicle_Age → 0/1/2 | Ordinal: natural age ordering for tree splits |
| 4 | Vehicle_Damage → 0/1 | Binary, same rationale as Gender |
| 5 | log1p(Annual_Premium) | Heavily right-skewed (max ~540K, median ~31K) |
| 6 | StandardScaler | Consistent SHAP interpretation scale |
| 7 | SMOTE (train only) | Address 88/12 split without losing majority class data |
The SMOTE step is the one that matters most to get right. Applying SMOTE to validation or test sets would leak information about the minority class distribution and produce unrealistically optimistic metrics. The code enforces this: apply_smote() is called only on the training split, never on val/test.
// model selection and results
LightGBM was the obvious choice for tabular insurance data. It handles class imbalance natively via class_weight='balanced', trains fast on 380K+ rows, and is fully compatible with SHAP’s TreeExplainer for exact Shapley value computation.
GridSearchCV tuned over n_estimators, max_depth, learning_rate, min_child_samples, and class_weight. The best configuration used 500 estimators with unlimited depth and balanced class weights.
| Metric | Train | Val | Test |
|---|---|---|---|
| ROC-AUC | 0.9705 | 0.8478 | 0.8468 |
| F1 (positive) | 0.8912 | 0.3727 | 0.3776 |
| Precision | 0.9022 | 0.3562 | 0.3568 |
| Recall | 0.8805 | 0.3909 | 0.4010 |
The train-test gap is expected — SMOTE inflates training metrics by creating synthetic minority samples. The test metrics are the honest numbers. Val and test AUC being nearly identical suggests the model generalizes well and isn’t overfit to the validation set.
// making predictions explainable
A probability score alone isn’t useful to a customer. The app provides three layers of explanation:
SHAP waterfall chart — shows exactly which features pushed the score up or down for this specific customer, using SHAP’s TreeExplainer for exact (not approximate) Shapley values.
Plain-English factor descriptions — each top factor is translated into a sentence like “Your vehicle damage history increases your re-shopping likelihood by 34%.” No one should need to understand SHAP values to use this tool.
Counterfactual suggestions — the app identifies the single most actionable change that would move the customer’s score. This is the feature with the largest SHAP magnitude among features the customer could realistically influence.
// honest limitations
This model is trained on Indian market insurance data. Premium ranges, region codes, and sales channels are India-specific. Applying it directly to North American customers would require recalibration at minimum.
We predict re-shopping propensity, not actual savings. A customer flagged as “likely to save” may or may not find a better price when they actually compare quotes.
The Previously_Insured feature’s near-perfect predictive power warrants serious investigation before production deployment. If it’s genuinely this predictive, the model barely needs other features. If it’s leakage, the model’s real performance is worse than reported.
SMOTE generates synthetic data points by interpolating between existing minority samples. These are plausible but not real customers. The synthetic distribution should be validated against domain expertise before any production use.