Triagegeist is a $10,000 Kaggle hackathon — predict emergency triage acuity (1–5, where 1 is most critical) from 80k clinical ED records. The dataset has 37 features, free-text chief complaints, and 24 binary comorbidity flags. We went from a baseline of 0.891 CV accuracy to 0.9995. Here is exactly how.
// what the competition was
Three data sources, joined on patient_id:
| File | Rows | Key content |
|---|---|---|
train.csv / test.csv | 80k / 20k | 37 clinical features — vitals, NEWS2, GCS, demographics |
chief_complaints.csv | 100k | free-text description of why the patient came in |
patient_history.csv | 100k | 24 binary comorbidity flags (diabetes, hypertension, etc.) |
The task: predict triage_acuity (1–5). Metric: accuracy. The challenge is that class 1 (resuscitation) and class 2 (emergent) are clinically distinct but often look similar in structured vitals alone.
// where the points came from
Scaling chief_complaint_raw from 50 TF-IDF features to 2000 bigrams moved CV accuracy from 0.891 to 0.9989. That was the move that mattered. Everything else was secondary.
In hindsight it's obvious. "Thunderclap headache" and "minor skin rash" carry very different acuity signals regardless of what the vitals show. The baseline had TF-IDF capped at 50 features — nowhere near enough to capture the complaint vocabulary.
| Experiment | CV Accuracy | Change |
|---|---|---|
| Baseline (TF-IDF 50 features) | 0.8910 | — |
| TF-IDF 150 features | 0.9836 | +0.0926 |
| TF-IDF 300 features | 0.9919 | +0.0083 |
| TF-IDF 500 features | 0.9948 | +0.0029 |
| TF-IDF 1000 features | 0.9980 | +0.0032 |
| TF-IDF 2000 features | 0.9989 | +0.0009 |
| + glaucoma tier (final) | 0.9995 | +0.0006 |
Going from 50 to 150 features was worth 9.3 points of accuracy — more than everything else in the project combined. When a dataset has free-text that directly describes what you are predicting, that text is the primary feature. Everything else is cleanup.
num_leaves, learning_rate, subsample) after reaching 0.9980. It gave 0.9980 back — no improvement. We reverted and scaled TF-IDF instead. This is a common pattern: hyperparameter tuning cannot compensate for a missing signal.
// error analysis — every mistake, one diagnosis
After reaching 0.9989, the model had plateaued. Adding more TF-IDF features yielded diminishing returns. At this point we ran full error analysis across all 5 CV folds — 80k rows, ~39 total errors.
Every single error came from the same complaint: variants of "acute angle closure glaucoma".
Why? Because this condition sits right on the clinical boundary between acuity 1 (critical) and acuity 2 (emergent). The complaint text is identical across patients but the correct label differs. The text alone cannot resolve it — the vital signs can.
Specifically: patients labeled acuity 1 had higher NEWS2 scores, lower GCS totals, and different pain scores compared to those labeled acuity 2. The text is ambiguous; the vitals are not.
This error analysis took about 30 minutes. It was worth more than days of hyperparameter tuning would have been.
// the 3-tier hybrid architecture
Once we knew the error source, the fix was straightforward: build a dedicated classifier for those 76 test rows that the text couldn't resolve, and keep everything else as a lookup.
Tier 1 — Direct lookup (19,885 rows, 99.4%). Any complaint text that appeared in training and always mapped to the same acuity is stored in a dictionary. For these rows, no model is needed. The lookup is deterministic and perfect. A model can only be wrong; a lookup of a memorised pattern cannot.
Tier 2 — Glaucoma binary classifier (76 rows, 0.4%). Trained only on the 237 training rows with an "acute angle closure glaucoma" complaint. Features: NEWS2 score, GCS total, pain score, heart rate, systolic BP, respiratory rate, SpO2, temperature, shock index. Binary target: acuity 1 vs acuity 2. CV accuracy: 94% — compared to ~27% without it (random performance on heavily imbalanced data).
Tier 3 — Full LightGBM multiclass (39 rows, 0.2%). Complaint texts not seen during training. Falls back to the complete feature set: TF-IDF, all vitals, comorbidities, interactions. These 39 rows are genuinely novel — no lookup can help them.
// feature engineering
All feature logic lives in features.py. The key design constraint: no leakage. Encoders are fit on training data only and applied to validation/test through a fit_params dictionary.
X_train_fe, fit_params = engineer_features(X_train, is_train=True)
X_val_fe = apply_features(X_val, fit_params)
What we use
| Feature group | Method | Notes |
|---|---|---|
| Chief complaint text | TF-IDF bigrams, 2000 features, sublinear TF | Dominant signal — see scaling table above |
| Categorical columns | Frequency encoding | Fitted on train only |
| Vitals with missingness | Median imputation | Medians fitted on train |
| Clinical interactions | gcs × news2, resp × spo2, pain × news2 | Clinically meaningful combinations |
| Comorbidities | 24 binary flags + sum (burden score) | From patient_history.csv |
What we drop
ed_los_hours and disposition are post-triage outcomes — they're in training but not in test, so including them would be pure leakage. triage_nurse_id and site_id are high-cardinality identifiers that won't generalise to held-out data.
// lessons
Text beats everything else when it describes the target. Before figuring this out, we spent time on vitals engineering, clinical interactions, and comorbidity combinations. None of it moved accuracy by more than 0.5%. The text change was worth 9.3 points. Treat free-text that describes your target as the primary feature from day one, not an afterthought.
Error analysis beats hyperparameter tuning. Going from 0.9989 to 0.9995 required understanding why the model was wrong, not trying parameter combinations. Every single error had the same root cause. Once you find it, the fix is obvious. Before that, you're just guessing.
Sometimes the answer is already in the training data. For 99.4% of this dataset, the right prediction was just a lookup. A lookup can't be wrong the way a model can. The model is only needed for the 0.6% where training data gives no definitive answer.