Triagegeist — From 0.891 to 0.9995

Triagegeist is a $10,000 Kaggle hackathon — predict emergency triage acuity (1–5, where 1 is most critical) from 80k clinical ED records. The dataset has 37 features, free-text chief complaints, and 24 binary comorbidity flags. We went from a baseline of 0.891 CV accuracy to 0.9995. Here is exactly how.

// what the competition was

Three data sources, joined on patient_id:

File	Rows	Key content
`train.csv` / `test.csv`	80k / 20k	37 clinical features — vitals, NEWS2, GCS, demographics
`chief_complaints.csv`	100k	free-text description of why the patient came in
`patient_history.csv`	100k	24 binary comorbidity flags (diabetes, hypertension, etc.)

The task: predict triage_acuity (1–5). Metric: accuracy. The challenge is that class 1 (resuscitation) and class 2 (emergent) are clinically distinct but often look similar in structured vitals alone.

// where the points came from

Scaling chief_complaint_raw from 50 TF-IDF features to 2000 bigrams moved CV accuracy from 0.891 to 0.9989. That was the move that mattered. Everything else was secondary.

In hindsight it's obvious. "Thunderclap headache" and "minor skin rash" carry very different acuity signals regardless of what the vitals show. The baseline had TF-IDF capped at 50 features — nowhere near enough to capture the complaint vocabulary.

Experiment	CV Accuracy	Change
Baseline (TF-IDF 50 features)	0.8910	—
TF-IDF 150 features	0.9836	+0.0926
TF-IDF 300 features	0.9919	+0.0083
TF-IDF 500 features	0.9948	+0.0029
TF-IDF 1000 features	0.9980	+0.0032
TF-IDF 2000 features	0.9989	+0.0009
+ glaucoma tier (final)	0.9995	+0.0006

Going from 50 to 150 features was worth 9.3 points of accuracy — more than everything else in the project combined. When a dataset has free-text that directly describes what you are predicting, that text is the primary feature. Everything else is cleanup.

We also tried hyperparameter tuning (LightGBM num_leaves, learning_rate, subsample) after reaching 0.9980. It gave 0.9980 back — no improvement. We reverted and scaled TF-IDF instead. This is a common pattern: hyperparameter tuning cannot compensate for a missing signal.

// error analysis — every mistake, one diagnosis

After reaching 0.9989, the model had plateaued. Adding more TF-IDF features yielded diminishing returns. At this point we ran full error analysis across all 5 CV folds — 80k rows, ~39 total errors.

Every single error came from the same complaint: variants of "acute angle closure glaucoma".

Why? Because this condition sits right on the clinical boundary between acuity 1 (critical) and acuity 2 (emergent). The complaint text is identical across patients but the correct label differs. The text alone cannot resolve it — the vital signs can.

Specifically: patients labeled acuity 1 had higher NEWS2 scores, lower GCS totals, and different pain scores compared to those labeled acuity 2. The text is ambiguous; the vitals are not.

This error analysis took about 30 minutes. It was worth more than days of hyperparameter tuning would have been.

// the 3-tier hybrid architecture

Once we knew the error source, the fix was straightforward: build a dedicated classifier for those 76 test rows that the text couldn't resolve, and keep everything else as a lookup.

test row │ ┌────────────▼────────────┐ │ Complaint text in │ │ unambiguous lookup? │ └──────┬──────────────────┘ │ YES (19,885 rows, 99.4%) ▼ Return label directly ──────────────► prediction │ NO ▼ ┌──────────────────────────┐ │ Glaucoma variant? │ │ (15 ambiguous texts) │ └──────┬───────────────────┘ │ YES (76 rows) ▼ Binary LightGBM (news2, gcs, pain, hr) ──────────────► prediction │ NO ▼ ┌──────────────────────────┐ │ Unseen complaint text │ │ (39 rows) │ └──────┬───────────────────┘ │ ▼ Full LightGBM multiclass ────────────► prediction

Tier 1 — Direct lookup (19,885 rows, 99.4%). Any complaint text that appeared in training and always mapped to the same acuity is stored in a dictionary. For these rows, no model is needed. The lookup is deterministic and perfect. A model can only be wrong; a lookup of a memorised pattern cannot.

Tier 2 — Glaucoma binary classifier (76 rows, 0.4%). Trained only on the 237 training rows with an "acute angle closure glaucoma" complaint. Features: NEWS2 score, GCS total, pain score, heart rate, systolic BP, respiratory rate, SpO2, temperature, shock index. Binary target: acuity 1 vs acuity 2. CV accuracy: 94% — compared to ~27% without it (random performance on heavily imbalanced data).

Tier 3 — Full LightGBM multiclass (39 rows, 0.2%). Complaint texts not seen during training. Falls back to the complete feature set: TF-IDF, all vitals, comorbidities, interactions. These 39 rows are genuinely novel — no lookup can help them.

// feature engineering

All feature logic lives in features.py. The key design constraint: no leakage. Encoders are fit on training data only and applied to validation/test through a fit_params dictionary.

X_train_fe, fit_params = engineer_features(X_train, is_train=True)
X_val_fe = apply_features(X_val, fit_params)

What we use

Feature group	Method	Notes
Chief complaint text	TF-IDF bigrams, 2000 features, sublinear TF	Dominant signal — see scaling table above
Categorical columns	Frequency encoding	Fitted on train only
Vitals with missingness	Median imputation	Medians fitted on train
Clinical interactions	`gcs × news2`, `resp × spo2`, `pain × news2`	Clinically meaningful combinations
Comorbidities	24 binary flags + sum (burden score)	From `patient_history.csv`

What we drop

ed_los_hours and disposition are post-triage outcomes — they're in training but not in test, so including them would be pure leakage. triage_nurse_id and site_id are high-cardinality identifiers that won't generalise to held-out data.

// lessons

Text beats everything else when it describes the target. Before figuring this out, we spent time on vitals engineering, clinical interactions, and comorbidity combinations. None of it moved accuracy by more than 0.5%. The text change was worth 9.3 points. Treat free-text that describes your target as the primary feature from day one, not an afterthought.

Error analysis beats hyperparameter tuning. Going from 0.9989 to 0.9995 required understanding why the model was wrong, not trying parameter combinations. Every single error had the same root cause. Once you find it, the fix is obvious. Before that, you're just guessing.

Sometimes the answer is already in the training data. For 99.4% of this dataset, the right prediction was just a lookup. A lookup can't be wrong the way a model can. The model is only needed for the 0.6% where training data gives no definitive answer.

Repository ↗ · Live Dashboard ↗ · ← more writing

// related posts

→ Insurance Re-Shopping Predictor — data quality first ML → Building RAG From Scratch — every algorithm from first principles → RAGOps API — production RAG with FastAPI and pgvector

Triagegeist — From 0.891to 0.9995 CV Accuracy

// what the competition was

// where the points came from

// error analysis — every mistake, one diagnosis

// the 3-tier hybrid architecture

// feature engineering

What we use

What we drop

// lessons

// related posts

Triagegeist — From 0.891
to 0.9995 CV Accuracy