← back to blog
$ cat ./blog/supportops-ai-monitor.md  ·  March 2026  ·  ~10 min read

Building SupportOps AI Monitor
— What I Learned


I wanted to understand what enterprise AI platform support actually looks like from an operational standpoint. Not the ML side — the infrastructure and workflow side. What kinds of tickets does a company like OpenAI or Anthropic deal with at scale? How does a team triage hundreds of requests a day? What does API reliability monitoring actually look like when it's hooked up to a real dashboard?

The honest answer: I didn't know. So I built something to find out.

// what the project does

SupportOps AI Monitor simulates a complete support operations workflow for an AI platform. Four modules, one-way pipeline:

ticket_generator.py ──► database.py ──► ai_triage.py ──► database.py Faker templates tickets category · sentiment 5 categories table summary · api_logs │ │ └────────────────────────────┘ app.py (Streamlit dashboard)

app.py only orchestrates and renders — no business logic, no SQL. database.py owns all persistence. ai_triage.py owns all inference. The whole thing runs in simulation mode with no API key — Gaussian latency (mean 820ms, σ 200ms), 10% failure rate across 429/500/408 status codes.

// architecture decisions (and why)

SQLite, not PostgreSQL

SQLite was the right call for a portfolio project with no deployment target defined. Zero-configuration, inspectable with any SQLite browser, schema readable without tooling.

The tradeoff: SQLite makes cloud deployment awkward. Containerised platforms have ephemeral filesystems — the database resets on every restart. If I were building this for production, I'd start with Supabase (free tier) and psycopg2. The schema queries are standard enough that the swap would be ~50 lines.

Simulation mode as first-class citizen

The most useful architectural decision I made was treating simulation as the primary mode, not a fallback. Gaussian latency, real error class proportions (429/500/408), specific error type labels — the dashboard looks live even when nothing has touched the OpenAI API. This matters for demos, for development, and it forced me to think clearly about what the observability data should look like before writing any real triage logic.

// what i learned building it

Issue Root cause Fix
applymap() crash at runtime pandas ≥ 2.2 removed it from Styler API .map() — one word change
pip install fails on Python 3.14 Exact pins (==), no pre-built wheels Minimum-version pins (>=)
Stale dashboard after writes @st.cache_data with no invalidation data_version counter in session state
SQLite connection leaks on exceptions conn.close() only on success path try/finally wrapping every DB function

The applymap() bug

The original code used df.style.applymap(...) to colour-code the ticket queue table. This is a guaranteed crash on pandas >= 2.2.0.applymap() was removed and replaced with .map(). One-word fix, but the lesson was about dependency pinning: exact pins (pandas==2.2.1) felt safe. On Python 3.14, no pre-built wheels existed and pip tried to compile pandas from source. Switching to minimum-version pins (>=) let pip pull the latest compatible wheels. The real lesson: exact pins ≠ reproducible builds. For reproducibility you need a lockfile (Poetry, pip-compile, uv).

Connection lifecycle and try/finally

The original database.py closed connections only on the success path:

# Before — connection leak on any exception
def insert_ticket(ticket):
    conn = get_connection()
    conn.execute(INSERT_SQL, ticket)
    conn.commit()
    conn.close()  # never reached if execute() throws

# After — always closes
def insert_ticket(ticket):
    conn = get_connection()
    try:
        conn.execute(INSERT_SQL, ticket)
        conn.commit()
    finally:
        conn.close()

Every single database function needed this. SQLite is forgiving at low volume, but connection leaks cause mysterious failures under load. It's also just correct Python.

Cache busting in Streamlit

@st.cache_data caches based on function arguments. With no arguments, it caches forever. After generating new tickets, the dashboard would show stale data. Fix: a data_version integer in session state, incremented after every write, passed as a parameter with an underscore prefix (Streamlit skips hashing it but still busts the cache when the value changes):

@st.cache_data(ttl=60)
def load_all_tickets(_version):   # _ prefix = not hashed
    return db.get_all_tickets()

# After any write:
st.session_state.data_version += 1
tickets = load_all_tickets(st.session_state.data_version)

// what i'd do differently

Service layer abstraction. Right now app.py calls database.py and ai_triage.py directly. A thin service layer — TicketService.generate_and_triage(n) — would make the business logic testable without Streamlit. There are no unit tests because everything is tangled with the UI.

PostgreSQL from the start. Every cloud deployment path hits the ephemeral filesystem problem. Starting with Supabase and psycopg2 would have made deployment a non-issue.

Async triage. Triaging 100 tickets sequentially takes several minutes. An asyncio + ThreadPoolExecutor pattern would let 10–20 calls run in parallel — meaningful UX improvement for large batches.

// deployment challenges

SQLite + ephemeral containers

Streamlit Community Cloud, Heroku, and Render all run containers where the filesystem resets on restart. The db/supportops.db file disappears. Every demo starts empty.

Platform Persistence Effort Best for
HuggingFace Spaces /data dir persists Change DB_PATH env var Fast live demo
Supabase + Streamlit Cloud PostgreSQL, always-on Swap database.py (~50 lines) Production
Railway / Render + volume Docker volume mount Docker config change Full control
Streamlit Community Cloud ❌ Ephemeral N/A Not viable with SQLite

Repository ↗  ·  ← more writing