I wanted to understand what enterprise AI platform support actually looks like from an operational standpoint. Not the ML side — the infrastructure and workflow side. What kinds of tickets does a company like OpenAI or Anthropic deal with at scale? How does a team triage hundreds of requests a day? What does API reliability monitoring actually look like when it's hooked up to a real dashboard?
The honest answer: I didn't know. So I built something to find out.
// what the project does
SupportOps AI Monitor simulates a complete support operations workflow for an AI platform. Four modules, one-way pipeline:
app.py only orchestrates and renders — no business logic, no SQL. database.py owns
all persistence. ai_triage.py owns all inference. The whole thing runs in simulation mode with no
API key — Gaussian latency (mean 820ms, σ 200ms), 10% failure rate across 429/500/408 status codes.
// architecture decisions (and why)
SQLite, not PostgreSQL
SQLite was the right call for a portfolio project with no deployment target defined. Zero-configuration, inspectable with any SQLite browser, schema readable without tooling.
The tradeoff: SQLite makes cloud deployment awkward. Containerised platforms have ephemeral filesystems — the
database resets on every restart. If I were building this for production, I'd start with Supabase (free tier)
and psycopg2. The schema queries are standard enough that the swap would be ~50 lines.
Simulation mode as first-class citizen
The most useful architectural decision I made was treating simulation as the primary mode, not a fallback. Gaussian latency, real error class proportions (429/500/408), specific error type labels — the dashboard looks live even when nothing has touched the OpenAI API. This matters for demos, for development, and it forced me to think clearly about what the observability data should look like before writing any real triage logic.
// what i learned building it
| Issue | Root cause | Fix |
|---|---|---|
applymap() crash at runtime |
pandas ≥ 2.2 removed it from Styler API | .map() — one word change |
pip install fails on Python 3.14 |
Exact pins (==), no pre-built wheels |
Minimum-version pins (>=) |
| Stale dashboard after writes | @st.cache_data with no invalidation |
data_version counter in session state |
| SQLite connection leaks on exceptions | conn.close() only on success path |
try/finally wrapping every DB function |
The applymap() bug
The original code used df.style.applymap(...) to colour-code the ticket queue table. This is a
guaranteed crash on pandas >= 2.2.0 — .applymap() was removed and replaced with
.map(). One-word fix, but the lesson was about dependency pinning: exact pins
(pandas==2.2.1) felt safe. On Python 3.14, no pre-built wheels existed and pip tried to compile
pandas from source. Switching to minimum-version pins (>=) let pip pull the latest compatible
wheels. The real lesson: exact pins ≠ reproducible builds. For reproducibility you need a lockfile (Poetry,
pip-compile, uv).
Connection lifecycle and try/finally
The original database.py closed connections only on the success path:
# Before — connection leak on any exception
def insert_ticket(ticket):
conn = get_connection()
conn.execute(INSERT_SQL, ticket)
conn.commit()
conn.close() # never reached if execute() throws
# After — always closes
def insert_ticket(ticket):
conn = get_connection()
try:
conn.execute(INSERT_SQL, ticket)
conn.commit()
finally:
conn.close()
Every single database function needed this. SQLite is forgiving at low volume, but connection leaks cause mysterious failures under load. It's also just correct Python.
Cache busting in Streamlit
@st.cache_data caches based on function arguments. With no arguments, it caches forever. After
generating new tickets, the dashboard would show stale data. Fix: a data_version integer in session
state, incremented after every write, passed as a parameter with an underscore prefix (Streamlit skips hashing
it but still busts the cache when the value changes):
@st.cache_data(ttl=60)
def load_all_tickets(_version): # _ prefix = not hashed
return db.get_all_tickets()
# After any write:
st.session_state.data_version += 1
tickets = load_all_tickets(st.session_state.data_version)
// what i'd do differently
Service layer abstraction. Right now app.py calls database.py and
ai_triage.py directly. A thin service layer — TicketService.generate_and_triage(n) —
would make the business logic testable without Streamlit. There are no unit tests because everything is tangled
with the UI.
PostgreSQL from the start. Every cloud deployment path hits the ephemeral filesystem problem.
Starting with Supabase and psycopg2 would have made deployment a non-issue.
Async triage. Triaging 100 tickets sequentially takes several minutes. An asyncio
+ ThreadPoolExecutor pattern would let 10–20 calls run in parallel — meaningful UX improvement for
large batches.
// deployment challenges
SQLite + ephemeral containers
Streamlit Community Cloud, Heroku, and Render all run containers where the filesystem resets on restart. The
db/supportops.db file disappears. Every demo starts empty.
| Platform | Persistence | Effort | Best for |
|---|---|---|---|
| HuggingFace Spaces | /data dir persists |
Change DB_PATH env var |
Fast live demo |
| Supabase + Streamlit Cloud | PostgreSQL, always-on | Swap database.py (~50 lines) |
Production |
| Railway / Render + volume | Docker volume mount | Docker config change | Full control |
| Streamlit Community Cloud | ❌ Ephemeral | N/A | Not viable with SQLite |