Hilo · Architecture DESIGN DOC

Hilo · How it works

The short version, with diagrams. Deeper detail is collapsed — click to expand.

In one sentence: Hilo is a FastAPI chat app that wraps the Claude Agent SDK. Every user message spawns a fresh agent subprocess that connects to a handful of HTTP MCP servers (Excel, Trustpilot, Freshchat), streams tokens back to the browser over Server-Sent Events, and persists the result to Postgres.


1 · Overview

One EC2 instance. One Postgres database. One FastAPI server. Three long-running MCP sidecar processes. One ephemeral agent subprocess per chat turn. That's the whole production stack.

Rendering…
Reading the diagram: green = our FastAPI app. Blue = the Claude Agent SDK subprocess we spawn per message. Orange = the MCP servers (Excel, Trustpilot, Freshchat). Red = Postgres. Purple = external APIs (Anthropic, Freshchat).

2 · System architecture

The stack is intentionally boring. Everything runs on a single EC2 host and is wired together with systemd. No microservices, no queue, no worker pool.

LayerWhat it is
UIJinja2 templates + vanilla JS (no SPA framework)
HTTP serverFastAPI + sse-starlette
Orchestratorapp/services/sdk_orchestrator.py — builds config, spawns subprocess
Agent runtimeclaude-agent-sdk, spawned per message
MCP servers3 long-lived HTTP processes (Excel, Trustpilot, Freshchat)
DatabasePostgreSQL 16 + pgvector
StreamingServer-Sent Events with an in-memory queue per session
Three design choices worth knowing

The agent process is short-lived

The agent does not outlive a single user turn. Every message spawns a fresh subprocess. Multi-turn continuity comes from passing the SDK's session_id back for the next turn. This keeps the server stateless between turns and bounds memory and cost per turn.

MCP servers are long-lived sidecars

The MCP servers run continuously as systemd units on fixed localhost ports. The agent subprocess connects to them over HTTP. This lets MCP servers hold expensive resources (database connections, file handles) across many requests.

No queue, no worker pool

Each chat message turns into a background asyncio.create_task inside the FastAPI process. The task drives the agent subprocess and pushes events into an in-memory asyncio.Queue keyed by session ID; a second request — the SSE stream — drains the queue. Simpler than SQS/workers and fine for current scale.


3 · How a prompt is assembled

Before the agent ever runs, the orchestrator builds the prompt that the model will see. It's a layering process — start from a fixed foundation, then add context tailored to this user, this conversation, and the tools they have access to. Think of it as a sandwich: the bottom is generic, the top is the user's actual question, and the middle is everything we've stitched in to make the model competent.

Rendering…
Reading the diagram: five independent sources of context feed into one assembled prompt. The blue boxes on the left are the inputs; the green box on the right is what actually gets sent to the agent subprocess.

The five layers, top to bottom

1. The base system prompt — fixed text shared by every conversation. It establishes the agent's identity (who it is, today's date, the company it works for), the house style for responses, and the list of built-in tools it can use (Read, Write, Bash, Grep, etc).

2. A usage guide per attached MCP server — for each MCP the user has access to, a short paragraph is appended explaining when to reach for it. For Trustpilot it says something like "use get_summary for instant star ratings, query_reviews when you need to run SQL against the review table". This is what stops the model from blindly guessing tool names.

3. The user's identity — their email gets interpolated into the prompt so the agent knows who it's talking to.

4. A skill preamble — Hilo runs a semantic search (pgvector) against a library of Markdown "skill cards" using the last few messages of context. The top matches are listed as "here are skills relevant to this task" — the agent reads each one on demand if it decides it applies.

5. The user's actual message — appended last. This is what the user typed.

The whole stack is serialized into a single JSON config file. The orchestrator then spawns the agent subprocess and points it at that file. The model reads layers 1–4 once at the start of the turn and treats layer 5 as the prompt to respond to.

What the agent gets in the config besides the prompt
  • Modelclaude-sonnet-4-6 with thinking enabled.
  • Tool allowlist — SDK built-ins (Bash, Read, Write, Edit, Glob, Grep, ToolSearch); WebFetch/WebSearch disallowed.
  • MCP server endpoints — URLs and headers for each HTTP MCP the user can use (after RBAC filtering strips ones they can't).
  • Budgetmax_budget_usd: 5.0 per turn.
  • Streaminginclude_partial_messages: True so we can stream tokens to the browser as they arrive.

The RBAC filter runs before the prompt is assembled — if the user can't use the Excel MCP, then the Excel usage guide never appears in layer 2 and the Excel tools never appear in the catalogue. The model can't even consider calling a tool it can't see.


4 · A single chat turn

The browser sends a POST to kick off the turn and immediately opens an EventSource to stream the response. Both share an in-memory queue keyed by session ID.

Rendering…
The POST returns fast — it only enqueues work. Time-to-first-token is dominated by the model. The user message is persisted immediately so it shows up in history even if the agent crashes; the assistant message is persisted in a single transaction once the SDK emits its result event.
What the background task does
  1. Saves the user message to messages.
  2. Calls sdk_orchestrator.run_agent() and iterates over its events.
  3. For each event: tags with seq and ui_category, pushes to the SSE queue, accumulates into in-memory content blocks.
  4. Tracks every tool call (audit events) and failure (tool_failure events).
  5. On the result event: writes the assistant message + metadata (cost, tokens, duration) + tool events in one transaction.
  6. Pushes None to close the SSE stream.

5 · MCP servers and their tools

The agent's "hands" are three Model Context Protocol servers — each a small HTTP service that exposes a handful of tools to the model. The SDK discovers them at session start and surfaces every tool as mcp__<server>__<tool> in the model's tool catalogue.

Rendering…

Excel MCP — spreadsheet over HTTP

Reads (and can write) a single xlsx workbook sitting on disk. The agent uses it to look up data the team has dropped into a spreadsheet — orders, customer lists, pricing, whatever.

ToolWhat it does
get_workbook_infoWorkbook metadata — sheet count, file size, last modified.
list_sheetsNames of all sheets in the workbook.
get_headersColumn header row for a given sheet — so the agent knows the schema before reading.
read_sheetPaginated read of a sheet's rows (limit, offset).
get_cellRead one cell by A1-style address (e.g. B14).
search_rowsKeyword search across all cells in a sheet — returns matching rows.
filter_rowsExact-match filter: "rows where column status = active".
count_by_columnGroup-by aggregation — how many rows for each value of a column.
write_cellMutate one cell. Use sparingly — this writes to the live workbook.
append_rowAdd a new row to a sheet (values as JSON array).

Trustpilot MCP — reviews from the Trustpilot database

Backs onto a Postgres database holding ingested Trustpilot reviews. The agent reaches for this whenever the user asks about reputation, sentiment, churn risk, or specific reviewer complaints.

ToolWhat it does
get_summaryOne-shot aggregate — star distribution, sentiment counts, response lag, churn-risk indicators. Pre-computed, so it's instant.
query_reviewsRun a SQL query against the trustpilot_reviews table. Use when you need detail the summary doesn't have.
search_reviews_semanticVector search over review text — "find reviews where customers complained about shipping" returns the closest matches by meaning, not just keywords.

Freshchat MCP — live customer support history

Hits the Freshchat REST API to pull a specific customer's chat history. Used when the agent is helping with a support question and needs to know what's already been said.

ToolWhat it does
find_customer_historyGiven an email (or first name + last name), looks up the customer in Freshchat and returns their conversation history with timestamps and a short per-conversation summary.
How an MCP gets attached at runtime

For each MCP, the orchestrator reads the endpoint URL from environment settings and adds an entry to the agent config:

{
  "trustpilot": {
    "type": "http",
    "url": "http://127.0.0.1:57401/mcp",
    "headers": { ... }
  }
}

The SDK opens an HTTP client per server, calls each server's introspection endpoint, and registers every returned tool. When the model calls a tool, the SDK POSTs the invocation to the right server and feeds the result back to the model as a tool result.


Where to look in the code
What you want to understandFile
How the prompt is assembledapp/services/sdk_orchestrator.py
The base system promptapp/services/sdk_orchestrator.pyBASE_AGENT_CONFIG
The agent subprocess entrypointagent/main.py
POST /messages + SSE streamapp/routers/chat.py
RBAC filteringapp/services/rbac_service.py
Skill retrieval (pgvector)app/services/skill_retriever.py
MCP server implementationsexcel_mcp/, trustpilot_mcp/, freshchat_mcp/