Hilo · How it works
The short version, with diagrams. Deeper detail is collapsed — click to expand.
In one sentence: Hilo is a FastAPI chat app that wraps the Claude Agent SDK. Every user message spawns a fresh agent subprocess that connects to a handful of HTTP MCP servers (Trustpilot, Freshchat), streams tokens back to the browser over Server-Sent Events, and persists the result to Postgres.
1 · Overview
One EC2 instance. One Postgres database. One FastAPI server. Three long-running MCP sidecar processes. One ephemeral agent subprocess per chat turn. That's the whole production stack.
Reading the diagram: green = our FastAPI app. Blue = the Claude Agent SDK subprocess we spawn per message. Orange = the MCP servers (Trustpilot, Freshchat). Red = Postgres. Purple = external APIs (Anthropic, Freshchat).
2 · How Trustpilot reviews get enriched
A raw Trustpilot review is just unstructured text: a title, a body, a star rating, a username, a timestamp. Two separate processes transform each review at ingest time before it's useful to the agent.
- AI extraction
- Semantic embedding (RAG)
What each extractor is actually doing
Every review ends up with ~25 columns. Nine are produced by the LLM, fanned out from the raw review:
The remaining columns — populated without any LLM:
review_id · review_created_utc · username · title · content · stars · language · source · company_response
Mechanical derivations — computed from the raw fields
has_response — true if company_response is non-nullresponse_lag_days — days between review_created_utc and the response timestampreview_month — YYYY-MM truncation of the review date, for time-bucketingreview_word_count — word count of the review body
Example: one review, before and after
Here's a single Trustpilot review as it arrives, and what ends up in the database after the pipeline runs.
Before — the raw review
That's all the platform receives: a star count, a title, a body, a username, a timestamp, a raw language tag. The rest has to be inferred.
After — the same review, broken down into columns
Each row below is one column on the trustpilot_reviews table. The right column shows what the extractors produced for this specific review.
| Column | Extracted value |
|---|---|
stars | 1 |
sentiment | negative |
topic | accuracy |
has_refund_request | true |
churn_risk | true |
competitor_mention | true |
competitor_name | Omron M7 |
praise_aspect | NULL |
language_label | English |
has_response | true |
response_lag_days | 8 |
review_word_count | 67 |
review_month | 2025-04 |
embedding | [0.0143, -0.0274, 0.1129, …] |
That's one review. Multiply this across every review in the database and the agent can answer "which competitors are customers comparing us to?" or "how many people said they're cancelling?" with a single SQL query — no LLM call needed at question time.
Twenty reviews after extraction
Here's what twenty rows of the trustpilot_reviews table look like once the pipeline has run. Rows are grouped by sentiment (negative → neutral → positive) so the column patterns are easy to read top-down.
| Review | stars | sentiment | topic | refund | churn | competitor | praise | lang | month |
|---|---|---|---|---|---|---|---|---|---|
| Inaccurate readings and slow support Hans M. |
1 | negative | accuracy |
✓ | ✓ | Omron M7 |
— | English | 2025-04 |
| Geld zurück, schalte auf Withings Dieter K. |
1 | negative | refund |
✓ | ✓ | Withings |
— | German | 2025-03 |
| Akku defekt nach drei Wochen Greta S. |
1 | negative | hardware |
✓ | — | — | — | German | 2025-05 |
| Llegó después de tres semanas Beatriz O. |
1 | negative | shipping |
— | — | — | — | Spanish | 2025-05 |
| App crashes, going back to Fitbit Niko F. |
1 | negative | app |
✓ | ✓ | Fitbit |
— | English | 2025-05 |
| Si scollega in continuazione Sofia P. |
2 | negative | connectivity |
— | — | — | — | Italian | 2025-01 |
| Switching to Apple Watch, support never replied Robert F. |
2 | negative | support |
— | ✓ | Apple Watch |
— | English | 2025-04 |
| Werte 10 Punkte unter Omron Helga R. |
2 | negative | accuracy |
— | — | Omron |
— | German | 2025-02 |
| Strap broke — Withings was sturdier Anders T. |
2 | negative | hardware |
— | ✓ | Withings |
— | English | 2025-01 |
| €349 ist zu viel für die Funktionen Erika H. |
3 | neutral | pricing |
— | — | — | — | German | 2025-02 |
| L'app marche, sans plus Anne L. |
3 | neutral | app |
— | — | — | — | French | 2025-02 |
| Très confortable pour la nuit Pierre G. |
4 | positive | positive_experience |
— | — | — | comfort |
French | 2025-01 |
| Super App, klare Charts Klaus W. |
4 | positive | app |
— | — | — | app |
German | 2025-03 |
| Helped me cut salt, sleep is better Lars J. |
4 | positive | positive_experience |
— | — | — | lifestyle_insights |
English | 2025-04 |
| Worth every euro for peace of mind Marta R. |
5 | positive | positive_experience |
— | — | — | value |
English | 2025-02 |
| Genauere Werte als erwartet Jürgen B. |
5 | positive | accuracy |
— | — | — | accuracy |
German | 2025-04 |
| Servizio clienti gentile e rapido Lucia M. |
5 | positive | support |
— | — | — | support |
Italian | 2025-05 |
| Mi cardiólogo confía en los datos Carlos D. |
5 | positive | positive_experience |
— | — | — | medical_usefulness |
Spanish | 2025-03 |
| Olvidarme del brazalete tradicional Tomás V. |
5 | positive | positive_experience |
— | — | — | passive_monitoring |
Spanish | 2025-03 |
| Bon rapport qualité-prix Camille B. |
5 | positive | positive_experience |
— | — | — | value |
French | 2025-04 |
A few patterns jump out without any analysis: competitor_name almost always lines up with negative sentiment plus churn-risk; praise_aspect only fires on positive reviews; explicit refund requests cluster at 1-star; topic = positive_experience is what most 5-star reviews get bucketed into. Now imagine the agent running WHERE churn_risk = true over 3,500 rows in milliseconds.
Two tables, two access patterns
trustpilot_reviews — one row per review. Holds raw fields plus every extracted column above plus the embedding vector. This is what query_reviews (SQL) and search_reviews_semantic (vector) read from.
trustpilot_summary — one row, period. Pre-computed aggregates: total reviews, sentiment counts, average stars, response coverage, churn count, competitor mention count and breakdown, date range. This is what get_summary reads — a single-row SELECT that returns in milliseconds.
The summary table is rebuilt whenever the reviews table is refreshed with new data. The agent is taught (via the tool docstrings) to reach for get_summary first for any overview question, and only drop down to query_reviews when it needs filtered detail.
3 · System architecture
The stack is intentionally boring. Everything runs on a single EC2 host and is wired together with systemd. No microservices, no queue, no worker pool.
| Layer | What it is |
|---|---|
| UI | Jinja2 templates + vanilla JS (no SPA framework) |
| HTTP server | FastAPI + sse-starlette |
| Orchestrator | app/services/sdk_orchestrator.py — builds config, spawns subprocess |
| Agent runtime | claude-agent-sdk, spawned per message |
| MCP servers | 2 long-lived HTTP processes (Trustpilot, Freshchat) |
| Database | PostgreSQL 16 + pgvector |
| Streaming | Server-Sent Events with an in-memory queue per session |
Three design choices worth knowing
The agent process is short-lived
The agent does not outlive a single user turn. Every message spawns a fresh subprocess. Multi-turn continuity comes from passing the SDK's session_id back for the next turn. This keeps the server stateless between turns and bounds memory and cost per turn.
MCP servers are long-lived sidecars
The MCP servers run continuously as systemd units on fixed localhost ports. The agent subprocess connects to them over HTTP. This lets MCP servers hold expensive resources (database connections, file handles) across many requests.
No queue, no worker pool
Each chat message turns into a background asyncio.create_task inside the FastAPI process. The task drives the agent subprocess and pushes events into an in-memory asyncio.Queue keyed by session ID; a second request — the SSE stream — drains the queue. Simpler than SQS/workers and fine for current scale.
4 · How a prompt is assembled
Before the agent ever runs, the orchestrator builds the prompt that the model will see. It's a layering process — start from a fixed foundation, then add context tailored to this user, this conversation, and the tools they have access to. Think of it as a sandwich: the bottom is generic, the top is the user's actual question, and the middle is everything we've stitched in to make the model competent.
Reading the diagram: five independent sources of context feed into one assembled prompt. The blue boxes on the left are the inputs; the green box on the right is what actually gets sent to the agent subprocess.
The five layers, top to bottom
1. The base system prompt — fixed text shared by every conversation. It establishes the agent's identity (who it is, today's date, the company it works for), the house style for responses, and the list of built-in tools it can use (Read, Write, Bash, Grep, etc).
2. A usage guide per attached MCP server — for each MCP the user has access to, a short paragraph is appended explaining when to reach for it. For Trustpilot it says something like "use get_summary for instant star ratings, query_reviews when you need to run SQL against the review table". This is what stops the model from blindly guessing tool names.
3. The user's identity — their email gets interpolated into the prompt so the agent knows who it's talking to.
4. A skill preamble — Hilo runs a semantic search (pgvector) against a library of Markdown "skill cards" using the last few messages of context. The top matches are listed as "here are skills relevant to this task" — the agent reads each one on demand if it decides it applies.
5. The user's actual message — appended last. This is what the user typed.
The whole stack is serialized into a single JSON config file. The orchestrator then spawns the agent subprocess and points it at that file. The model reads layers 1–4 once at the start of the turn and treats layer 5 as the prompt to respond to.
What the agent gets in the config besides the prompt
- Model —
claude-sonnet-4-6with thinking enabled. - Tool allowlist — SDK built-ins (Bash, Read, Write, Edit, Glob, Grep, ToolSearch); WebFetch/WebSearch disallowed.
- MCP server endpoints — URLs and headers for each HTTP MCP the user can use (after RBAC filtering strips ones they can't).
- Budget —
max_budget_usd: 5.0per turn. - Streaming —
include_partial_messages: Trueso we can stream tokens to the browser as they arrive.
The RBAC filter runs before the prompt is assembled — if the user can't use the Trustpilot MCP, then the Trustpilot usage guide never appears in layer 2 and the Trustpilot tools never appear in the catalogue. The model can't even consider calling a tool it can't see.
5 · MCP servers and their tools
The agent's "hands" are two Model Context Protocol servers — each a small HTTP service that exposes a handful of tools to the model. The SDK discovers them at session start and surfaces every tool as mcp__<server>__<tool> in the model's tool catalogue.
Tool details — inputs and outputs
Trustpilot MCP
| Tool | Inputs | Output |
|---|---|---|
get_summarypre-computed aggregate stats |
None. |
JSON object — one row from the trustpilot_summary table.
|
query_reviewsSQL against trustpilot_reviews |
sql string, requiredSELECT-only query. Must include LIMIT (max 200). Non-SELECT statements are rejected.
|
JSON array of row objects. Each row has the full ~25 columns:
review_id, review_created_utc, username, title, content, stars, sentiment, topic, has_refund_request, has_response, response_lag_days, review_month, review_word_count, language_label, churn_risk, competitor_mention, competitor_name, praise_aspect, language, source, company_response, company_response_author, domain_url, tags, location_name.
On SQL error or non-SELECT: {"error": "..."}.
|
search_reviews_semanticvector similarity over review text |
query string, requiredNatural-language phrase. Embedded with all-MiniLM-L6-v2 and matched via cosine similarity.limit int, optionalDefault 15. Hard max 30. |
JSON array of reviews ordered by similarity (highest first). Each row:
If all similarity_score values are below 0.25, the agent is instructed to refuse answering — retrieval found nothing relevant.
|
Freshchat MCP
| Tool | Inputs | Output |
|---|---|---|
find_customer_historycustomer lookup + chat history |
email stringPreferred match key — exact, unique. first_name stringlast_name stringUsed when email is unknown — may match multiple users. max_users int, optionalDefault 5. Hard max 15. At least one of email / first_name / last_name must be provided. |
JSON object — Freshchat users that matched, with their conversation history rolled up:
sample_user_text is filtered to actor_type=user only — excludes bot / agent template responses. On bad input or upstream failure: {"error": "...", "details": ..., "search_query": ..., "api_calls": N}.
|
6 · A single chat turn
The browser sends a POST to kick off the turn and immediately opens an EventSource to stream the response. Both share an in-memory queue keyed by session ID.
The POST returns fast — it only enqueues work. Time-to-first-token is dominated by the model. The user message is persisted immediately so it shows up in history even if the agent crashes; the assistant message is persisted in a single transaction once the SDK emits its result event.
What the background task does
- Saves the user message to
messages. - Calls
sdk_orchestrator.run_agent()and iterates over its events. - For each event: tags with
seqandui_category, pushes to the SSE queue, accumulates into in-memory content blocks. - Tracks every tool call (
auditevents) and failure (tool_failureevents). - On the
resultevent: writes the assistant message + metadata (cost, tokens, duration) + tool events in one transaction. - Pushes
Noneto close the SSE stream.