Hilo · How it works

The short version, with diagrams. Deeper detail is collapsed — click to expand.

In one sentence: Hilo is a FastAPI chat app that wraps the Claude Agent SDK. Every user message spawns a fresh agent subprocess that connects to a handful of HTTP MCP servers (Trustpilot, Freshchat), streams tokens back to the browser over Server-Sent Events, and persists the result to Postgres.

1 · Overview

One EC2 instance. One Postgres database. One FastAPI server. Three long-running MCP sidecar processes. One ephemeral agent subprocess per chat turn. That's the whole production stack.

Rendering…

Reading the diagram: green = our FastAPI app. Blue = the Claude Agent SDK subprocess we spawn per message. Orange = the MCP servers (Trustpilot, Freshchat). Red = Postgres. Purple = external APIs (Anthropic, Freshchat).

2 · How Trustpilot reviews get enriched

A raw Trustpilot review is just unstructured text: a title, a body, a star rating, a username, a timestamp. Two separate processes transform each review at ingest time before it's useful to the agent.

AI extraction
Semantic embedding (RAG)

What each extractor is actually doing

Every review ends up with ~25 columns. Nine are produced by the LLM, fanned out from the raw review:

Rendering…

The remaining columns — populated without any LLM:

Raw fields — as received from Trustpilot review_id · review_created_utc · username · title · content · stars · language · source · company_response Mechanical derivations — computed from the raw fields has_response — true if company_response is non-null
response_lag_days — days between review_created_utc and the response timestamp
review_month — YYYY-MM truncation of the review date, for time-bucketing
review_word_count — word count of the review body

Example: one review, before and after

Here's a single Trustpilot review as it arrives, and what ends up in the database after the pipeline runs.

Before — the raw review

★ ☆ ☆ ☆ ☆ Hans M. · 2025-04-12 · language: en

Inaccurate readings and slow support

Bought this band three months ago to track my blood pressure between doctor visits. Initial readings looked promising but they're consistently 15–20 points lower than my Omron M7 cuff. Contacted support twice — the first reply took 8 days, the second never came. The app keeps disconnecting from the band too. I want a refund. Going back to the Omron, this isn't worth €349.

That's all the platform receives: a star count, a title, a body, a username, a timestamp, a raw language tag. The rest has to be inferred.

After — the same review, broken down into columns

Each row below is one column on the trustpilot_reviews table. The right column shows what the extractors produced for this specific review.

Column	Extracted value
`stars`	`1`
`sentiment`	`negative`
`topic`	`accuracy`
`has_refund_request`	`true`
`churn_risk`	`true`
`competitor_mention`	`true`
`competitor_name`	`Omron M7`
`praise_aspect`	`NULL`
`language_label`	`English`
`has_response`	`true`
`response_lag_days`	`8`
`review_word_count`	`67`
`review_month`	`2025-04`
`embedding`	`[0.0143, -0.0274, 0.1129, …]`

That's one review. Multiply this across every review in the database and the agent can answer "which competitors are customers comparing us to?" or "how many people said they're cancelling?" with a single SQL query — no LLM call needed at question time.

Twenty reviews after extraction

Here's what twenty rows of the trustpilot_reviews table look like once the pipeline has run. Rows are grouped by sentiment (negative → neutral → positive) so the column patterns are easy to read top-down.

Review	stars	sentiment	topic	refund	churn	competitor	praise	lang	month
Inaccurate readings and slow support Hans M.	1	negative	`accuracy`	✓	✓	`Omron M7`	—	English	2025-04
Geld zurück, schalte auf Withings Dieter K.	1	negative	`refund`	✓	✓	`Withings`	—	German	2025-03
Akku defekt nach drei Wochen Greta S.	1	negative	`hardware`	✓	—	—	—	German	2025-05
Llegó después de tres semanas Beatriz O.	1	negative	`shipping`	—	—	—	—	Spanish	2025-05
App crashes, going back to Fitbit Niko F.	1	negative	`app`	✓	✓	`Fitbit`	—	English	2025-05
Si scollega in continuazione Sofia P.	2	negative	`connectivity`	—	—	—	—	Italian	2025-01
Switching to Apple Watch, support never replied Robert F.	2	negative	`support`	—	✓	`Apple Watch`	—	English	2025-04
Werte 10 Punkte unter Omron Helga R.	2	negative	`accuracy`	—	—	`Omron`	—	German	2025-02
Strap broke — Withings was sturdier Anders T.	2	negative	`hardware`	—	✓	`Withings`	—	English	2025-01
€349 ist zu viel für die Funktionen Erika H.	3	neutral	`pricing`	—	—	—	—	German	2025-02
L'app marche, sans plus Anne L.	3	neutral	`app`	—	—	—	—	French	2025-02
Très confortable pour la nuit Pierre G.	4	positive	`positive_experience`	—	—	—	`comfort`	French	2025-01
Super App, klare Charts Klaus W.	4	positive	`app`	—	—	—	`app`	German	2025-03
Helped me cut salt, sleep is better Lars J.	4	positive	`positive_experience`	—	—	—	`lifestyle_insights`	English	2025-04
Worth every euro for peace of mind Marta R.	5	positive	`positive_experience`	—	—	—	`value`	English	2025-02
Genauere Werte als erwartet Jürgen B.	5	positive	`accuracy`	—	—	—	`accuracy`	German	2025-04
Servizio clienti gentile e rapido Lucia M.	5	positive	`support`	—	—	—	`support`	Italian	2025-05
Mi cardiólogo confía en los datos Carlos D.	5	positive	`positive_experience`	—	—	—	`medical_usefulness`	Spanish	2025-03
Olvidarme del brazalete tradicional Tomás V.	5	positive	`positive_experience`	—	—	—	`passive_monitoring`	Spanish	2025-03
Bon rapport qualité-prix Camille B.	5	positive	`positive_experience`	—	—	—	`value`	French	2025-04

A few patterns jump out without any analysis: competitor_name almost always lines up with negative sentiment plus churn-risk; praise_aspect only fires on positive reviews; explicit refund requests cluster at 1-star; topic = positive_experience is what most 5-star reviews get bucketed into. Now imagine the agent running WHERE churn_risk = true over 3,500 rows in milliseconds.

Two tables, two access patterns

trustpilot_reviews — one row per review. Holds raw fields plus every extracted column above plus the embedding vector. This is what query_reviews (SQL) and search_reviews_semantic (vector) read from.

trustpilot_summary — one row, period. Pre-computed aggregates: total reviews, sentiment counts, average stars, response coverage, churn count, competitor mention count and breakdown, date range. This is what get_summary reads — a single-row SELECT that returns in milliseconds.

The summary table is rebuilt whenever the reviews table is refreshed with new data. The agent is taught (via the tool docstrings) to reach for get_summary first for any overview question, and only drop down to query_reviews when it needs filtered detail.

3 · System architecture

The stack is intentionally boring. Everything runs on a single EC2 host and is wired together with systemd. No microservices, no queue, no worker pool.

Layer	What it is
UI	Jinja2 templates + vanilla JS (no SPA framework)
HTTP server	FastAPI + sse-starlette
Orchestrator	`app/services/sdk_orchestrator.py` — builds config, spawns subprocess
Agent runtime	`claude-agent-sdk`, spawned per message
MCP servers	2 long-lived HTTP processes (Trustpilot, Freshchat)
Database	PostgreSQL 16 + pgvector
Streaming	Server-Sent Events with an in-memory queue per session

Three design choices worth knowing

The agent process is short-lived

The agent does not outlive a single user turn. Every message spawns a fresh subprocess. Multi-turn continuity comes from passing the SDK's session_id back for the next turn. This keeps the server stateless between turns and bounds memory and cost per turn.

MCP servers are long-lived sidecars

The MCP servers run continuously as systemd units on fixed localhost ports. The agent subprocess connects to them over HTTP. This lets MCP servers hold expensive resources (database connections, file handles) across many requests.

No queue, no worker pool

Each chat message turns into a background asyncio.create_task inside the FastAPI process. The task drives the agent subprocess and pushes events into an in-memory asyncio.Queue keyed by session ID; a second request — the SSE stream — drains the queue. Simpler than SQS/workers and fine for current scale.

4 · How a prompt is assembled

Before the agent ever runs, the orchestrator builds the prompt that the model will see. It's a layering process — start from a fixed foundation, then add context tailored to this user, this conversation, and the tools they have access to. Think of it as a sandwich: the bottom is generic, the top is the user's actual question, and the middle is everything we've stitched in to make the model competent.

Rendering…

Reading the diagram: five independent sources of context feed into one assembled prompt. The blue boxes on the left are the inputs; the green box on the right is what actually gets sent to the agent subprocess.

The five layers, top to bottom

1. The base system prompt — fixed text shared by every conversation. It establishes the agent's identity (who it is, today's date, the company it works for), the house style for responses, and the list of built-in tools it can use (Read, Write, Bash, Grep, etc).

2. A usage guide per attached MCP server — for each MCP the user has access to, a short paragraph is appended explaining when to reach for it. For Trustpilot it says something like "use get_summary for instant star ratings, query_reviews when you need to run SQL against the review table". This is what stops the model from blindly guessing tool names.

3. The user's identity — their email gets interpolated into the prompt so the agent knows who it's talking to.

4. A skill preamble — Hilo runs a semantic search (pgvector) against a library of Markdown "skill cards" using the last few messages of context. The top matches are listed as "here are skills relevant to this task" — the agent reads each one on demand if it decides it applies.

5. The user's actual message — appended last. This is what the user typed.

The whole stack is serialized into a single JSON config file. The orchestrator then spawns the agent subprocess and points it at that file. The model reads layers 1–4 once at the start of the turn and treats layer 5 as the prompt to respond to.

What the agent gets in the config besides the prompt

Model — claude-sonnet-4-6 with thinking enabled.
Tool allowlist — SDK built-ins (Bash, Read, Write, Edit, Glob, Grep, ToolSearch); WebFetch/WebSearch disallowed.
MCP server endpoints — URLs and headers for each HTTP MCP the user can use (after RBAC filtering strips ones they can't).
Budget — max_budget_usd: 5.0 per turn.
Streaming — include_partial_messages: True so we can stream tokens to the browser as they arrive.

The RBAC filter runs before the prompt is assembled — if the user can't use the Trustpilot MCP, then the Trustpilot usage guide never appears in layer 2 and the Trustpilot tools never appear in the catalogue. The model can't even consider calling a tool it can't see.

5 · MCP servers and their tools

The agent's "hands" are two Model Context Protocol servers — each a small HTTP service that exposes a handful of tools to the model. The SDK discovers them at session start and surfaces every tool as mcp__<server>__<tool> in the model's tool catalogue.

Rendering…

Tool details — inputs and outputs

Trustpilot MCP

Tool	Inputs	Output
`get_summary` pre-computed aggregate stats	None.	JSON object — one row from the `trustpilot_summary` table. `{ "total_reviews": 3548, "positive_count": 2812, "neutral_count": 196, "negative_count": 540, "avg_stars": 4.42, "response_count": 2946, "response_rate": 0.83, "avg_response_lag_days": 4.2, "churn_count": 41, "competitor_mention_count": 87, "competitor_name_breakdown": {"Omron": 31, "Withings": 14, ...}, "language_counts": {"English": 1820, "German": 980, ...}, "date_range_min": "2024-01-03", "date_range_max": "2026-05-19" }`
`query_reviews` SQL against `trustpilot_reviews`	`sql` string, required SELECT-only query. Must include `LIMIT` (max 200). Non-SELECT statements are rejected.	JSON array of row objects. Each row has the full ~25 columns: `review_id`, `review_created_utc`, `username`, `title`, `content`, `stars`, `sentiment`, `topic`, `has_refund_request`, `has_response`, `response_lag_days`, `review_month`, `review_word_count`, `language_label`, `churn_risk`, `competitor_mention`, `competitor_name`, `praise_aspect`, `language`, `source`, `company_response`, `company_response_author`, `domain_url`, `tags`, `location_name`. `[ { "review_id": "tp_4f29c1", "username": "Hans M.", "stars": 1, "sentiment": "negative", "topic": "accuracy", "churn_risk": true, "competitor_name": "Omron M7", "review_month": "2025-04", ... } ]` On SQL error or non-SELECT: `{"error": "..."}`.
`search_reviews_semantic` vector similarity over review text	`query` string, required Natural-language phrase. Embedded with `all-MiniLM-L6-v2` and matched via cosine similarity. `limit` int, optional Default 15. Hard max 30.	JSON array of reviews ordered by similarity (highest first). Each row: `[ { "review_id": "tp_8a12b3", "username": "Marta R.", "review_month": "2025-02", "title": "Great battery, runs days", "content": "...", "stars": 5, "sentiment": "positive", "topic": "positive_experience", "churn_risk": false, "praise_aspect": "value", "competitor_name": null, "language_label": "English", "similarity_score": 0.8214 } ]` If all `similarity_score` values are below 0.25, the agent is instructed to refuse answering — retrieval found nothing relevant.

Freshchat MCP

Tool	Inputs	Output
`find_customer_history` customer lookup + chat history	`email` string Preferred match key — exact, unique. `first_name` string `last_name` string Used when email is unknown — may match multiple users. `max_users` int, optional Default 5. Hard max 15. At least one of email / first_name / last_name must be provided.	JSON object — Freshchat users that matched, with their conversation history rolled up: { "matched_users": [ { "id": "9f3b2c01-...", "email": "hans.m@example.de", "name": "Hans Müller", "created_time": "2024-09-12T08:14:00Z", "conversations": [ { "conversation_id": "c_412...", "msg_count": 14, "user_msg_count": 6, "first_msg_utc": "2025-04-05T09:22:00Z", "last_msg_utc": "2025-04-13T16:01:00Z", "sample_user_text": "The readings are off by 15 points...", "sample_full_text": "Agent: Hi Hans, sorry to hear..." } ] } ], "total_users_found": 1, "users_returned": 1, "api_calls": 4, "search_query": {"by": "email", "email": "hans.m@example.de"} } `sample_user_text` is filtered to `actor_type=user` only — excludes bot / agent template responses. On bad input or upstream failure: `{"error": "...", "details": ..., "search_query": ..., "api_calls": N}`.

6 · A single chat turn

The browser sends a POST to kick off the turn and immediately opens an EventSource to stream the response. Both share an in-memory queue keyed by session ID.

Rendering…

The POST returns fast — it only enqueues work. Time-to-first-token is dominated by the model. The user message is persisted immediately so it shows up in history even if the agent crashes; the assistant message is persisted in a single transaction once the SDK emits its result event.

What the background task does

Saves the user message to messages.
Calls sdk_orchestrator.run_agent() and iterates over its events.
For each event: tags with seq and ui_category, pushes to the SSE queue, accumulates into in-memory content blocks.
Tracks every tool call (audit events) and failure (tool_failure events).
On the result event: writes the assistant message + metadata (cost, tokens, duration) + tool events in one transaction.
Pushes None to close the SSE stream.