Hilo · Architecture DESIGN DOC

Hilo · How it works

The short version, with diagrams. Deeper detail is collapsed — click to expand.

In one sentence: Hilo is a FastAPI chat app that wraps the Claude Agent SDK. Every user message spawns a fresh agent subprocess that connects to a handful of HTTP MCP servers (Trustpilot, Freshchat), streams tokens back to the browser over Server-Sent Events, and persists the result to Postgres.


1 · Overview

One EC2 instance. One Postgres database. One FastAPI server. Three long-running MCP sidecar processes. One ephemeral agent subprocess per chat turn. That's the whole production stack.

Rendering…
Reading the diagram: green = our FastAPI app. Blue = the Claude Agent SDK subprocess we spawn per message. Orange = the MCP servers (Trustpilot, Freshchat). Red = Postgres. Purple = external APIs (Anthropic, Freshchat).

2 · How Trustpilot reviews get enriched

A raw Trustpilot review is just unstructured text: a title, a body, a star rating, a username, a timestamp. Two separate processes transform each review at ingest time before it's useful to the agent.

What each extractor is actually doing

Every review ends up with ~25 columns. Nine are produced by the LLM, fanned out from the raw review:

Rendering…

The remaining columns — populated without any LLM:

Raw fields — as received from Trustpilot review_id  ·  review_created_utc  ·  username  ·  title  ·  content  ·  stars  ·  language  ·  source  ·  company_response Mechanical derivations — computed from the raw fields has_response — true if company_response is non-null
response_lag_days — days between review_created_utc and the response timestamp
review_month — YYYY-MM truncation of the review date, for time-bucketing
review_word_count — word count of the review body

Example: one review, before and after

Here's a single Trustpilot review as it arrives, and what ends up in the database after the pipeline runs.

Before — the raw review

★ ☆ ☆ ☆ ☆ Hans M. · 2025-04-12 · language: en
Inaccurate readings and slow support
Bought this band three months ago to track my blood pressure between doctor visits. Initial readings looked promising but they're consistently 15–20 points lower than my Omron M7 cuff. Contacted support twice — the first reply took 8 days, the second never came. The app keeps disconnecting from the band too. I want a refund. Going back to the Omron, this isn't worth €349.

That's all the platform receives: a star count, a title, a body, a username, a timestamp, a raw language tag. The rest has to be inferred.

After — the same review, broken down into columns

Each row below is one column on the trustpilot_reviews table. The right column shows what the extractors produced for this specific review.

ColumnExtracted value
stars1
sentimentnegative
topicaccuracy
has_refund_requesttrue
churn_risktrue
competitor_mentiontrue
competitor_nameOmron M7
praise_aspectNULL
language_labelEnglish
has_responsetrue
response_lag_days8
review_word_count67
review_month2025-04
embedding[0.0143, -0.0274, 0.1129, …]

That's one review. Multiply this across every review in the database and the agent can answer "which competitors are customers comparing us to?" or "how many people said they're cancelling?" with a single SQL query — no LLM call needed at question time.

Twenty reviews after extraction

Here's what twenty rows of the trustpilot_reviews table look like once the pipeline has run. Rows are grouped by sentiment (negative → neutral → positive) so the column patterns are easy to read top-down.

Review stars sentiment topic refund churn competitor praise lang month
Inaccurate readings and slow support
Hans M.
1 negative accuracy Omron M7 English 2025-04
Geld zurück, schalte auf Withings
Dieter K.
1 negative refund Withings German 2025-03
Akku defekt nach drei Wochen
Greta S.
1 negative hardware German 2025-05
Llegó después de tres semanas
Beatriz O.
1 negative shipping Spanish 2025-05
App crashes, going back to Fitbit
Niko F.
1 negative app Fitbit English 2025-05
Si scollega in continuazione
Sofia P.
2 negative connectivity Italian 2025-01
Switching to Apple Watch, support never replied
Robert F.
2 negative support Apple Watch English 2025-04
Werte 10 Punkte unter Omron
Helga R.
2 negative accuracy Omron German 2025-02
Strap broke — Withings was sturdier
Anders T.
2 negative hardware Withings English 2025-01
€349 ist zu viel für die Funktionen
Erika H.
3 neutral pricing German 2025-02
L'app marche, sans plus
Anne L.
3 neutral app French 2025-02
Très confortable pour la nuit
Pierre G.
4 positive positive_experience comfort French 2025-01
Super App, klare Charts
Klaus W.
4 positive app app German 2025-03
Helped me cut salt, sleep is better
Lars J.
4 positive positive_experience lifestyle_insights English 2025-04
Worth every euro for peace of mind
Marta R.
5 positive positive_experience value English 2025-02
Genauere Werte als erwartet
Jürgen B.
5 positive accuracy accuracy German 2025-04
Servizio clienti gentile e rapido
Lucia M.
5 positive support support Italian 2025-05
Mi cardiólogo confía en los datos
Carlos D.
5 positive positive_experience medical_usefulness Spanish 2025-03
Olvidarme del brazalete tradicional
Tomás V.
5 positive positive_experience passive_monitoring Spanish 2025-03
Bon rapport qualité-prix
Camille B.
5 positive positive_experience value French 2025-04

A few patterns jump out without any analysis: competitor_name almost always lines up with negative sentiment plus churn-risk; praise_aspect only fires on positive reviews; explicit refund requests cluster at 1-star; topic = positive_experience is what most 5-star reviews get bucketed into. Now imagine the agent running WHERE churn_risk = true over 3,500 rows in milliseconds.

Two tables, two access patterns

trustpilot_reviews — one row per review. Holds raw fields plus every extracted column above plus the embedding vector. This is what query_reviews (SQL) and search_reviews_semantic (vector) read from.

trustpilot_summary — one row, period. Pre-computed aggregates: total reviews, sentiment counts, average stars, response coverage, churn count, competitor mention count and breakdown, date range. This is what get_summary reads — a single-row SELECT that returns in milliseconds.

The summary table is rebuilt whenever the reviews table is refreshed with new data. The agent is taught (via the tool docstrings) to reach for get_summary first for any overview question, and only drop down to query_reviews when it needs filtered detail.


3 · System architecture

The stack is intentionally boring. Everything runs on a single EC2 host and is wired together with systemd. No microservices, no queue, no worker pool.

LayerWhat it is
UIJinja2 templates + vanilla JS (no SPA framework)
HTTP serverFastAPI + sse-starlette
Orchestratorapp/services/sdk_orchestrator.py — builds config, spawns subprocess
Agent runtimeclaude-agent-sdk, spawned per message
MCP servers2 long-lived HTTP processes (Trustpilot, Freshchat)
DatabasePostgreSQL 16 + pgvector
StreamingServer-Sent Events with an in-memory queue per session
Three design choices worth knowing

The agent process is short-lived

The agent does not outlive a single user turn. Every message spawns a fresh subprocess. Multi-turn continuity comes from passing the SDK's session_id back for the next turn. This keeps the server stateless between turns and bounds memory and cost per turn.

MCP servers are long-lived sidecars

The MCP servers run continuously as systemd units on fixed localhost ports. The agent subprocess connects to them over HTTP. This lets MCP servers hold expensive resources (database connections, file handles) across many requests.

No queue, no worker pool

Each chat message turns into a background asyncio.create_task inside the FastAPI process. The task drives the agent subprocess and pushes events into an in-memory asyncio.Queue keyed by session ID; a second request — the SSE stream — drains the queue. Simpler than SQS/workers and fine for current scale.


4 · How a prompt is assembled

Before the agent ever runs, the orchestrator builds the prompt that the model will see. It's a layering process — start from a fixed foundation, then add context tailored to this user, this conversation, and the tools they have access to. Think of it as a sandwich: the bottom is generic, the top is the user's actual question, and the middle is everything we've stitched in to make the model competent.

Rendering…
Reading the diagram: five independent sources of context feed into one assembled prompt. The blue boxes on the left are the inputs; the green box on the right is what actually gets sent to the agent subprocess.
The five layers, top to bottom

1. The base system prompt — fixed text shared by every conversation. It establishes the agent's identity (who it is, today's date, the company it works for), the house style for responses, and the list of built-in tools it can use (Read, Write, Bash, Grep, etc).

2. A usage guide per attached MCP server — for each MCP the user has access to, a short paragraph is appended explaining when to reach for it. For Trustpilot it says something like "use get_summary for instant star ratings, query_reviews when you need to run SQL against the review table". This is what stops the model from blindly guessing tool names.

3. The user's identity — their email gets interpolated into the prompt so the agent knows who it's talking to.

4. A skill preamble — Hilo runs a semantic search (pgvector) against a library of Markdown "skill cards" using the last few messages of context. The top matches are listed as "here are skills relevant to this task" — the agent reads each one on demand if it decides it applies.

5. The user's actual message — appended last. This is what the user typed.

The whole stack is serialized into a single JSON config file. The orchestrator then spawns the agent subprocess and points it at that file. The model reads layers 1–4 once at the start of the turn and treats layer 5 as the prompt to respond to.

What the agent gets in the config besides the prompt
  • Modelclaude-sonnet-4-6 with thinking enabled.
  • Tool allowlist — SDK built-ins (Bash, Read, Write, Edit, Glob, Grep, ToolSearch); WebFetch/WebSearch disallowed.
  • MCP server endpoints — URLs and headers for each HTTP MCP the user can use (after RBAC filtering strips ones they can't).
  • Budgetmax_budget_usd: 5.0 per turn.
  • Streaminginclude_partial_messages: True so we can stream tokens to the browser as they arrive.

The RBAC filter runs before the prompt is assembled — if the user can't use the Trustpilot MCP, then the Trustpilot usage guide never appears in layer 2 and the Trustpilot tools never appear in the catalogue. The model can't even consider calling a tool it can't see.


5 · MCP servers and their tools

The agent's "hands" are two Model Context Protocol servers — each a small HTTP service that exposes a handful of tools to the model. The SDK discovers them at session start and surfaces every tool as mcp__<server>__<tool> in the model's tool catalogue.

Rendering…
Tool details — inputs and outputs

Trustpilot MCP

ToolInputsOutput
get_summary
pre-computed aggregate stats
None. JSON object — one row from the trustpilot_summary table.
{
  "total_reviews": 3548,
  "positive_count": 2812,
  "neutral_count": 196,
  "negative_count": 540,
  "avg_stars": 4.42,
  "response_count": 2946,
  "response_rate": 0.83,
  "avg_response_lag_days": 4.2,
  "churn_count": 41,
  "competitor_mention_count": 87,
  "competitor_name_breakdown": {"Omron": 31, "Withings": 14, ...},
  "language_counts": {"English": 1820, "German": 980, ...},
  "date_range_min": "2024-01-03",
  "date_range_max": "2026-05-19"
}
query_reviews
SQL against trustpilot_reviews
sql  string, required
SELECT-only query. Must include LIMIT (max 200). Non-SELECT statements are rejected.
JSON array of row objects. Each row has the full ~25 columns: review_id, review_created_utc, username, title, content, stars, sentiment, topic, has_refund_request, has_response, response_lag_days, review_month, review_word_count, language_label, churn_risk, competitor_mention, competitor_name, praise_aspect, language, source, company_response, company_response_author, domain_url, tags, location_name.
[
  {
    "review_id": "tp_4f29c1",
    "username": "Hans M.",
    "stars": 1,
    "sentiment": "negative",
    "topic": "accuracy",
    "churn_risk": true,
    "competitor_name": "Omron M7",
    "review_month": "2025-04",
    ...
  }
]
On SQL error or non-SELECT: {"error": "..."}.
search_reviews_semantic
vector similarity over review text
query  string, required
Natural-language phrase. Embedded with all-MiniLM-L6-v2 and matched via cosine similarity.

limit  int, optional
Default 15. Hard max 30.
JSON array of reviews ordered by similarity (highest first). Each row:
[
  {
    "review_id": "tp_8a12b3",
    "username": "Marta R.",
    "review_month": "2025-02",
    "title": "Great battery, runs days",
    "content": "...",
    "stars": 5,
    "sentiment": "positive",
    "topic": "positive_experience",
    "churn_risk": false,
    "praise_aspect": "value",
    "competitor_name": null,
    "language_label": "English",
    "similarity_score": 0.8214
  }
]
If all similarity_score values are below 0.25, the agent is instructed to refuse answering — retrieval found nothing relevant.

Freshchat MCP

ToolInputsOutput
find_customer_history
customer lookup + chat history
email  string
Preferred match key — exact, unique.

first_name  string
last_name  string
Used when email is unknown — may match multiple users.

max_users  int, optional
Default 5. Hard max 15.

At least one of email / first_name / last_name must be provided.
JSON object — Freshchat users that matched, with their conversation history rolled up:
{
  "matched_users": [
    {
      "id": "9f3b2c01-...",
      "email": "hans.m@example.de",
      "name": "Hans Müller",
      "created_time": "2024-09-12T08:14:00Z",
      "conversations": [
        {
          "conversation_id": "c_412...",
          "msg_count": 14,
          "user_msg_count": 6,
          "first_msg_utc": "2025-04-05T09:22:00Z",
          "last_msg_utc":  "2025-04-13T16:01:00Z",
          "sample_user_text": "The readings are off by 15 points...",
          "sample_full_text": "Agent: Hi Hans, sorry to hear..."
        }
      ]
    }
  ],
  "total_users_found": 1,
  "users_returned": 1,
  "api_calls": 4,
  "search_query": {"by": "email", "email": "hans.m@example.de"}
}
sample_user_text is filtered to actor_type=user only — excludes bot / agent template responses. On bad input or upstream failure: {"error": "...", "details": ..., "search_query": ..., "api_calls": N}.

6 · A single chat turn

The browser sends a POST to kick off the turn and immediately opens an EventSource to stream the response. Both share an in-memory queue keyed by session ID.

Rendering…
The POST returns fast — it only enqueues work. Time-to-first-token is dominated by the model. The user message is persisted immediately so it shows up in history even if the agent crashes; the assistant message is persisted in a single transaction once the SDK emits its result event.
What the background task does
  1. Saves the user message to messages.
  2. Calls sdk_orchestrator.run_agent() and iterates over its events.
  3. For each event: tags with seq and ui_category, pushes to the SSE queue, accumulates into in-memory content blocks.
  4. Tracks every tool call (audit events) and failure (tool_failure events).
  5. On the result event: writes the assistant message + metadata (cost, tokens, duration) + tool events in one transaction.
  6. Pushes None to close the SSE stream.