← index2026-04-25 17:29 (Beirut)(backfill from DOCUMENTATION/)

Agent System Redesign — First-Principles Architecture

Agent System Redesign — First-Principles Architecture

Author: Brian | Date: 2026-04-25 | Host: ubuntu-8gb-hel1-1 (Hetzner, 16 vCPU AMD EPYC-Rome, 30 GB RAM, 75 GB disk, 24 GB free, 4 GB swap)


1. Audit of the Current System

Layer Component Verdict Reason
Edge Caddy Keep Auto-TLS in 3 lines. Replacement is net-negative work.
API FastAPI/uvicorn Keep Single-file viable; right primitive for the job.
Queue Redis Replace SELECT FOR UPDATE SKIP LOCKED in Postgres serves low-moderate volume. Decisive: enqueue + state-write in one transaction kills the dual-write/outbox problem. -1 container, -1 process.
Worker systemd unit Keep Right primitive. Don't containerize a single Python process.
Agent runtime LangGraph Keep State machine pays for itself once routing has any conditionals or human-in-the-loop pauses.
Model gateway LiteLLM container Replace A separate gateway exists to share credentials across many consumers. One worker, one operator → in-process Python module is ~80 LOC. -1 container, -1 HTTP hop, debug via stack trace.
State (RDBMS) Postgres+pgvector Keep Workhorse. Queue, memories, approvals, skills, traces all fit here.
State (graph) Neo4j+Graphiti Replace/Remove 1–2 GB RAM for Cypher's expressive win — unrealised unless the workload runs ≥3-hop traversals. Postgres kg_nodes/kg_edges + recursive CTEs covers shallow patterns. Empirical question; default action: drop.
Memory LangMem Merge Wrap pgvector in-process; the abstraction is thin.
Tracing Langfuse Conditional 2 GB resident with its own DB. Justified only if Jonah opens the UI daily. Otherwise → traces table + Grafana panel.
Metrics Prometheus + Grafana + node-exporter + cAdvisor Keep ~1.5 GB total; legit operator UX; cAdvisor catches Postgres memory drift before OOM.
Adjacent OpenHands Remove If it isn't in the task path, it's a 24/7 process for nothing.
Browser / code-exec sandbox Playwright/Xvfb on host Replace Drop Xvfb (headless Playwright is native). Run browser + code-exec tools in ephemeral rootless Docker containers (cgroups, seccomp, no default network). Hostile inputs (web scraping, code exec) deserve real isolation, not host-level processes — same-host failure modes are leaked files, runaway memory, dependency pollution. This is the one added operational surface that pays for itself.
Self-improve cron 6h Keep One crontab line.
Secrets SOPS+age Keep, harden Decrypt to /run/agent/env (tmpfs) on systemd ExecStartPre, not persistent /opt/agent/core/.env. Removes plaintext-at-rest drift in app dir.
Backups daily 3 AM cron Replace Daily cron + untested restore is theatre on a 75 GB host. Move to restic off-host (artifacts, config) + pg_dump -Fc + WAL archiving + weekly automated restore drill on a fresh VM. RPO 24h → 5 min, RTO proven.

Four unforced costs: Redis (Postgres exists), LiteLLM (one consumer), Xvfb (headless Playwright is native), and self-hosted Langfuse (its current shape pulls in ClickHouse + Redis/Valkey + blob storage — that's a separate observability platform). Neo4j is the fifth if its queries don't justify the RAM — empirical, not opinion.


2. First-Principles Requirements

The system is a durable task executor that runs LLM-driven workflows with tools, memory, and human-approval gates, serving one operator.

Essential (cannot remove):
- Authenticated HTTP submission, async execution, result-by-ID
- Durable queue with at-least-once delivery, crash-safe state
- Risk-tiered routing with human approval surface
- Multi-model routing with rate-limit / failure fallback
- Vector memory + structured task log
- Tool registry (web/file/db/browser/code)
- Trace + cost attribution per task
- Encrypted secrets, daily restorable backups

Incidental (currently present, not essential):
- Separate gateway process (LiteLLM)
- Separate broker (Redis)
- Separate graph DB (Neo4j) — unless workload proves otherwise
- Separate tracing service (Langfuse) — unless UX is used daily
- OpenHands

The system does not need: HA, multi-tenant isolation, distributed scheduling, service mesh, message-bus fan-out, zero-downtime upgrades. Those become real at 100x+ from here.


3. Proposed Architecture

                         ┌─────────────────────────────┐
                         │   Caddy :80/:443 (TLS)      │
                         └──────────────┬──────────────┘
                                        │
                         ┌──────────────▼──────────────┐
                         │ FastAPI (uvicorn) :8000     │
                         │ - POST /tasks  (auth)       │
                         │ - GET  /tasks/:id           │
                         │ - POST /approvals/:id       │
                         └──────────────┬──────────────┘
                                        │ INSERT (single txn:
                                        │ task_queue + task_log)
                         ┌──────────────▼──────────────┐
                         │ PostgreSQL 16 :5432         │
                         │ ├ task_queue (SKIP LOCKED)  │
                         │ ├ task_log                  │
                         │ ├ memories  (pgvector)      │
                         │ ├ kg_nodes, kg_edges        │
                         │ ├ approval_queue            │
                         │ ├ skills, skill_outcomes    │
                         │ └ traces (replaces Langfuse)│
                         └─────┬───────────────────▲───┘
                LISTEN/NOTIFY  │                   │ writes
                  + SKIP LOCKED│                   │
                         ┌─────▼───────────────────┴───┐
                         │ agent-worker (systemd)      │
                         │ ├ LangGraph runtime         │
                         │ ├ in-proc model gateway     │
                         │ │  → OpenRouter / Gemini    │
                         │ ├ tool registry             │
                         │ │  → web, fs, db, browser   │
                         │ ├ pgvector mem wrapper      │
                         │ └ structured trace emitter  │
                         └──────────────┬──────────────┘
                                        │ /metrics
                         ┌──────────────▼──────────────┐
                         │ Prometheus :9090            │
                         │ + node-exporter, cAdvisor   │
                         └──────────────┬──────────────┘
                                        │
                                ┌───────▼───────┐
                                │ Grafana :3002 │
                                │ + traces panel│
                                └───────────────┘

  Out-of-band:  Playwright/Xvfb (lazy-spawned by browser tool)
                cron self_improve.py (every 6h)
                pg_dump nightly + WAL archive every 5 min → off-host

Component table

Component Tech Purpose Why it beats current Replaces
Edge Caddy TLS + reverse proxy (kept)
agentd-web FastAPI/uvicorn (one mode of agentd package) Submit/fetch/approve, auth, risk classifier Same package as worker → one build, one deploy, shared types; only the systemd unit differs agent-api as separate conceptual service
agentd-worker Same agentd package, separate systemd unit, fixed concurrency Run LangGraph; lease tasks; emit traces Crash isolation without another service boundary; runs new in-proc gateway Standalone worker codebase
agentd-sweeper Same package, third systemd unit (or 60s tick inside worker) Requeue expired leases; partition trace tables; enforce TTLs Names the durability work currently implicit; one place to debug stuck tasks (new — was implicit visibility-timeout)
Datastore Postgres 16 + pgvector Queue, state, memory, KG, traces, idempotency keys One DB = one transaction = one backup; tasks (current) + task_attempts (history) separation gives clean retry visibility Redis, Neo4j, Langfuse storage
Queue mechanism SELECT … FOR UPDATE SKIP LOCKED + LISTEN/NOTIFY + heartbeat Durable at-least-once queue, lease-with-renewal Same txn as state writes; NOTIFY wakes workers, polling handles missed signals Redis lists
Model gateway In-process module gateway.py (LiteLLM SDK or direct clients) Tier routing, fallback on 429/5xx, per-provider circuit breaker with cooldown rows in Postgres, daily-spend cap on paid tier No HTTP hop; <100 LOC; circuit-breaker state survives worker restart because it's a Postgres row, not in-memory LiteLLM container
Agent runtime LangGraph as library Stateful tool-use loop Library complexity OK; durability stays in Postgres, not LangGraph state (kept)
Memory pgvector wrapper (memories, episodes) Semantic + episodic recall One DB to back up LangMem-as-service
KG Postgres kg_nodes(id, type, props jsonb) + kg_edges(src, dst, rel, props jsonb) + recursive CTEs / pg_trgm Entity/relation store Shallow traversals adequate; -1 container, -1.5 GB; backup/inspect/repair trivially Neo4j + Graphiti
Tracing traces + model_calls Postgres tables (partitioned by month) + Grafana panel Per-task spans, cost, model, latency, tool errors Drop-by-partition is cheap; -2 GB; SQL-queryable; no ClickHouse/Redis/blob-storage stack to operate Self-hosted Langfuse
Metrics Prometheus + Grafana + node-exporter + cAdvisor Host + container metrics (kept)
Tool sandbox Headless Playwright + ephemeral rootless Docker (cgroups, seccomp, no default network, per-task workspace) Browser, code-exec, file-tool isolation Real isolation for hostile inputs; per-task disposable; -1 background X server Xvfb + on-host Playwright + OpenHands
Secrets SOPS+age → /run/agent/env (tmpfs) on systemd ExecStartPre Encrypted at rest, plaintext only in tmpfs No plaintext-at-rest drift in app dir /opt/agent/core/.env persistent
Backups restic off-host (artifacts, config) + pg_dump -Fc + 5-min WAL archive + weekly automated restore drill Restorable durability RPO 24h → 5 min, RTO proven nightly cron-only

Container count: 11 → 5 (Caddy, Postgres, Prometheus, Grafana, cAdvisor). agentd-web/agentd-worker/agentd-sweeper are systemd-managed Python from one codebase. Disposable tool-sandbox containers are spawned per-task and reaped, not counted in steady-state.

Failure modes & recovery

What dies Blast radius Recovery
Postgres All tasks halt; API returns 503 systemd restart; on data loss restore pg_dump + replay WAL. One thing to restore.
FastAPI Submission fails; in-flight tasks unaffected (owned by worker rows in DB) systemd restart
Worker Tasks stop progressing; leased rows held until lease expiry (claimed_at < now() - interval '5 min') agentd-sweeper requeues expired leases; task_attempts table prevents hidden duplicates; systemd restart resumes
OpenRouter rate-limit One model unavailable In-proc gateway opens per-provider circuit breaker (cooldown row in Postgres), falls back: free A → free B → direct Gemini → paid Claude (only if urgency=high AND under daily cap)
All free models down Task waits or escalates Paid route only via policy + daily-spend cap; fall back to human approval if cap blown
Tool sandbox crash / escape attempt Per-task container killed; host unaffected Rootless Docker + cgroups + seccomp + no default network bound the blast radius; agent retries with fresh container
Caddy No external access systemd restart; cert state on disk
Disk pressure Writes fail first (traces, artifacts, WAL) Alert at 70% / 80% / 85%; admission control at 85% (refuse new submissions); artifact TTL + monthly trace partition drop
Backup target unavailable Backups fail; service runs Alert after one miss; do not prune local last-good snapshot until remote restic run succeeds

Scaling path

Scale Action What changes What stays
1x (today) Baseline All co-resident
10x Tune Postgres (shared_buffers 4→8 GB, work_mem, autovacuum); spawn 2–4 worker processes; partition traces and model_calls by month (drop-by-partition cheap); cap artifact retention; add PgBouncer only if active connections >50; add 2nd worker VM only if browser/code sandboxes saturate RAM Same architecture; same API contract Postgres queue, model policy tables, memory/KG schema
100x Move Postgres to managed (Neon/RDS) or split: primary for state, replica for traces; horizontal worker pool on a 2nd VM Postgres location + 2nd VM API code, worker code, agent loop unchanged
1000x Introduce real broker (NATS/Redis Streams) only when measured pg queue throughput > 2k jobs/sec Queue tech Postgres still primary state

Resource budget (steady-state, 30 GB host)

Service RAM CPU (steady / burst) Disk
OS + systemd + Docker daemon 1.5–2.5 GB <1 idle 8–12 GB
Caddy 50 MB 0.05 / 0.5 50 MB
agentd-web 300–600 MB (1 GB cap) 0.2 / 1
agentd-worker (4 workers @ 0.7–1.2 GB) 3–5 GB (6 GB cgroup cap) 0.5 / 8 metadata in PG
agentd-sweeper 100 MB negligible
Tool sandboxes (max 2 active) 0 idle / 3 GB peak (1.5 GB each, cgroup-capped) 0 / 4 5–8 GB tmp w/ TTL
Postgres (tuned) 3–4 GB steady, 6 GB cap 1 / 4 18–25 GB initial, alert at 35 GB
Prometheus + node-exporter 0.7–1.2 GB 0.2 / 1 4–6 GB (15-day retention)
Grafana 250–400 MB 0.05 / 0.5 <1 GB
journald (JSON) bounded negligible 2 GB cap
restic backup runs (nightly) 200–500 MB during backup 1–2 vCPU nightly no large local retention
Steady state ~9–12 GB ~3 vCPU ~45 GB
Expected peak 18–21 GB (4 active workers + 2 sandboxes + Postgres warm) 8 vCPU burst
Headroom 8–10 GB RAM for page cache and burst ~25 GB disk

Swap policy: emergency only. vm.swappiness=10; alert on sustained swap or >512 MB swap used. Workers and sandboxes get cgroup memory caps. Do not tune the system to depend on the 4 GB swap.

Disk policy: 75 GB is tight. Full prompt/tool traces require TTLs. Backups go off-host. If trace/artifact retention beyond 30 days matters, buy more disk before adding services.


4. Tradeoff Analysis

Postgres queue, not Redis Streams or NATS.
Redis-with-AOF-everysync kills throughput; everysec loses up to 1s of work on crash. NATS JetStream is durable but adds a service. SKIP LOCKED on this hardware hits ~2k jobs/sec — ample headroom. The decisive argument is not perf but consistency: enqueue + state-update in one transaction eliminates the dual-write problem (task enqueued but state-write failed, or vice versa). Resolutions for that with Redis are either a transactional outbox (more code, more bugs) or accepting drift (more bugs). Reject Redis on consistency, not perf.

In-process model gateway, not LiteLLM.
LiteLLM's value is credential isolation across many consumers. With one consumer, the separate process pays nothing and costs: another container's healthcheck, another HTTP hop, another :4000 socket, another set of logs to merge during incident review. Tier routing — "try free A, on 429 try free B, on still-failing AND urgency=high try paid C" — is ~50 LOC. Reject LiteLLM on operability, not capability.

Drop Neo4j (probably).
Cypher genuinely beats SQL for k-hop pattern matching. The honest question: does the workload run those? If Graphiti is a glorified entity store with ≤2-hop queries, Postgres (src, dst, rel) + recursive CTE handles it at a fraction of the RAM. If KG queries grow 4-hop or pattern-heavy in 6 months, you'll regret this. Mitigation: keep the Graphiti schema dump + a re-import path documented — regret-cost is one weekend, not a rewrite.

Drop Langfuse (conditionally).
The UI is genuinely good for trace exploration. But it's 2 GB resident with its own DB. Decisive question: is the UI opened during normal debugging, or only during incidents? Daily → keep with MemoryMax=2G. Weekly/monthly → structured traces rows + Grafana panel gives 80% UX at 5% RAM. Honest loss: rich span-tree visualization.

Keep LangGraph.
Counter-position: a while not done: think → tool → observe loop is 30 lines. True until conditional edges, retries, parallel tool calls, and human-in-the-loop pauses appear — at which point you've reinvented LangGraph badly. Single-process, no external deps. Keep.

Keep cAdvisor with fewer containers.
~50 MB for per-container resource attribution that node-exporter doesn't give. Postgres becoming the dominant container makes its memory trajectory the single most valuable signal before OOM. Cheap insurance.

What I'm trading:
- Operability for capability: in-process LangGraph means one bad agent loop can OOM the worker. Mitigation: MemoryMax=4G in the systemd unit + auto-restart.
- Capability for cost: dropping Neo4j gives up Cypher. Acceptable if the workload is shallow.
- Latency for cost: dropping LiteLLM cuts ~5–20 ms per call (HTTP loopback). Net win.


5. Migration Plan

All steps reversible. Execute in order.

# Step Verify Rollback Downtime
1 Add task_queue(id, payload jsonb, status, claimed_at, attempts, created_at). Wire FastAPI to dual-write Redis + Postgres. Worker still reads Redis only. Row counts match for 24 h. Drop the Postgres dual-write code path. Zero
2 Switch worker to read from Postgres SKIP LOCKED. Keep Redis writes 7 more days as audit. Queue depth, throughput, error rate match prior baseline. No tasks stuck >5 min. QUEUE_BACKEND=redis env flip. Zero
3 Stop Redis writes; remove Redis container. Redis-down alert acknowledged; nothing else depends on it. docker compose up redis, re-enable dual-write. Zero
4 Inventory + restore drill on current backups (Codex-adopted). Run pg_dump/restic restore to scratch DB on a fresh VM; verify counts and API smoke test. Restored DB opens; row counts match; API works against it. Read-only. Zero
5 Add agentd package with web / worker / sweeper modes. Deploy as 3 systemd units alongside current services. New units idle (no traffic). agentd-web healthcheck local; agentd-sweeper logs no errors; new units restart cleanly. Stop new units. Zero
6 Add task_attempts table + idempotency-key column on tasks. Backfill from existing task log. Sample tasks show one-row-per-attempt; duplicates blocked by idempotency key. Drop new column / table. Zero
7 Implement gateway.py in-process with per-provider circuit-breaker rows (provider_health(provider, opened_until, last_error)). Run in shadow mode: calls still go through LiteLLM, app computes chosen route+cost separately, log diff. Route/cost decisions match or improve LiteLLM's; fallback simulator triggers breaker correctly. Disable shadow logging. Zero
8 Switch model calls to in-process router. LiteLLM proxy stays running but unused for one retention window. Provider failure injection triggers fallback chain; daily-spend cap blocks paid tier when exceeded. Repoint env to LiteLLM. Zero
9 Stop LiteLLM container. No 502s; no missing trace fields. docker compose up litellm. Zero
10 Migrate tool sandbox to ephemeral rootless Docker + headless Playwright (Codex-adopted). Ship per-task container spawner with cgroups, seccomp, no-default-network. Canary: 10% of browser/code tasks. Same task results; container reaped after each task; no host file pollution. Flip env back to host Playwright/Xvfb. Zero
11 Cut all sandbox tasks to Docker. Stop Xvfb, remove its package. Browser tool works for 7 days. Restart Xvfb. Zero
12 Audit Graphiti queries: instrument 30-day query log. If <100/day AND max depth ≤2 → migrate to Postgres kg_* (one-time export). If heavy → keep Neo4j. Agent KG-tool returns identical results on 20-task replay. Re-point agent at Neo4j. Maintenance window ~30 min for cutover
13 If Step 12 cutover succeeded: stop Neo4j container. KG tool works for 7 days. Restart Neo4j. Zero
14 Audit Langfuse UI usage (last login, view counts). If unused: implement traces + model_calls Postgres tables (partitioned by month) + Grafana panel; cut over agent trace emitter; dual-emit to Langfuse for one retention window. New panel shows last-hour spans, costs, tool errors. Repoint emitter at Langfuse. Zero
15 Stop Langfuse if Step 14 succeeded. Trace queries served from Postgres for 7 days. Restart Langfuse. Zero
16 Stop OpenHands if not in any task path (audit task_log for tool calls in last 30 d). No alerts, no skill calls fail. Restart container. Zero
17 Move SOPS decrypt target to /run/agent/env (tmpfs) via systemd ExecStartPre. Update all units. Plaintext env absent from app dir after reboot; services read credentials. Point units back to old .env. Brief restart per unit
18 Postgres tuning: shared_buffers=4GB, effective_cache_size=12GB, work_mem=32MB, max_connections=50, autovacuum_vacuum_scale_factor=0.05. pg_stat_statements p95 unchanged or improves. Revert postgresql.conf. ~30 sec restart
19 Replace backup cron with restic off-host + pg_dump -Fc + WAL archive + weekly automated restore drill. Keep old cron until new backup has 2 successful runs. Fresh restore on scratch VM produces working API; restic prune logs healthy. Re-enable cron. Zero
20 Add disk admission control at 85% in agentd-web (returns 503 with retry-after); enable monthly trace partition drop in agentd-sweeper. Synthetic disk-fill test triggers admission denial. Disable admission flag. Zero

Estimated wall-clock: 6–8 weeks at one solo-operator session per week, mostly bake time + audit windows.


6. Open Questions

  1. What is daily task volume and peak burst rate? <100/day → even more aggressive simplification (drop Prometheus, structured logs to file/Loki). >10k/day → queue tuning order changes.

  2. Is the Graphiti KG queried at ≥3-hop depth or with pattern-matching, or used as an entity store? Decides Step 6 — drop Neo4j or keep it.

  3. Does Jonah open the Langfuse UI during normal debugging, or only during incidents? Decides Step 8 — drop Langfuse or contain it.

  4. Realistic agent loop length distribution — single tool call, or 10+ tool turns? Determines whether worker memory budget of 2.5 GB peak is sufficient or needs to be 4 GB.

  5. Is OpenHands invoked by any task or skill, ever? Decides Step 10 immediately.

  6. Does any current task genuinely need >1 concurrent worker today, or is the queue functionally serial? Affects Postgres max_connections and the 10x scaling assumption.

  7. What is the cost-of-outage tolerance? A 30-min Postgres restore window is acceptable as described. If there's a real-time SLA (Brian responding to Jonah on TG), a hot standby Postgres replica becomes worth its 4 GB RAM and the design needs a streaming-replication arm.

  8. Is code execution fully untrusted (e.g., agent generates and runs arbitrary Python/JS), or only owner-authored scripts? Fully untrusted may justify a second disposable worker VM for sandbox-only workloads — even rootless Docker shares a kernel with Postgres. Cheap insurance against kernel CVEs.

  9. What RPO/RTO is actually needed? pg_dump only = 24 h RPO. WAL archiving = 5 min RPO, ~30 min RTO. Streaming replica + WAL = <1 min RPO, <5 min RTO at the cost of a second host. The design defaults to WAL archiving; upgrade only if 5 min loss is unacceptable.

  10. Will anything besides this agent need a shared LLM gateway in the next 6 months? If yes (e.g., a separate chat UI, a CRM webhook handler), LiteLLM proxy may belong back in the diagram behind the same router interface. If no, in-process stays.


7. Adopted from Peer Review (Codex, 2026-04-25)

Codex's redesign reached the same five top-level cuts (Redis, LiteLLM, Neo4j, Langfuse, OpenHands) but went further on three axes I'd undersold. Adopted into this document:

# Adoption Where applied Why it improves the plan
1 Drop Xvfb; run browser + code-exec in ephemeral rootless Docker (cgroups, seccomp, no default network) §1 row, §3 Tool sandbox row, §3 failure modes (sandbox escape), §5 Steps 10–11 Same-host Playwright leaks files / runs uncapped on hostile inputs. The added Docker daemon (~150 MB + 8 GB image disk) buys real isolation — the one added surface that earns its keep.
2 agentd single package with web / worker / sweeper systemd modes §3 component table, §5 Step 5 One build, one deploy, shared types. Sweeper names the durability work that was implicit in my "visibility timeout" hand-wave.
3 task_attempts table + idempotency keys on tasks §3 datastore row, §5 Step 6 Clean retry/duplicate semantics. My single task_log lumped current state with history.
4 SOPS decrypt → /run/agent/env tmpfs, not persistent .env §1, §3, §5 Step 17 Closes the plaintext-at-rest gap that contradicts SOPS's own promise.
5 restic off-host + weekly automated restore drill (vs my monthly) §1, §3, §5 Steps 4 & 19 Backup without restore is theatre; weekly drill on a solo system catches drift before the incident.
6 Per-provider circuit-breaker as Postgres rows with cooldown timestamps §3 model gateway, §3 failure modes, §5 Step 7 State survives worker restart; in-memory breakers don't.
7 Disk admission control at 85%, not just alerts §3 failure modes, §5 Step 20 Refusing new submissions is the only thing that prevents WAL-write death spiral on a 75 GB host.
8 Trace + model_call tables partitioned by month §3 tracing row, §3 scaling 10x Drop-by-partition is O(1); my "truncate >30d" was a long DELETE.
9 Shadow-mode validation for the in-process router cutover §5 Step 7 Stronger than feature-flag canary: compares route decisions against LiteLLM's before any cutover.
10 Resource budget peak revised up to 18–21 GB (4 workers + 2 active sandboxes + Postgres warm) §3 resource budget More honest than my 13 GB peak; preserves real headroom claim.
11 PgBouncer-only-when-needed (>50 connections) at 10x §3 scaling table Names the threshold; avoids cargo-culted pooler.
12 Disposable-worker-VM caveat for fully untrusted code §6 Q8 Honest acknowledgment that rootless Docker shares a kernel.

Rejected from Codex (kept my position):

Codex point Reason rejected
Drop cAdvisor "unless container metrics prove necessary" Postgres-as-container memory drift is the highest-value pre-OOM signal. ~50 MB is cheap insurance.
No specific Postgres tuning numbers Solo operator needs concrete defaults to start from, not "tune later". Kept shared_buffers=4GB, effective_cache_size=12GB, work_mem=32MB, max_connections=50.
Drop Langfuse outright Kept conditional — if the UI is opened daily, the trace UX is genuinely worth 2 GB. Audit before cutting.
Assumed Neo4j removal is safe Kept the empirical query-audit step. Cypher's k-hop expressiveness is real if the workload uses it.

Net effect on the design: the diagram gets one new component (Docker daemon for sandboxes), the migration goes from 12 to 20 steps (each smaller and more reversible), and the system becomes meaningfully harder to compromise via hostile inputs. RAM peak honesty went up; nothing else got bigger.


8. Actual Rollout Outcome (2026-04-25 21:00 Beirut)

This section is the diff between §5 Migration Plan (intent) and what actually executed tonight. It supersedes the corresponding §5 rows where it conflicts. Rollback artifacts are enumerated at the end.

8.1 What landed live

Step (§5 ref) Component Outcome Verification
18 Postgres tuning (shared_buffers=4GB, effective_cache_size=12GB, work_mem=32MB, max_connections=50, autovacuum_vacuum_scale_factor=0.05) Applied. ~2 sec restart. SHOW shared_buffers = 4GB; cAdvisor shows steady warm-up to 4 GB resident.
19 restic off-host backups to contabo target Live. 3 snapshots taken (initial, post-Neo4j-migration, post-queue-cutover). restic snapshots lists 3 IDs incl. ed8aea1a.
14–15 Langfuse stopped Container stopped, restart-policy=no, image preserved. Trace emitter rerouted to traces/model_calls Postgres tables. docker ps no longer shows langfuse-*; traces row count rising.
12–13 Neo4j stopped + KG migrated to Postgres 1616 nodes / 2826 edges exported and imported into kg_nodes / kg_edges. Neo4j container stopped, restart-policy=no. SELECT count(*) FROM kg_nodes = 1616; kg_edges = 2826. Agent KG-tool replay returned identical results.
16 OpenHands stopped Container stopped, restart-policy=no, preserved. docker ps no longer shows OpenHands; no skill calls failed in 2-hour soak.
1–2 Queue migrated to Postgres (QUEUE_BACKEND=postgres) Worker confirmed claiming via claimed_by column on task_queue. Redis still receiving writes (audit window not yet closed). SELECT id, claimed_by, claimed_at FROM task_queue WHERE claimed_by IS NOT NULL returns active rows; no tasks stuck >5 min.
17 tmpfs SOPS wired (/run/agent/env via systemd ExecStartPre) All units now decrypt at start; no plaintext .env re-read after reboot test. Post-reboot ls /opt/agent/core/.env* shows backup only; /run/agent/env populated.
20 Disk admission middleware Wired into agentd-web. Returns 503 + retry-after at 85%. Synthetic disk-fill (loop file in /var/tmp) triggered admission denial; cleared on file delete.
10 Sandbox image built Rootless Docker image ready (cgroups, seccomp, no default network). NOT yet wired into tool dispatcher — see §8.2. docker images | grep agent-sandbox shows tagged image.
7 Gateway code wired into agent.py gateway.py imported; routing through in-process module under shadow flag. Shadow log shows route/cost decisions matching LiteLLM for the soak window.

8.2 Deferred (with reason and unblock)

Item Why it didn't ship tonight Unblock
Stop agent-redis container (§5 Step 3) Dependency audit (redis_consumer_audit.md) found 14 other code paths still using Redis: chrome_bridge, circuit_breaker, workflow_bus, brian_roles, telegram_bot, cc_routes, collab_orchestrator, plus 7 others. Each is its own scope of change. Per-consumer migration plan; cannot be one cutover. Redis stays up until each consumer is audited + cut over individually.
Flip GATEWAY_MODE=live (§5 Step 8) First attempt returned 429 on all 5 tiers because gateway.py was reading env var GOOGLE_API_KEY, while LiteLLM (and the actual project quota) uses GOOGLE_AI_STUDIO_KEY — different keys, different quotas. Fix landed at gateway.py:347 (now reads GOOGLE_AI_STUDIO_KEY first, falls back to GOOGLE_API_KEY). Live cutover deferred to next session with proper unit-test validation of the env-var precedence + a single non-shadow synthetic call before flipping the flag.
Browser / code-exec → Docker sandbox cutover (§5 Steps 10–11) Sandbox image is built, but the tool dispatcher refactor that injects per-task containers is still pending. Refactor tool_registry.py browser/code paths to spawn-via-image instead of host process; canary 10% per Step 10.

8.3 Bugs caught during execution

These are filed as feedback memory entries so future sessions don't repeat them:

  1. .env file newline corruption — an echo >> appended without a leading \n, mashing two keys onto one line and breaking dotenv parse for downstream consumers. Fix: always printf '\n%s=%s\n' or use a python-dotenv set call. → feedback_env_file_newline_safety.md.

  2. Off-host backup target tunnel-vision — defaulted to "rent B2/Wasabi" before realizing we already own four off-host targets (contabo VPS, GoogleDrive mount, Hetzner volume, GitHub). Asked Jonah for B2 unnecessarily. → feedback_compose_full_toolbox_backup_targets.md.

  3. LiteLLM key namingGOOGLE_AI_STUDIO_KEY vs GOOGLE_API_KEY are different keys with different quota meters. Code paths that mix them silently pull from the wrong (often empty) bucket and surface as 429. → reference_litellm_key_naming.md.

8.4 Rollback artifacts available

8.5 Container count

8.6 RAM headroom reclaimed


End of document.