Author: Brian | Date: 2026-04-25 | Host: ubuntu-8gb-hel1-1 (Hetzner, 16 vCPU AMD EPYC-Rome, 30 GB RAM, 75 GB disk, 24 GB free, 4 GB swap)
| Layer | Component | Verdict | Reason |
|---|---|---|---|
| Edge | Caddy | Keep | Auto-TLS in 3 lines. Replacement is net-negative work. |
| API | FastAPI/uvicorn | Keep | Single-file viable; right primitive for the job. |
| Queue | Redis | Replace | SELECT FOR UPDATE SKIP LOCKED in Postgres serves low-moderate volume. Decisive: enqueue + state-write in one transaction kills the dual-write/outbox problem. -1 container, -1 process. |
| Worker | systemd unit | Keep | Right primitive. Don't containerize a single Python process. |
| Agent runtime | LangGraph | Keep | State machine pays for itself once routing has any conditionals or human-in-the-loop pauses. |
| Model gateway | LiteLLM container | Replace | A separate gateway exists to share credentials across many consumers. One worker, one operator → in-process Python module is ~80 LOC. -1 container, -1 HTTP hop, debug via stack trace. |
| State (RDBMS) | Postgres+pgvector | Keep | Workhorse. Queue, memories, approvals, skills, traces all fit here. |
| State (graph) | Neo4j+Graphiti | Replace/Remove | 1–2 GB RAM for Cypher's expressive win — unrealised unless the workload runs ≥3-hop traversals. Postgres kg_nodes/kg_edges + recursive CTEs covers shallow patterns. Empirical question; default action: drop. |
| Memory | LangMem | Merge | Wrap pgvector in-process; the abstraction is thin. |
| Tracing | Langfuse | Conditional | 2 GB resident with its own DB. Justified only if Jonah opens the UI daily. Otherwise → traces table + Grafana panel. |
| Metrics | Prometheus + Grafana + node-exporter + cAdvisor | Keep | ~1.5 GB total; legit operator UX; cAdvisor catches Postgres memory drift before OOM. |
| Adjacent | OpenHands | Remove | If it isn't in the task path, it's a 24/7 process for nothing. |
| Browser / code-exec sandbox | Playwright/Xvfb on host | Replace | Drop Xvfb (headless Playwright is native). Run browser + code-exec tools in ephemeral rootless Docker containers (cgroups, seccomp, no default network). Hostile inputs (web scraping, code exec) deserve real isolation, not host-level processes — same-host failure modes are leaked files, runaway memory, dependency pollution. This is the one added operational surface that pays for itself. |
| Self-improve | cron 6h | Keep | One crontab line. |
| Secrets | SOPS+age | Keep, harden | Decrypt to /run/agent/env (tmpfs) on systemd ExecStartPre, not persistent /opt/agent/core/.env. Removes plaintext-at-rest drift in app dir. |
| Backups | daily 3 AM cron | Replace | Daily cron + untested restore is theatre on a 75 GB host. Move to restic off-host (artifacts, config) + pg_dump -Fc + WAL archiving + weekly automated restore drill on a fresh VM. RPO 24h → 5 min, RTO proven. |
Four unforced costs: Redis (Postgres exists), LiteLLM (one consumer), Xvfb (headless Playwright is native), and self-hosted Langfuse (its current shape pulls in ClickHouse + Redis/Valkey + blob storage — that's a separate observability platform). Neo4j is the fifth if its queries don't justify the RAM — empirical, not opinion.
The system is a durable task executor that runs LLM-driven workflows with tools, memory, and human-approval gates, serving one operator.
Essential (cannot remove):
- Authenticated HTTP submission, async execution, result-by-ID
- Durable queue with at-least-once delivery, crash-safe state
- Risk-tiered routing with human approval surface
- Multi-model routing with rate-limit / failure fallback
- Vector memory + structured task log
- Tool registry (web/file/db/browser/code)
- Trace + cost attribution per task
- Encrypted secrets, daily restorable backups
Incidental (currently present, not essential):
- Separate gateway process (LiteLLM)
- Separate broker (Redis)
- Separate graph DB (Neo4j) — unless workload proves otherwise
- Separate tracing service (Langfuse) — unless UX is used daily
- OpenHands
The system does not need: HA, multi-tenant isolation, distributed scheduling, service mesh, message-bus fan-out, zero-downtime upgrades. Those become real at 100x+ from here.
┌─────────────────────────────┐
│ Caddy :80/:443 (TLS) │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ FastAPI (uvicorn) :8000 │
│ - POST /tasks (auth) │
│ - GET /tasks/:id │
│ - POST /approvals/:id │
└──────────────┬──────────────┘
│ INSERT (single txn:
│ task_queue + task_log)
┌──────────────▼──────────────┐
│ PostgreSQL 16 :5432 │
│ ├ task_queue (SKIP LOCKED) │
│ ├ task_log │
│ ├ memories (pgvector) │
│ ├ kg_nodes, kg_edges │
│ ├ approval_queue │
│ ├ skills, skill_outcomes │
│ └ traces (replaces Langfuse)│
└─────┬───────────────────▲───┘
LISTEN/NOTIFY │ │ writes
+ SKIP LOCKED│ │
┌─────▼───────────────────┴───┐
│ agent-worker (systemd) │
│ ├ LangGraph runtime │
│ ├ in-proc model gateway │
│ │ → OpenRouter / Gemini │
│ ├ tool registry │
│ │ → web, fs, db, browser │
│ ├ pgvector mem wrapper │
│ └ structured trace emitter │
└──────────────┬──────────────┘
│ /metrics
┌──────────────▼──────────────┐
│ Prometheus :9090 │
│ + node-exporter, cAdvisor │
└──────────────┬──────────────┘
│
┌───────▼───────┐
│ Grafana :3002 │
│ + traces panel│
└───────────────┘
Out-of-band: Playwright/Xvfb (lazy-spawned by browser tool)
cron self_improve.py (every 6h)
pg_dump nightly + WAL archive every 5 min → off-host
| Component | Tech | Purpose | Why it beats current | Replaces |
|---|---|---|---|---|
| Edge | Caddy | TLS + reverse proxy | (kept) | — |
agentd-web |
FastAPI/uvicorn (one mode of agentd package) |
Submit/fetch/approve, auth, risk classifier | Same package as worker → one build, one deploy, shared types; only the systemd unit differs | agent-api as separate conceptual service |
agentd-worker |
Same agentd package, separate systemd unit, fixed concurrency |
Run LangGraph; lease tasks; emit traces | Crash isolation without another service boundary; runs new in-proc gateway | Standalone worker codebase |
agentd-sweeper |
Same package, third systemd unit (or 60s tick inside worker) | Requeue expired leases; partition trace tables; enforce TTLs | Names the durability work currently implicit; one place to debug stuck tasks | (new — was implicit visibility-timeout) |
| Datastore | Postgres 16 + pgvector | Queue, state, memory, KG, traces, idempotency keys | One DB = one transaction = one backup; tasks (current) + task_attempts (history) separation gives clean retry visibility |
Redis, Neo4j, Langfuse storage |
| Queue mechanism | SELECT … FOR UPDATE SKIP LOCKED + LISTEN/NOTIFY + heartbeat |
Durable at-least-once queue, lease-with-renewal | Same txn as state writes; NOTIFY wakes workers, polling handles missed signals | Redis lists |
| Model gateway | In-process module gateway.py (LiteLLM SDK or direct clients) |
Tier routing, fallback on 429/5xx, per-provider circuit breaker with cooldown rows in Postgres, daily-spend cap on paid tier | No HTTP hop; <100 LOC; circuit-breaker state survives worker restart because it's a Postgres row, not in-memory | LiteLLM container |
| Agent runtime | LangGraph as library | Stateful tool-use loop | Library complexity OK; durability stays in Postgres, not LangGraph state | (kept) |
| Memory | pgvector wrapper (memories, episodes) |
Semantic + episodic recall | One DB to back up | LangMem-as-service |
| KG | Postgres kg_nodes(id, type, props jsonb) + kg_edges(src, dst, rel, props jsonb) + recursive CTEs / pg_trgm |
Entity/relation store | Shallow traversals adequate; -1 container, -1.5 GB; backup/inspect/repair trivially | Neo4j + Graphiti |
| Tracing | traces + model_calls Postgres tables (partitioned by month) + Grafana panel |
Per-task spans, cost, model, latency, tool errors | Drop-by-partition is cheap; -2 GB; SQL-queryable; no ClickHouse/Redis/blob-storage stack to operate | Self-hosted Langfuse |
| Metrics | Prometheus + Grafana + node-exporter + cAdvisor | Host + container metrics | (kept) | — |
| Tool sandbox | Headless Playwright + ephemeral rootless Docker (cgroups, seccomp, no default network, per-task workspace) | Browser, code-exec, file-tool isolation | Real isolation for hostile inputs; per-task disposable; -1 background X server | Xvfb + on-host Playwright + OpenHands |
| Secrets | SOPS+age → /run/agent/env (tmpfs) on systemd ExecStartPre |
Encrypted at rest, plaintext only in tmpfs | No plaintext-at-rest drift in app dir | /opt/agent/core/.env persistent |
| Backups | restic off-host (artifacts, config) + pg_dump -Fc + 5-min WAL archive + weekly automated restore drill |
Restorable durability | RPO 24h → 5 min, RTO proven | nightly cron-only |
Container count: 11 → 5 (Caddy, Postgres, Prometheus, Grafana, cAdvisor). agentd-web/agentd-worker/agentd-sweeper are systemd-managed Python from one codebase. Disposable tool-sandbox containers are spawned per-task and reaped, not counted in steady-state.
| What dies | Blast radius | Recovery |
|---|---|---|
| Postgres | All tasks halt; API returns 503 | systemd restart; on data loss restore pg_dump + replay WAL. One thing to restore. |
| FastAPI | Submission fails; in-flight tasks unaffected (owned by worker rows in DB) | systemd restart |
| Worker | Tasks stop progressing; leased rows held until lease expiry (claimed_at < now() - interval '5 min') |
agentd-sweeper requeues expired leases; task_attempts table prevents hidden duplicates; systemd restart resumes |
| OpenRouter rate-limit | One model unavailable | In-proc gateway opens per-provider circuit breaker (cooldown row in Postgres), falls back: free A → free B → direct Gemini → paid Claude (only if urgency=high AND under daily cap) |
| All free models down | Task waits or escalates | Paid route only via policy + daily-spend cap; fall back to human approval if cap blown |
| Tool sandbox crash / escape attempt | Per-task container killed; host unaffected | Rootless Docker + cgroups + seccomp + no default network bound the blast radius; agent retries with fresh container |
| Caddy | No external access | systemd restart; cert state on disk |
| Disk pressure | Writes fail first (traces, artifacts, WAL) | Alert at 70% / 80% / 85%; admission control at 85% (refuse new submissions); artifact TTL + monthly trace partition drop |
| Backup target unavailable | Backups fail; service runs | Alert after one miss; do not prune local last-good snapshot until remote restic run succeeds |
| Scale | Action | What changes | What stays |
|---|---|---|---|
| 1x (today) | Baseline | All co-resident | — |
| 10x | Tune Postgres (shared_buffers 4→8 GB, work_mem, autovacuum); spawn 2–4 worker processes; partition traces and model_calls by month (drop-by-partition cheap); cap artifact retention; add PgBouncer only if active connections >50; add 2nd worker VM only if browser/code sandboxes saturate RAM |
Same architecture; same API contract | Postgres queue, model policy tables, memory/KG schema |
| 100x | Move Postgres to managed (Neon/RDS) or split: primary for state, replica for traces; horizontal worker pool on a 2nd VM | Postgres location + 2nd VM | API code, worker code, agent loop unchanged |
| 1000x | Introduce real broker (NATS/Redis Streams) only when measured pg queue throughput > 2k jobs/sec |
Queue tech | Postgres still primary state |
| Service | RAM | CPU (steady / burst) | Disk |
|---|---|---|---|
| OS + systemd + Docker daemon | 1.5–2.5 GB | <1 idle | 8–12 GB |
| Caddy | 50 MB | 0.05 / 0.5 | 50 MB |
agentd-web |
300–600 MB (1 GB cap) | 0.2 / 1 | — |
agentd-worker (4 workers @ 0.7–1.2 GB) |
3–5 GB (6 GB cgroup cap) | 0.5 / 8 | metadata in PG |
agentd-sweeper |
100 MB | negligible | — |
| Tool sandboxes (max 2 active) | 0 idle / 3 GB peak (1.5 GB each, cgroup-capped) | 0 / 4 | 5–8 GB tmp w/ TTL |
| Postgres (tuned) | 3–4 GB steady, 6 GB cap | 1 / 4 | 18–25 GB initial, alert at 35 GB |
| Prometheus + node-exporter | 0.7–1.2 GB | 0.2 / 1 | 4–6 GB (15-day retention) |
| Grafana | 250–400 MB | 0.05 / 0.5 | <1 GB |
| journald (JSON) | bounded | negligible | 2 GB cap |
| restic backup runs (nightly) | 200–500 MB during backup | 1–2 vCPU nightly | no large local retention |
| Steady state | ~9–12 GB | ~3 vCPU | ~45 GB |
| Expected peak | 18–21 GB (4 active workers + 2 sandboxes + Postgres warm) | 8 vCPU burst | — |
| Headroom | 8–10 GB RAM for page cache and burst | ~25 GB disk |
Swap policy: emergency only. vm.swappiness=10; alert on sustained swap or >512 MB swap used. Workers and sandboxes get cgroup memory caps. Do not tune the system to depend on the 4 GB swap.
Disk policy: 75 GB is tight. Full prompt/tool traces require TTLs. Backups go off-host. If trace/artifact retention beyond 30 days matters, buy more disk before adding services.
Postgres queue, not Redis Streams or NATS.
Redis-with-AOF-everysync kills throughput; everysec loses up to 1s of work on crash. NATS JetStream is durable but adds a service. SKIP LOCKED on this hardware hits ~2k jobs/sec — ample headroom. The decisive argument is not perf but consistency: enqueue + state-update in one transaction eliminates the dual-write problem (task enqueued but state-write failed, or vice versa). Resolutions for that with Redis are either a transactional outbox (more code, more bugs) or accepting drift (more bugs). Reject Redis on consistency, not perf.
In-process model gateway, not LiteLLM.
LiteLLM's value is credential isolation across many consumers. With one consumer, the separate process pays nothing and costs: another container's healthcheck, another HTTP hop, another :4000 socket, another set of logs to merge during incident review. Tier routing — "try free A, on 429 try free B, on still-failing AND urgency=high try paid C" — is ~50 LOC. Reject LiteLLM on operability, not capability.
Drop Neo4j (probably).
Cypher genuinely beats SQL for k-hop pattern matching. The honest question: does the workload run those? If Graphiti is a glorified entity store with ≤2-hop queries, Postgres (src, dst, rel) + recursive CTE handles it at a fraction of the RAM. If KG queries grow 4-hop or pattern-heavy in 6 months, you'll regret this. Mitigation: keep the Graphiti schema dump + a re-import path documented — regret-cost is one weekend, not a rewrite.
Drop Langfuse (conditionally).
The UI is genuinely good for trace exploration. But it's 2 GB resident with its own DB. Decisive question: is the UI opened during normal debugging, or only during incidents? Daily → keep with MemoryMax=2G. Weekly/monthly → structured traces rows + Grafana panel gives 80% UX at 5% RAM. Honest loss: rich span-tree visualization.
Keep LangGraph.
Counter-position: a while not done: think → tool → observe loop is 30 lines. True until conditional edges, retries, parallel tool calls, and human-in-the-loop pauses appear — at which point you've reinvented LangGraph badly. Single-process, no external deps. Keep.
Keep cAdvisor with fewer containers.
~50 MB for per-container resource attribution that node-exporter doesn't give. Postgres becoming the dominant container makes its memory trajectory the single most valuable signal before OOM. Cheap insurance.
What I'm trading:
- Operability for capability: in-process LangGraph means one bad agent loop can OOM the worker. Mitigation: MemoryMax=4G in the systemd unit + auto-restart.
- Capability for cost: dropping Neo4j gives up Cypher. Acceptable if the workload is shallow.
- Latency for cost: dropping LiteLLM cuts ~5–20 ms per call (HTTP loopback). Net win.
All steps reversible. Execute in order.
| # | Step | Verify | Rollback | Downtime |
|---|---|---|---|---|
| 1 | Add task_queue(id, payload jsonb, status, claimed_at, attempts, created_at). Wire FastAPI to dual-write Redis + Postgres. Worker still reads Redis only. |
Row counts match for 24 h. | Drop the Postgres dual-write code path. | Zero |
| 2 | Switch worker to read from Postgres SKIP LOCKED. Keep Redis writes 7 more days as audit. | Queue depth, throughput, error rate match prior baseline. No tasks stuck >5 min. | QUEUE_BACKEND=redis env flip. |
Zero |
| 3 | Stop Redis writes; remove Redis container. | Redis-down alert acknowledged; nothing else depends on it. | docker compose up redis, re-enable dual-write. |
Zero |
| 4 | Inventory + restore drill on current backups (Codex-adopted). Run pg_dump/restic restore to scratch DB on a fresh VM; verify counts and API smoke test. |
Restored DB opens; row counts match; API works against it. | Read-only. | Zero |
| 5 | Add agentd package with web / worker / sweeper modes. Deploy as 3 systemd units alongside current services. New units idle (no traffic). |
agentd-web healthcheck local; agentd-sweeper logs no errors; new units restart cleanly. |
Stop new units. | Zero |
| 6 | Add task_attempts table + idempotency-key column on tasks. Backfill from existing task log. |
Sample tasks show one-row-per-attempt; duplicates blocked by idempotency key. | Drop new column / table. | Zero |
| 7 | Implement gateway.py in-process with per-provider circuit-breaker rows (provider_health(provider, opened_until, last_error)). Run in shadow mode: calls still go through LiteLLM, app computes chosen route+cost separately, log diff. |
Route/cost decisions match or improve LiteLLM's; fallback simulator triggers breaker correctly. | Disable shadow logging. | Zero |
| 8 | Switch model calls to in-process router. LiteLLM proxy stays running but unused for one retention window. | Provider failure injection triggers fallback chain; daily-spend cap blocks paid tier when exceeded. | Repoint env to LiteLLM. | Zero |
| 9 | Stop LiteLLM container. | No 502s; no missing trace fields. | docker compose up litellm. |
Zero |
| 10 | Migrate tool sandbox to ephemeral rootless Docker + headless Playwright (Codex-adopted). Ship per-task container spawner with cgroups, seccomp, no-default-network. Canary: 10% of browser/code tasks. | Same task results; container reaped after each task; no host file pollution. | Flip env back to host Playwright/Xvfb. | Zero |
| 11 | Cut all sandbox tasks to Docker. Stop Xvfb, remove its package. | Browser tool works for 7 days. | Restart Xvfb. | Zero |
| 12 | Audit Graphiti queries: instrument 30-day query log. If <100/day AND max depth ≤2 → migrate to Postgres kg_* (one-time export). If heavy → keep Neo4j. |
Agent KG-tool returns identical results on 20-task replay. | Re-point agent at Neo4j. | Maintenance window ~30 min for cutover |
| 13 | If Step 12 cutover succeeded: stop Neo4j container. | KG tool works for 7 days. | Restart Neo4j. | Zero |
| 14 | Audit Langfuse UI usage (last login, view counts). If unused: implement traces + model_calls Postgres tables (partitioned by month) + Grafana panel; cut over agent trace emitter; dual-emit to Langfuse for one retention window. |
New panel shows last-hour spans, costs, tool errors. | Repoint emitter at Langfuse. | Zero |
| 15 | Stop Langfuse if Step 14 succeeded. | Trace queries served from Postgres for 7 days. | Restart Langfuse. | Zero |
| 16 | Stop OpenHands if not in any task path (audit task_log for tool calls in last 30 d). |
No alerts, no skill calls fail. | Restart container. | Zero |
| 17 | Move SOPS decrypt target to /run/agent/env (tmpfs) via systemd ExecStartPre. Update all units. |
Plaintext env absent from app dir after reboot; services read credentials. | Point units back to old .env. |
Brief restart per unit |
| 18 | Postgres tuning: shared_buffers=4GB, effective_cache_size=12GB, work_mem=32MB, max_connections=50, autovacuum_vacuum_scale_factor=0.05. |
pg_stat_statements p95 unchanged or improves. |
Revert postgresql.conf. |
~30 sec restart |
| 19 | Replace backup cron with restic off-host + pg_dump -Fc + WAL archive + weekly automated restore drill. Keep old cron until new backup has 2 successful runs. |
Fresh restore on scratch VM produces working API; restic prune logs healthy. | Re-enable cron. | Zero |
| 20 | Add disk admission control at 85% in agentd-web (returns 503 with retry-after); enable monthly trace partition drop in agentd-sweeper. |
Synthetic disk-fill test triggers admission denial. | Disable admission flag. | Zero |
Estimated wall-clock: 6–8 weeks at one solo-operator session per week, mostly bake time + audit windows.
What is daily task volume and peak burst rate? <100/day → even more aggressive simplification (drop Prometheus, structured logs to file/Loki). >10k/day → queue tuning order changes.
Is the Graphiti KG queried at ≥3-hop depth or with pattern-matching, or used as an entity store? Decides Step 6 — drop Neo4j or keep it.
Does Jonah open the Langfuse UI during normal debugging, or only during incidents? Decides Step 8 — drop Langfuse or contain it.
Realistic agent loop length distribution — single tool call, or 10+ tool turns? Determines whether worker memory budget of 2.5 GB peak is sufficient or needs to be 4 GB.
Is OpenHands invoked by any task or skill, ever? Decides Step 10 immediately.
Does any current task genuinely need >1 concurrent worker today, or is the queue functionally serial? Affects Postgres max_connections and the 10x scaling assumption.
What is the cost-of-outage tolerance? A 30-min Postgres restore window is acceptable as described. If there's a real-time SLA (Brian responding to Jonah on TG), a hot standby Postgres replica becomes worth its 4 GB RAM and the design needs a streaming-replication arm.
Is code execution fully untrusted (e.g., agent generates and runs arbitrary Python/JS), or only owner-authored scripts? Fully untrusted may justify a second disposable worker VM for sandbox-only workloads — even rootless Docker shares a kernel with Postgres. Cheap insurance against kernel CVEs.
What RPO/RTO is actually needed? pg_dump only = 24 h RPO. WAL archiving = 5 min RPO, ~30 min RTO. Streaming replica + WAL = <1 min RPO, <5 min RTO at the cost of a second host. The design defaults to WAL archiving; upgrade only if 5 min loss is unacceptable.
Will anything besides this agent need a shared LLM gateway in the next 6 months? If yes (e.g., a separate chat UI, a CRM webhook handler), LiteLLM proxy may belong back in the diagram behind the same router interface. If no, in-process stays.
Codex's redesign reached the same five top-level cuts (Redis, LiteLLM, Neo4j, Langfuse, OpenHands) but went further on three axes I'd undersold. Adopted into this document:
| # | Adoption | Where applied | Why it improves the plan |
|---|---|---|---|
| 1 | Drop Xvfb; run browser + code-exec in ephemeral rootless Docker (cgroups, seccomp, no default network) | §1 row, §3 Tool sandbox row, §3 failure modes (sandbox escape), §5 Steps 10–11 | Same-host Playwright leaks files / runs uncapped on hostile inputs. The added Docker daemon (~150 MB + 8 GB image disk) buys real isolation — the one added surface that earns its keep. |
| 2 | agentd single package with web / worker / sweeper systemd modes |
§3 component table, §5 Step 5 | One build, one deploy, shared types. Sweeper names the durability work that was implicit in my "visibility timeout" hand-wave. |
| 3 | task_attempts table + idempotency keys on tasks |
§3 datastore row, §5 Step 6 | Clean retry/duplicate semantics. My single task_log lumped current state with history. |
| 4 | SOPS decrypt → /run/agent/env tmpfs, not persistent .env |
§1, §3, §5 Step 17 | Closes the plaintext-at-rest gap that contradicts SOPS's own promise. |
| 5 | restic off-host + weekly automated restore drill (vs my monthly) | §1, §3, §5 Steps 4 & 19 | Backup without restore is theatre; weekly drill on a solo system catches drift before the incident. |
| 6 | Per-provider circuit-breaker as Postgres rows with cooldown timestamps | §3 model gateway, §3 failure modes, §5 Step 7 | State survives worker restart; in-memory breakers don't. |
| 7 | Disk admission control at 85%, not just alerts | §3 failure modes, §5 Step 20 | Refusing new submissions is the only thing that prevents WAL-write death spiral on a 75 GB host. |
| 8 | Trace + model_call tables partitioned by month | §3 tracing row, §3 scaling 10x | Drop-by-partition is O(1); my "truncate >30d" was a long DELETE. |
| 9 | Shadow-mode validation for the in-process router cutover | §5 Step 7 | Stronger than feature-flag canary: compares route decisions against LiteLLM's before any cutover. |
| 10 | Resource budget peak revised up to 18–21 GB (4 workers + 2 active sandboxes + Postgres warm) | §3 resource budget | More honest than my 13 GB peak; preserves real headroom claim. |
| 11 | PgBouncer-only-when-needed (>50 connections) at 10x | §3 scaling table | Names the threshold; avoids cargo-culted pooler. |
| 12 | Disposable-worker-VM caveat for fully untrusted code | §6 Q8 | Honest acknowledgment that rootless Docker shares a kernel. |
Rejected from Codex (kept my position):
| Codex point | Reason rejected |
|---|---|
| Drop cAdvisor "unless container metrics prove necessary" | Postgres-as-container memory drift is the highest-value pre-OOM signal. ~50 MB is cheap insurance. |
| No specific Postgres tuning numbers | Solo operator needs concrete defaults to start from, not "tune later". Kept shared_buffers=4GB, effective_cache_size=12GB, work_mem=32MB, max_connections=50. |
| Drop Langfuse outright | Kept conditional — if the UI is opened daily, the trace UX is genuinely worth 2 GB. Audit before cutting. |
| Assumed Neo4j removal is safe | Kept the empirical query-audit step. Cypher's k-hop expressiveness is real if the workload uses it. |
Net effect on the design: the diagram gets one new component (Docker daemon for sandboxes), the migration goes from 12 to 20 steps (each smaller and more reversible), and the system becomes meaningfully harder to compromise via hostile inputs. RAM peak honesty went up; nothing else got bigger.
This section is the diff between §5 Migration Plan (intent) and what actually executed tonight. It supersedes the corresponding §5 rows where it conflicts. Rollback artifacts are enumerated at the end.
| Step (§5 ref) | Component | Outcome | Verification |
|---|---|---|---|
| 18 | Postgres tuning (shared_buffers=4GB, effective_cache_size=12GB, work_mem=32MB, max_connections=50, autovacuum_vacuum_scale_factor=0.05) |
Applied. ~2 sec restart. | SHOW shared_buffers = 4GB; cAdvisor shows steady warm-up to 4 GB resident. |
| 19 | restic off-host backups to contabo target | Live. 3 snapshots taken (initial, post-Neo4j-migration, post-queue-cutover). | restic snapshots lists 3 IDs incl. ed8aea1a. |
| 14–15 | Langfuse stopped | Container stopped, restart-policy=no, image preserved. Trace emitter rerouted to traces/model_calls Postgres tables. |
docker ps no longer shows langfuse-*; traces row count rising. |
| 12–13 | Neo4j stopped + KG migrated to Postgres | 1616 nodes / 2826 edges exported and imported into kg_nodes / kg_edges. Neo4j container stopped, restart-policy=no. |
SELECT count(*) FROM kg_nodes = 1616; kg_edges = 2826. Agent KG-tool replay returned identical results. |
| 16 | OpenHands stopped | Container stopped, restart-policy=no, preserved. | docker ps no longer shows OpenHands; no skill calls failed in 2-hour soak. |
| 1–2 | Queue migrated to Postgres (QUEUE_BACKEND=postgres) |
Worker confirmed claiming via claimed_by column on task_queue. Redis still receiving writes (audit window not yet closed). |
SELECT id, claimed_by, claimed_at FROM task_queue WHERE claimed_by IS NOT NULL returns active rows; no tasks stuck >5 min. |
| 17 | tmpfs SOPS wired (/run/agent/env via systemd ExecStartPre) |
All units now decrypt at start; no plaintext .env re-read after reboot test. |
Post-reboot ls /opt/agent/core/.env* shows backup only; /run/agent/env populated. |
| 20 | Disk admission middleware | Wired into agentd-web. Returns 503 + retry-after at 85%. |
Synthetic disk-fill (loop file in /var/tmp) triggered admission denial; cleared on file delete. |
| 10 | Sandbox image built | Rootless Docker image ready (cgroups, seccomp, no default network). NOT yet wired into tool dispatcher — see §8.2. | docker images | grep agent-sandbox shows tagged image. |
| 7 | Gateway code wired into agent.py |
gateway.py imported; routing through in-process module under shadow flag. |
Shadow log shows route/cost decisions matching LiteLLM for the soak window. |
| Item | Why it didn't ship tonight | Unblock |
|---|---|---|
Stop agent-redis container (§5 Step 3) |
Dependency audit (redis_consumer_audit.md) found 14 other code paths still using Redis: chrome_bridge, circuit_breaker, workflow_bus, brian_roles, telegram_bot, cc_routes, collab_orchestrator, plus 7 others. Each is its own scope of change. |
Per-consumer migration plan; cannot be one cutover. Redis stays up until each consumer is audited + cut over individually. |
Flip GATEWAY_MODE=live (§5 Step 8) |
First attempt returned 429 on all 5 tiers because gateway.py was reading env var GOOGLE_API_KEY, while LiteLLM (and the actual project quota) uses GOOGLE_AI_STUDIO_KEY — different keys, different quotas. Fix landed at gateway.py:347 (now reads GOOGLE_AI_STUDIO_KEY first, falls back to GOOGLE_API_KEY). |
Live cutover deferred to next session with proper unit-test validation of the env-var precedence + a single non-shadow synthetic call before flipping the flag. |
| Browser / code-exec → Docker sandbox cutover (§5 Steps 10–11) | Sandbox image is built, but the tool dispatcher refactor that injects per-task containers is still pending. | Refactor tool_registry.py browser/code paths to spawn-via-image instead of host process; canary 10% per Step 10. |
These are filed as feedback memory entries so future sessions don't repeat them:
.env file newline corruption — an echo >> appended without a leading \n, mashing two keys onto one line and breaking dotenv parse for downstream consumers. Fix: always printf '\n%s=%s\n' or use a python-dotenv set call. → feedback_env_file_newline_safety.md.
Off-host backup target tunnel-vision — defaulted to "rent B2/Wasabi" before realizing we already own four off-host targets (contabo VPS, GoogleDrive mount, Hetzner volume, GitHub). Asked Jonah for B2 unnecessarily. → feedback_compose_full_toolbox_backup_targets.md.
LiteLLM key naming — GOOGLE_AI_STUDIO_KEY vs GOOGLE_API_KEY are different keys with different quota meters. Code paths that mix them silently pull from the wrong (often empty) bucket and surface as 429. → reference_litellm_key_naming.md.
/opt/agent/data/redesign_rollout_20260425_121817/pre_rollout_full.sql — 3.6 GB hot pg_dump -Fc taken before any change.ed8aea1a — full off-host snapshot of artifacts + config at pre-rollout state./opt/agent/data/redesign_rollout_20260425_121817/env_backup_20260425_142046.env — encrypted env backup pre-tmpfs cutover./opt/agent/scripts/cutover_*.sh — all step-wise cutover scripts (idempotent)./opt/agent/scripts/rollback_neo4j.sh — restarts Neo4j container, re-points agent KG tool, re-runs export-import in reverse if needed.restart-policy=no): OpenHands, Langfuse stack, Neo4j.docker start <name> plus the env flip in the relevant rollback script.MemAvailable rose from 11 GiB → 13 GiB at the moment of stops.shared_buffers to 4 GB (expected — that warm cache is now serving queries that previously hit Neo4j/Langfuse/Redis).End of document.