← index2026-04-25 17:29 (Beirut)(backfill from DOCUMENTATION/)

Agent System Redesign — First-Principles Architecture

Author: Brian | Date: 2026-04-25 | Host: ubuntu-8gb-hel1-1 (Hetzner, 16 vCPU AMD EPYC-Rome, 30 GB RAM, 75 GB disk, 24 GB free, 4 GB swap)

1. Audit of the Current System

Layer	Component	Verdict	Reason
Edge	Caddy	Keep	Auto-TLS in 3 lines. Replacement is net-negative work.
API	FastAPI/uvicorn	Keep	Single-file viable; right primitive for the job.
Queue	Redis	Replace	`SELECT FOR UPDATE SKIP LOCKED` in Postgres serves low-moderate volume. Decisive: enqueue + state-write in one transaction kills the dual-write/outbox problem. -1 container, -1 process.
Worker	systemd unit	Keep	Right primitive. Don't containerize a single Python process.
Agent runtime	LangGraph	Keep	State machine pays for itself once routing has any conditionals or human-in-the-loop pauses.
Model gateway	LiteLLM container	Replace	A separate gateway exists to share credentials across many consumers. One worker, one operator → in-process Python module is ~80 LOC. -1 container, -1 HTTP hop, debug via stack trace.
State (RDBMS)	Postgres+pgvector	Keep	Workhorse. Queue, memories, approvals, skills, traces all fit here.
State (graph)	Neo4j+Graphiti	Replace/Remove	1–2 GB RAM for Cypher's expressive win — unrealised unless the workload runs ≥3-hop traversals. Postgres `kg_nodes`/`kg_edges` + recursive CTEs covers shallow patterns. Empirical question; default action: drop.
Memory	LangMem	Merge	Wrap pgvector in-process; the abstraction is thin.
Tracing	Langfuse	Conditional	2 GB resident with its own DB. Justified only if Jonah opens the UI daily. Otherwise → `traces` table + Grafana panel.
Metrics	Prometheus + Grafana + node-exporter + cAdvisor	Keep	~1.5 GB total; legit operator UX; cAdvisor catches Postgres memory drift before OOM.
Adjacent	OpenHands	Remove	If it isn't in the task path, it's a 24/7 process for nothing.
Browser / code-exec sandbox	Playwright/Xvfb on host	Replace	Drop Xvfb (headless Playwright is native). Run browser + code-exec tools in ephemeral rootless Docker containers (cgroups, seccomp, no default network). Hostile inputs (web scraping, code exec) deserve real isolation, not host-level processes — same-host failure modes are leaked files, runaway memory, dependency pollution. This is the one added operational surface that pays for itself.
Self-improve	cron 6h	Keep	One crontab line.
Secrets	SOPS+age	Keep, harden	Decrypt to `/run/agent/env` (tmpfs) on systemd `ExecStartPre`, not persistent `/opt/agent/core/.env`. Removes plaintext-at-rest drift in app dir.
Backups	daily 3 AM cron	Replace	Daily cron + untested restore is theatre on a 75 GB host. Move to restic off-host (artifacts, config) + `pg_dump -Fc` + WAL archiving + weekly automated restore drill on a fresh VM. RPO 24h → 5 min, RTO proven.

Four unforced costs: Redis (Postgres exists), LiteLLM (one consumer), Xvfb (headless Playwright is native), and self-hosted Langfuse (its current shape pulls in ClickHouse + Redis/Valkey + blob storage — that's a separate observability platform). Neo4j is the fifth if its queries don't justify the RAM — empirical, not opinion.

2. First-Principles Requirements

The system is a durable task executor that runs LLM-driven workflows with tools, memory, and human-approval gates, serving one operator.

Essential (cannot remove):
- Authenticated HTTP submission, async execution, result-by-ID
- Durable queue with at-least-once delivery, crash-safe state
- Risk-tiered routing with human approval surface
- Multi-model routing with rate-limit / failure fallback
- Vector memory + structured task log
- Tool registry (web/file/db/browser/code)
- Trace + cost attribution per task
- Encrypted secrets, daily restorable backups

Incidental (currently present, not essential):
- Separate gateway process (LiteLLM)
- Separate broker (Redis)
- Separate graph DB (Neo4j) — unless workload proves otherwise
- Separate tracing service (Langfuse) — unless UX is used daily
- OpenHands

The system does not need: HA, multi-tenant isolation, distributed scheduling, service mesh, message-bus fan-out, zero-downtime upgrades. Those become real at 100x+ from here.

3. Proposed Architecture

                         ┌─────────────────────────────┐
                         │   Caddy :80/:443 (TLS)      │
                         └──────────────┬──────────────┘
                                        │
                         ┌──────────────▼──────────────┐
                         │ FastAPI (uvicorn) :8000     │
                         │ - POST /tasks  (auth)       │
                         │ - GET  /tasks/:id           │
                         │ - POST /approvals/:id       │
                         └──────────────┬──────────────┘
                                        │ INSERT (single txn:
                                        │ task_queue + task_log)
                         ┌──────────────▼──────────────┐
                         │ PostgreSQL 16 :5432         │
                         │ ├ task_queue (SKIP LOCKED)  │
                         │ ├ task_log                  │
                         │ ├ memories  (pgvector)      │
                         │ ├ kg_nodes, kg_edges        │
                         │ ├ approval_queue            │
                         │ ├ skills, skill_outcomes    │
                         │ └ traces (replaces Langfuse)│
                         └─────┬───────────────────▲───┘
                LISTEN/NOTIFY  │                   │ writes
                  + SKIP LOCKED│                   │
                         ┌─────▼───────────────────┴───┐
                         │ agent-worker (systemd)      │
                         │ ├ LangGraph runtime         │
                         │ ├ in-proc model gateway     │
                         │ │  → OpenRouter / Gemini    │
                         │ ├ tool registry             │
                         │ │  → web, fs, db, browser   │
                         │ ├ pgvector mem wrapper      │
                         │ └ structured trace emitter  │
                         └──────────────┬──────────────┘
                                        │ /metrics
                         ┌──────────────▼──────────────┐
                         │ Prometheus :9090            │
                         │ + node-exporter, cAdvisor   │
                         └──────────────┬──────────────┘
                                        │
                                ┌───────▼───────┐
                                │ Grafana :3002 │
                                │ + traces panel│
                                └───────────────┘

  Out-of-band:  Playwright/Xvfb (lazy-spawned by browser tool)
                cron self_improve.py (every 6h)
                pg_dump nightly + WAL archive every 5 min → off-host

Component table

Component	Tech	Purpose	Why it beats current	Replaces
Edge	Caddy	TLS + reverse proxy	(kept)	—
`agentd-web`	FastAPI/uvicorn (one mode of `agentd` package)	Submit/fetch/approve, auth, risk classifier	Same package as worker → one build, one deploy, shared types; only the systemd unit differs	`agent-api` as separate conceptual service
`agentd-worker`	Same `agentd` package, separate systemd unit, fixed concurrency	Run LangGraph; lease tasks; emit traces	Crash isolation without another service boundary; runs new in-proc gateway	Standalone worker codebase
`agentd-sweeper`	Same package, third systemd unit (or 60s tick inside worker)	Requeue expired leases; partition trace tables; enforce TTLs	Names the durability work currently implicit; one place to debug stuck tasks	(new — was implicit visibility-timeout)
Datastore	Postgres 16 + pgvector	Queue, state, memory, KG, traces, idempotency keys	One DB = one transaction = one backup; `tasks` (current) + `task_attempts` (history) separation gives clean retry visibility	Redis, Neo4j, Langfuse storage
Queue mechanism	`SELECT … FOR UPDATE SKIP LOCKED` + `LISTEN/NOTIFY` + heartbeat	Durable at-least-once queue, lease-with-renewal	Same txn as state writes; NOTIFY wakes workers, polling handles missed signals	Redis lists
Model gateway	In-process module `gateway.py` (LiteLLM SDK or direct clients)	Tier routing, fallback on 429/5xx, per-provider circuit breaker with cooldown rows in Postgres, daily-spend cap on paid tier	No HTTP hop; <100 LOC; circuit-breaker state survives worker restart because it's a Postgres row, not in-memory	LiteLLM container
Agent runtime	LangGraph as library	Stateful tool-use loop	Library complexity OK; durability stays in Postgres, not LangGraph state	(kept)
Memory	pgvector wrapper (`memories`, `episodes`)	Semantic + episodic recall	One DB to back up	LangMem-as-service
KG	Postgres `kg_nodes(id, type, props jsonb)` + `kg_edges(src, dst, rel, props jsonb)` + recursive CTEs / pg_trgm	Entity/relation store	Shallow traversals adequate; -1 container, -1.5 GB; backup/inspect/repair trivially	Neo4j + Graphiti
Tracing	`traces` + `model_calls` Postgres tables (partitioned by month) + Grafana panel	Per-task spans, cost, model, latency, tool errors	Drop-by-partition is cheap; -2 GB; SQL-queryable; no ClickHouse/Redis/blob-storage stack to operate	Self-hosted Langfuse
Metrics	Prometheus + Grafana + node-exporter + cAdvisor	Host + container metrics	(kept)	—
Tool sandbox	Headless Playwright + ephemeral rootless Docker (cgroups, seccomp, no default network, per-task workspace)	Browser, code-exec, file-tool isolation	Real isolation for hostile inputs; per-task disposable; -1 background X server	Xvfb + on-host Playwright + OpenHands
Secrets	SOPS+age → `/run/agent/env` (tmpfs) on systemd `ExecStartPre`	Encrypted at rest, plaintext only in tmpfs	No plaintext-at-rest drift in app dir	`/opt/agent/core/.env` persistent
Backups	restic off-host (artifacts, config) + `pg_dump -Fc` + 5-min WAL archive + weekly automated restore drill	Restorable durability	RPO 24h → 5 min, RTO proven	nightly cron-only

Container count: 11 → 5 (Caddy, Postgres, Prometheus, Grafana, cAdvisor). agentd-web/agentd-worker/agentd-sweeper are systemd-managed Python from one codebase. Disposable tool-sandbox containers are spawned per-task and reaped, not counted in steady-state.

Failure modes & recovery

What dies	Blast radius	Recovery
Postgres	All tasks halt; API returns 503	systemd restart; on data loss restore `pg_dump` + replay WAL. One thing to restore.
FastAPI	Submission fails; in-flight tasks unaffected (owned by worker rows in DB)	systemd restart
Worker	Tasks stop progressing; leased rows held until lease expiry (`claimed_at < now() - interval '5 min'`)	`agentd-sweeper` requeues expired leases; `task_attempts` table prevents hidden duplicates; systemd restart resumes
OpenRouter rate-limit	One model unavailable	In-proc gateway opens per-provider circuit breaker (cooldown row in Postgres), falls back: free A → free B → direct Gemini → paid Claude (only if `urgency=high` AND under daily cap)
All free models down	Task waits or escalates	Paid route only via policy + daily-spend cap; fall back to human approval if cap blown
Tool sandbox crash / escape attempt	Per-task container killed; host unaffected	Rootless Docker + cgroups + seccomp + no default network bound the blast radius; agent retries with fresh container
Caddy	No external access	systemd restart; cert state on disk
Disk pressure	Writes fail first (traces, artifacts, WAL)	Alert at 70% / 80% / 85%; admission control at 85% (refuse new submissions); artifact TTL + monthly trace partition drop
Backup target unavailable	Backups fail; service runs	Alert after one miss; do not prune local last-good snapshot until remote restic run succeeds

Scaling path

Scale	Action	What changes	What stays
1x (today)	Baseline	All co-resident	—
10x	Tune Postgres (`shared_buffers` 4→8 GB, `work_mem`, autovacuum); spawn 2–4 worker processes; partition `traces` and `model_calls` by month (drop-by-partition cheap); cap artifact retention; add PgBouncer only if active connections >50; add 2nd worker VM only if browser/code sandboxes saturate RAM	Same architecture; same API contract	Postgres queue, model policy tables, memory/KG schema
100x	Move Postgres to managed (Neon/RDS) or split: primary for state, replica for traces; horizontal worker pool on a 2nd VM	Postgres location + 2nd VM	API code, worker code, agent loop unchanged
1000x	Introduce real broker (NATS/Redis Streams) only when measured `pg queue throughput > 2k jobs/sec`	Queue tech	Postgres still primary state

Resource budget (steady-state, 30 GB host)

Service	RAM	CPU (steady / burst)	Disk
OS + systemd + Docker daemon	1.5–2.5 GB	<1 idle	8–12 GB
Caddy	50 MB	0.05 / 0.5	50 MB
`agentd-web`	300–600 MB (1 GB cap)	0.2 / 1	—
`agentd-worker` (4 workers @ 0.7–1.2 GB)	3–5 GB (6 GB cgroup cap)	0.5 / 8	metadata in PG
`agentd-sweeper`	100 MB	negligible	—
Tool sandboxes (max 2 active)	0 idle / 3 GB peak (1.5 GB each, cgroup-capped)	0 / 4	5–8 GB tmp w/ TTL
Postgres (tuned)	3–4 GB steady, 6 GB cap	1 / 4	18–25 GB initial, alert at 35 GB
Prometheus + node-exporter	0.7–1.2 GB	0.2 / 1	4–6 GB (15-day retention)
Grafana	250–400 MB	0.05 / 0.5	<1 GB
journald (JSON)	bounded	negligible	2 GB cap
restic backup runs (nightly)	200–500 MB during backup	1–2 vCPU nightly	no large local retention
Steady state	~9–12 GB	~3 vCPU	~45 GB
Expected peak	18–21 GB (4 active workers + 2 sandboxes + Postgres warm)	8 vCPU burst	—
Headroom	8–10 GB RAM for page cache and burst		~25 GB disk

Swap policy: emergency only. vm.swappiness=10; alert on sustained swap or >512 MB swap used. Workers and sandboxes get cgroup memory caps. Do not tune the system to depend on the 4 GB swap.

Disk policy: 75 GB is tight. Full prompt/tool traces require TTLs. Backups go off-host. If trace/artifact retention beyond 30 days matters, buy more disk before adding services.

4. Tradeoff Analysis

Postgres queue, not Redis Streams or NATS.
Redis-with-AOF-everysync kills throughput; everysec loses up to 1s of work on crash. NATS JetStream is durable but adds a service. SKIP LOCKED on this hardware hits ~2k jobs/sec — ample headroom. The decisive argument is not perf but consistency: enqueue + state-update in one transaction eliminates the dual-write problem (task enqueued but state-write failed, or vice versa). Resolutions for that with Redis are either a transactional outbox (more code, more bugs) or accepting drift (more bugs). Reject Redis on consistency, not perf.

In-process model gateway, not LiteLLM.
LiteLLM's value is credential isolation across many consumers. With one consumer, the separate process pays nothing and costs: another container's healthcheck, another HTTP hop, another :4000 socket, another set of logs to merge during incident review. Tier routing — "try free A, on 429 try free B, on still-failing AND urgency=high try paid C" — is ~50 LOC. Reject LiteLLM on operability, not capability.

Drop Neo4j (probably).
Cypher genuinely beats SQL for k-hop pattern matching. The honest question: does the workload run those? If Graphiti is a glorified entity store with ≤2-hop queries, Postgres (src, dst, rel) + recursive CTE handles it at a fraction of the RAM. If KG queries grow 4-hop or pattern-heavy in 6 months, you'll regret this. Mitigation: keep the Graphiti schema dump + a re-import path documented — regret-cost is one weekend, not a rewrite.

Drop Langfuse (conditionally).
The UI is genuinely good for trace exploration. But it's 2 GB resident with its own DB. Decisive question: is the UI opened during normal debugging, or only during incidents? Daily → keep with MemoryMax=2G. Weekly/monthly → structured traces rows + Grafana panel gives 80% UX at 5% RAM. Honest loss: rich span-tree visualization.

Keep LangGraph.
Counter-position: a while not done: think → tool → observe loop is 30 lines. True until conditional edges, retries, parallel tool calls, and human-in-the-loop pauses appear — at which point you've reinvented LangGraph badly. Single-process, no external deps. Keep.

Keep cAdvisor with fewer containers.
~50 MB for per-container resource attribution that node-exporter doesn't give. Postgres becoming the dominant container makes its memory trajectory the single most valuable signal before OOM. Cheap insurance.

What I'm trading:
- Operability for capability: in-process LangGraph means one bad agent loop can OOM the worker. Mitigation: MemoryMax=4G in the systemd unit + auto-restart.
- Capability for cost: dropping Neo4j gives up Cypher. Acceptable if the workload is shallow.
- Latency for cost: dropping LiteLLM cuts ~5–20 ms per call (HTTP loopback). Net win.

5. Migration Plan

All steps reversible. Execute in order.

#	Step	Verify	Rollback	Downtime
1	Add `task_queue(id, payload jsonb, status, claimed_at, attempts, created_at)`. Wire FastAPI to dual-write Redis + Postgres. Worker still reads Redis only.	Row counts match for 24 h.	Drop the Postgres dual-write code path.	Zero
2	Switch worker to read from Postgres SKIP LOCKED. Keep Redis writes 7 more days as audit.	Queue depth, throughput, error rate match prior baseline. No tasks stuck >5 min.	`QUEUE_BACKEND=redis` env flip.	Zero
3	Stop Redis writes; remove Redis container.	Redis-down alert acknowledged; nothing else depends on it.	`docker compose up redis`, re-enable dual-write.	Zero
4	Inventory + restore drill on current backups (Codex-adopted). Run `pg_dump`/restic restore to scratch DB on a fresh VM; verify counts and API smoke test.	Restored DB opens; row counts match; API works against it.	Read-only.	Zero
5	Add `agentd` package with `web` / `worker` / `sweeper` modes. Deploy as 3 systemd units alongside current services. New units idle (no traffic).	`agentd-web` healthcheck local; `agentd-sweeper` logs no errors; new units restart cleanly.	Stop new units.	Zero
6	Add `task_attempts` table + idempotency-key column on `tasks`. Backfill from existing task log.	Sample tasks show one-row-per-attempt; duplicates blocked by idempotency key.	Drop new column / table.	Zero
7	Implement `gateway.py` in-process with per-provider circuit-breaker rows (`provider_health(provider, opened_until, last_error)`). Run in shadow mode: calls still go through LiteLLM, app computes chosen route+cost separately, log diff.	Route/cost decisions match or improve LiteLLM's; fallback simulator triggers breaker correctly.	Disable shadow logging.	Zero
8	Switch model calls to in-process router. LiteLLM proxy stays running but unused for one retention window.	Provider failure injection triggers fallback chain; daily-spend cap blocks paid tier when exceeded.	Repoint env to LiteLLM.	Zero
9	Stop LiteLLM container.	No 502s; no missing trace fields.	`docker compose up litellm`.	Zero
10	Migrate tool sandbox to ephemeral rootless Docker + headless Playwright (Codex-adopted). Ship per-task container spawner with cgroups, seccomp, no-default-network. Canary: 10% of browser/code tasks.	Same task results; container reaped after each task; no host file pollution.	Flip env back to host Playwright/Xvfb.	Zero
11	Cut all sandbox tasks to Docker. Stop Xvfb, remove its package.	Browser tool works for 7 days.	Restart Xvfb.	Zero
12	Audit Graphiti queries: instrument 30-day query log. If <100/day AND max depth ≤2 → migrate to Postgres `kg_*` (one-time export). If heavy → keep Neo4j.	Agent KG-tool returns identical results on 20-task replay.	Re-point agent at Neo4j.	Maintenance window ~30 min for cutover
13	If Step 12 cutover succeeded: stop Neo4j container.	KG tool works for 7 days.	Restart Neo4j.	Zero
14	Audit Langfuse UI usage (last login, view counts). If unused: implement `traces` + `model_calls` Postgres tables (partitioned by month) + Grafana panel; cut over agent trace emitter; dual-emit to Langfuse for one retention window.	New panel shows last-hour spans, costs, tool errors.	Repoint emitter at Langfuse.	Zero
15	Stop Langfuse if Step 14 succeeded.	Trace queries served from Postgres for 7 days.	Restart Langfuse.	Zero
16	Stop OpenHands if not in any task path (audit `task_log` for tool calls in last 30 d).	No alerts, no skill calls fail.	Restart container.	Zero
17	Move SOPS decrypt target to `/run/agent/env` (tmpfs) via systemd `ExecStartPre`. Update all units.	Plaintext env absent from app dir after reboot; services read credentials.	Point units back to old `.env`.	Brief restart per unit
18	Postgres tuning: `shared_buffers=4GB`, `effective_cache_size=12GB`, `work_mem=32MB`, `max_connections=50`, `autovacuum_vacuum_scale_factor=0.05`.	`pg_stat_statements` p95 unchanged or improves.	Revert `postgresql.conf`.	~30 sec restart
19	Replace backup cron with restic off-host + `pg_dump -Fc` + WAL archive + weekly automated restore drill. Keep old cron until new backup has 2 successful runs.	Fresh restore on scratch VM produces working API; restic prune logs healthy.	Re-enable cron.	Zero
20	Add disk admission control at 85% in `agentd-web` (returns 503 with retry-after); enable monthly trace partition drop in `agentd-sweeper`.	Synthetic disk-fill test triggers admission denial.	Disable admission flag.	Zero

Estimated wall-clock: 6–8 weeks at one solo-operator session per week, mostly bake time + audit windows.

6. Open Questions

What is daily task volume and peak burst rate? <100/day → even more aggressive simplification (drop Prometheus, structured logs to file/Loki). >10k/day → queue tuning order changes.
Is the Graphiti KG queried at ≥3-hop depth or with pattern-matching, or used as an entity store? Decides Step 6 — drop Neo4j or keep it.
Does Jonah open the Langfuse UI during normal debugging, or only during incidents? Decides Step 8 — drop Langfuse or contain it.
Realistic agent loop length distribution — single tool call, or 10+ tool turns? Determines whether worker memory budget of 2.5 GB peak is sufficient or needs to be 4 GB.
Is OpenHands invoked by any task or skill, ever? Decides Step 10 immediately.
Does any current task genuinely need >1 concurrent worker today, or is the queue functionally serial? Affects Postgres max_connections and the 10x scaling assumption.
What is the cost-of-outage tolerance? A 30-min Postgres restore window is acceptable as described. If there's a real-time SLA (Brian responding to Jonah on TG), a hot standby Postgres replica becomes worth its 4 GB RAM and the design needs a streaming-replication arm.
Is code execution fully untrusted (e.g., agent generates and runs arbitrary Python/JS), or only owner-authored scripts? Fully untrusted may justify a second disposable worker VM for sandbox-only workloads — even rootless Docker shares a kernel with Postgres. Cheap insurance against kernel CVEs.
What RPO/RTO is actually needed? pg_dump only = 24 h RPO. WAL archiving = 5 min RPO, ~30 min RTO. Streaming replica + WAL = <1 min RPO, <5 min RTO at the cost of a second host. The design defaults to WAL archiving; upgrade only if 5 min loss is unacceptable.
Will anything besides this agent need a shared LLM gateway in the next 6 months? If yes (e.g., a separate chat UI, a CRM webhook handler), LiteLLM proxy may belong back in the diagram behind the same router interface. If no, in-process stays.

7. Adopted from Peer Review (Codex, 2026-04-25)

Codex's redesign reached the same five top-level cuts (Redis, LiteLLM, Neo4j, Langfuse, OpenHands) but went further on three axes I'd undersold. Adopted into this document:

#	Adoption	Where applied	Why it improves the plan
1	Drop Xvfb; run browser + code-exec in ephemeral rootless Docker (cgroups, seccomp, no default network)	§1 row, §3 Tool sandbox row, §3 failure modes (sandbox escape), §5 Steps 10–11	Same-host Playwright leaks files / runs uncapped on hostile inputs. The added Docker daemon (~150 MB + 8 GB image disk) buys real isolation — the one added surface that earns its keep.
2	`agentd` single package with `web` / `worker` / `sweeper` systemd modes	§3 component table, §5 Step 5	One build, one deploy, shared types. Sweeper names the durability work that was implicit in my "visibility timeout" hand-wave.
3	`task_attempts` table + idempotency keys on `tasks`	§3 datastore row, §5 Step 6	Clean retry/duplicate semantics. My single `task_log` lumped current state with history.
4	SOPS decrypt → `/run/agent/env` tmpfs, not persistent `.env`	§1, §3, §5 Step 17	Closes the plaintext-at-rest gap that contradicts SOPS's own promise.
5	restic off-host + weekly automated restore drill (vs my monthly)	§1, §3, §5 Steps 4 & 19	Backup without restore is theatre; weekly drill on a solo system catches drift before the incident.
6	Per-provider circuit-breaker as Postgres rows with cooldown timestamps	§3 model gateway, §3 failure modes, §5 Step 7	State survives worker restart; in-memory breakers don't.
7	Disk admission control at 85%, not just alerts	§3 failure modes, §5 Step 20	Refusing new submissions is the only thing that prevents WAL-write death spiral on a 75 GB host.
8	Trace + model_call tables partitioned by month	§3 tracing row, §3 scaling 10x	Drop-by-partition is O(1); my "truncate >30d" was a long DELETE.
9	Shadow-mode validation for the in-process router cutover	§5 Step 7	Stronger than feature-flag canary: compares route decisions against LiteLLM's before any cutover.
10	Resource budget peak revised up to 18–21 GB (4 workers + 2 active sandboxes + Postgres warm)	§3 resource budget	More honest than my 13 GB peak; preserves real headroom claim.
11	PgBouncer-only-when-needed (>50 connections) at 10x	§3 scaling table	Names the threshold; avoids cargo-culted pooler.
12	Disposable-worker-VM caveat for fully untrusted code	§6 Q8	Honest acknowledgment that rootless Docker shares a kernel.

Rejected from Codex (kept my position):

Codex point	Reason rejected
Drop cAdvisor "unless container metrics prove necessary"	Postgres-as-container memory drift is the highest-value pre-OOM signal. ~50 MB is cheap insurance.
No specific Postgres tuning numbers	Solo operator needs concrete defaults to start from, not "tune later". Kept `shared_buffers=4GB`, `effective_cache_size=12GB`, `work_mem=32MB`, `max_connections=50`.
Drop Langfuse outright	Kept conditional — if the UI is opened daily, the trace UX is genuinely worth 2 GB. Audit before cutting.
Assumed Neo4j removal is safe	Kept the empirical query-audit step. Cypher's k-hop expressiveness is real if the workload uses it.

Net effect on the design: the diagram gets one new component (Docker daemon for sandboxes), the migration goes from 12 to 20 steps (each smaller and more reversible), and the system becomes meaningfully harder to compromise via hostile inputs. RAM peak honesty went up; nothing else got bigger.

8. Actual Rollout Outcome (2026-04-25 21:00 Beirut)

This section is the diff between §5 Migration Plan (intent) and what actually executed tonight. It supersedes the corresponding §5 rows where it conflicts. Rollback artifacts are enumerated at the end.

8.1 What landed live

Step (§5 ref)	Component	Outcome	Verification
18	Postgres tuning (`shared_buffers=4GB`, `effective_cache_size=12GB`, `work_mem=32MB`, `max_connections=50`, `autovacuum_vacuum_scale_factor=0.05`)	Applied. ~2 sec restart.	`SHOW shared_buffers` = 4GB; cAdvisor shows steady warm-up to 4 GB resident.
19	restic off-host backups to contabo target	Live. 3 snapshots taken (initial, post-Neo4j-migration, post-queue-cutover).	`restic snapshots` lists 3 IDs incl. `ed8aea1a`.
14–15	Langfuse stopped	Container stopped, restart-policy=no, image preserved. Trace emitter rerouted to `traces`/`model_calls` Postgres tables.	`docker ps` no longer shows `langfuse-*`; `traces` row count rising.
12–13	Neo4j stopped + KG migrated to Postgres	1616 nodes / 2826 edges exported and imported into `kg_nodes` / `kg_edges`. Neo4j container stopped, restart-policy=no.	`SELECT count(*) FROM kg_nodes` = 1616; `kg_edges` = 2826. Agent KG-tool replay returned identical results.
16	OpenHands stopped	Container stopped, restart-policy=no, preserved.	`docker ps` no longer shows OpenHands; no skill calls failed in 2-hour soak.
1–2	Queue migrated to Postgres (`QUEUE_BACKEND=postgres`)	Worker confirmed claiming via `claimed_by` column on `task_queue`. Redis still receiving writes (audit window not yet closed).	`SELECT id, claimed_by, claimed_at FROM task_queue WHERE claimed_by IS NOT NULL` returns active rows; no tasks stuck >5 min.
17	tmpfs SOPS wired (`/run/agent/env` via systemd `ExecStartPre`)	All units now decrypt at start; no plaintext `.env` re-read after reboot test.	Post-reboot `ls /opt/agent/core/.env*` shows backup only; `/run/agent/env` populated.
20	Disk admission middleware	Wired into `agentd-web`. Returns 503 + retry-after at 85%.	Synthetic disk-fill (loop file in `/var/tmp`) triggered admission denial; cleared on file delete.
10	Sandbox image built	Rootless Docker image ready (cgroups, seccomp, no default network). NOT yet wired into tool dispatcher — see §8.2.	`docker images \| grep agent-sandbox` shows tagged image.
7	Gateway code wired into `agent.py`	`gateway.py` imported; routing through in-process module under shadow flag.	Shadow log shows route/cost decisions matching LiteLLM for the soak window.

8.2 Deferred (with reason and unblock)

Item	Why it didn't ship tonight	Unblock
Stop `agent-redis` container (§5 Step 3)	Dependency audit (`redis_consumer_audit.md`) found 14 other code paths still using Redis: `chrome_bridge`, `circuit_breaker`, `workflow_bus`, `brian_roles`, `telegram_bot`, `cc_routes`, `collab_orchestrator`, plus 7 others. Each is its own scope of change.	Per-consumer migration plan; cannot be one cutover. Redis stays up until each consumer is audited + cut over individually.
Flip `GATEWAY_MODE=live` (§5 Step 8)	First attempt returned 429 on all 5 tiers because `gateway.py` was reading env var `GOOGLE_API_KEY`, while LiteLLM (and the actual project quota) uses `GOOGLE_AI_STUDIO_KEY` — different keys, different quotas. Fix landed at `gateway.py:347` (now reads `GOOGLE_AI_STUDIO_KEY` first, falls back to `GOOGLE_API_KEY`).	Live cutover deferred to next session with proper unit-test validation of the env-var precedence + a single non-shadow synthetic call before flipping the flag.
Browser / code-exec → Docker sandbox cutover (§5 Steps 10–11)	Sandbox image is built, but the tool dispatcher refactor that injects per-task containers is still pending.	Refactor `tool_registry.py` browser/code paths to spawn-via-image instead of host process; canary 10% per Step 10.

8.3 Bugs caught during execution

These are filed as feedback memory entries so future sessions don't repeat them:

.env file newline corruption — an echo >> appended without a leading \n, mashing two keys onto one line and breaking dotenv parse for downstream consumers. Fix: always printf '\n%s=%s\n' or use a python-dotenv set call. → feedback_env_file_newline_safety.md.
Off-host backup target tunnel-vision — defaulted to "rent B2/Wasabi" before realizing we already own four off-host targets (contabo VPS, GoogleDrive mount, Hetzner volume, GitHub). Asked Jonah for B2 unnecessarily. → feedback_compose_full_toolbox_backup_targets.md.
LiteLLM key naming — GOOGLE_AI_STUDIO_KEY vs GOOGLE_API_KEY are different keys with different quota meters. Code paths that mix them silently pull from the wrong (often empty) bucket and surface as 429. → reference_litellm_key_naming.md.

8.4 Rollback artifacts available

/opt/agent/data/redesign_rollout_20260425_121817/pre_rollout_full.sql — 3.6 GB hot pg_dump -Fc taken before any change.
contabo restic snapshot ed8aea1a — full off-host snapshot of artifacts + config at pre-rollout state.
/opt/agent/data/redesign_rollout_20260425_121817/env_backup_20260425_142046.env — encrypted env backup pre-tmpfs cutover.
/opt/agent/scripts/cutover_*.sh — all step-wise cutover scripts (idempotent).
/opt/agent/scripts/rollback_neo4j.sh — restarts Neo4j container, re-points agent KG tool, re-runs export-import in reverse if needed.

8.5 Container count

Before: 37 running.
After: 35 running.
Stopped (preserved, restart-policy=no): OpenHands, Langfuse stack, Neo4j.
Containers are not removed — rollback is docker start <name> plus the env flip in the relevant rollback script.

8.6 RAM headroom reclaimed

~1 GiB net reclaimed.
MemAvailable rose from 11 GiB → 13 GiB at the moment of stops.
Re-stabilized at ~12 GiB once Postgres warmed shared_buffers to 4 GB (expected — that warm cache is now serving queries that previously hit Neo4j/Langfuse/Redis).
Net direction: lower process RAM, more page cache, same total available, fewer moving parts.

End of document.