Date: 2026-05-03
Status: v2.0 — post-round-table (R1+R2+R3+R4), pending Jonah final sign-off on 4 trade-offs
Owner: Brian
Code home: /opt/agent/cli-pwa (extending existing scaffold)
History below preserved as audit trail (sections 0.5/0.6 are R1/R2/R3 patches). This v2 section is the SINGLE SOURCE OF TRUTH and supersedes all earlier sections on conflict.
Domain: brian.jonahtebaa.com (Jonah locked).
Voices: Puck (male, narration + chat) + Kore (female, clinical session-output reading) (Jonah locked). Two pinned WS to gemini-live :8102 because voiceConfig is immutable mid-session. Serial priority queue across both — never simultaneous.
Activation: Push-to-talk with DataChannel gating + always-on track lifecycle (R2/R3 finding).
Session end: Explicit only (Jonah locked). v2 adds idle-suspend (NOT end) — see §V2.5.
noVNC: Auto-spawn wezterm window per session (Jonah locked). v2 switches WM from xfce to i3wm on :1 for tiling — xfce preserved on :2 for personal use.
PWA: Mobile-first, Apple-feeling premium skin, Framer Motion + SF Pro + glassmorphism + micro-interactions. Honest framing: "premium" not "designed by Apple itself" (R4 universally rejected the latter as 5h-impossible).
Build budget: ~75h (R4 honest median; R1's 27h was wishful, v1.2's 32h and v1.3's 39h were both optimistic).
/app: CLI command typed inside any tmux/CC session adopts THAT session into supervisor (Jonah locked).
The single biggest insight from the four rounds: every trust-failure mode (split-brain, reboot, OOM, rate-limit, surface-desync) collapses to the same root — there is no single source of truth for a session. Build that first.
session-supervisor.py — one Python daemon per session. Owns:
- The tmux session containing the claude CLI subprocess
- The DB row state (running/parked/suspended/dead)
- All surface attachments (PWA WS, noVNC wezterm, voice gateway, /app adoption)
- Permission/approval routing (broadcasts to all surfaces, first responder wins)
- Reboot recovery (tmux-resurrect-style replay)
- Idle-suspend / warm-resume lifecycle
- Quota-fallback signaling (when Gemini 429s)
Source of truth: the tmux pane scrollback + Anthropic CLI's session state in ~/.config/anthropic/sessions/<id>/. All surfaces are READ-MIRRORS of tmux scrollback (via tmux pipe-pane) and WRITE-INJECTORS into tmux (via tmux send-keys). No surface owns its own copy. This solves the wezterm↔PWA bidirectional sync that Vibe identified as the Day-1 abandonment trigger.
PWA mic ──WebRTC──► brian-voice-gateway.py
│
├── DataChannel gate (track always-on, Opus only flows when ptt_down=true)
│
├── Puck WS to gemini-live :8102 (pinned, narration + chat)
└── Kore WS to gemini-live :8102 (pinned, clinical session output)
Audio arbiter (state machine):
IDLE → NARRATING(Puck) → CLINICAL(Kore) → PTT_ACTIVE
Kore preempts Puck on first PCM chunk (50ms gain ramp-down on Puck)
PTT_ACTIVE preempts everything (DataChannel send clear_buffer to BOTH Gemini WS)
Echo-cancel: Kore chunks carry sequence_id; on PTT-down, PWA sends last_played_id;
gateway rewinds Gemini context so model knows where it was interrupted
Codec: SDP forces opus 48kHz mono on transport, gateway resamples 48k→16k once for Gemini
Decisions deferred to Jonah (see §V2.10):
- Should the narrator (mid-call "running command" Puck utterances during tool calls) ship in v2.0 or be deferred? Codex argues defer — "dopamine polish on unreliable plumbing." I lean keep, because without it long tool chains feel dead.
| Issue | Mitigation |
|---|---|
MediaStreamTrack killed after ~5min backgrounded |
track.onended handler triggers re-acquisition flow (silent re-permission if granted, recreate track, ICE-restart) |
| AudioContext suspended on background | audioCtx.resume() on every touchstart of the PTT button |
| PSTN call / notification steals audio focus | audiointerruptbegin/end events → pause UI + "phone call interrupted" banner + auto-resume on audiointerruptend |
| WS dies on background | iOS app-foregrounded event triggers reconnect; UI shows blur+spinner only if recovery >200ms |
| Web Push needs install + 24h interaction | Fallback: every push event ALSO sent to TG COMMS; in-app red-dot badge using setAppBadge (16.4+) or CSS ::after (older) |
| MediaRecorder Opus flaky | opus-recorder WASM library (~30KB), bypass native MediaRecorder |
| PWA install rate 12-18% | Custom A2HS modal mimicking iOS native sheet, trigger after 3rd session OR high-value action; never the cheap browser infobar |
Reboot recovery (was broken in v1.x):
- New systemd unit tmux-resurrect@.service saves tmux state every 5min to /var/lib/brian-pwa/tmux-snapshots/
- On boot, agent-pwa-recovery.service (oneshot) restores tmux sessions BEFORE cli-pwa-backend starts
- Supervisor on first PWA tap rehydrates claude --resume <id> from Anthropic CLI's persisted session state
- PWA shows "Resumed from
- Reaper marks sessions whose Anthropic state is unrecoverable as dead; otherwise suspended and ready for warm-resume
Idle-suspend (NEW in v2 — solves OOM, honors "never auto-end" lock):
- Session enters suspended state after 30 min of pure idle (no tool calls, no user input, no voice).
- Suspend kills the claude subprocess + closes wezterm window. Preserves tmux scrollback + DB row + Anthropic session state on disk.
- ANY surface tap re-spawns claude subprocess via --resume <id> + re-spawns wezterm window. Perceived warm-resume <3s.
- Critical: suspended ≠ ended. The DB row stays alive forever (or until Jonah explicitly ends). Honors the lock — no session auto-ends.
- 8-session OOM scenario: at most 1-2 active at any moment; rest suspended. RAM footprint stays bounded.
When gateway gets 429 from gemini-live:
1. Mute the voice channel (no more dead-silence loops).
2. PWA modal: "Voice quota hit — switching to text mode for this session" + visual indicator on session card.
3. TG COMMS notification: "Voice degraded for session
4. Session continues in text mode (chat works, transcripts work).
5. After 1h cooldown, gateway probes Gemini; if recovered, voice button re-enables.
6. Hard-fallback to local Whisper+Piper is deferred to v2.1 (out of v2.0 scope to keep budget honest).
Locked specs (mandatory):
- Framer Motion springs: type: "spring", stiffness: 260, damping: 20
- Glassmorphism: backdrop-filter: blur(20px) saturate(180%) contrast(90%) + border: 0.5px solid rgba(255,255,255,0.1) (specular edge)
- Typography: -apple-system, 'SF Pro Display', system-ui; headers letter-spacing: -0.022em font-weight: 600; body -0.011em
- Safe-area insets on every screen
- Iconography: SF Symbols via Iconify
- Haptics: navigator.vibrate([20,50,20]) on PTT, [50] on confirm actions
- Micro-interactions (mandatory, not optional):
- Button press: transform: scale(0.95) opacity(0.8) 100ms ease
- List swipe: translateX(-100px) opacity(0) 200ms spring
- Page transitions: slide+fade, no hard cuts
- 60fps gate on PTT call UI + session list scroll. @media (prefers-reduced-transparency) removes blurs in iOS Low Power Mode.
Cut from v2.0 (defer to v2.1):
- Dynamic Island replication
- layoutId shared-element morphs
- 3D transforms / multi-finger gestures
- Apple's specific spring-overshoot tuning per-component (use the one global spring config above)
Honest framing for Jonah: v2.0 ships "premium Apple-feeling skin." NOT "indistinguishable from Apple's own design team." That's a $50k-budget design engagement, not 8h of execution. R4 was unanimous on this — pretending otherwise is the kind of overreach that ships ugly.
UI build: ~10h (was 5h in v1.x, was unrealistic per R4).
| Phase | Scope | Hours |
|---|---|---|
| 0 | Audit + dependency install (i3wm, opus-recorder, framer-motion, workbox, web-push, aiortc, coturn). Dry-run cli-pwa current state. | 2 |
| 1 | session-supervisor.py — the keystone. tmux ownership, scrollback pipe-pane, surface attach API, permission broadcast, idle-suspend, reboot recovery via tmux-resurrect. |
12 |
| 2 | Migrate cli-pwa-backend routes (/sessions/*, /ws/sessions/:id, /sessions/:id/input, /sessions/:id/open-tab) to delegate to supervisor. Kill legacy spawn/kill code paths. Bidirectional wezterm↔PWA sync. |
8 |
| 3 | i3wm install on :1, xfce relocated to :2. wezterm theme + auto-tile. Window-id tracking + focus contract. |
3 |
| 4 | coturn self-host (TLS, UDP 3478-3481/5349, Let's Encrypt). |
2 |
| 5 | brian-voice-gateway.py — dual-WS to gemini-live, audio arbiter state machine, DataChannel gating, codec pinning, 50ms ramp-down crossfade, sequence_id echo-cancel. |
10 |
| 6 | iOS hardening: track lifecycle, audioCtx.resume, audiointerrupt events, ICE-restart-on-foreground, IndexedDB instant-resume, opus-recorder WASM. | 6 |
| 7 | Quota fallback (429 detection → mute + UI modal + TG COMMS push + 1h cooldown probe). | 2 |
| 8 | /app CLI command (Python script in tmux env, registers pane PID with supervisor, idempotent on (pane_pid, session_id)). |
2 |
| 9 | PWA frontend rebuild — Apple-feeling premium skin (Framer Motion, glassmorphism, SF Pro, micro-interactions, safe-area, haptics, A2HS modal, in-app badge). | 10 |
| 10 | Service worker + Workbox precache, BroadcastChannel cross-tab sync, IndexedDB session-state writer. | 3 |
| 11 | Push notifications (web-push subscribe, server emit on 6 event types, TG COMMS mirror). | 3 |
| 12 | noVNC tab top-bar injection (Return-to-Brian button, postMessage theme sync). | 1 |
| 13 | Auth: device-pairing JWT, Cloudflare Access policy, pwa.brianserves.me→brian.jonahtebaa.com Caddy block. |
3 |
| 14 | Narrator (session-narrator.py) IF kept — phrase library, tool-event subscriber, Puck low-priority injection. |
3 |
| 15 | E2E UAT against §11 success criteria. iPhone real-device test. Latency benchmarking on cellular. Reboot drill. 8-session OOM drill. | 5 |
Subtotal: 75h. Buffer: ~5h for the inevitable Safari quirks. Realistic total: 75-80h.
| Failure | Mitigation |
|---|---|
| Hetzner reboot mid-night | tmux-resurrect + claude --resume; <3s warm-resume on first tap; "Resumed from X" Puck cue |
| iOS background-kill of audio | track.onended re-acquisition + audioCtx.resume + ICE-restart |
| iOS PSTN call interrupt | audiointerrupt events + auto-resume on end |
| Gemini Live 429 | mute voice + UI modal + TG COMMS + 1h probe + text-mode continuation |
| 8 idle sessions OOM | idle-suspend (30min) → at most 1-2 active claude processes |
| wezterm in noVNC vs PWA desync | tmux scrollback as source of truth, all surfaces read-mirror via pipe-pane |
| Permission prompt across surfaces | supervisor broadcasts; first responder wins |
| WS reconnect mid-call | preserve WebRTC, reconnect WS in background, 1s Puck filler |
/app re-adoption loop |
idempotent on (pane_pid, session_id); reject duplicate adoption |
| # | Decision | LOCK |
|---|---|---|
| 1 | Narrator | KEEP, event-driven only — no speculative filler, no hallucinated "thinking" between real events. Narrator is a state-reporter bound to supervisor tool-events, never imagines progress. |
| 2 | Opus encoder | opus-recorder WASM library. Decided. No future asks on this class of micro-choice. |
| 3 | UI fidelity | "Premium Apple-feeling skin" framing, but executed with craft — every minute of the 10h on micro-interaction polish, not feature creep. Portfolio piece, not half-assed. |
| 4 | Scope | Full vision, 75-80h single push. Both voices, all phases. |
Post-build directive (Jonah explicit):
- Round-table audit by suitable members → fix iteration
- E2E test → fix iteration
- Jonah-perspective self-test using use-my-browser + chrome-mac skills (Brian uses Jonah's real Mac + real Chrome) → fix iteration
- "No too-much-effort budget. Make it close to perfect."
260/20. 60fps gate.| # | Decision | Lock |
|---|---|---|
| Domain | brian.jonahtebaa.com |
LOCKED |
| Round-table roster | 11-agent (Manus INCLUDED, paid OK for this plan) | LOCKED |
/app mechanism |
CLI command typed inside any tmux/CC session, adopts THAT session into PWA list | LOCKED |
| Session end semantics | Only when Jonah explicitly taps "end session". NEVER auto-end. | LOCKED |
| noVNC window | ALWAYS auto-spawn an XFCE wezterm window the moment a session is created | LOCKED |
| noVNC embed in PWA | NO — separate browser tab, PWA deep-links to noVNC | LOCKED |
| Voice A (narration/chat) | Puck (male, Brian voice — same as SIP) | LOCKED |
| Voice B (session output) | Kore (crisp female, calm, clinical, monotone) | LOCKED |
| Voice activation | Push-to-talk (hold mic button) | LOCKED |
| Push notifications | ON for: session-idle-after-tool, social mentions, backend errors, daily summary, incoming SIP, /agency-published | LOCKED |
| Build scope | FULL vision in one push (~27h+) — no MVP slice | LOCKED |
| Aesthetic mandate | Jonah branding + futuristic + animations + smooth interface — "designed by Apple itself." | LOCKED |
The Apple-grade futuristic mandate elevates the UI build from "mobile-first PWA" to a portfolio-piece front-end. Adds the UI Designer + frontend-design + jonah-branding skills as first-class participants in phase 1.
Round 1 panel: Codex, Gemini, Hermes, Vibe (4 substantive critiques in 110s; Manus paid-flag, OpenHands offline, Antigravity/Jules async-dispatched).
brian-voice-gateway.py:voice_name switching per utterance instead of two pinned WS. gemini_live_server.py:660 currently hardcodes Puck — extend ?voice= query or per-utterance config message.cli-pwa/backend/src/routes/sessions.ts:97 (open-tab route) kills the backend-managed session before spawning claude --resume in xfce4-terminal. cli-pwa/backend/src/routes/ws.ts:13 rejects PWA WS streaming if manager.get(id) isn't live. UAT item "type in wezterm → appears in PWA" is architecturally false right now.session-supervisor.py — one daemon per session that owns ONE pty/tmux. PWA WS, noVNC wezterm, voice gateway, and /app adoption are all CLIENTS attaching to that supervisor (multiplexed). The supervisor is the single source of truth.tmux attach. PWA attaches via WS. Voice gateway attaches via stdin/stdout pipes. /app adopts by registering an existing tmux into the supervisor's catalog.relay candidates; STUN-only fails on cellular symmetric NAT.wmctrl to position new windows in a grid + name them with session id; provide "minimize all but X" hotkey./app records (tmux_pane_pid, session_id) uniqueness in DB. Re-running /app inside an already-adopted pane is a no-op + visual "already in PWA" hint. Prevents recursive adoption loop.pending_permissions table in DB; PWA shows a modal, voice gateway speaks "needs your approval to write X — say yes or no", noVNC wezterm shows the standard CC prompt. Once any surface answers, others dismiss.brian-voice-gateway.py: parallel-mux → single-stream priority queue (FINDING-A)session-supervisor.py: single PTY owner per session, multi-surface attach (FINDING-B)coturn self-host BEFORE phase-3 WebRTC work (FINDING-C)+5h: session supervisor (3h), coturn setup (1h), iOS background handling + permission broadcast (1h). New total: ~32h (still buildable as one push, but the call stops being "27h-or-bust").
R2 panel (voice/WebRTC): Vibe, Gemini, Hermes (Codex timed out). R3 panel (PWA/iOS/UI): Vibe, Gemini, Hermes (Codex timed out).
voiceConfig is immutable mid-session. Cannot swap voice_name per utterance on a single WS. The "single WS with dynamic voice swap" pivot is not viable.brian-voice-gateway.py:states: IDLE | NARRATING(puck) | CLINICAL(kore) | PTT_ACTIVE
transitions:
tool-event arrives + IDLE → NARRATING (start Puck WS utterance)
assistant-text arrives → CLINICAL (50ms gain ramp-down on Puck, kore_first_chunk preempts)
user PTT_DOWN → PTT_ACTIVE (clear_buffer to Gemini, hard-cut both WS, route mic)
utterance complete → IDLEMediaStreamTrack on PTT release. iOS Safari shows the persistent red recording bar regardless of mute state; closing/reopening the track triggers the "Allow Microphone" prompt with ~1s delay every press.enabled: true). PWA sends a WebRTC DataChannel message {"event": "ptt_down" | "ptt_up"}. Gateway only routes Opus packets to the Puck-WS during the PTT-down window. No track recreation.audioCtx.resume() inside the PTT button's touchstart event handler — iOS suspends AudioContext on background and remains suspended even when PWA returns to foreground unless explicitly resumed via user gesture.RTCRtpTransceiver.setCodecPreferences([{mimeType: 'audio/opus', clockRate: 48000, channels: 1}]) AND inject a=fmtp:111 sampling=16000;stereo=0 into SDP. Gateway resamples 48k→16k once on ingress, no per-packet jitter.sequence_id. On PTT-down, PWA sends last_played_sequence_id → gateway "rewinds" Gemini's session context to that point. Solves the interrupt-lag problem cleanly. ~2h to implement, kills the "user feels ignored when interrupting" failure mode.beforeinstallprompt + custom modal mimicking iOS native A2HS sheet (rounded corners, backdrop-filter: blur(20px), SF Symbols icons). Trigger AFTER 3rd session OR on high-value action ("Start Recording"). Skip the cheap browser mini-infobar.navigator.setAppBadge() is iOS 16.4+ only. Older iOS: CSS ::after floating red dot. Web Push fallback to TG COMMS already in v1.2 — keeps working for non-installed users.audio/webm; codecs=opus but is notoriously flaky with fragmented blobs and pitch-shifts when system clock fluctuates.opus-recorder (WASM-based) instead of native MediaRecorder. Bypasses Safari quirks, ~30KB gzipped, well-maintained. Gateway accepts the WASM-encoded Opus directly.Animation: Framer Motion — 14KB gzipped, iOS Safari native, layoutId for shared-element transitions (Dynamic-Island-style PTT pill morph), whileTap/drag gestures. NOT GSAP (heavier, no real win), NOT Skia (overkill for web).
Spring physics (locked): type: "spring", stiffness: 260, damping: 20 — exact iOS sheet bounce per Gemini.
Glassmorphism (locked): backdrop-filter: blur(20px) saturate(180%) contrast(90%) + border: 0.5px solid rgba(255,255,255,0.1) for the specular edge effect.
Typography (locked): font-family: -apple-system, BlinkMacSystemFont, 'SF Pro Display', system-ui, sans-serif. Headers: letter-spacing: -0.022em; font-weight: 600. Body: letter-spacing: -0.011em.
Safe area (locked): Every screen uses padding-top: env(safe-area-inset-top); padding-bottom: env(safe-area-inset-bottom);. Overlap of home indicator = instant cheap-feel disqualifier.
Iconography: react-symbols or Iconify with SF Symbols set. Stroke weight matches typography weight.
Haptics: navigator.vibrate([20, 50, 20]) on PTT press/release. navigator.vibrate([50]) on session-create, swipe-action confirms.
Performance gate (must-not-skip): PTT UI transition + session list scroll must be 60fps (120fps on ProMotion). If either drops frames, the "Apple illusion shatters" — Gemini's words. Add @media (prefers-reduced-transparency) to drop blurs on iOS Low Power Mode.
Micro-interactions (must-not-skip): Button press transform: scale(0.95) opacity(0.8) 100ms; list swipe translateX(-100px) opacity(0) 200ms spring; page transitions slide+fade (no hard cuts). These ARE the Apple feel — Vibe's "biggest risk" is skipping these.
Cuts for 5h budget: Dynamic Island replication (defer to v2), complex multi-finger gestures (defer), 3D transforms (defer). Spend the time on the four locked items above.
arrow.clockwise spinner only if WS reconnect is still in-flight after 200ms. The 5s replay was too long for Apple-grade feel._blank with noopener + window.name="brian-novnc". Inject top bar into noVNC HTML with a "← Return to Brian" button (postMessage to PWA). Sync theme via postMessage so noVNC tab inherits Jonah/Apple style.opus-recorder) (NEW)stiffness:260 damping:20, glassmorphism with specular border, SF Pro + exact letter-spacing, env(safe-area-inset-*), micro-interactions mandatory (LOCKED)← Return + postMessage theme sync (NEW)+7h: WASM Opus encoder integration (1h), DataChannel echo-cancel + sequence_id rewind (2h), IndexedDB resume + BroadcastChannel (2h), custom A2HS modal (1h), Apple-grade UI spec execution refinement (1h on top of phase 1's existing 4h).
New total: ~39h. That's a real budget — flag in final summary that the "27h" was optimistic and v1.3 is the honest number after R1+R2+R3. Ship-quality demands it.
Mobile-first PWA. One tap → spawns a fresh, persistent Claude Code (CC) session on Hetzner. Three input modes:
- Text chat (PWA UI)
- Voice message (record → transcribe → inject as text)
- 2-way voice call (full-duplex, Brian voice, same as SIP)
The session is simultaneously visible as a real graphical window inside the existing noVNC desktop on :1 (XFCE). Jonah can leave the PWA, come back hours later, the session is still there with full scrollback. Voice calls feel identical to SIP calls today — same Brian voice, same Gemini Live model.
User story:
Jonah opens
pwa.brianserves.meon his iPhone in bed. Hits "New Session" → "Webspot pricing thoughts." Starts a voice call. Talks for 10 minutes. Hangs up. Brian keeps working in that session. Two hours later Jonah opens the PWA on his Mac, sees the session in the list, sees in the noVNC tab the same session sitting in an XFCE window with all the work Brian did, scrolls through, types a follow-up. Same continuity, three surfaces.
| Asset | Location | Status |
|---|---|---|
cli-pwa backend (Fastify + SQLite) |
/opt/agent/cli-pwa/backend |
RUNNING port 8111, full session CRUD |
cli-pwa frontend (React + Vite + react-router) |
/opt/agent/cli-pwa/frontend |
Built, has /sessions/:id route, WS streaming, file attach |
| Session lifecycle (spawn / kill / respawn / resume) | backend/src/sessions.ts + sdk.ts |
WORKING — uses claude --resume <sessionId> |
| Open-in-noVNC-as-XFCE-window | backend/src/routes/sessions.ts:92 (/sessions/:id/open-tab) |
WORKING — spawns terminal on :1.0 X-display via XFCE env |
WebSocket streaming /ws/sessions/:id |
backend | WORKING |
Input injection /sessions/:id/input |
backend | WORKING |
| Transcript export | backend | WORKING |
Tool-call status detection (thinking / tool / idle) |
sessions.ts:125-167 |
WORKING — gold for voice narration |
| Brian voice via Gemini Live | gemini-live.service :8102, WebSocket /voice/gemini-ws?role=secretary |
RUNNING — same model SIP uses |
| Voice → CC bridge | voice/sek_cc_manager.py + voice/sek_cc_bridge.py |
WORKING — runs claude -p --resume per call |
| SIP→Brian voice path | voice/sip_pure_bridge.py (sip-pure-bridge.service) |
RUNNING |
| noVNC desktop (XFCE on :1, websockify :6080) | vncserver@:1.service |
RUNNING — full xfce4-session, ready to receive new windows |
inject-cc skill (tmux send-keys) |
~/.claude/skills/inject-cc |
EXISTS (tmux path; cli-pwa uses subprocess path — both valid) |
Gap to vision (the actual work):
1. PWA manifest + service worker + install affordance (mobile-first polish)
2. Voice-message recording UI + STT endpoint + auto-inject
3. 2-way voice call: WebRTC from browser ↔ existing gemini-live :8102 ↔ active CC session
4. Mid-call narrator (silence-killer): tool-status events → short Brian-voice utterances
5. Unify two parallel session models: cli-pwa.sessions (DB) and sek_cc_manager (per-SIP-call). One model, two entry points.
6. Persistence hardening: stale-timeout off for PWA-owned sessions; explicit park/resume; survive host reboot via systemd resurrection
7. Auth: Cloudflare Access in front (Jonah-only) + signed device key for PWA push
8. Push notifications: web-push when long-running task in backgrounded session emits idle after >60s of tool
9. Domain + TLS: pwa.brianserves.me (or cc.jonahtebaa.com), behind Cloudflare
┌──────────────────────────────────────────────────────────────────┐
│ iPhone / Mac browser — PWA installed from pwa.brianserves.me │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ React PWA (mobile-first) │ │
│ │ • Session list • Chat view (xterm.js mirror) │ │
│ │ • Voice-msg recorder • 2-way call button (WebRTC) │ │
│ │ • Service worker (push notifs, offline shell) │ │
│ └────────────┬─────────────────────────────────┬──────────────┘ │
└───────────────│─────────────────────────────────│────────────────┘
│ HTTPS + WSS │ WebRTC (audio)
│ (Cloudflare → Caddy) │
▼ ▼
┌──────────────────────────────────────────────────────────────────┐
│ HETZNER ubuntu-8gb-hel1-1 │
│ │
│ ┌──────────────────┐ ┌─────────────────────┐ │
│ │ Caddy (TLS, gate)│ │ Brian Voice Gateway │ NEW │
│ │ pwa.brianserves │ │ WebRTC ↔ WebSocket │ (small Python) │
│ └────────┬─────────┘ │ bridge to :8102 │ │
│ │ └──────────┬──────────┘ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ cli-pwa-backend (Fastify, port 8111) │ │
│ │ • /api/sessions CRUD + WS stream │ │
│ │ • /api/sessions/:id/input (text + transcribed) │ │
│ │ • /api/sessions/:id/open-tab (spawn XFCE window) │ │
│ │ • /api/voice/transcribe NEW (Gemini Live STT) │ │
│ │ • /api/voice/call/:id NEW (WebRTC signaling) │ │
│ │ • /api/push/subscribe NEW (web-push) │ │
│ │ • /api/auth/pair NEW (device-key pair) │ │
│ └────────┬───────────────────────────┬─────────────────┘ │
│ │ │ │
│ ▼ subprocess + DB ▼ control msgs │
│ ┌──────────────────────┐ ┌─────────────────────────┐ │
│ │ claude CLI (per-sess)│◄──►│ session-narrator │ NEW │
│ │ --resume <id> │ │ watches tool events, │ │
│ │ in tmux (optional) │ │ emits Brian utterances │ │
│ └──────────────────────┘ │ during 2-way call │ │
│ └────────────┬────────────┘ │
│ ┌──────────────────────┐ │ │
│ │ XFCE window on :1 │◄────────────────┘ │
│ │ wezterm/xterm │ (also visible via noVNC :6080) │
│ │ attached to session │ │
│ └──────────────────────┘ │
│ │
│ ┌──────────────────────┐ ┌─────────────────────────┐ │
│ │ gemini-live :8102 │ │ sip-pure-bridge │ │
│ │ (Brian voice STT/TTS│ │ (existing SIP path, │ │
│ │ — REUSED AS-IS) │ │ unchanged, still works)│ │
│ └──────────────────────┘ └─────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
/opt/agent/cli-pwa/frontend)Tech: React + Vite (existing) + vite-plugin-pwa (already installed) + xterm.js (new) + mediaRecorder API + WebRTC.
Routes (mobile-first, bottom-nav):
- / — Sessions list. Cards show name, last activity, status badge (idle/thinking/tool: Bash), quick "voice call" + "open in noVNC" + "kill" actions.
- /new — One-screen new-session form: name, optional starter prompt, optional cwd (default /root).
- /s/:id — Session view. Tabs:
- Chat (default): xterm.js mirror via /ws/sessions/:id, text input bar with + voice msg button (push-to-talk hold).
- Files: existing attach UI.
- Settings: rename, archive, kill, "open in noVNC tab", "duplicate".
- /s/:id/call — Full-screen 2-way voice call. Visual: pulsing Brian-voice avatar, current narration text in big type, mute/end buttons.
- /settings — Push subscription, voice preferences, paired devices.
PWA manifest (frontend/public/manifest.webmanifest — currently empty, fix):
- Name: "Brian"
- Short name: "Brian"
- Display: standalone
- Theme color: matches Jonah brand
- Icons: 192/512/maskable (generate via /imagen skill)
- Start URL: /
Service worker (sw.js):
- Precache app shell
- Background sync for failed inputs (offline support — queues and replays)
- Push notification handler → opens /s/:id deep link
/opt/agent/cli-pwa/backend)New endpoints:
POST /api/voice/transcribe
Body: { audio: <base64 webm/opus>, sessionId?: string }
→ Forwards to gemini-live :8102 (mode=stt-only)
→ Returns { text: "...", durationMs: 1234 }
→ If sessionId provided: also POSTs /sessions/:id/input internally
POST /api/voice/call/:sessionId/offer
Body: { sdp: <WebRTC offer>, role: "brian-pwa" }
→ Spins up bridge worker (see 4.3)
→ Returns { sdp: <answer>, callId, narratorUrl }
DELETE /api/voice/call/:callId
→ Tears down bridge
POST /api/push/subscribe
Body: { endpoint, keys: { p256dh, auth } }
→ Stores per-device subscription in SQLite
POST /api/auth/pair
Body: { qrToken }
→ Issues signed device JWT (used for all subsequent calls)
Modified:
- /sessions POST — add parkable: true flag → exempts from stale-timeout reaper.
- /sessions/:id/open-tab — already works; add ensureWindow: true semantics → idempotent (don't spawn duplicate windows).
DB schema additions (backend/src/db.ts):
ALTER TABLE sessions ADD COLUMN parkable INTEGER DEFAULT 1;
ALTER TABLE sessions ADD COLUMN owner_device TEXT; -- paired device id
ALTER TABLE sessions ADD COLUMN xfce_window_id TEXT; -- so kill/refocus works
ALTER TABLE sessions ADD COLUMN created_via TEXT; -- 'pwa' | 'sip' | 'cli'
CREATE TABLE push_subscriptions (
id TEXT PRIMARY KEY,
device_id TEXT NOT NULL,
endpoint TEXT NOT NULL,
p256dh TEXT NOT NULL,
auth TEXT NOT NULL,
created_at INTEGER
);
CREATE TABLE devices (
id TEXT PRIMARY KEY,
name TEXT,
paired_at INTEGER,
last_seen INTEGER
);
Purpose: Bridge browser WebRTC ↔ existing gemini-live :8102 WebSocket without breaking the SIP path.
Why a new service: gemini-live :8102 already works. Don't touch it. Instead, write a thin gateway that:
- Accepts WebRTC offer from PWA (via aiortc)
- Opens a WebSocket to ws://127.0.0.1:8102/voice/gemini-ws?role=brian-pwa&session_id=<sid>
- Pipes browser-mic audio (Opus → PCM 16k) → WS audio frames
- Pipes WS audio frames (PCM → Opus) → WebRTC track to browser
- Subscribes to cli-pwa-backend events for the session: when CC emits tool: <name> → also send a "narration text" event to gemini-live's prompt channel so Brian voice speaks "reading the file" etc.
File: /opt/agent/voice/brian_voice_gateway.py
Service: brian-voice-gateway.service, listens on 127.0.0.1:8120
Reuse: gemini_live_server.py already has the role-multiplexing pattern (role=secretary for SIP). Add role=brian-pwa with same Brian voice prompt + system-prompt-aware-of-CC-context.
Purpose: Kill silence during 2-way calls.
Logic:
- Subscribes to cli-pwa-backend events for active call's session
- Threshold: if tool status > 3s without an assistant text emit → fire one short narration
- Library of phrases keyed by tool: Bash → "running command", Read → "reading file", Edit → "patching the file", Agent → "dispatching a subagent", WebSearch → "searching the web", Task → "queuing a task"
- Variation: rotate 2-3 phrasings per tool to not sound robotic
- During pure thinking > 5s → "thinking" / "working it out" / "drafting"
- Sends narration via gateway's narration channel → gemini-live → TTS → browser
File: /opt/agent/voice/session_narrator.py
Co-located with gateway (could be same process — TBD per round-table).
Existing: POST /sessions/:id/open-tab spawns terminal in XFCE on :1.0.
Enhancements:
- Set window title = session name (so Jonah finds it in the XFCE taskbar)
- Use wezterm instead of default xterm (better fonts, better scroll)
- Track xfce_window_id so the PWA can show "open in noVNC" only when window doesn't exist; otherwise "focus in noVNC"
- Add a noVNC quick-link button on each session card → opens https://board.jonahtebaa.com/vnc.html?autoconnect=true&path=...&focus=<window_id>
Caveat: :1.0 has -localhost=1 (Xtigervnc binds 127.0.0.1 only). noVNC websockify on :6080 proxies it. That's already how it works for Jonah today — no change needed.
pwa.brianserves.me (or sub of jonahtebaa.com — Jonah picks)pwa.brianserves.me {
@authed header_regexp Cookie ^.*device_jwt=.*
handle @authed { reverse_proxy 127.0.0.1:8111 }
handle /pair* { reverse_proxy 127.0.0.1:8111 }
handle { redir /pair }
}/pair page shows QR code (or just a one-time token he confirms in a TG message). Issues signed device JWT (HS256, long-lived, revocable). Stored in HttpOnly cookie + localStorage for service worker.cli-pwa-backend restarts → DB has session row → on startup, claude --resume <id> reattaches if process died, otherwise reattaches WS to live PID. Already implemented in sessions.ts.--resume works because Anthropic CLI persists session state on disk. On boot, run a reaper that marks all running rows parked and lets Jonah explicitly resume.parkable=1 exempts session from auto-kill. Only manual delete or 30-day idle removes it.data/transcripts/<session_id>.jsonl. Already done.PWA: hold mic button
→ MediaRecorder captures Opus webm
→ release button
→ POST /api/voice/transcribe { audio, sessionId }
Backend → ws://127.0.0.1:8102/voice/gemini-ws?role=stt
Receives transcript → POST internally to /sessions/:id/input
Returns { text, durationMs } to PWA
PWA: shows transcript bubble in chat ("you said: ..."), then assistant reply streams in
Latency target: <2s for 5s clip.
PWA: tap "call"
→ getUserMedia(audio)
→ RTCPeerConnection, addTrack(mic)
→ POST /api/voice/call/:id/offer { sdp }
Backend asks brian-voice-gateway to allocate
Gateway:
① opens RTCPeerConnection, returns answer SDP
② opens WS to gemini-live :8102 role=brian-pwa
③ starts session-narrator subscribed to session events
Returns { sdp, callId }
PWA: setRemoteDescription(answer) → audio flowing
[user speaks] → Opus → Gateway → PCM 16k → gemini-live → CC injection
[CC works] → tool events → narrator → "reading the file" → gemini-live TTS → Gateway → Opus → PWA
[CC final] → assistant text → gemini-live TTS (Brian voice) → PWA
PWA: tap "end" → DELETE /api/voice/call/:callId
Gateway tears down both legs; session lives on
Latency target: first STT token <300ms, first TTS audio <400ms (matching SIP today).
Store in /opt/agent/voice/narrator_phrases.json:
{
"Bash": ["running a command", "executing", "checking the shell"],
"Read": ["reading the file", "pulling up the file", "checking that file"],
"Edit": ["patching the file", "making the edit", "writing the change"],
"Write": ["writing a new file", "saving the new file"],
"Grep": ["searching", "grepping", "looking through the code"],
"Glob": ["scanning files", "finding files"],
"Agent": ["dispatching a subagent", "spinning up a sub-task"],
"WebSearch": ["searching the web", "googling that"],
"WebFetch": ["fetching the page", "pulling the URL"],
"thinking": ["thinking", "working through it", "drafting", "weighing options"]
}
Rotate per call to avoid robotic repetition.
| Phase | Scope | Effort | Outcome |
|---|---|---|---|
| 0 | Audit & document existing cli-pwa state, DB schema, voice gateway pattern. Confirm reuse plan. | 1h | Spec validated, no surprises |
| 1 | Mobile-first PWA UI pass: manifest, sw.js, install prompt, bottom-nav, mobile chat layout. xterm.js for session view. Deploy to pwa.brianserves.me behind Cloudflare. |
4h | Installable PWA, text chat works on iPhone |
| 2 | Voice message: MediaRecorder UI + /api/voice/transcribe endpoint + gemini-live STT-only role. Auto-inject. |
3h | Hold-to-talk → transcript → CC reply (text) |
| 3 | brian-voice-gateway.py: WebRTC ↔ gemini-live bridge. Add role=brian-pwa to gemini-live server. 2-way call works without narrator. |
6h | Real voice call, but silent during tool calls |
| 4 | session-narrator.py: tool-event subscriber + phrase library + narration injection into call. |
3h | Calls feel alive, no dead air |
| 5 | Persistence hardening: parkable flag, reboot-safe session reaper, scrollback retention policy. |
2h | "Leave and come back" guaranteed |
| 6 | Push notifications: web-push subscribe + finish-event trigger + iOS-aware payload. | 3h | "CC done" pings phone |
| 7 | Auth: device-pairing flow, signed JWT, Cloudflare Access policy review. | 2h | Jonah-only |
| 8 | noVNC integration polish: window-id tracking, "focus" deep link, wezterm theme. | 1h | Smooth noVNC tab UX |
| 9 | Round-table consultation review → patch plan → re-verify each phase against feedback. | 2h | External validation pass |
Total: ~27h focused work. Ship MVP (phases 0-3) in one sprint (~14h), full vision in two.
| Failure | Detection | Mitigation |
|---|---|---|
| Gemini Live disconnects mid-call | gateway WS error | Auto-reconnect; play "one second" Brian utterance; resume |
| CC subprocess crashes | sessions.ts already detects session_end |
Auto-respawn with --resume (already implemented) |
| PWA loses network mid-input | service worker queue | Background sync replays input on reconnect |
| Multiple browsers open same session | DB xfce_window_id + WS counter |
Allow it — both mirror same stream; warn in UI |
| Host reboot during call | systemd restart sequence | Sessions parked, narrator emits "we got cut off, picking up where we left" on reconnect |
| iOS PWA push limitations | iOS 16.4+ supports web-push only when installed to home screen | Detect, show "install to home screen" coach mark |
| WebRTC blocked by carrier NAT | aiortc TURN | Configure TURN server (coturn) on Hetzner — free, ~30min setup |
| Voice latency degrades on cellular | client metrics | Show RTT badge; auto-fallback to voice-msg mode if RTT > 800ms |
cli-pwa.sessions and sek_cc_manager (per-SIP-call) merge into one session model, or stay parallel? Merge = one source of truth, but bigger refactor. Parallel = fewer regressions, two systems.After Jonah signs off this plan, run /round_table with the 10-member free roster (codex, chatgpt, gemini, antigravity, perplexity, grok, hermes, openhands, vibe, jules) — ask Jonah first whether to include Manus (per memory rule).
Round-table prompt outline:
1. Share this plan as context
2. Ask each member to weigh in on the 10 open questions above
3. Specifically request: WebRTC-vs-WebSocket-audio recommendation, PWA-iOS-push edge cases, ttyd-vs-xterm-js tradeoff, narrator-architecture opinion
4. Synthesize → patch plan → re-share final to Jonah
Single paid risk: if Gemini Live free tier exhausts under heavy use (currently fine — SIP traffic isn't huge). If it does, drop to local Whisper+Piper for off-peak. Round-table to advise.
pwa.brianserves.me on iPhone, installs to home screen, opens it as standalone apppwa.brianserves.me / cc.jonahtebaa.com / other)After sign-off → run round-table → integrate feedback → execute phase 0.