← index2026-05-03 21:45 (Beirut)(backfill from DOCUMENTATION/)

Brian Voice PWA — Architecture Plan v2 (ULTIMATE)

Brian Voice PWA — Architecture Plan v2 (ULTIMATE)

Date: 2026-05-03
Status: v2.0 — post-round-table (R1+R2+R3+R4), pending Jonah final sign-off on 4 trade-offs
Owner: Brian
Code home: /opt/agent/cli-pwa (extending existing scaffold)
History below preserved as audit trail (sections 0.5/0.6 are R1/R2/R3 patches). This v2 section is the SINGLE SOURCE OF TRUTH and supersedes all earlier sections on conflict.


V2 — THE ULTIMATE PLAN (one coherent spec)

V2.1 — Locked-and-final

Domain: brian.jonahtebaa.com (Jonah locked).
Voices: Puck (male, narration + chat) + Kore (female, clinical session-output reading) (Jonah locked). Two pinned WS to gemini-live :8102 because voiceConfig is immutable mid-session. Serial priority queue across both — never simultaneous.
Activation: Push-to-talk with DataChannel gating + always-on track lifecycle (R2/R3 finding).
Session end: Explicit only (Jonah locked). v2 adds idle-suspend (NOT end) — see §V2.5.
noVNC: Auto-spawn wezterm window per session (Jonah locked). v2 switches WM from xfce to i3wm on :1 for tiling — xfce preserved on :2 for personal use.
PWA: Mobile-first, Apple-feeling premium skin, Framer Motion + SF Pro + glassmorphism + micro-interactions. Honest framing: "premium" not "designed by Apple itself" (R4 universally rejected the latter as 5h-impossible).
Build budget: ~75h (R4 honest median; R1's 27h was wishful, v1.2's 32h and v1.3's 39h were both optimistic).
/app: CLI command typed inside any tmux/CC session adopts THAT session into supervisor (Jonah locked).

V2.2 — The architectural keystone: Session Supervisor

The single biggest insight from the four rounds: every trust-failure mode (split-brain, reboot, OOM, rate-limit, surface-desync) collapses to the same root — there is no single source of truth for a session. Build that first.

session-supervisor.py — one Python daemon per session. Owns:
- The tmux session containing the claude CLI subprocess
- The DB row state (running/parked/suspended/dead)
- All surface attachments (PWA WS, noVNC wezterm, voice gateway, /app adoption)
- Permission/approval routing (broadcasts to all surfaces, first responder wins)
- Reboot recovery (tmux-resurrect-style replay)
- Idle-suspend / warm-resume lifecycle
- Quota-fallback signaling (when Gemini 429s)

Source of truth: the tmux pane scrollback + Anthropic CLI's session state in ~/.config/anthropic/sessions/<id>/. All surfaces are READ-MIRRORS of tmux scrollback (via tmux pipe-pane) and WRITE-INJECTORS into tmux (via tmux send-keys). No surface owns its own copy. This solves the wezterm↔PWA bidirectional sync that Vibe identified as the Day-1 abandonment trigger.

V2.3 — Audio architecture (final)

PWA mic ──WebRTC──► brian-voice-gateway.py
                    │
                    ├── DataChannel gate (track always-on, Opus only flows when ptt_down=true)
                    │
                    ├── Puck WS to gemini-live :8102 (pinned, narration + chat)
                    └── Kore WS to gemini-live :8102 (pinned, clinical session output)

Audio arbiter (state machine):
  IDLE → NARRATING(Puck) → CLINICAL(Kore) → PTT_ACTIVE
  Kore preempts Puck on first PCM chunk (50ms gain ramp-down on Puck)
  PTT_ACTIVE preempts everything (DataChannel send clear_buffer to BOTH Gemini WS)
  Echo-cancel: Kore chunks carry sequence_id; on PTT-down, PWA sends last_played_id;
    gateway rewinds Gemini context so model knows where it was interrupted

Codec: SDP forces opus 48kHz mono on transport, gateway resamples 48k→16k once for Gemini

Decisions deferred to Jonah (see §V2.10):
- Should the narrator (mid-call "running command" Puck utterances during tool calls) ship in v2.0 or be deferred? Codex argues defer — "dopamine polish on unreliable plumbing." I lean keep, because without it long tool chains feel dead.

V2.4 — iOS Safari hardening (R3 + R4)

Issue Mitigation
MediaStreamTrack killed after ~5min backgrounded track.onended handler triggers re-acquisition flow (silent re-permission if granted, recreate track, ICE-restart)
AudioContext suspended on background audioCtx.resume() on every touchstart of the PTT button
PSTN call / notification steals audio focus audiointerruptbegin/end events → pause UI + "phone call interrupted" banner + auto-resume on audiointerruptend
WS dies on background iOS app-foregrounded event triggers reconnect; UI shows blur+spinner only if recovery >200ms
Web Push needs install + 24h interaction Fallback: every push event ALSO sent to TG COMMS; in-app red-dot badge using setAppBadge (16.4+) or CSS ::after (older)
MediaRecorder Opus flaky opus-recorder WASM library (~30KB), bypass native MediaRecorder
PWA install rate 12-18% Custom A2HS modal mimicking iOS native sheet, trigger after 3rd session OR high-value action; never the cheap browser infobar

V2.5 — Reboot recovery + idle-suspend (R4 fixes)

Reboot recovery (was broken in v1.x):
- New systemd unit tmux-resurrect@.service saves tmux state every 5min to /var/lib/brian-pwa/tmux-snapshots/
- On boot, agent-pwa-recovery.service (oneshot) restores tmux sessions BEFORE cli-pwa-backend starts
- Supervisor on first PWA tap rehydrates claude --resume <id> from Anthropic CLI's persisted session state
- PWA shows "Resumed from " header utterance (Puck) on first interaction
- Reaper marks sessions whose Anthropic state is unrecoverable as dead; otherwise suspended and ready for warm-resume

Idle-suspend (NEW in v2 — solves OOM, honors "never auto-end" lock):
- Session enters suspended state after 30 min of pure idle (no tool calls, no user input, no voice).
- Suspend kills the claude subprocess + closes wezterm window. Preserves tmux scrollback + DB row + Anthropic session state on disk.
- ANY surface tap re-spawns claude subprocess via --resume <id> + re-spawns wezterm window. Perceived warm-resume <3s.
- Critical: suspended ≠ ended. The DB row stays alive forever (or until Jonah explicitly ends). Honors the lock — no session auto-ends.
- 8-session OOM scenario: at most 1-2 active at any moment; rest suspended. RAM footprint stays bounded.

V2.6 — Quota / rate-limit dignity (R4 fix)

When gateway gets 429 from gemini-live:
1. Mute the voice channel (no more dead-silence loops).
2. PWA modal: "Voice quota hit — switching to text mode for this session" + visual indicator on session card.
3. TG COMMS notification: "Voice degraded for session — text mode active."
4. Session continues in text mode (chat works, transcripts work).
5. After 1h cooldown, gateway probes Gemini; if recovered, voice button re-enables.
6. Hard-fallback to local Whisper+Piper is deferred to v2.1 (out of v2.0 scope to keep budget honest).

V2.7 — UI: "Apple-feeling premium" (honest framing)

Locked specs (mandatory):
- Framer Motion springs: type: "spring", stiffness: 260, damping: 20
- Glassmorphism: backdrop-filter: blur(20px) saturate(180%) contrast(90%) + border: 0.5px solid rgba(255,255,255,0.1) (specular edge)
- Typography: -apple-system, 'SF Pro Display', system-ui; headers letter-spacing: -0.022em font-weight: 600; body -0.011em
- Safe-area insets on every screen
- Iconography: SF Symbols via Iconify
- Haptics: navigator.vibrate([20,50,20]) on PTT, [50] on confirm actions
- Micro-interactions (mandatory, not optional):
- Button press: transform: scale(0.95) opacity(0.8) 100ms ease
- List swipe: translateX(-100px) opacity(0) 200ms spring
- Page transitions: slide+fade, no hard cuts
- 60fps gate on PTT call UI + session list scroll. @media (prefers-reduced-transparency) removes blurs in iOS Low Power Mode.

Cut from v2.0 (defer to v2.1):
- Dynamic Island replication
- layoutId shared-element morphs
- 3D transforms / multi-finger gestures
- Apple's specific spring-overshoot tuning per-component (use the one global spring config above)

Honest framing for Jonah: v2.0 ships "premium Apple-feeling skin." NOT "indistinguishable from Apple's own design team." That's a $50k-budget design engagement, not 8h of execution. R4 was unanimous on this — pretending otherwise is the kind of overreach that ships ugly.

UI build: ~10h (was 5h in v1.x, was unrealistic per R4).

V2.8 — Build phases (re-ordered, honest budgets)

Phase Scope Hours
0 Audit + dependency install (i3wm, opus-recorder, framer-motion, workbox, web-push, aiortc, coturn). Dry-run cli-pwa current state. 2
1 session-supervisor.py — the keystone. tmux ownership, scrollback pipe-pane, surface attach API, permission broadcast, idle-suspend, reboot recovery via tmux-resurrect. 12
2 Migrate cli-pwa-backend routes (/sessions/*, /ws/sessions/:id, /sessions/:id/input, /sessions/:id/open-tab) to delegate to supervisor. Kill legacy spawn/kill code paths. Bidirectional wezterm↔PWA sync. 8
3 i3wm install on :1, xfce relocated to :2. wezterm theme + auto-tile. Window-id tracking + focus contract. 3
4 coturn self-host (TLS, UDP 3478-3481/5349, Let's Encrypt). 2
5 brian-voice-gateway.py — dual-WS to gemini-live, audio arbiter state machine, DataChannel gating, codec pinning, 50ms ramp-down crossfade, sequence_id echo-cancel. 10
6 iOS hardening: track lifecycle, audioCtx.resume, audiointerrupt events, ICE-restart-on-foreground, IndexedDB instant-resume, opus-recorder WASM. 6
7 Quota fallback (429 detection → mute + UI modal + TG COMMS push + 1h cooldown probe). 2
8 /app CLI command (Python script in tmux env, registers pane PID with supervisor, idempotent on (pane_pid, session_id)). 2
9 PWA frontend rebuild — Apple-feeling premium skin (Framer Motion, glassmorphism, SF Pro, micro-interactions, safe-area, haptics, A2HS modal, in-app badge). 10
10 Service worker + Workbox precache, BroadcastChannel cross-tab sync, IndexedDB session-state writer. 3
11 Push notifications (web-push subscribe, server emit on 6 event types, TG COMMS mirror). 3
12 noVNC tab top-bar injection (Return-to-Brian button, postMessage theme sync). 1
13 Auth: device-pairing JWT, Cloudflare Access policy, pwa.brianserves.mebrian.jonahtebaa.com Caddy block. 3
14 Narrator (session-narrator.py) IF kept — phrase library, tool-event subscriber, Puck low-priority injection. 3
15 E2E UAT against §11 success criteria. iPhone real-device test. Latency benchmarking on cellular. Reboot drill. 8-session OOM drill. 5

Subtotal: 75h. Buffer: ~5h for the inevitable Safari quirks. Realistic total: 75-80h.

V2.9 — Failure-mode → mitigation matrix (R4 closed-loop)

Failure Mitigation
Hetzner reboot mid-night tmux-resurrect + claude --resume; <3s warm-resume on first tap; "Resumed from X" Puck cue
iOS background-kill of audio track.onended re-acquisition + audioCtx.resume + ICE-restart
iOS PSTN call interrupt audiointerrupt events + auto-resume on end
Gemini Live 429 mute voice + UI modal + TG COMMS + 1h probe + text-mode continuation
8 idle sessions OOM idle-suspend (30min) → at most 1-2 active claude processes
wezterm in noVNC vs PWA desync tmux scrollback as source of truth, all surfaces read-mirror via pipe-pane
Permission prompt across surfaces supervisor broadcasts; first responder wins
WS reconnect mid-call preserve WebRTC, reconnect WS in background, 1s Puck filler
/app re-adoption loop idempotent on (pane_pid, session_id); reject duplicate adoption

V2.10 — Four trade-offs — JONAH LOCKED 2026-05-03 17:55 Beirut

# Decision LOCK
1 Narrator KEEP, event-driven only — no speculative filler, no hallucinated "thinking" between real events. Narrator is a state-reporter bound to supervisor tool-events, never imagines progress.
2 Opus encoder opus-recorder WASM library. Decided. No future asks on this class of micro-choice.
3 UI fidelity "Premium Apple-feeling skin" framing, but executed with craft — every minute of the 10h on micro-interaction polish, not feature creep. Portfolio piece, not half-assed.
4 Scope Full vision, 75-80h single push. Both voices, all phases.

Post-build directive (Jonah explicit):
- Round-table audit by suitable members → fix iteration
- E2E test → fix iteration
- Jonah-perspective self-test using use-my-browser + chrome-mac skills (Brian uses Jonah's real Mac + real Chrome) → fix iteration
- "No too-much-effort budget. Make it close to perfect."

V2.11 — Round-table audit trail


0. LOCKED DECISIONS (Jonah sign-off, 2026-05-03)

# Decision Lock
Domain brian.jonahtebaa.com LOCKED
Round-table roster 11-agent (Manus INCLUDED, paid OK for this plan) LOCKED
/app mechanism CLI command typed inside any tmux/CC session, adopts THAT session into PWA list LOCKED
Session end semantics Only when Jonah explicitly taps "end session". NEVER auto-end. LOCKED
noVNC window ALWAYS auto-spawn an XFCE wezterm window the moment a session is created LOCKED
noVNC embed in PWA NO — separate browser tab, PWA deep-links to noVNC LOCKED
Voice A (narration/chat) Puck (male, Brian voice — same as SIP) LOCKED
Voice B (session output) Kore (crisp female, calm, clinical, monotone) LOCKED
Voice activation Push-to-talk (hold mic button) LOCKED
Push notifications ON for: session-idle-after-tool, social mentions, backend errors, daily summary, incoming SIP, /agency-published LOCKED
Build scope FULL vision in one push (~27h+) — no MVP slice LOCKED
Aesthetic mandate Jonah branding + futuristic + animations + smooth interface — "designed by Apple itself." LOCKED

The Apple-grade futuristic mandate elevates the UI build from "mobile-first PWA" to a portfolio-piece front-end. Adds the UI Designer + frontend-design + jonah-branding skills as first-class participants in phase 1.


0.5. ROUND 1 INTEGRATION (v1.2) — architecture critique patches

Round 1 panel: Codex, Gemini, Hermes, Vibe (4 substantive critiques in 110s; Manus paid-flag, OpenHands offline, Antigravity/Jules async-dispatched).

Convergent breaking findings (all 4 members agreed)

FINDING-A: Dual-voice parallel-mux will glitch under jitter — pivot to ONE-STREAM PRIORITY QUEUE

FINDING-B: Session ownership in cli-pwa is split-brain TODAY — introduce Session Supervisor

FINDING-C: iOS Safari realities — TURN required, background kills WS, Push needs fallback

Additional findings (non-overlapping)

FINDING-D: noVNC window bloat with auto-spawn — needs tiling WM

FINDING-E: /app must be idempotent on tmux pane PID

FINDING-F: Permission/approval routing across surfaces

FINDING-G: Resource reaper without violating Jonah's "never auto-end"

v1.2 changes summary

  1. brian-voice-gateway.py: parallel-mux → single-stream priority queue (FINDING-A)
  2. NEW session-supervisor.py: single PTY owner per session, multi-surface attach (FINDING-B)
  3. coturn self-host BEFORE phase-3 WebRTC work (FINDING-C)
  4. iOS background-kill handling: ICE restart + 5s context replay + audible "resumed" cue (FINDING-C)
  5. Web Push: mirror every event to TG COMMS as fallback; in-app badge counter (FINDING-C)
  6. Tiling WM (i3wm) on :1 for agent desktop, xfce4 retained on :2 for personal (FINDING-D)
  7. /app idempotency on (tmux_pane_pid, session_id) (FINDING-E)
  8. Permission broadcast to all attached surfaces, first-responder-wins (FINDING-F)
  9. Process-dead reaper only, never live-session reaper (FINDING-G)

Build-budget impact

+5h: session supervisor (3h), coturn setup (1h), iOS background handling + permission broadcast (1h). New total: ~32h (still buildable as one push, but the call stops being "27h-or-bust").


0.6. ROUND 2 + ROUND 3 INTEGRATION (v1.3)

R2 panel (voice/WebRTC): Vibe, Gemini, Hermes (Codex timed out). R3 panel (PWA/iOS/UI): Vibe, Gemini, Hermes (Codex timed out).

R2-CRITICAL: Reversal of v1.2 audio architecture

R2-CRITICAL: iOS PTT mic lifecycle (Gemini deep-dive)

R2 patches — supplementary

R3-CRITICAL: iOS PWA install rate is 12-18% — design for non-installed too

R3-CRITICAL: iOS Safari MediaRecorder Opus is flaky — use WASM encoder

R3 — Concrete Apple-grade UI specs (locked)

Animation: Framer Motion — 14KB gzipped, iOS Safari native, layoutId for shared-element transitions (Dynamic-Island-style PTT pill morph), whileTap/drag gestures. NOT GSAP (heavier, no real win), NOT Skia (overkill for web).

Spring physics (locked): type: "spring", stiffness: 260, damping: 20 — exact iOS sheet bounce per Gemini.

Glassmorphism (locked): backdrop-filter: blur(20px) saturate(180%) contrast(90%) + border: 0.5px solid rgba(255,255,255,0.1) for the specular edge effect.

Typography (locked): font-family: -apple-system, BlinkMacSystemFont, 'SF Pro Display', system-ui, sans-serif. Headers: letter-spacing: -0.022em; font-weight: 600. Body: letter-spacing: -0.011em.

Safe area (locked): Every screen uses padding-top: env(safe-area-inset-top); padding-bottom: env(safe-area-inset-bottom);. Overlap of home indicator = instant cheap-feel disqualifier.

Iconography: react-symbols or Iconify with SF Symbols set. Stroke weight matches typography weight.

Haptics: navigator.vibrate([20, 50, 20]) on PTT press/release. navigator.vibrate([50]) on session-create, swipe-action confirms.

Performance gate (must-not-skip): PTT UI transition + session list scroll must be 60fps (120fps on ProMotion). If either drops frames, the "Apple illusion shatters" — Gemini's words. Add @media (prefers-reduced-transparency) to drop blurs on iOS Low Power Mode.

Micro-interactions (must-not-skip): Button press transform: scale(0.95) opacity(0.8) 100ms; list swipe translateX(-100px) opacity(0) 200ms spring; page transitions slide+fade (no hard cuts). These ARE the Apple feel — Vibe's "biggest risk" is skipping these.

Cuts for 5h budget: Dynamic Island replication (defer to v2), complex multi-finger gestures (defer), 3D transforms (defer). Spend the time on the four locked items above.

R3 patches — service worker + iOS resume

v1.3 changes summary

  1. Audio arch: dual-WS to gemini-live (Puck+Kore pinned) + serial priority queue + 50ms gain ramp-down on Puck preempt (REVISED from v1.2)
  2. iOS PTT: track-always-on + DataChannel gating + audioCtx.resume in touchstart (NEW)
  3. Codec pinning + WASM Opus encoder (opus-recorder) (NEW)
  4. WS reconnect ≠ ICE restart — different mitigation per scenario (REVISED)
  5. DataChannel echo-cancel with sequence_id rewind for clean PTT interrupts (NEW)
  6. Custom A2HS install modal + in-app badge with CSS fallback (NEW)
  7. Apple UI spec locked: Framer Motion stiffness:260 damping:20, glassmorphism with specular border, SF Pro + exact letter-spacing, env(safe-area-inset-*), micro-interactions mandatory (LOCKED)
  8. 60/120fps gate on PTT + session list (mandatory testing)
  9. IndexedDB instant-resume replaces 5s replay (REVISED)
  10. noVNC tab gets injected top-bar with ← Return + postMessage theme sync (NEW)

Build-budget impact

+7h: WASM Opus encoder integration (1h), DataChannel echo-cancel + sequence_id rewind (2h), IndexedDB resume + BroadcastChannel (2h), custom A2HS modal (1h), Apple-grade UI spec execution refinement (1h on top of phase 1's existing 4h).

New total: ~39h. That's a real budget — flag in final summary that the "27h" was optimistic and v1.3 is the honest number after R1+R2+R3. Ship-quality demands it.


1. Goal

Mobile-first PWA. One tap → spawns a fresh, persistent Claude Code (CC) session on Hetzner. Three input modes:
- Text chat (PWA UI)
- Voice message (record → transcribe → inject as text)
- 2-way voice call (full-duplex, Brian voice, same as SIP)

The session is simultaneously visible as a real graphical window inside the existing noVNC desktop on :1 (XFCE). Jonah can leave the PWA, come back hours later, the session is still there with full scrollback. Voice calls feel identical to SIP calls today — same Brian voice, same Gemini Live model.

User story:

Jonah opens pwa.brianserves.me on his iPhone in bed. Hits "New Session" → "Webspot pricing thoughts." Starts a voice call. Talks for 10 minutes. Hangs up. Brian keeps working in that session. Two hours later Jonah opens the PWA on his Mac, sees the session in the list, sees in the noVNC tab the same session sitting in an XFCE window with all the work Brian did, scrolls through, types a follow-up. Same continuity, three surfaces.


2. Existing Infra — What's Already Built (the gift)

Asset Location Status
cli-pwa backend (Fastify + SQLite) /opt/agent/cli-pwa/backend RUNNING port 8111, full session CRUD
cli-pwa frontend (React + Vite + react-router) /opt/agent/cli-pwa/frontend Built, has /sessions/:id route, WS streaming, file attach
Session lifecycle (spawn / kill / respawn / resume) backend/src/sessions.ts + sdk.ts WORKING — uses claude --resume <sessionId>
Open-in-noVNC-as-XFCE-window backend/src/routes/sessions.ts:92 (/sessions/:id/open-tab) WORKING — spawns terminal on :1.0 X-display via XFCE env
WebSocket streaming /ws/sessions/:id backend WORKING
Input injection /sessions/:id/input backend WORKING
Transcript export backend WORKING
Tool-call status detection (thinking / tool / idle) sessions.ts:125-167 WORKING — gold for voice narration
Brian voice via Gemini Live gemini-live.service :8102, WebSocket /voice/gemini-ws?role=secretary RUNNING — same model SIP uses
Voice → CC bridge voice/sek_cc_manager.py + voice/sek_cc_bridge.py WORKING — runs claude -p --resume per call
SIP→Brian voice path voice/sip_pure_bridge.py (sip-pure-bridge.service) RUNNING
noVNC desktop (XFCE on :1, websockify :6080) vncserver@:1.service RUNNING — full xfce4-session, ready to receive new windows
inject-cc skill (tmux send-keys) ~/.claude/skills/inject-cc EXISTS (tmux path; cli-pwa uses subprocess path — both valid)

Gap to vision (the actual work):
1. PWA manifest + service worker + install affordance (mobile-first polish)
2. Voice-message recording UI + STT endpoint + auto-inject
3. 2-way voice call: WebRTC from browser ↔ existing gemini-live :8102 ↔ active CC session
4. Mid-call narrator (silence-killer): tool-status events → short Brian-voice utterances
5. Unify two parallel session models: cli-pwa.sessions (DB) and sek_cc_manager (per-SIP-call). One model, two entry points.
6. Persistence hardening: stale-timeout off for PWA-owned sessions; explicit park/resume; survive host reboot via systemd resurrection
7. Auth: Cloudflare Access in front (Jonah-only) + signed device key for PWA push
8. Push notifications: web-push when long-running task in backgrounded session emits idle after >60s of tool
9. Domain + TLS: pwa.brianserves.me (or cc.jonahtebaa.com), behind Cloudflare


3. Target Architecture

┌──────────────────────────────────────────────────────────────────┐
│  iPhone / Mac browser — PWA installed from pwa.brianserves.me    │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │ React PWA (mobile-first)                                    │ │
│  │  • Session list           • Chat view (xterm.js mirror)     │ │
│  │  • Voice-msg recorder     • 2-way call button (WebRTC)      │ │
│  │  • Service worker (push notifs, offline shell)              │ │
│  └────────────┬─────────────────────────────────┬──────────────┘ │
└───────────────│─────────────────────────────────│────────────────┘
                │  HTTPS + WSS                    │  WebRTC (audio)
                │  (Cloudflare → Caddy)           │
                ▼                                 ▼
┌──────────────────────────────────────────────────────────────────┐
│ HETZNER ubuntu-8gb-hel1-1                                        │
│                                                                  │
│  ┌──────────────────┐    ┌─────────────────────┐                 │
│  │ Caddy (TLS, gate)│    │ Brian Voice Gateway │  NEW            │
│  │  pwa.brianserves │    │  WebRTC ↔ WebSocket │  (small Python) │
│  └────────┬─────────┘    │  bridge to :8102    │                 │
│           │              └──────────┬──────────┘                 │
│           ▼                         ▼                            │
│  ┌──────────────────────────────────────────────────────┐        │
│  │ cli-pwa-backend (Fastify, port 8111)                 │        │
│  │  • /api/sessions   CRUD + WS stream                  │        │
│  │  • /api/sessions/:id/input    (text + transcribed)   │        │
│  │  • /api/sessions/:id/open-tab (spawn XFCE window)    │        │
│  │  • /api/voice/transcribe      NEW (Gemini Live STT)  │        │
│  │  • /api/voice/call/:id        NEW (WebRTC signaling) │        │
│  │  • /api/push/subscribe        NEW (web-push)         │        │
│  │  • /api/auth/pair             NEW (device-key pair)  │        │
│  └────────┬───────────────────────────┬─────────────────┘        │
│           │                           │                          │
│           ▼ subprocess + DB           ▼ control msgs             │
│  ┌──────────────────────┐    ┌─────────────────────────┐         │
│  │ claude CLI (per-sess)│◄──►│ session-narrator        │ NEW     │
│  │  --resume <id>       │    │  watches tool events,   │         │
│  │  in tmux (optional)  │    │  emits Brian utterances │         │
│  └──────────────────────┘    │  during 2-way call      │         │
│                              └────────────┬────────────┘         │
│  ┌──────────────────────┐                 │                      │
│  │ XFCE window on :1    │◄────────────────┘                      │
│  │  wezterm/xterm       │   (also visible via noVNC :6080)       │
│  │  attached to session │                                        │
│  └──────────────────────┘                                        │
│                                                                  │
│  ┌──────────────────────┐    ┌─────────────────────────┐         │
│  │ gemini-live :8102    │    │ sip-pure-bridge         │         │
│  │  (Brian voice STT/TTS│    │ (existing SIP path,     │         │
│  │   — REUSED AS-IS)    │    │  unchanged, still works)│         │
│  └──────────────────────┘    └─────────────────────────┘         │
└──────────────────────────────────────────────────────────────────┘

4. Component Specs

4.1 PWA Frontend (extend /opt/agent/cli-pwa/frontend)

Tech: React + Vite (existing) + vite-plugin-pwa (already installed) + xterm.js (new) + mediaRecorder API + WebRTC.

Routes (mobile-first, bottom-nav):
- / — Sessions list. Cards show name, last activity, status badge (idle/thinking/tool: Bash), quick "voice call" + "open in noVNC" + "kill" actions.
- /new — One-screen new-session form: name, optional starter prompt, optional cwd (default /root).
- /s/:id — Session view. Tabs:
- Chat (default): xterm.js mirror via /ws/sessions/:id, text input bar with + voice msg button (push-to-talk hold).
- Files: existing attach UI.
- Settings: rename, archive, kill, "open in noVNC tab", "duplicate".
- /s/:id/call — Full-screen 2-way voice call. Visual: pulsing Brian-voice avatar, current narration text in big type, mute/end buttons.
- /settings — Push subscription, voice preferences, paired devices.

PWA manifest (frontend/public/manifest.webmanifest — currently empty, fix):
- Name: "Brian"
- Short name: "Brian"
- Display: standalone
- Theme color: matches Jonah brand
- Icons: 192/512/maskable (generate via /imagen skill)
- Start URL: /

Service worker (sw.js):
- Precache app shell
- Background sync for failed inputs (offline support — queues and replays)
- Push notification handler → opens /s/:id deep link

4.2 Backend extensions (/opt/agent/cli-pwa/backend)

New endpoints:

POST /api/voice/transcribe
Body: { audio: <base64 webm/opus>, sessionId?: string }
→ Forwards to gemini-live :8102 (mode=stt-only)
→ Returns { text: "...", durationMs: 1234 }
→ If sessionId provided: also POSTs /sessions/:id/input internally

POST /api/voice/call/:sessionId/offer
Body: { sdp: <WebRTC offer>, role: "brian-pwa" }
→ Spins up bridge worker (see 4.3)
→ Returns { sdp: <answer>, callId, narratorUrl }

DELETE /api/voice/call/:callId
→ Tears down bridge

POST /api/push/subscribe
Body: { endpoint, keys: { p256dh, auth } }
→ Stores per-device subscription in SQLite

POST /api/auth/pair
Body: { qrToken }
→ Issues signed device JWT (used for all subsequent calls)

Modified:
- /sessions POST — add parkable: true flag → exempts from stale-timeout reaper.
- /sessions/:id/open-tab — already works; add ensureWindow: true semantics → idempotent (don't spawn duplicate windows).

DB schema additions (backend/src/db.ts):

ALTER TABLE sessions ADD COLUMN parkable INTEGER DEFAULT 1;
ALTER TABLE sessions ADD COLUMN owner_device TEXT;     -- paired device id
ALTER TABLE sessions ADD COLUMN xfce_window_id TEXT;   -- so kill/refocus works
ALTER TABLE sessions ADD COLUMN created_via TEXT;      -- 'pwa' | 'sip' | 'cli'

CREATE TABLE push_subscriptions (
  id TEXT PRIMARY KEY,
  device_id TEXT NOT NULL,
  endpoint TEXT NOT NULL,
  p256dh TEXT NOT NULL,
  auth TEXT NOT NULL,
  created_at INTEGER
);

CREATE TABLE devices (
  id TEXT PRIMARY KEY,
  name TEXT,
  paired_at INTEGER,
  last_seen INTEGER
);

4.3 Brian Voice Gateway (NEW — small Python service)

Purpose: Bridge browser WebRTC ↔ existing gemini-live :8102 WebSocket without breaking the SIP path.

Why a new service: gemini-live :8102 already works. Don't touch it. Instead, write a thin gateway that:
- Accepts WebRTC offer from PWA (via aiortc)
- Opens a WebSocket to ws://127.0.0.1:8102/voice/gemini-ws?role=brian-pwa&session_id=<sid>
- Pipes browser-mic audio (Opus → PCM 16k) → WS audio frames
- Pipes WS audio frames (PCM → Opus) → WebRTC track to browser
- Subscribes to cli-pwa-backend events for the session: when CC emits tool: <name> → also send a "narration text" event to gemini-live's prompt channel so Brian voice speaks "reading the file" etc.

File: /opt/agent/voice/brian_voice_gateway.py
Service: brian-voice-gateway.service, listens on 127.0.0.1:8120
Reuse: gemini_live_server.py already has the role-multiplexing pattern (role=secretary for SIP). Add role=brian-pwa with same Brian voice prompt + system-prompt-aware-of-CC-context.

4.4 Session Narrator (NEW — small daemon)

Purpose: Kill silence during 2-way calls.

Logic:
- Subscribes to cli-pwa-backend events for active call's session
- Threshold: if tool status > 3s without an assistant text emit → fire one short narration
- Library of phrases keyed by tool: Bash → "running command", Read → "reading file", Edit → "patching the file", Agent → "dispatching a subagent", WebSearch → "searching the web", Task → "queuing a task"
- Variation: rotate 2-3 phrasings per tool to not sound robotic
- During pure thinking > 5s → "thinking" / "working it out" / "drafting"
- Sends narration via gateway's narration channel → gemini-live → TTS → browser

File: /opt/agent/voice/session_narrator.py
Co-located with gateway (could be same process — TBD per round-table).

4.5 noVNC visibility (already 95% working)

Existing: POST /sessions/:id/open-tab spawns terminal in XFCE on :1.0.

Enhancements:
- Set window title = session name (so Jonah finds it in the XFCE taskbar)
- Use wezterm instead of default xterm (better fonts, better scroll)
- Track xfce_window_id so the PWA can show "open in noVNC" only when window doesn't exist; otherwise "focus in noVNC"
- Add a noVNC quick-link button on each session card → opens https://board.jonahtebaa.com/vnc.html?autoconnect=true&path=...&focus=<window_id>

Caveat: :1.0 has -localhost=1 (Xtigervnc binds 127.0.0.1 only). noVNC websockify on :6080 proxies it. That's already how it works for Jonah today — no change needed.

4.6 Auth & deploy

4.7 Persistence guarantees


5. Voice Flow — Detailed

5.1 Voice message (push-to-talk)

PWA: hold mic button
  → MediaRecorder captures Opus webm
  → release button
  → POST /api/voice/transcribe { audio, sessionId }
    Backend → ws://127.0.0.1:8102/voice/gemini-ws?role=stt
    Receives transcript → POST internally to /sessions/:id/input
    Returns { text, durationMs } to PWA
PWA: shows transcript bubble in chat ("you said: ..."), then assistant reply streams in

Latency target: <2s for 5s clip.

5.2 2-way call

PWA: tap "call"
  → getUserMedia(audio)
  → RTCPeerConnection, addTrack(mic)
  → POST /api/voice/call/:id/offer { sdp }
    Backend asks brian-voice-gateway to allocate
    Gateway:
      ① opens RTCPeerConnection, returns answer SDP
      ② opens WS to gemini-live :8102 role=brian-pwa
      ③ starts session-narrator subscribed to session events
    Returns { sdp, callId }
PWA: setRemoteDescription(answer) → audio flowing
[user speaks] → Opus → Gateway → PCM 16k → gemini-live → CC injection
[CC works]   → tool events → narrator → "reading the file" → gemini-live TTS → Gateway → Opus → PWA
[CC final]   → assistant text → gemini-live TTS (Brian voice) → PWA
PWA: tap "end" → DELETE /api/voice/call/:callId
  Gateway tears down both legs; session lives on

Latency target: first STT token <300ms, first TTS audio <400ms (matching SIP today).

5.3 Mid-call narrator phrasing library

Store in /opt/agent/voice/narrator_phrases.json:

{
  "Bash":      ["running a command", "executing", "checking the shell"],
  "Read":      ["reading the file", "pulling up the file", "checking that file"],
  "Edit":      ["patching the file", "making the edit", "writing the change"],
  "Write":     ["writing a new file", "saving the new file"],
  "Grep":      ["searching", "grepping", "looking through the code"],
  "Glob":      ["scanning files", "finding files"],
  "Agent":     ["dispatching a subagent", "spinning up a sub-task"],
  "WebSearch": ["searching the web", "googling that"],
  "WebFetch":  ["fetching the page", "pulling the URL"],
  "thinking":  ["thinking", "working through it", "drafting", "weighing options"]
}

Rotate per call to avoid robotic repetition.


6. Build Phases (executable, in order)

Phase Scope Effort Outcome
0 Audit & document existing cli-pwa state, DB schema, voice gateway pattern. Confirm reuse plan. 1h Spec validated, no surprises
1 Mobile-first PWA UI pass: manifest, sw.js, install prompt, bottom-nav, mobile chat layout. xterm.js for session view. Deploy to pwa.brianserves.me behind Cloudflare. 4h Installable PWA, text chat works on iPhone
2 Voice message: MediaRecorder UI + /api/voice/transcribe endpoint + gemini-live STT-only role. Auto-inject. 3h Hold-to-talk → transcript → CC reply (text)
3 brian-voice-gateway.py: WebRTC ↔ gemini-live bridge. Add role=brian-pwa to gemini-live server. 2-way call works without narrator. 6h Real voice call, but silent during tool calls
4 session-narrator.py: tool-event subscriber + phrase library + narration injection into call. 3h Calls feel alive, no dead air
5 Persistence hardening: parkable flag, reboot-safe session reaper, scrollback retention policy. 2h "Leave and come back" guaranteed
6 Push notifications: web-push subscribe + finish-event trigger + iOS-aware payload. 3h "CC done" pings phone
7 Auth: device-pairing flow, signed JWT, Cloudflare Access policy review. 2h Jonah-only
8 noVNC integration polish: window-id tracking, "focus" deep link, wezterm theme. 1h Smooth noVNC tab UX
9 Round-table consultation review → patch plan → re-verify each phase against feedback. 2h External validation pass

Total: ~27h focused work. Ship MVP (phases 0-3) in one sprint (~14h), full vision in two.


7. Failure Modes & Mitigations

Failure Detection Mitigation
Gemini Live disconnects mid-call gateway WS error Auto-reconnect; play "one second" Brian utterance; resume
CC subprocess crashes sessions.ts already detects session_end Auto-respawn with --resume (already implemented)
PWA loses network mid-input service worker queue Background sync replays input on reconnect
Multiple browsers open same session DB xfce_window_id + WS counter Allow it — both mirror same stream; warn in UI
Host reboot during call systemd restart sequence Sessions parked, narrator emits "we got cut off, picking up where we left" on reconnect
iOS PWA push limitations iOS 16.4+ supports web-push only when installed to home screen Detect, show "install to home screen" coach mark
WebRTC blocked by carrier NAT aiortc TURN Configure TURN server (coturn) on Hetzner — free, ~30min setup
Voice latency degrades on cellular client metrics Show RTT badge; auto-fallback to voice-msg mode if RTT > 800ms

8. Open Questions (round-table seeds)

  1. One model or two? Should cli-pwa.sessions and sek_cc_manager (per-SIP-call) merge into one session model, or stay parallel? Merge = one source of truth, but bigger refactor. Parallel = fewer regressions, two systems.
  2. PWA terminal: xterm.js or custom? xterm.js is heavy (~300KB). Custom JSON-message renderer is lighter, mobile-friendlier, but loses copy-paste-of-raw-output. Trade-off?
  3. Narrator: separate process or co-located? Separate = cleaner, restartable. Co-located in gateway = lower latency, fewer moving parts. Round-table opinion?
  4. Voice-only mode for low bandwidth? Auto-detect cellular + drop video/visual updates, just stream audio + text-input fallback?
  5. Session-name auto-generation? Use first user prompt as title (LLM-summarize)? Or always require Jonah to name?
  6. Multi-device same session? Two browsers open same session — merge inputs, last-write-wins, or lock per-device? Round-table view on UX?
  7. TURN server cost? Self-host coturn on Hetzner (free, but more attack surface) vs Cloudflare Calls (paid)?
  8. What does "ended" mean? PWA close = end? Tab close = end? Explicit "end session" only? My default: only explicit end + 30-day idle.
  9. iOS quirks: Web Push only works on iOS 16.4+ installed PWAs. Is that acceptable, or do we also build a native iOS shell (TestFlight)?
  10. Auth fallback: What if Jonah loses his device? QR pairing from another paired device? TG-bot one-time token?

9. Round-Table Consultation Plan

After Jonah signs off this plan, run /round_table with the 10-member free roster (codex, chatgpt, gemini, antigravity, perplexity, grok, hermes, openhands, vibe, jules) — ask Jonah first whether to include Manus (per memory rule).

Round-table prompt outline:
1. Share this plan as context
2. Ask each member to weigh in on the 10 open questions above
3. Specifically request: WebRTC-vs-WebSocket-audio recommendation, PWA-iOS-push edge cases, ttyd-vs-xterm-js tradeoff, narrator-architecture opinion
4. Synthesize → patch plan → re-share final to Jonah


10. What This Costs (real money discipline)

Single paid risk: if Gemini Live free tier exhausts under heavy use (currently fine — SIP traffic isn't huge). If it does, drop to local Whisper+Piper for off-peak. Round-table to advise.


11. Success Criteria (UAT)


12. Sign-off needed

After sign-off → run round-table → integrate feedback → execute phase 0.