← index2026-05-07 22:30 (Beirut)(backfill from DOCUMENTATION/)

260507_session_brian_hook_hardening

================================================================================
SESSION TRANSCRIPT — Brian hook hardening + revenue pivot
Date: 2026-05-07 (Beirut)
Host: ubuntu-8gb-hel1-1 (Hetzner)
Session start: ~22:00 Beirut (post-/clear)
================================================================================


CONTEXT AT SESSION START
------------------------
Just after /clear. Pre-clear preservation note carried 7 followup items:
  (1) Ed25519 migration steps 1-4 APPLIED; HMAC+Ed25519 dual-sign window
      active, decommission HMAC ~1 week.
  (2) Stale Brian session PID 2290184 alive since May 5 firing test
      fixtures — investigate.
  (3) Sweep 5 sibling Stop-hooks for COMMS-default alerts:
      brian-presend-evidence-gate, brian-done-plus-guard,
      brian-flat-voice-guard, brian-proactive-stop-guard,
      brian-action-language-guard.
  (4) brian_validator.py:303 sparse-args command_zero_exit crash → clean
      policy-fail.
  (5) Defense-in-depth: hard-fail hook on BRIAN_TEST_MARKER files in prod
      overrides dir.
  (6) Freepik fallback for blog hero images half-installed.
  (7) Patterns: ALERT_RELAY_RE guard for any prompt-content hook;
      test-vs-prod env isolation flags OVERRIDE_AUDIT_DIR/NO_ALERT/MEMORY_FILE;
      HARD rule errors→LOGS never COMMS (locked 2026-04-17).

Jonah opened the session by running:
  echo '{"transcript_path":"/dev/null","stop_hook_active":false}' \
    | /root/.claude/hooks/brian-evidence-gate-sandboxed.sh
  rc=0
That rc=0 closed item #4 (sparse-args path now exits clean).


TURN 1 — "now what?"
--------------------
Brian: confirmed item #4 closed; proposed investigating PID 2290184 next.

Inspected PID 2290184:
  /root/.local/bin/claude --system-prompt You are Brian. Jonah Tebaa's partner...
  ETIME=2-00:41 STAT=Sl+ TTY=pts/1 PPID=2289512
Conclusion: NOT a stale runaway. It's a live interactive Brian Claude CLI
sitting on pts/1 since May 5 (lined up with the Codex behavioral-enforcement
work). Foreground, sleeping, no CPU. The earlier preservation note's
"firing test fixtures" framing was wrong — it's just a detached terminal.
Reclassified item #2: not stale, leave-or-kill is Jonah's call.


TURN 2 — "yes" (proceed with #3 sweep)
--------------------------------------
Located all 5 Stop-hooks at /root/.claude/hooks/. Greps for:
  COMMS / LOGS / jonah_claude_bot / brian_system_logs_bot / alert_relay /
  tg_send / send_telegram / hermes_send / level= / chat_id / TELEGRAM_
Result: ZERO matches across all five.

Confirmed by import grep:
  All 5 hooks import only json/sys/re/os/pathlib/datetime.
  No requests, urllib, or subprocess module imports anywhere.
  Pure-local stdin-validators that block via stderr + nonzero exit.

Verdict: clean. No COMMS-vs-LOGS misroute possible from inside these 5.
Item #3 closed.

PROACTIVE RISK SURFACED: with no outbound alerting, a CRASHED hook
silently passes (Stop hooks exit-2 to block, exit-0 or any error to
allow). If one of the five throws on malformed stdin, the trait stops
being enforced and nobody knows. Same failure mode that bit
brian_validator.py:303. Recommended adding a try/except wrapper that
catches exceptions and logs to the rotating crash file referenced in
feedback_hook_crash_pattern.md.


TURN 3 — "wanna push them all? or not recommended?"
---------------------------------------------------
Brian's honest take per remaining queue:
  #1 Ed25519 HMAC decommission → DEFER. Window is May 14; pulling early
     risks breaking signature verification.
  #5 BRIAN_TEST_MARKER hard-fail → PUSH NOW. Small, additive, blast
     radius zero.
  #6 Freepik fallback → CHECK FIRST. Half-installed, likely blocked on
     creds or quota.
  #7 patterns docs → LOW VALUE vs revenue cost. Defer.
  Hook-crash wrapper (the proactive risk) → PUSH NOW. Higher leverage
     than #5/#6/#7 combined; protects every other hook.

Bigger picture flagged: all of this is infra polish. CORE memory says
"notice when you drift back into infra-polish mode; flag yourself and
pivot to revenue work". Recommended bundle-then-pivot:
  1. Push #5 + hook-crash wrapper as 20-min defense-in-depth bundle
  2. Stop the infra sweep
  3. Pivot to revenue — /agency post status, GEO hour, lead-form
     probes, signal sweep


TURN 4 — "do what you know is best. also do not forget to test
         with codex and, if available, jules and gemini."
-------------------------------------------------------------

PHASE A — Discovery (parallel):
  ✓ Prod overrides dir = /opt/agent/data/agent_runtime/overrides/
    Already had a leaked test fixture: test_t1_1778135072.jsonl
    (from May 7 06:24). Real threat existed at the time of the work.
  ✓ Crash convention: _brian_hook_crash.record_crash(name, exc) module
    at /root/.claude/hooks/_brian_hook_crash.py. Fail-allow pattern,
    threshold 3/hour → breach marker → alert.
  ✓ Reviewer CLIs:
      codex   → /usr/bin/codex          (sync, exec subcommand)
      jules   → /usr/bin/jules          (async, repo-bound — not fit)
      gemini  → /usr/local/bin/gemini   (sync, -p prompt)
    Round-table = codex + gemini. Jules skipped (wrong fit, not failure).

PHASE B — Backup before edit:
  /opt/agent/data/agent_runtime/_hook_backups/20260507T191008Z_pre_wrapper/
  (6 hooks copied with mode preserved)

PHASE C — Design doc written, dispatched to codex + gemini in parallel.
  Doc: /tmp/brian_hook_hardening_design.md
  Both ran headless; gemini first attempt SIGTERM'd on --approval-mode
  plan (interactive); retried with -y (yolo headless), succeeded.

ROUND-TABLE VERDICTS:
  CODEX  = GO-WITH-CHANGES
  GEMINI = GO-WITH-CHANGES

Convergent answers (both reviewers):
  Q1 exit code:    keep 0; alert + cleanup, never block prompts
  Q3 SystemExit:   re-raise as written
  Q4 performance:  acceptable
  Q5 race:         ignore — goal is leak detection, not atomic enforce

Codex unique catches:
  - sys.path.insert idempotency guard (no duplicate entries)
  - Narrow detector except to OSError, not blanket-pass (preserves
    real bug visibility)
  - shlex.quote filenames in error message; remove auto rm -f
    (filename injection risk)
  - Cap directory scan at 1000 entries
  - Write a first-occurrence breach marker DIRECTLY (do not wait
    for 3/hour aggregation — single leak might never trigger)
  - Add prefixes pytest_, fixture_, _test

Gemini unique catches:
  - Use os.scandir() instead of pathlib.iterdir() (avoids stat)
  - Filter to .jsonl only + "MARKER" substring (tighter false-positive)
  - Avoid _test prefix (collision risk with future legit files)
  - Verify imports + ensure sys.path insert happens before import in
    every branch

Conflict resolution:
  _test prefix → DROPPED (gemini conservative wins; legit files in
                  OVERRIDES_ROOT are UUID-named, no collision risk
                  going either way, but conservative is safer)
  All other recommendations → MERGED


PHASE D — Applied edits:
  1) Wrapper appended to 5 hooks:
     brian-presend-evidence-gate.py
     brian-done-plus-guard.py
     brian-flat-voice-guard.py
     brian-proactive-stop-guard.py
     brian-action-language-guard.py

     Pattern:
       if __name__ == "__main__":
           try:
               main()
           except SystemExit:
               raise
           except Exception as _e:
               try:
                   _HD = "/root/.claude/hooks"
                   if _HD not in sys.path:
                       sys.path.insert(0, _HD)
                   from _brian_hook_crash import record_crash
                   record_crash("brian-<hookname>", _e)
               except Exception:
                   pass
               sys.exit(0)

  2) Pollution detector added to brian-override-audit.py:
     - shlex import added
     - _detect_pollution() helper: os.scandir, capped 1000,
       .jsonl-only, prefix-or-MARKER match
     - main() runs detect first; on pollution: stderr msg with
       shlex.quote'd names + write _pollution_breach.json
       first-occurrence marker + record_crash + sys.exit(0)


PHASE E — Smoke tests (6 cases):
  TEST 1 valid Stop-hook input    → 5/5 hooks rc=0 ✓
  TEST 2 empty + junk stdin       → 5/5 hooks rc=0 (no crash) ✓
  TEST 4 EXISTING leak detected   →
    rc=0
    stderr: PROD OVERRIDES DIR POLLUTED — 1 test-fixture file(s)
            in /opt/agent/data/agent_runtime/overrides:
            test_t1_1778135072.jsonl. Clean before override audit
            resumes.
    _pollution_breach.json written ✓
  TEST 5 cleanup → re-run         → rc=0, no pollution stderr,
                                     normal session jsonl created ✓
  TEST 6 synthetic BRIAN_TEST_MARKER plant → detected, rc=0,
                                              marker re-written ✓

Result: 6/6 PASS.

PHASE F — Quarantine + cleanup:
  Original leaked file (test_t1_1778135072.jsonl) moved from /tmp to
  /opt/agent/data/agent_runtime/_hook_backups/20260507T191008Z_pre_wrapper/
  for forensics. Smoketest artifact (smoketest123.jsonl) removed.
  OVERRIDES_ROOT now clean.


PHASE G — Revenue pivot:
  ✓ GEO hour armed for 02:00 Beirut tonight (cron 0 23 * * * UTC,
    hard rule confirmed).
  ✓ brianserves.me/api/contact → 200 (X-Probe header working,
    no Jonah ring per feedback_no_probe_alerts_for_lead_form.md).
  ⚠ jonahtebaa.com → 403 on /api/contact, 405 on /contact and /lead.
    Real endpoint moved or Cloudflare-gated. Worth fixing — primary
    inbound surface.
  ❓ webspot.me + pgpro.me → no response on standard paths. Endpoints
    unknown to Brian.
  ❓ /agency daily log shows generic queue routing, no per-platform
    LI/IG/FB success/fail visibility for today's post.
  ❓ Signal sweep storage location not found in social_signals/
    or 20crm/ dirs.

  PROACTIVE RISK: visibility gap on /agency daily post is the bigger
  issue — if today's cross-post silently failed, the "no silent skip"
  hard rule is violated and we wouldn't know.


TURN 5 — "tell me what we have done so far today"
-------------------------------------------------
Pulled today's full activity record. Earlier today (before /clear):
  • Evidence-gate Day 5-3 — Ed25519 migration steps 1-4 applied.
    HMAC + Ed25519 dual-sign window active. 99/99 tests green.
    Decommission scheduled ~May 14. Touched: rotate_evidence_gate_hmac.py,
    setup_validator_uid_separation.sh, brian_validator.py,
    evidence_gate_calibrate.py.
  • Created monitor_service_status skill — proactive systemd watcher.
  • AUDIT_FINDINGS.md + d1_review postmortems at
    /opt/agent/data/postmortems/260507_*.
  • Hook-crash threshold + calibration-drift tracked in:
    feedback_hook_crash_pattern.md, feedback_hook_crash_test-hook.md,
    feedback_hook_crash_target-bs-test.md, feedback_calibration_drift.md.
  • Pre-/clear preservation captured the 7-item followup queue.

This session (since ~22:00 Beirut):
  • Item #4 sparse-args crash — closed (rc=0 verified).
  • Item #2 PID 2290184 — reclassified as live interactive session.
  • Item #3 5-hook COMMS sweep — clean (no outbound paths exist).
  • Item #5 + hook-crash wrapper bundle — round-tabled, merged,
    applied, 6/6 smoke tests pass, original May 5 leak quarantined.
  • Revenue pivot — partial; GEO armed, brianserves working,
    jonahtebaa.com lead form broken, webspot/pgpro unknown,
    /agency log opaque, signal sweep not found.


STILL OPEN (held per Jonah's "hold this plan" instruction)
----------------------------------------------------------
- Lead-form endpoint discovery on jonahtebaa.com / webspot.me / pgpro.me
- Today's /agency LI+FB+IG publish receipt trace
- Item #1 Ed25519 HMAC decommission (~May 14, scheduled — not due)
- Item #6 Freepik blog-hero fallback finish (half-installed)
- Item #7 patterns docs consolidation


FILES TOUCHED THIS SESSION
--------------------------
/root/.claude/hooks/brian-presend-evidence-gate.py     (wrapper added)
/root/.claude/hooks/brian-done-plus-guard.py           (wrapper added)
/root/.claude/hooks/brian-flat-voice-guard.py          (wrapper added)
/root/.claude/hooks/brian-proactive-stop-guard.py      (wrapper added)
/root/.claude/hooks/brian-action-language-guard.py     (wrapper added)
/root/.claude/hooks/brian-override-audit.py            (pollution detector added)

Backup:
/opt/agent/data/agent_runtime/_hook_backups/20260507T191008Z_pre_wrapper/
  brian-action-language-guard.py
  brian-done-plus-guard.py
  brian-flat-voice-guard.py
  brian-override-audit.py
  brian-presend-evidence-gate.py
  brian-proactive-stop-guard.py
  test_t1_quarantined.jsonl  (the original May 5 leak)

Working files:
/tmp/brian_hook_hardening_design.md   (design doc sent to reviewers)
/tmp/brian_review_prompt.txt          (review prompt template)
/tmp/brian_review_full.txt            (combined doc + prompt)
/tmp/codex_review.txt                 (codex verdict + recommendations)
/tmp/gemini_review.txt                (gemini verdict + recommendations)


VERIFICATION LEDGER
-------------------
[✓] python3 ast.parse on all 6 modified files → no syntax errors
[✓] 5 wrapped hooks rc=0 on valid Stop-hook payload
[✓] 5 wrapped hooks rc=0 on empty stdin
[✓] 5 wrapped hooks rc=0 on non-JSON stdin
[✓] override-audit detects pre-existing real leak, writes breach marker
[✓] override-audit resumes normal operation after leak removed
[✓] override-audit detects synthetic BRIAN_TEST_MARKER plant
[✓] OVERRIDES_ROOT clean at end of session
[✓] Original leak preserved in backup dir for forensics

================================================================================
END OF SESSION TRANSCRIPT
================================================================================