← index2026-05-03 06:41 (Beirut)(backfill from DOCUMENTATION/)

04 — Probes, Autofix, Flap Detection

The probe contract

A probe is a script (typically under /opt/agent/scripts/probes/) that prints exactly one JSON line to stdout:

{
  "ok": true,
  "layer": "transport|auth|quota|shape|runtime",
  "evidence": "free-form success blurb",
  "hint": "free-form failure blurb (only when ok=false)",
  "latency_ms": 123
}

Exit code is fallback only — the JSON is authoritative. Probes that crash or return malformed JSON are treated as ok=false, layer=shape, hint="probe crashed".

Layers

Layer	What it checks	Autofix class
`transport`	Can I reach the host? (DNS, TCP, TLS)	safe-retry — Tailscale reconnect, DNS flush
`auth`	Are credentials valid? (token, cookie, API key)	at-least-once for refresh; always-ask for re-grant
`quota`	Am I rate-limited or out of budget?	back-off, alert if persistent
`shape`	Does the surface look right? (schema, file existence)	safe-retry — remount, recreate dir
`runtime`	Is the service alive and responding?	safe-retry — `systemctl restart`

The probe library helper

/opt/agent/scripts/probes/_lib.sh provides:

. /opt/agent/scripts/probes/_lib.sh
START=$(now_ms)
KEY=$(read_env API_KEY_NAME)              # reads from /opt/agent/core/.env
# ... do the check ...
probe_emit_ok   "auth" "evidence string"   "$(($(now_ms) - START))"
probe_emit_fail "auth" "hint string"       "$(($(now_ms) - START))"

read_env strips surrounding quotes from .env values; never source-eval the file.

Existing probes (16 scripts as of 2026-05-03)

Script	What it probes
`agent_pg.sh`	PostgreSQL reachability
`agency_pipeline.sh`	/agency skill/cmd file presence
`arg_self.sh`	`arg validate` clean
`board_local.sh`	board.jonahtebaa.com backend
`brian_runtime.sh`	agent-api `/health`
`composio.sh`	Composio API (Brian-Gmail, Calendar, Drive path)
`disabled_marker.sh`	Always-ok marker for intentionally-disabled assets
`fb_page.sh`	Meta page token validity
`file_check.sh`	Generic existence probe (`file_check.sh <path> [<min_size>]`)
`gh_user.sh`	GitHub PAT
`ig_brian.sh`	Brian's Instagram via Graph
`journal_alive.sh`	systemd journal activity (last 5 min)
`li_brian.sh`	LinkedIn voyager/api/me with Brian's cookies
`reddit_brian.sh`	Reddit OAuth
`site_check.sh`	Generic HTTP probe (`site_check.sh <url>`)
`systemd_unit.sh`	`systemctl is-active <unit>`
`tg_bot_comms.sh`, `tg_bot_logs.sh`, `tg_bot_hermes.sh`	TG bot tokens via getMe
`twenty_crm.sh`	20CRM container + healthz
`voice_gateway.sh`	Gemini Live + Puck on :8102
`wa_evolution.sh`	WhatsApp via Evolution container
`zoho_refresh.sh`	Zoho refresh-token flow (used by autofix)

Rate limiting (`min_interval_seconds`)

Some upstreams flag rapid identical probes as bot traffic. Currently rate-limited:

Atom	Interval	Reason
`acc.brian.linkedin` / `key.li_at` / `key.li_jsessionid`	3600s	LI session-flags rapid /voyager/api/me
`acc.brian.fb_page` / `acc.brian.ig` / `key.meta_page_token` / `key.ig_graph_token`	3600s	Meta app-level rate limit (#4 — added 2026-05-03 after probe storm tripped it)

The CLI honors min_interval_seconds via observability/probe_status.json. Force-bypass = delete the row from probe_status.json (only main Brian).

Autofix matrix

/opt/agent/scripts/arg_autofix.py runs at :15,:45 via cron.

SAFE_FIXES (auto-applied)

Target	Action	Idempotency
`host.hermes_gateway` runtime	`systemctl restart hermes-gateway`	safe-retry
`host.brian_board` runtime	`systemctl restart brian-board`	safe-retry
`data.agent_redis` transport	`docker restart agent-redis`	safe-retry
`sub.brightdata` shape	`/opt/agent/scripts/heal_npx_cache.sh`	safe-retry
`data.drive_shared` transport	`systemctl restart rclone-shared`	safe-retry
`acc.brian.zoho_mail` auth	`zoho_refresh.sh`	at-least-once

ALWAYS_ASK (escalate, never auto-act)

acc.brian.linkedin (cookie expiry → Mac browser)
acc.brian.fb_page (OAuth re-issue → Graph API Explorer)
acc.brian.ig
acc.brian.composio (API key regen)
acc.brian.stripe (money-sensitive)
data.agent_pg (Postgres restart drops connections for 12+ apps)

Flap detection

If the same (rid, action) pair fires ≥5 times within 180 minutes without sticking, autofix stops auto-trying and escalates. State persisted at observability/autofix_state.json. Verified working on acc.brian.zoho_mail::zoho-refresh during 2026-05-02 smoke test.

Idempotency-aware: even at-least-once actions trip flap detection — running a refresh-token refresh 100 times is technically safe but tells you something fundamental is broken (e.g. the refresh token itself is gone). Better to escalate.

Capability miner

/opt/agent/scripts/arg_capability_miner.py scans the event journal for capability_resolved events on cap-ids NOT yet in capabilities.json. Drops proposals into observability/inbox/ with approval_required: True. Main Brian processes via arg inbox accept|reject.

The miner exists so the registry grows from observed reality, not just hand-curation.

04 — Probes, Autofix, Flap Detection

04 — Probes, Autofix, Flap Detection

The probe contract

Layers

The probe library helper

Existing probes (16 scripts as of 2026-05-03)

Rate limiting (min_interval_seconds)

Autofix matrix

SAFE_FIXES (auto-applied)

ALWAYS_ASK (escalate, never auto-act)

Flap detection

Capability miner

Rate limiting (`min_interval_seconds`)