← index2026-05-03 06:41 (Beirut)(backfill from DOCUMENTATION/)

04 — Probes, Autofix, Flap Detection

04 — Probes, Autofix, Flap Detection

The probe contract

A probe is a script (typically under /opt/agent/scripts/probes/) that prints exactly one JSON line to stdout:

{
  "ok": true,
  "layer": "transport|auth|quota|shape|runtime",
  "evidence": "free-form success blurb",
  "hint": "free-form failure blurb (only when ok=false)",
  "latency_ms": 123
}

Exit code is fallback only — the JSON is authoritative. Probes that crash or return malformed JSON are treated as ok=false, layer=shape, hint="probe crashed".

Layers

Layer What it checks Autofix class
transport Can I reach the host? (DNS, TCP, TLS) safe-retry — Tailscale reconnect, DNS flush
auth Are credentials valid? (token, cookie, API key) at-least-once for refresh; always-ask for re-grant
quota Am I rate-limited or out of budget? back-off, alert if persistent
shape Does the surface look right? (schema, file existence) safe-retry — remount, recreate dir
runtime Is the service alive and responding? safe-retry — systemctl restart

The probe library helper

/opt/agent/scripts/probes/_lib.sh provides:

. /opt/agent/scripts/probes/_lib.sh
START=$(now_ms)
KEY=$(read_env API_KEY_NAME)              # reads from /opt/agent/core/.env
# ... do the check ...
probe_emit_ok   "auth" "evidence string"   "$(($(now_ms) - START))"
probe_emit_fail "auth" "hint string"       "$(($(now_ms) - START))"

read_env strips surrounding quotes from .env values; never source-eval the file.

Existing probes (16 scripts as of 2026-05-03)

Script What it probes
agent_pg.sh PostgreSQL reachability
agency_pipeline.sh /agency skill/cmd file presence
arg_self.sh arg validate clean
board_local.sh board.jonahtebaa.com backend
brian_runtime.sh agent-api /health
composio.sh Composio API (Brian-Gmail, Calendar, Drive path)
disabled_marker.sh Always-ok marker for intentionally-disabled assets
fb_page.sh Meta page token validity
file_check.sh Generic existence probe (file_check.sh <path> [<min_size>])
gh_user.sh GitHub PAT
ig_brian.sh Brian's Instagram via Graph
journal_alive.sh systemd journal activity (last 5 min)
li_brian.sh LinkedIn voyager/api/me with Brian's cookies
reddit_brian.sh Reddit OAuth
site_check.sh Generic HTTP probe (site_check.sh <url>)
systemd_unit.sh systemctl is-active <unit>
tg_bot_comms.sh, tg_bot_logs.sh, tg_bot_hermes.sh TG bot tokens via getMe
twenty_crm.sh 20CRM container + healthz
voice_gateway.sh Gemini Live + Puck on :8102
wa_evolution.sh WhatsApp via Evolution container
zoho_refresh.sh Zoho refresh-token flow (used by autofix)

Rate limiting (min_interval_seconds)

Some upstreams flag rapid identical probes as bot traffic. Currently rate-limited:

Atom Interval Reason
acc.brian.linkedin / key.li_at / key.li_jsessionid 3600s LI session-flags rapid /voyager/api/me
acc.brian.fb_page / acc.brian.ig / key.meta_page_token / key.ig_graph_token 3600s Meta app-level rate limit (#4 — added 2026-05-03 after probe storm tripped it)

The CLI honors min_interval_seconds via observability/probe_status.json. Force-bypass = delete the row from probe_status.json (only main Brian).

Autofix matrix

/opt/agent/scripts/arg_autofix.py runs at :15,:45 via cron.

SAFE_FIXES (auto-applied)

Target Action Idempotency
host.hermes_gateway runtime systemctl restart hermes-gateway safe-retry
host.brian_board runtime systemctl restart brian-board safe-retry
data.agent_redis transport docker restart agent-redis safe-retry
sub.brightdata shape /opt/agent/scripts/heal_npx_cache.sh safe-retry
data.drive_shared transport systemctl restart rclone-shared safe-retry
acc.brian.zoho_mail auth zoho_refresh.sh at-least-once

ALWAYS_ASK (escalate, never auto-act)

Flap detection

If the same (rid, action) pair fires ≥5 times within 180 minutes without sticking, autofix stops auto-trying and escalates. State persisted at observability/autofix_state.json. Verified working on acc.brian.zoho_mail::zoho-refresh during 2026-05-02 smoke test.

Idempotency-aware: even at-least-once actions trip flap detection — running a refresh-token refresh 100 times is technically safe but tells you something fundamental is broken (e.g. the refresh token itself is gone). Better to escalate.

Capability miner

/opt/agent/scripts/arg_capability_miner.py scans the event journal for capability_resolved events on cap-ids NOT yet in capabilities.json. Drops proposals into observability/inbox/ with approval_required: True. Main Brian processes via arg inbox accept|reject.

The miner exists so the registry grows from observed reality, not just hand-curation.