A probe is a script (typically under /opt/agent/scripts/probes/) that prints exactly one JSON line to stdout:
{
"ok": true,
"layer": "transport|auth|quota|shape|runtime",
"evidence": "free-form success blurb",
"hint": "free-form failure blurb (only when ok=false)",
"latency_ms": 123
}
Exit code is fallback only — the JSON is authoritative. Probes that crash or return malformed JSON are treated as ok=false, layer=shape, hint="probe crashed".
| Layer | What it checks | Autofix class |
|---|---|---|
transport |
Can I reach the host? (DNS, TCP, TLS) | safe-retry — Tailscale reconnect, DNS flush |
auth |
Are credentials valid? (token, cookie, API key) | at-least-once for refresh; always-ask for re-grant |
quota |
Am I rate-limited or out of budget? | back-off, alert if persistent |
shape |
Does the surface look right? (schema, file existence) | safe-retry — remount, recreate dir |
runtime |
Is the service alive and responding? | safe-retry — systemctl restart |
/opt/agent/scripts/probes/_lib.sh provides:
. /opt/agent/scripts/probes/_lib.sh
START=$(now_ms)
KEY=$(read_env API_KEY_NAME) # reads from /opt/agent/core/.env
# ... do the check ...
probe_emit_ok "auth" "evidence string" "$(($(now_ms) - START))"
probe_emit_fail "auth" "hint string" "$(($(now_ms) - START))"
read_env strips surrounding quotes from .env values; never source-eval the file.
| Script | What it probes |
|---|---|
agent_pg.sh |
PostgreSQL reachability |
agency_pipeline.sh |
/agency skill/cmd file presence |
arg_self.sh |
arg validate clean |
board_local.sh |
board.jonahtebaa.com backend |
brian_runtime.sh |
agent-api /health |
composio.sh |
Composio API (Brian-Gmail, Calendar, Drive path) |
disabled_marker.sh |
Always-ok marker for intentionally-disabled assets |
fb_page.sh |
Meta page token validity |
file_check.sh |
Generic existence probe (file_check.sh <path> [<min_size>]) |
gh_user.sh |
GitHub PAT |
ig_brian.sh |
Brian's Instagram via Graph |
journal_alive.sh |
systemd journal activity (last 5 min) |
li_brian.sh |
LinkedIn voyager/api/me with Brian's cookies |
reddit_brian.sh |
Reddit OAuth |
site_check.sh |
Generic HTTP probe (site_check.sh <url>) |
systemd_unit.sh |
systemctl is-active <unit> |
tg_bot_comms.sh, tg_bot_logs.sh, tg_bot_hermes.sh |
TG bot tokens via getMe |
twenty_crm.sh |
20CRM container + healthz |
voice_gateway.sh |
Gemini Live + Puck on :8102 |
wa_evolution.sh |
WhatsApp via Evolution container |
zoho_refresh.sh |
Zoho refresh-token flow (used by autofix) |
min_interval_seconds)Some upstreams flag rapid identical probes as bot traffic. Currently rate-limited:
| Atom | Interval | Reason |
|---|---|---|
acc.brian.linkedin / key.li_at / key.li_jsessionid |
3600s | LI session-flags rapid /voyager/api/me |
acc.brian.fb_page / acc.brian.ig / key.meta_page_token / key.ig_graph_token |
3600s | Meta app-level rate limit (#4 — added 2026-05-03 after probe storm tripped it) |
The CLI honors min_interval_seconds via observability/probe_status.json. Force-bypass = delete the row from probe_status.json (only main Brian).
/opt/agent/scripts/arg_autofix.py runs at :15,:45 via cron.
| Target | Action | Idempotency |
|---|---|---|
host.hermes_gateway runtime |
systemctl restart hermes-gateway |
safe-retry |
host.brian_board runtime |
systemctl restart brian-board |
safe-retry |
data.agent_redis transport |
docker restart agent-redis |
safe-retry |
sub.brightdata shape |
/opt/agent/scripts/heal_npx_cache.sh |
safe-retry |
data.drive_shared transport |
systemctl restart rclone-shared |
safe-retry |
acc.brian.zoho_mail auth |
zoho_refresh.sh |
at-least-once |
acc.brian.linkedin (cookie expiry → Mac browser)acc.brian.fb_page (OAuth re-issue → Graph API Explorer)acc.brian.igacc.brian.composio (API key regen)acc.brian.stripe (money-sensitive)data.agent_pg (Postgres restart drops connections for 12+ apps)If the same (rid, action) pair fires ≥5 times within 180 minutes without sticking, autofix stops auto-trying and escalates. State persisted at observability/autofix_state.json. Verified working on acc.brian.zoho_mail::zoho-refresh during 2026-05-02 smoke test.
Idempotency-aware: even at-least-once actions trip flap detection — running a refresh-token refresh 100 times is technically safe but tells you something fundamental is broken (e.g. the refresh token itself is gone). Better to escalate.
/opt/agent/scripts/arg_capability_miner.py scans the event journal for capability_resolved events on cap-ids NOT yet in capabilities.json. Drops proposals into observability/inbox/ with approval_required: True. Main Brian processes via arg inbox accept|reject.
The miner exists so the registry grows from observed reality, not just hand-curation.