← index2026-05-03 06:43 (Beirut)(backfill from DOCUMENTATION/)

07 — Observability

07 — Observability

Event journal

Append-only NDJSON, one file per UTC day, at observability/events/<YYYY-MM-DD>.ndjson.

Every line is a JSON object with at least ts, actor, event_type. Common event types:

event_type Emitted by Fields
probe_run arg probe, arg probe-all target, ok, layer, evidence, hint, latency_ms
capability_resolved arg resolve cap_id, state, blocking, warnings
policy_block arg_policy_hook.py rule_id, decision, tool_name
policy_approval_required hook rule_id, tool_name
autofix_run arg_autofix.py target, layer, action, idempotency, outcome_ok
autofix_reprobe arg_autofix.py target, ok_after_fix
inbox_proposal_added sub-agents (via inbox/) proposal_id, category, source_actor
inbox_proposal_accepted main Brian proposal_id
inbox_proposal_rejected main Brian proposal_id, reason

Retention

/opt/agent/scripts/arg_rotate_events.sh runs 10 4 * * * (04:10 UTC daily):
- gzip files older than 7 days
- delete files older than 90 days

arg events grep is gzip-aware — searches across all archived files transparently.

probe_status.json

Cache of latest probe result per id:

{
  "acc.brian.linkedin": {
    "ok": true,
    "layer": "auth",
    "evidence": "voyager/api/me 200",
    "latency_ms": 597,
    "checked_at": "2026-05-03T03:13:00+00:00"
  },
  "..."
}

Used by:
- arg status (counts come from here)
- arg resolve (atom state lookup)
- arg probe-all --critical (decides what to skip based on freshness)
- min_interval_seconds enforcement

Force-bust = delete the row + re-probe. Only main Brian.

autofix_state.json

Flap-detection history. One key per (rid, action) pair:

{
  "history": {
    "acc.brian.zoho_mail::zoho-refresh": [
      "2026-05-02T22:57:11+00:00",
      "2026-05-02T23:15:01+00:00",
      "..."
    ]
  }
}

Last 50 timestamps per key retained; older trimmed automatically. After 5 fires within 180 minutes, the action stops auto-trying and escalates.

Inbox (sub-agent proposals)

observability/inbox/ is the only path sub-agents have to influence the registry. A proposal is a JSON file:

{
  "proposal_id": "prop-260503-001",
  "from_actor": "agent.gsd_planner",
  "category": "capabilities",
  "row": { "id": "cap.example.new_thing", "name": "...", ... },
  "rationale": "observed via cap-id resolved with state=yes 5 times on date X",
  "approval_required": true
}

Single-writer invariant means main Brian processes via:

arg inbox list                              # see queue
arg inbox accept prop-260503-001            # promote to canonical
arg inbox reject prop-260503-001 --reason   # drop with explanation

Both accept and reject emit events for audit.

Capability miner

/opt/agent/scripts/arg_capability_miner.py reads the event journal looking for capability_resolved state=yes events whose cap_id is NOT in capabilities.json. Drops a proposal into the inbox (with approval_required: True).

This is the path by which the registry learns from observed reality rather than hand-curation alone. If Brian successfully invokes a capability often enough that the journal proves it works, the miner surfaces it for canonicalization.

SessionStart status injection

/opt/agent/scripts/arg_sessionstart.sh runs on every Claude Code SessionStart and prints to the model's context:

===== ARG | self-knowledge system =====
Index: /root/.claude/system/README.md  |  CLI: /usr/local/bin/arg  |  /ARG skill

Recent events:
  03:12:51  probe_run                 host.hetzner_main
  ...

Status counts: {"verified-fresh":71,"verified-stale":0,"red":13,"unknown":183}
Use 'arg resolve <cap-id>' before non-trivial actions.
====================================================

Belt-and-suspenders: even if MEMORY.md is somehow stripped, the SessionStart injection still surfaces ARG.

Logging out-of-band

For events the journal can't capture (e.g. external alerts about ARG itself), use:

/opt/agent/venv/bin/python3 /opt/agent/scripts/brian_alert.py \
  --error-log --level low --title "ARG XYZ" --text "..."

This routes to TG LOGS (@brian_system_logs_bot), never to COMMS. ARG escalations from autofix already use this path.