← index2026-05-12 14:23 (Beirut)Proposal Agent — teammate handoff. Setup, run guide, what to ask Jonah for, current state, what to work on. Share this URL with the teammate alongside the tarball.

Proposal Agent — Teammate Handoff

Proposal Agent — Teammate Handoff

From: Jonah Tebaa (Webspot)
Repo state: working, proven on one client (ABC Store, 4 rounds)
Your mission: run it on your own machine, continue the teaching on the proposals folder, push it forward


TL;DR

You'll be running a Python pipeline that:
1. Ingests a Drive folder of past Webspot proposal PDFs (251 of them) into a local Qdrant vector DB
2. Searches that corpus — given a brief, returns the top 5 closest past proposals + top 12 most relevant sections
3. Generates new proposal PDFs by copying a master Google Slides deck, replacing text frames per client, and exporting to PDF
4. Audits each generated round — flags off-brief content, banned phrases, visual issues

Zero paid model calls anywhere. Embeddings are local (bge-m3 via sentence-transformers). Section classification is regex. There's no LLM in the generation path today — that's actually the biggest open improvement (see "What to work on" below).


What you'll receive from Jonah

Item How Notes
Source tarballproposal-agent.tar.gz (13 MB) Direct file send Contains code + 251 canonical JSONs + all ABC Store rounds + master deck refs + images. No raw PDFs (you'll re-ingest those).
Drive folder accessWEBSPOT \| PROPOSALS Drive share Folder ID: 1y0jWe8aNTVMGWihwMhAk8oQLf7UquRQA. You'll get Viewer on your service account's email.
Master Slides deck access Drive share Slides ID: 1PlMaMn2sOAkqy1GJNnsj292zShEolrCgHLezCVopSFA (WEBSPOT_PROPOSAL_MASTER_v1). You'll get Editor.
Source PPTX (optional, if you want to rebuild the master from scratch) Drive share Copy of WS Proposal Template.pptx in BRIAN SHARED/PROPOSALS

What you need to send Jonah so he can grant access:
- Your Google account email
- Your service account email (created in step 3 below, format: xxx@yyy.iam.gserviceaccount.com)


What's inside the tarball

When you extract proposal-agent.tar.gz:

proposal-agent-fullstate/
├── cli.py                          # main CLI: ingest / search / report / notify
├── ingest/                         # Drive walker, PDF parser, section classifier, canonical builder
├── rag/                            # bge-m3 embedder, Qdrant client, retriever
├── scripts/
│   ├── 01_analyze_master.py        # Deep PPTX analysis → master_analysis.json
│   ├── 03_build_master.py          # (superseded) original master builder
│   ├── 03b_build_master_via_rclone_token.py   # current master builder (uses rclone OAuth)
│   ├── 04_visual_fidelity.py       # PPTX-PDF vs Slides-PDF perceptual diff
│   ├── 04b_visual_fidelity_streaming.py       # streaming variant
│   ├── 04c_visual_fidelity_fast.py            # fast batch variant
│   ├── 05_generate_abc_store.py    # v1 generator (anchor-only, superseded)
│   ├── 06_generate_abc_store_v2.py # CURRENT generator (full text-frame replace, round-aware)
│   └── 07_audit_abc_store.py       # per-round audit + PNG render
├── data/
│   ├── canonical/                  # 251 ingested proposals as JSON (the distilled learning)
│   ├── generated_drafts/           # ABC Store: all 4 rounds (PDFs, manifests, audit PNGs)
│   ├── abc_store_brief_intent.md   # the spec ABC Store generator was audited against
│   ├── master_build.json           # master Slides ID + every anchor occurrence
│   ├── master_analysis.json        # deep PPTX analysis (slides, runs, fonts, colors, positions)
│   ├── page_library.yaml           # routing config: which layouts always include, which gate
│   ├── layout_mapping.yaml         # layout-tag → swap-anchor map
│   ├── visual_fidelity.json        # perceptual-hash drift results
│   └── INGESTION_REPORT.md         # corpus stats + confidence flags
├── images/                         # 50+ slide rasters from the master deck
├── docker-compose.yml              # local Qdrant
├── requirements.txt                # Python deps
├── .env.example                    # all env vars documented
├── .gitignore
├── HANDOFF.md                      # short notes
└── README.md                       # original README

Not included (recreate as needed):
- data/pdf_cache/ — 1.2 GB of raw PDFs from Drive. Recreate with cli.py ingest once you have folder access.
- venv/, __pycache__/, logs/, working/, .env — runtime/secret.


Setup — first time on your machine

1. Extract + Python env

mkdir -p /opt/agent && cd /opt/agent
tar xzf ~/Downloads/proposal-agent.tar.gz
mv proposal-agent-fullstate webspot_proposal_agent   # paths in scripts assume this name
cd webspot_proposal_agent
python3 -m venv venv
./venv/bin/pip install -r requirements.txt

Python 3.11+ recommended. If you don't want it at /opt/agent/, you can put it anywhere — but you'll need to grep for /opt/agent/webspot_proposal_agent in scripts/ and adjust those paths, OR symlink:

sudo mkdir -p /opt/agent
sudo ln -s /your/actual/path /opt/agent/webspot_proposal_agent

The CLI itself (cli.py) is env-driven and works from any path. Only the scripts/ files have hardcoded paths.

2. Local Qdrant

docker compose up -d

Brings up Qdrant on 127.0.0.1:6333. Verify: curl http://localhost:6333/healthzhealthz check passed.

3. Google Cloud — service account

Used for headless Slides API editing.

  1. Go to https://console.cloud.google.com — create a project (or reuse one)
  2. Enable APIs: Google Drive API, Google Slides API
  3. Create a service account: IAM → Service Accounts → Create
  4. Give it a name like proposal-agent, no roles needed (Drive permissions are granted via Drive sharing, not IAM)
  5. Create a JSON key, download it → save to ~/secrets/proposal-agent-sa.json (or wherever, just outside the repo)
  6. Send Jonah the SA's email (it'll look like proposal-agent@your-project-id.iam.gserviceaccount.com)

Jonah will then share:
- WEBSPOT | PROPOSALS folder with your SA email as Viewer
- Master Slides deck with your SA email as Editor

4. rclone — user OAuth for uploads the SA can't do

The service account has no Drive quota and is only Viewer/Editor on shared folders — it can't upload/convert PPTX → Slides into folders it doesn't own. rclone gives you a user OAuth token that runs as a real Drive account.

# install rclone if you don't have it
curl https://rclone.org/install.sh | sudo bash

# configure a "gdrive" remote authenticated as YOU
rclone config
# → n (new remote)
# → name: gdrive
# → storage: drive (Google Drive)
# → client_id: <blank> (uses rclone's public default)
# → client_secret: <blank>
# → scope: 1 (full access)
# → root_folder_id: <blank>
# → service_account_file: <blank>
# → edit advanced config: n
# → use auto config: y (opens browser)
# → configure as team drive: n
# → confirm: y
# → quit: q

Verify: rclone lsd gdrive: should list your Drive root.

The scripts read the token from ~/.config/rclone/rclone.conf by default.

5. .env

cp .env.example .env

Edit .env:
- GCP_SERVICE_ACCOUNT_JSON=/absolute/path/to/your-sa-key.json
- Leave the rclone defaults as-is — they're rclone's public OAuth client (open-source, works for everyone)
- Leave Drive IDs as Jonah's defaults — those are what Jonah is sharing with you
- MAX_WORKERS — drop to 4 if you're on a laptop, default 8 is for a 16-vCPU server

Load it in your shell:

set -a; source .env; set +a

(Or use direnv / python-dotenv. The CLI auto-loads .env if you run it from the repo root.)


First runs — verify it works

Smoke test the search (uses existing canonical, no Drive needed)

./venv/bin/python cli.py search "AI agent for hospitality"

You should get 5 proposals + 12 sections. If yes: canonical+embeddings transferred cleanly, ingestion pipeline is healthy, you're ready.

Wait — but the 251 canonical JSONs ship in the tarball. Qdrant starts empty though. The first search will be empty until you re-embed. Run a quick re-embed:
bash ./venv/bin/python cli.py ingest --limit 5
This pulls 5 raw PDFs from Drive (verifies folder access), parses them, re-embeds, upserts to Qdrant. After this, search works. Then run the full ingest:
bash ./venv/bin/python cli.py ingest
~15-30 minutes depending on machine. Idempotent — safe to re-run.

Generate the ABC Store proposal as a smoke test of the generator

This regenerates Jonah's already-done ABC Store work, end-to-end:

./venv/bin/python scripts/06_generate_abc_store_v2.py 5

(Pass any round number that doesn't collide with the existing 4. Round 5 produces a fresh PDF you can compare against the existing data/generated_drafts/2026-05-10_abc_store_proposal_r4.pdf.)

If it produces a clean PDF matching the existing r4, the generator + master deck access + SA + rclone are all wired correctly.


How to use the agent

cli.py ingest

Walks Drive, parses PDFs (PyMuPDF + OCR fallback), classifies pages by section (cover/scope/pricing/...), builds canonical JSON, embeds locally with bge-m3, upserts to Qdrant. Idempotent.

./venv/bin/python cli.py ingest                    # full
./venv/bin/python cli.py ingest --limit 10         # smoke test

Writes:
- data/pdf_cache/ — raw PDFs mirrored from Drive
- data/canonical/ — one JSON per proposal
- data/INGESTION_REPORT.md — stats
- Qdrant collections webspot_proposal_summaries, _sections, _blocks

./venv/bin/python cli.py search "AI customer service agent for retail"
./venv/bin/python cli.py search "ecommerce rebuild" --json
./venv/bin/python cli.py search "branding refresh" --proposals 10 --sections 20

Returns the closest past proposals + most relevant sections (pricing, scope, deliverables, terms).

scripts/03b_build_master_via_rclone_token.py

One-time, only if you want to rebuild the master deck from a PPTX. Uploads PPTX → Slides, sprinkles {{ANCHOR}} placeholders into text frames, shares with SA as Editor, writes new master_build.json.

./venv/bin/python scripts/03b_build_master_via_rclone_token.py

The existing data/master_build.json already points at Jonah's live master, so you can skip this unless you're working on a fresh template.

scripts/06_generate_abc_store_v2.py — the current generator

./venv/bin/python scripts/06_generate_abc_store_v2.py <round_number>

Does:
1. Copies master Slides
2. Keeps only the slides that apply to ABC Store; deletes the rest
3. For every kept slide: reads the existing text frame's first-run style, deletes its content, inserts new text, re-applies the captured style (this is the "full text-frame replacement" — fixes the partial-anchor smashing that v1 produced)
4. Exports to PDF via Drive API
5. Renders each page to PNG at 110 DPI for audit
6. Writes data/generated_drafts/2026-05-10_abc_store_proposal_r<N>_manifest.json

This script has the ABC Store proposal prose hardcoded as Python strings. It is a reference implementation, not a generic generator. See "What to work on" below.

scripts/07_audit_abc_store.py — per-round audit

./venv/bin/python scripts/07_audit_abc_store.py <round_number>

Extracts text per page, runs:
- Banned phrase check (e.g., "ComfyUI", "Marketing Services" — anything not in the ABC Store brief)
- Required phrase check (pre/post-test, multi-channel, human handoff)
- Client correctness ("ABC Store" present, no leftover from other clients)
- Visual sanity (overlapping text, smashed words)

Writes AUDIT.md per round. Loop rounds until clean.


Architecture

Drive folder of PDFs (1y0jWe8aNTVMGWihwMhAk8oQLf7UquRQA)
        ↓
  cli.py ingest
        ↓
  PyMuPDF (+ ocrmypdf fallback when text density is low)
        ↓
  regex section classifier (cover / scope / pricing / timeline / ...)
        ↓
  canonical JSON  ──→  bge-m3 local embeddings  ──→  Qdrant (5 collections)
                                                      │
  cli.py search  ←──────────────────────────────────  ┘
        ↓
  top-5 proposals + top-12 sections

  (one-time)
  scripts/03b_*  PPTX  ──→  Google Slides + {{ANCHOR}} placeholders
                            (master deck, ID in master_build.json)

  (per client)
  scripts/06_*   master copy  ──→  text-frame replace  ──→  PDF export  ──→  PNG render

  (per round)
  scripts/07_*   audit text + visual checks  ──→  AUDIT.md

Qdrant collections (namespaced webspot_proposal_* so they coexist with other apps)


Current state


Hard constraints (please keep these)


What to work on

In rough priority:

  1. Build the missing generic generator. Today 06_* has ABC Store prose hardcoded. The natural shape is: brief markdown in → content per anchor out → fill the master deck. Could be an LLM (local model only, see constraint above), a template engine, or hand-author per service line. The retrieval CLI gives you reference language from the 251 past proposals to draft from.

  2. Populate edit_diffs and style_rules collections. They exist in Qdrant but are empty. The intent: capture human edits across rounds so the system learns Jonah's style preferences over time. Diff each round's text vs the prior round's, embed the change, store it.

  3. Routing intelligence. page_library.yaml does keyword scoring today (routing_keywords: per layout). Upgrade to embedding-similarity scoring against the brief as the corpus grows.

  4. Image swap. Logos and hero images aren't replaced today — they come from the master. Add a per-client image map (logo, hero, optional team photos).

  5. Wrap into CLI. Once a generic generator exists: cli.py generate <brief.md> / cli.py audit <round>. Right now those are loose scripts.

  6. Optional: portal. A simple web UI where someone pastes a brief and gets a generated PDF + audit. Only worth building once #1 lands — without a generic generator, a portal can't work for new clients.


Reference


Stuck? Ask Jonah

Best questions to ping him about:
- Anything around the master deck design / layout decisions
- What "good" looks like for a Webspot proposal (tone, structure, pricing logic)
- Whether a feature you're considering matches the brand direction
- Drive access issues — only Jonah can share

You shouldn't need to ask him about:
- Code structure / how a script works — read the file, run it, iterate
- Qdrant / embeddings / RAG mechanics — those are standard, just docs
- Google API errors — Stack Overflow + the API reference linked above