← index2026-05-12 14:23 (Beirut)Proposal Agent — teammate handoff. Setup, run guide, what to ask Jonah for, current state, what to work on. Share this URL with the teammate alongside the tarball.

Proposal Agent — Teammate Handoff

From: Jonah Tebaa (Webspot)
Repo state: working, proven on one client (ABC Store, 4 rounds)
Your mission: run it on your own machine, continue the teaching on the proposals folder, push it forward

TL;DR

You'll be running a Python pipeline that:
1. Ingests a Drive folder of past Webspot proposal PDFs (251 of them) into a local Qdrant vector DB
2. Searches that corpus — given a brief, returns the top 5 closest past proposals + top 12 most relevant sections
3. Generates new proposal PDFs by copying a master Google Slides deck, replacing text frames per client, and exporting to PDF
4. Audits each generated round — flags off-brief content, banned phrases, visual issues

Zero paid model calls anywhere. Embeddings are local (bge-m3 via sentence-transformers). Section classification is regex. There's no LLM in the generation path today — that's actually the biggest open improvement (see "What to work on" below).

What you'll receive from Jonah

Item	How	Notes
Source tarball — `proposal-agent.tar.gz` (13 MB)	Direct file send	Contains code + 251 canonical JSONs + all ABC Store rounds + master deck refs + images. No raw PDFs (you'll re-ingest those).
Drive folder access — `WEBSPOT \\| PROPOSALS`	Drive share	Folder ID: `1y0jWe8aNTVMGWihwMhAk8oQLf7UquRQA`. You'll get Viewer on your service account's email.
Master Slides deck access	Drive share	Slides ID: `1PlMaMn2sOAkqy1GJNnsj292zShEolrCgHLezCVopSFA` (`WEBSPOT_PROPOSAL_MASTER_v1`). You'll get Editor.
Source PPTX (optional, if you want to rebuild the master from scratch)	Drive share	`Copy of WS Proposal Template.pptx` in BRIAN SHARED/PROPOSALS

What you need to send Jonah so he can grant access:
- Your Google account email
- Your service account email (created in step 3 below, format: xxx@yyy.iam.gserviceaccount.com)

What's inside the tarball

When you extract proposal-agent.tar.gz:

proposal-agent-fullstate/
├── cli.py                          # main CLI: ingest / search / report / notify
├── ingest/                         # Drive walker, PDF parser, section classifier, canonical builder
├── rag/                            # bge-m3 embedder, Qdrant client, retriever
├── scripts/
│   ├── 01_analyze_master.py        # Deep PPTX analysis → master_analysis.json
│   ├── 03_build_master.py          # (superseded) original master builder
│   ├── 03b_build_master_via_rclone_token.py   # current master builder (uses rclone OAuth)
│   ├── 04_visual_fidelity.py       # PPTX-PDF vs Slides-PDF perceptual diff
│   ├── 04b_visual_fidelity_streaming.py       # streaming variant
│   ├── 04c_visual_fidelity_fast.py            # fast batch variant
│   ├── 05_generate_abc_store.py    # v1 generator (anchor-only, superseded)
│   ├── 06_generate_abc_store_v2.py # CURRENT generator (full text-frame replace, round-aware)
│   └── 07_audit_abc_store.py       # per-round audit + PNG render
├── data/
│   ├── canonical/                  # 251 ingested proposals as JSON (the distilled learning)
│   ├── generated_drafts/           # ABC Store: all 4 rounds (PDFs, manifests, audit PNGs)
│   ├── abc_store_brief_intent.md   # the spec ABC Store generator was audited against
│   ├── master_build.json           # master Slides ID + every anchor occurrence
│   ├── master_analysis.json        # deep PPTX analysis (slides, runs, fonts, colors, positions)
│   ├── page_library.yaml           # routing config: which layouts always include, which gate
│   ├── layout_mapping.yaml         # layout-tag → swap-anchor map
│   ├── visual_fidelity.json        # perceptual-hash drift results
│   └── INGESTION_REPORT.md         # corpus stats + confidence flags
├── images/                         # 50+ slide rasters from the master deck
├── docker-compose.yml              # local Qdrant
├── requirements.txt                # Python deps
├── .env.example                    # all env vars documented
├── .gitignore
├── HANDOFF.md                      # short notes
└── README.md                       # original README

Not included (recreate as needed):
- data/pdf_cache/ — 1.2 GB of raw PDFs from Drive. Recreate with cli.py ingest once you have folder access.
- venv/, __pycache__/, logs/, working/, .env — runtime/secret.

Setup — first time on your machine

1. Extract + Python env

mkdir -p /opt/agent && cd /opt/agent
tar xzf ~/Downloads/proposal-agent.tar.gz
mv proposal-agent-fullstate webspot_proposal_agent   # paths in scripts assume this name
cd webspot_proposal_agent
python3 -m venv venv
./venv/bin/pip install -r requirements.txt

Python 3.11+ recommended. If you don't want it at /opt/agent/, you can put it anywhere — but you'll need to grep for /opt/agent/webspot_proposal_agent in scripts/ and adjust those paths, OR symlink:

sudo mkdir -p /opt/agent
sudo ln -s /your/actual/path /opt/agent/webspot_proposal_agent

The CLI itself (cli.py) is env-driven and works from any path. Only the scripts/ files have hardcoded paths.

2. Local Qdrant

docker compose up -d

Brings up Qdrant on 127.0.0.1:6333. Verify: curl http://localhost:6333/healthz → healthz check passed.

3. Google Cloud — service account

Used for headless Slides API editing.

Go to https://console.cloud.google.com — create a project (or reuse one)
Enable APIs: Google Drive API, Google Slides API
Create a service account: IAM → Service Accounts → Create
Give it a name like proposal-agent, no roles needed (Drive permissions are granted via Drive sharing, not IAM)
Create a JSON key, download it → save to ~/secrets/proposal-agent-sa.json (or wherever, just outside the repo)
Send Jonah the SA's email (it'll look like proposal-agent@your-project-id.iam.gserviceaccount.com)

Jonah will then share:
- WEBSPOT | PROPOSALS folder with your SA email as Viewer
- Master Slides deck with your SA email as Editor

4. rclone — user OAuth for uploads the SA can't do

The service account has no Drive quota and is only Viewer/Editor on shared folders — it can't upload/convert PPTX → Slides into folders it doesn't own. rclone gives you a user OAuth token that runs as a real Drive account.

# install rclone if you don't have it
curl https://rclone.org/install.sh | sudo bash

# configure a "gdrive" remote authenticated as YOU
rclone config
# → n (new remote)
# → name: gdrive
# → storage: drive (Google Drive)
# → client_id: <blank> (uses rclone's public default)
# → client_secret: <blank>
# → scope: 1 (full access)
# → root_folder_id: <blank>
# → service_account_file: <blank>
# → edit advanced config: n
# → use auto config: y (opens browser)
# → configure as team drive: n
# → confirm: y
# → quit: q

Verify: rclone lsd gdrive: should list your Drive root.

The scripts read the token from ~/.config/rclone/rclone.conf by default.

5. `.env`

cp .env.example .env

Edit .env:
- GCP_SERVICE_ACCOUNT_JSON=/absolute/path/to/your-sa-key.json
- Leave the rclone defaults as-is — they're rclone's public OAuth client (open-source, works for everyone)
- Leave Drive IDs as Jonah's defaults — those are what Jonah is sharing with you
- MAX_WORKERS — drop to 4 if you're on a laptop, default 8 is for a 16-vCPU server

Load it in your shell:

set -a; source .env; set +a

(Or use direnv / python-dotenv. The CLI auto-loads .env if you run it from the repo root.)

First runs — verify it works

Smoke test the search (uses existing canonical, no Drive needed)

./venv/bin/python cli.py search "AI agent for hospitality"

You should get 5 proposals + 12 sections. If yes: canonical+embeddings transferred cleanly, ingestion pipeline is healthy, you're ready.

Wait — but the 251 canonical JSONs ship in the tarball. Qdrant starts empty though. The first search will be empty until you re-embed. Run a quick re-embed:
bash ./venv/bin/python cli.py ingest --limit 5
This pulls 5 raw PDFs from Drive (verifies folder access), parses them, re-embeds, upserts to Qdrant. After this, search works. Then run the full ingest:
bash ./venv/bin/python cli.py ingest
~15-30 minutes depending on machine. Idempotent — safe to re-run.

Generate the ABC Store proposal as a smoke test of the generator

This regenerates Jonah's already-done ABC Store work, end-to-end:

./venv/bin/python scripts/06_generate_abc_store_v2.py 5

(Pass any round number that doesn't collide with the existing 4. Round 5 produces a fresh PDF you can compare against the existing data/generated_drafts/2026-05-10_abc_store_proposal_r4.pdf.)

If it produces a clean PDF matching the existing r4, the generator + master deck access + SA + rclone are all wired correctly.

How to use the agent

`cli.py ingest`

Walks Drive, parses PDFs (PyMuPDF + OCR fallback), classifies pages by section (cover/scope/pricing/...), builds canonical JSON, embeds locally with bge-m3, upserts to Qdrant. Idempotent.

./venv/bin/python cli.py ingest                    # full
./venv/bin/python cli.py ingest --limit 10         # smoke test

Writes:
- data/pdf_cache/ — raw PDFs mirrored from Drive
- data/canonical/ — one JSON per proposal
- data/INGESTION_REPORT.md — stats
- Qdrant collections webspot_proposal_summaries, _sections, _blocks

`cli.py search`

./venv/bin/python cli.py search "AI customer service agent for retail"
./venv/bin/python cli.py search "ecommerce rebuild" --json
./venv/bin/python cli.py search "branding refresh" --proposals 10 --sections 20

Returns the closest past proposals + most relevant sections (pricing, scope, deliverables, terms).

`scripts/03b_build_master_via_rclone_token.py`

One-time, only if you want to rebuild the master deck from a PPTX. Uploads PPTX → Slides, sprinkles {{ANCHOR}} placeholders into text frames, shares with SA as Editor, writes new master_build.json.

./venv/bin/python scripts/03b_build_master_via_rclone_token.py

The existing data/master_build.json already points at Jonah's live master, so you can skip this unless you're working on a fresh template.

`scripts/06_generate_abc_store_v2.py` — the current generator

./venv/bin/python scripts/06_generate_abc_store_v2.py <round_number>

Does:
1. Copies master Slides
2. Keeps only the slides that apply to ABC Store; deletes the rest
3. For every kept slide: reads the existing text frame's first-run style, deletes its content, inserts new text, re-applies the captured style (this is the "full text-frame replacement" — fixes the partial-anchor smashing that v1 produced)
4. Exports to PDF via Drive API
5. Renders each page to PNG at 110 DPI for audit
6. Writes data/generated_drafts/2026-05-10_abc_store_proposal_r<N>_manifest.json

This script has the ABC Store proposal prose hardcoded as Python strings. It is a reference implementation, not a generic generator. See "What to work on" below.

`scripts/07_audit_abc_store.py` — per-round audit

./venv/bin/python scripts/07_audit_abc_store.py <round_number>

Extracts text per page, runs:
- Banned phrase check (e.g., "ComfyUI", "Marketing Services" — anything not in the ABC Store brief)
- Required phrase check (pre/post-test, multi-channel, human handoff)
- Client correctness ("ABC Store" present, no leftover from other clients)
- Visual sanity (overlapping text, smashed words)

Writes AUDIT.md per round. Loop rounds until clean.

Architecture

Drive folder of PDFs (1y0jWe8aNTVMGWihwMhAk8oQLf7UquRQA)
        ↓
  cli.py ingest
        ↓
  PyMuPDF (+ ocrmypdf fallback when text density is low)
        ↓
  regex section classifier (cover / scope / pricing / timeline / ...)
        ↓
  canonical JSON  ──→  bge-m3 local embeddings  ──→  Qdrant (5 collections)
                                                      │
  cli.py search  ←──────────────────────────────────  ┘
        ↓
  top-5 proposals + top-12 sections

  (one-time)
  scripts/03b_*  PPTX  ──→  Google Slides + {{ANCHOR}} placeholders
                            (master deck, ID in master_build.json)

  (per client)
  scripts/06_*   master copy  ──→  text-frame replace  ──→  PDF export  ──→  PNG render

  (per round)
  scripts/07_*   audit text + visual checks  ──→  AUDIT.md

Qdrant collections (namespaced `webspot_proposal_*` so they coexist with other apps)

webspot_proposal_summaries — 1 vec / proposal (populated)
webspot_proposal_sections — 1 vec / section (populated — primary retrieval unit)
webspot_proposal_blocks — 1 vec / pricing+scope+terms+deliverables block (populated)
webspot_proposal_edit_diffs — created, not yet populated (this is Week 4 work — see below)
webspot_proposal_style_rules — created, not yet populated (Week 4)

Current state

273 Drive files scanned, 251 proposals indexed
OCR fallback rate: 8.8% (22/251)
Corpus: ~366K tokens, ~1.46M chars
Year mix: 18 (2023), 68 (2024), 139 (2025), 23 (2026 YTD)
Master deck: 54 slides, A4 portrait (7,556,500 × 10,693,400 EMU)
One client proven end-to-end (ABC Store, 4 rounds — R4 audited clean)

Hard constraints (please keep these)

Zero paid model calls. Embeddings local (bge-m3), classifier regex, no LLM in the gen path. If you add a paid call anywhere, gate it behind a feature flag with a daily cap.
Never mutate the master deck. Every run copies first.
A4 portrait enforced. Page size re-asserted to 7,556,500 × 10,693,400 EMU on every copy.
Round-based iteration. Each round writes its own manifest + PDF + PNG audits so a bad round can be discarded.

What to work on

In rough priority:

Build the missing generic generator. Today 06_* has ABC Store prose hardcoded. The natural shape is: brief markdown in → content per anchor out → fill the master deck. Could be an LLM (local model only, see constraint above), a template engine, or hand-author per service line. The retrieval CLI gives you reference language from the 251 past proposals to draft from.
Populate edit_diffs and style_rules collections. They exist in Qdrant but are empty. The intent: capture human edits across rounds so the system learns Jonah's style preferences over time. Diff each round's text vs the prior round's, embed the change, store it.
Routing intelligence. page_library.yaml does keyword scoring today (routing_keywords: per layout). Upgrade to embedding-similarity scoring against the brief as the corpus grows.
Image swap. Logos and hero images aren't replaced today — they come from the master. Add a per-client image map (logo, hero, optional team photos).
Wrap into CLI. Once a generic generator exists: cli.py generate <brief.md> / cli.py audit <round>. Right now those are loose scripts.
Optional: portal. A simple web UI where someone pastes a brief and gets a generated PDF + audit. Only worth building once #1 lands — without a generic generator, a portal can't work for new clients.

Reference

Master deck: https://docs.google.com/presentation/d/1PlMaMn2sOAkqy1GJNnsj292zShEolrCgHLezCVopSFA/edit
Proposals folder: https://drive.google.com/drive/folders/1y0jWe8aNTVMGWihwMhAk8oQLf7UquRQA
Qdrant docs: https://qdrant.tech/documentation/
bge-m3: https://huggingface.co/BAAI/bge-m3
Google Slides API: https://developers.google.com/slides/api/reference/rest
Google Drive API: https://developers.google.com/drive/api/reference/rest/v3
rclone setup: https://rclone.org/drive/

Stuck? Ask Jonah

Best questions to ping him about:
- Anything around the master deck design / layout decisions
- What "good" looks like for a Webspot proposal (tone, structure, pricing logic)
- Whether a feature you're considering matches the brand direction
- Drive access issues — only Jonah can share

You shouldn't need to ask him about:
- Code structure / how a script works — read the file, run it, iterate
- Qdrant / embeddings / RAG mechanics — those are standard, just docs
- Google API errors — Stack Overflow + the API reference linked above

Proposal Agent — Teammate Handoff

Proposal Agent — Teammate Handoff

TL;DR

What you'll receive from Jonah

What's inside the tarball

Setup — first time on your machine

1. Extract + Python env

2. Local Qdrant

3. Google Cloud — service account

4. rclone — user OAuth for uploads the SA can't do

5. .env

First runs — verify it works

Smoke test the search (uses existing canonical, no Drive needed)

Generate the ABC Store proposal as a smoke test of the generator

How to use the agent

cli.py ingest

cli.py search

scripts/03b_build_master_via_rclone_token.py

scripts/06_generate_abc_store_v2.py — the current generator

scripts/07_audit_abc_store.py — per-round audit

Architecture

Qdrant collections (namespaced webspot_proposal_* so they coexist with other apps)

Current state

Hard constraints (please keep these)

What to work on

Reference

Stuck? Ask Jonah

5. `.env`

`cli.py ingest`

`cli.py search`

`scripts/03b_build_master_via_rclone_token.py`

`scripts/06_generate_abc_store_v2.py` — the current generator

`scripts/07_audit_abc_store.py` — per-round audit

Qdrant collections (namespaced `webspot_proposal_*` so they coexist with other apps)