From: Jonah Tebaa (Webspot)
Repo state: working, proven on one client (ABC Store, 4 rounds)
Your mission: run it on your own machine, continue the teaching on the proposals folder, push it forward
You'll be running a Python pipeline that:
1. Ingests a Drive folder of past Webspot proposal PDFs (251 of them) into a local Qdrant vector DB
2. Searches that corpus — given a brief, returns the top 5 closest past proposals + top 12 most relevant sections
3. Generates new proposal PDFs by copying a master Google Slides deck, replacing text frames per client, and exporting to PDF
4. Audits each generated round — flags off-brief content, banned phrases, visual issues
Zero paid model calls anywhere. Embeddings are local (bge-m3 via sentence-transformers). Section classification is regex. There's no LLM in the generation path today — that's actually the biggest open improvement (see "What to work on" below).
| Item | How | Notes |
|---|---|---|
Source tarball — proposal-agent.tar.gz (13 MB) |
Direct file send | Contains code + 251 canonical JSONs + all ABC Store rounds + master deck refs + images. No raw PDFs (you'll re-ingest those). |
Drive folder access — WEBSPOT \| PROPOSALS |
Drive share | Folder ID: 1y0jWe8aNTVMGWihwMhAk8oQLf7UquRQA. You'll get Viewer on your service account's email. |
| Master Slides deck access | Drive share | Slides ID: 1PlMaMn2sOAkqy1GJNnsj292zShEolrCgHLezCVopSFA (WEBSPOT_PROPOSAL_MASTER_v1). You'll get Editor. |
| Source PPTX (optional, if you want to rebuild the master from scratch) | Drive share | Copy of WS Proposal Template.pptx in BRIAN SHARED/PROPOSALS |
What you need to send Jonah so he can grant access:
- Your Google account email
- Your service account email (created in step 3 below, format: xxx@yyy.iam.gserviceaccount.com)
When you extract proposal-agent.tar.gz:
proposal-agent-fullstate/
├── cli.py # main CLI: ingest / search / report / notify
├── ingest/ # Drive walker, PDF parser, section classifier, canonical builder
├── rag/ # bge-m3 embedder, Qdrant client, retriever
├── scripts/
│ ├── 01_analyze_master.py # Deep PPTX analysis → master_analysis.json
│ ├── 03_build_master.py # (superseded) original master builder
│ ├── 03b_build_master_via_rclone_token.py # current master builder (uses rclone OAuth)
│ ├── 04_visual_fidelity.py # PPTX-PDF vs Slides-PDF perceptual diff
│ ├── 04b_visual_fidelity_streaming.py # streaming variant
│ ├── 04c_visual_fidelity_fast.py # fast batch variant
│ ├── 05_generate_abc_store.py # v1 generator (anchor-only, superseded)
│ ├── 06_generate_abc_store_v2.py # CURRENT generator (full text-frame replace, round-aware)
│ └── 07_audit_abc_store.py # per-round audit + PNG render
├── data/
│ ├── canonical/ # 251 ingested proposals as JSON (the distilled learning)
│ ├── generated_drafts/ # ABC Store: all 4 rounds (PDFs, manifests, audit PNGs)
│ ├── abc_store_brief_intent.md # the spec ABC Store generator was audited against
│ ├── master_build.json # master Slides ID + every anchor occurrence
│ ├── master_analysis.json # deep PPTX analysis (slides, runs, fonts, colors, positions)
│ ├── page_library.yaml # routing config: which layouts always include, which gate
│ ├── layout_mapping.yaml # layout-tag → swap-anchor map
│ ├── visual_fidelity.json # perceptual-hash drift results
│ └── INGESTION_REPORT.md # corpus stats + confidence flags
├── images/ # 50+ slide rasters from the master deck
├── docker-compose.yml # local Qdrant
├── requirements.txt # Python deps
├── .env.example # all env vars documented
├── .gitignore
├── HANDOFF.md # short notes
└── README.md # original README
Not included (recreate as needed):
- data/pdf_cache/ — 1.2 GB of raw PDFs from Drive. Recreate with cli.py ingest once you have folder access.
- venv/, __pycache__/, logs/, working/, .env — runtime/secret.
mkdir -p /opt/agent && cd /opt/agent
tar xzf ~/Downloads/proposal-agent.tar.gz
mv proposal-agent-fullstate webspot_proposal_agent # paths in scripts assume this name
cd webspot_proposal_agent
python3 -m venv venv
./venv/bin/pip install -r requirements.txt
Python 3.11+ recommended. If you don't want it at /opt/agent/, you can put it anywhere — but you'll need to grep for /opt/agent/webspot_proposal_agent in scripts/ and adjust those paths, OR symlink:
sudo mkdir -p /opt/agent
sudo ln -s /your/actual/path /opt/agent/webspot_proposal_agent
The CLI itself (cli.py) is env-driven and works from any path. Only the scripts/ files have hardcoded paths.
docker compose up -d
Brings up Qdrant on 127.0.0.1:6333. Verify: curl http://localhost:6333/healthz → healthz check passed.
Used for headless Slides API editing.
proposal-agent, no roles needed (Drive permissions are granted via Drive sharing, not IAM)~/secrets/proposal-agent-sa.json (or wherever, just outside the repo)proposal-agent@your-project-id.iam.gserviceaccount.com)Jonah will then share:
- WEBSPOT | PROPOSALS folder with your SA email as Viewer
- Master Slides deck with your SA email as Editor
The service account has no Drive quota and is only Viewer/Editor on shared folders — it can't upload/convert PPTX → Slides into folders it doesn't own. rclone gives you a user OAuth token that runs as a real Drive account.
# install rclone if you don't have it
curl https://rclone.org/install.sh | sudo bash
# configure a "gdrive" remote authenticated as YOU
rclone config
# → n (new remote)
# → name: gdrive
# → storage: drive (Google Drive)
# → client_id: <blank> (uses rclone's public default)
# → client_secret: <blank>
# → scope: 1 (full access)
# → root_folder_id: <blank>
# → service_account_file: <blank>
# → edit advanced config: n
# → use auto config: y (opens browser)
# → configure as team drive: n
# → confirm: y
# → quit: q
Verify: rclone lsd gdrive: should list your Drive root.
The scripts read the token from ~/.config/rclone/rclone.conf by default.
.envcp .env.example .env
Edit .env:
- GCP_SERVICE_ACCOUNT_JSON=/absolute/path/to/your-sa-key.json
- Leave the rclone defaults as-is — they're rclone's public OAuth client (open-source, works for everyone)
- Leave Drive IDs as Jonah's defaults — those are what Jonah is sharing with you
- MAX_WORKERS — drop to 4 if you're on a laptop, default 8 is for a 16-vCPU server
Load it in your shell:
set -a; source .env; set +a
(Or use direnv / python-dotenv. The CLI auto-loads .env if you run it from the repo root.)
./venv/bin/python cli.py search "AI agent for hospitality"
You should get 5 proposals + 12 sections. If yes: canonical+embeddings transferred cleanly, ingestion pipeline is healthy, you're ready.
Wait — but the 251 canonical JSONs ship in the tarball. Qdrant starts empty though. The first search will be empty until you re-embed. Run a quick re-embed:
bash ./venv/bin/python cli.py ingest --limit 5
This pulls 5 raw PDFs from Drive (verifies folder access), parses them, re-embeds, upserts to Qdrant. After this, search works. Then run the full ingest:
bash ./venv/bin/python cli.py ingest
~15-30 minutes depending on machine. Idempotent — safe to re-run.
This regenerates Jonah's already-done ABC Store work, end-to-end:
./venv/bin/python scripts/06_generate_abc_store_v2.py 5
(Pass any round number that doesn't collide with the existing 4. Round 5 produces a fresh PDF you can compare against the existing data/generated_drafts/2026-05-10_abc_store_proposal_r4.pdf.)
If it produces a clean PDF matching the existing r4, the generator + master deck access + SA + rclone are all wired correctly.
cli.py ingestWalks Drive, parses PDFs (PyMuPDF + OCR fallback), classifies pages by section (cover/scope/pricing/...), builds canonical JSON, embeds locally with bge-m3, upserts to Qdrant. Idempotent.
./venv/bin/python cli.py ingest # full
./venv/bin/python cli.py ingest --limit 10 # smoke test
Writes:
- data/pdf_cache/ — raw PDFs mirrored from Drive
- data/canonical/ — one JSON per proposal
- data/INGESTION_REPORT.md — stats
- Qdrant collections webspot_proposal_summaries, _sections, _blocks
cli.py search./venv/bin/python cli.py search "AI customer service agent for retail"
./venv/bin/python cli.py search "ecommerce rebuild" --json
./venv/bin/python cli.py search "branding refresh" --proposals 10 --sections 20
Returns the closest past proposals + most relevant sections (pricing, scope, deliverables, terms).
scripts/03b_build_master_via_rclone_token.pyOne-time, only if you want to rebuild the master deck from a PPTX. Uploads PPTX → Slides, sprinkles {{ANCHOR}} placeholders into text frames, shares with SA as Editor, writes new master_build.json.
./venv/bin/python scripts/03b_build_master_via_rclone_token.py
The existing data/master_build.json already points at Jonah's live master, so you can skip this unless you're working on a fresh template.
scripts/06_generate_abc_store_v2.py — the current generator./venv/bin/python scripts/06_generate_abc_store_v2.py <round_number>
Does:
1. Copies master Slides
2. Keeps only the slides that apply to ABC Store; deletes the rest
3. For every kept slide: reads the existing text frame's first-run style, deletes its content, inserts new text, re-applies the captured style (this is the "full text-frame replacement" — fixes the partial-anchor smashing that v1 produced)
4. Exports to PDF via Drive API
5. Renders each page to PNG at 110 DPI for audit
6. Writes data/generated_drafts/2026-05-10_abc_store_proposal_r<N>_manifest.json
This script has the ABC Store proposal prose hardcoded as Python strings. It is a reference implementation, not a generic generator. See "What to work on" below.
scripts/07_audit_abc_store.py — per-round audit./venv/bin/python scripts/07_audit_abc_store.py <round_number>
Extracts text per page, runs:
- Banned phrase check (e.g., "ComfyUI", "Marketing Services" — anything not in the ABC Store brief)
- Required phrase check (pre/post-test, multi-channel, human handoff)
- Client correctness ("ABC Store" present, no leftover from other clients)
- Visual sanity (overlapping text, smashed words)
Writes AUDIT.md per round. Loop rounds until clean.
Drive folder of PDFs (1y0jWe8aNTVMGWihwMhAk8oQLf7UquRQA)
↓
cli.py ingest
↓
PyMuPDF (+ ocrmypdf fallback when text density is low)
↓
regex section classifier (cover / scope / pricing / timeline / ...)
↓
canonical JSON ──→ bge-m3 local embeddings ──→ Qdrant (5 collections)
│
cli.py search ←────────────────────────────────── ┘
↓
top-5 proposals + top-12 sections
(one-time)
scripts/03b_* PPTX ──→ Google Slides + {{ANCHOR}} placeholders
(master deck, ID in master_build.json)
(per client)
scripts/06_* master copy ──→ text-frame replace ──→ PDF export ──→ PNG render
(per round)
scripts/07_* audit text + visual checks ──→ AUDIT.md
webspot_proposal_* so they coexist with other apps)webspot_proposal_summaries — 1 vec / proposal (populated)webspot_proposal_sections — 1 vec / section (populated — primary retrieval unit)webspot_proposal_blocks — 1 vec / pricing+scope+terms+deliverables block (populated)webspot_proposal_edit_diffs — created, not yet populated (this is Week 4 work — see below)webspot_proposal_style_rules — created, not yet populated (Week 4)In rough priority:
Build the missing generic generator. Today 06_* has ABC Store prose hardcoded. The natural shape is: brief markdown in → content per anchor out → fill the master deck. Could be an LLM (local model only, see constraint above), a template engine, or hand-author per service line. The retrieval CLI gives you reference language from the 251 past proposals to draft from.
Populate edit_diffs and style_rules collections. They exist in Qdrant but are empty. The intent: capture human edits across rounds so the system learns Jonah's style preferences over time. Diff each round's text vs the prior round's, embed the change, store it.
Routing intelligence. page_library.yaml does keyword scoring today (routing_keywords: per layout). Upgrade to embedding-similarity scoring against the brief as the corpus grows.
Image swap. Logos and hero images aren't replaced today — they come from the master. Add a per-client image map (logo, hero, optional team photos).
Wrap into CLI. Once a generic generator exists: cli.py generate <brief.md> / cli.py audit <round>. Right now those are loose scripts.
Optional: portal. A simple web UI where someone pastes a brief and gets a generated PDF + audit. Only worth building once #1 lands — without a generic generator, a portal can't work for new clients.
Best questions to ping him about:
- Anything around the master deck design / layout decisions
- What "good" looks like for a Webspot proposal (tone, structure, pricing logic)
- Whether a feature you're considering matches the brand direction
- Drive access issues — only Jonah can share
You shouldn't need to ask him about:
- Code structure / how a script works — read the file, run it, iterate
- Qdrant / embeddings / RAG mechanics — those are standard, just docs
- Google API errors — Stack Overflow + the API reference linked above