← index2026-05-12 13:49 (Beirut)Webspot Proposal Agent — what it does, how to use it, how it works (ingest → retrieve → generate → audit). Note: Google Slides, not Canva.

Webspot Proposal Agent — Full Guide

What it is: a no-paid-model agent that takes a client brief, retrieves the closest past Webspot proposals, generates pricing + content, and outputs a finished branded PDF proposal by editing a copy of the master Google Slides deck. (Note: the design layer is Google Slides, not Canva — Webspot's existing 54-slide A4 portrait master is the source of visual fidelity.)

Where it lives: /opt/agent/webspot_proposal_agent/ on Hetzner (ubuntu-8gb-hel1-1).

Status (2026-05-12): ingestion live (251 proposals indexed), master deck built with anchor placeholders, end-to-end generation proven on the ABC Store client (4 rounds, R4 = clean).

How to use it

1. Write the brief (intent doc)

Drop a markdown file describing the client and services into data/. Example: data/abc_store_brief_intent.md. It must include:

Client name + country
Each service: purpose, deliverables verbatim from Jonah's words, distinguishing features
Audit checklist (per-page) — keeps the generator honest

2. Search the corpus (optional but recommended)

cd /opt/agent/webspot_proposal_agent
./venv/bin/python cli.py search "AI training + customer service agent for retail"
./venv/bin/python cli.py search "ecommerce site rebuild" --json

Returns top-5 closest past proposals + top-12 most relevant sections (pricing, scope, deliverables, etc.). These become reference material when drafting.

3. Generate the proposal

./venv/bin/python scripts/06_generate_abc_store_v2.py <round_number>

What that script does, in order:

Copies WEBSPOT_PROPOSAL_MASTER_v1 (Slides ID 1PlMaMn2sOAkqy1GJNnsj292zShEolrCgHLezCVopSFA).
Reads the keep-set: which of the 54 master slides apply to this brief.
Deletes the rest.
For each kept slide, fully replaces text-frame content via the Slides API (delete-old → insert-new → re-apply original first-run style — preserves fonts, sizes, colors, spacing).
Exports as PDF via Drive API.
Renders each PDF page to PNG at 110 DPI for audit.
Writes a manifest JSON: data/generated_drafts/YYYY-MM-DD_<client>_proposal_r<N>_manifest.json.

4. Audit each round

./venv/bin/python scripts/07_audit_abc_store.py <round_number>

Extracts text, flags banned phrases (other client names, off-brief services like "ComfyUI workshop"), runs visual checks, writes AUDIT.md. Loop rounds until clean.

5. Deliver

PDF lands in data/generated_drafts/ and is mirrored to the BRIAN SHARED Drive folder so it appears on Jonah's Mac within ~4s.

Other CLI commands

./venv/bin/python cli.py ingest                    # full Drive walk (idempotent)
./venv/bin/python cli.py ingest --limit 10         # smoke test
./venv/bin/python cli.py report                    # re-emit INGESTION_REPORT.md
./venv/bin/python cli.py notify                    # send TG COMMS completion notice

How it works

Three-stage pipeline

Drive corpus  →  Retrieval (Qdrant)  →  Generation (Slides API)  →  PDF
   (275 PDFs)      (bge-m3 local)         (anchor replacement)       (audit)

Stage A — Ingestion (`cli.py ingest`)

Step	File	Detail
Walk Drive	`ingest/drive_ingest.py`	Recursive via service account `webspot-proposal-agent@gen-lang-client-0765538237.iam.gserviceaccount.com`. Skips `trash`, `demo`, `price card`, `_`, `old`, `template_test*`.
Parse PDFs	`ingest/pdf_parser.py`	PyMuPDF first; OCR (ocrmypdf) fallback when text density low. ~8.8% OCR rate.
Classify pages	`ingest/section_classifier.py`	Regex heuristic — cover / intro / scope / pricing / timeline / deliverables / terms / contact / case_study / about / other.
Build canonical JSON	`ingest/canonical.py`	One JSON per proposal in `data/canonical/`.
Embed	`rag/embed.py`	Local bge-m3 via sentence-transformers (normalized vectors, no paid API).
Upsert	`rag/qdrant_client.py`	5 Qdrant collections (namespaced `webspot_proposal_*`).

Stage B — Retrieval (`cli.py search`)

Five Qdrant collections on the shared agent-qdrant container (127.0.0.1:6333):

Collection	Granularity	Status
`webspot_proposal_summaries`	1 vec / proposal	populated
`webspot_proposal_sections`	1 vec / section	populated (primary retrieval unit)
`webspot_proposal_blocks`	1 vec / block (pricing/scope/terms/deliverables only)	populated
`webspot_proposal_edit_diffs`	—	created, Week 4 work
`webspot_proposal_style_rules`	—	created, Week 4 work

rag/retrieve.py returns top-5 proposals + top-12 sections per query.

Stage C — Generation (`scripts/06_generate_abc_store_v2.py`)

The master is a real Google Slides file with anchor strings sprinkled in: {{PROPOSAL_TITLE}}, {{CLIENT_NAME}}, {{PROPOSAL_DATE}}, {{CLIENT_LOCATION}}, {{CLIENT_AUDIENCE}}, {{SERVICE_DESCRIPTION}} (×24 occurrences), and many more — built by scripts/03b_build_master_via_rclone_token.py from the original PPTX.

Why rclone's OAuth token, not the service account? The SA has no Drive quota and is only read-level on the folder; rclone is authenticated as Jonah (the owner), so it can upload + convert PPTX → native Slides + edit freely. The SA is then added as Editor so future generator runs work headless.

Full text-frame replacement (v2 strategy) avoids the partial-anchor smashing problem ("GENERATIONWORKSHOP") that the earlier replaceAllText-only path produced. For every slide we author content for, the script:

Reads the existing text frame's first-run style.
Deletes the entire frame contents.
Inserts the new generated text.
Re-applies the captured style.

Then exports to PDF via drive.files.export(mimeType="application/pdf").

Stage D — Visual fidelity check

scripts/04_visual_fidelity.py (+ 04b streaming and 04c fast variants): renders the new Slides → PDF, diffs page-by-page against the original PPTX-converted PDF using a perceptual hash (8×8 average-hash) to flag drift between master and the new template. Used once at template-build time, not per-client.

Stage E — Per-round audit

scripts/07_audit_abc_store.py extracts text per page and runs:

Relevance check — does this page belong in this brief?
Content match — does the page text reflect Jonah's brief verbatim (pre/post-test, multi-channel handling, human handoff/handback for ABC Store)?
Client correctness — client name present, no leftover from other proposals.
Visual check — no overlapping text, no smashed words, no garbled fonts.

Writes AUDIT.md per round. Loop until clean.

Current numbers (post-ingestion, 2026-05-10)

273 Drive files scanned · 251 proposals indexed · 0 parse failures · 0 download failures
OCR fallback: 8.8% (22/251)
Corpus size: ~366K tokens, ~1.46M chars
Year mix: 18 (2023), 68 (2024), 139 (2025), 23 (2026 YTD)
Project mix: ai_agent 105, website 31, ecommerce 31, website_branding 16, ads 11, branding 6, unknown 51
Master deck: 54 slides, A4 portrait (7,556,500 × 10,693,400 EMU)

Files map

Path	Purpose
`cli.py`	`ingest` / `search` / `report` / `notify`
`ingest/drive_ingest.py`	Drive walker (service account)
`ingest/pdf_parser.py`	PyMuPDF + ocrmypdf fallback
`ingest/section_classifier.py`	Per-page regex classifier
`ingest/canonical.py`	Canonical JSON builder
`rag/embed.py`	Local bge-m3 embeddings
`rag/qdrant_client.py`	5-collection upserter
`rag/retrieve.py`	Top-5 proposals + top-12 sections
`scripts/01_analyze_master.py`	Deep PPTX analysis → `master_analysis.json`
`scripts/03b_build_master_via_rclone_token.py`	PPTX → native Google Slides w/ anchors (live builder)
`scripts/04*.py`	Visual-fidelity checks (3 variants)
`scripts/05_generate_abc_store.py`	First-gen generator (v1, anchor-only)
`scripts/06_generate_abc_store_v2.py`	Current generator — full text-frame replacement, round-aware
`scripts/07_audit_abc_store.py`	Per-round audit + PNG renders
`data/canonical/`	One JSON per proposal (251 files)
`data/pdf_cache/`	Mirrored Drive structure, raw PDFs
`data/page_library.yaml`	Routing index — which layouts always include, which gate on keywords
`data/master_build.json`	Live master Slides ID + anchor occurrence map
`data/INGESTION_REPORT.md`	Stats + confidence flags
`data/generated_drafts/`	Output PDFs + manifests, one set per round
`logs/ingest.log`	Append-only ingestion log

Hard constraints

Zero paid model calls — bge-m3 local, regex classifier, anchor replacement (no LLM in the gen path). Consistent with daily-spend lockdown.
No mutation of the master — every run copies first.
A4 portrait enforced — page size re-asserted to 7,556,500 × 10,693,400 EMU on every copy.
Round-based iteration — rounds are atomic; each writes its own manifest + PDF + PNG audits so a bad round can be discarded.

Known gaps / next work

Week 4: populate webspot_proposal_edit_diffs + webspot_proposal_style_rules collections (currently created but empty). Lets the generator learn from Jonah's edits across rounds.
page_library.yaml routing scoring is keyword-only — fine for now; upgrade to embedding-similarity scoring when corpus grows.
Current ABC Store generator is client-specific (scripts/06_generate_abc_store_v2.py). Generalize into scripts/generate.py <brief.md> <round> once the second client lands.
No Slides-side image swap yet (logos, hero images). Body images come from the master; client-specific images need a follow-up step.

Trigger words / one-liners

Generate: "draft a Webspot proposal for covering " → I write the intent doc, run search to pull references, run generator round 1, run audit, iterate.
Re-ingest: "re-ingest the proposals folder" → cli.py ingest (idempotent, picks up new files only).
Search corpus: "find Webspot proposals about " → cli.py search "X".

Index location: /opt/agent/webspot_proposal_agent/ · master Slides: 1PlMaMn2sOAkqy1GJNnsj292zShEolrCgHLezCVopSFA · Drive folder: 1y0jWe8aNTVMGWihwMhAk8oQLf7UquRQA (WEBSPOT | PROPOSALS).

Webspot Proposal Agent — Full Guide

Webspot Proposal Agent — Full Guide

How to use it

1. Write the brief (intent doc)

2. Search the corpus (optional but recommended)

3. Generate the proposal

4. Audit each round

5. Deliver

Other CLI commands

How it works

Three-stage pipeline

Stage A — Ingestion (cli.py ingest)

Stage B — Retrieval (cli.py search)

Stage C — Generation (scripts/06_generate_abc_store_v2.py)

Stage D — Visual fidelity check

Stage E — Per-round audit

Current numbers (post-ingestion, 2026-05-10)

Files map

Hard constraints

Known gaps / next work

Trigger words / one-liners

Stage A — Ingestion (`cli.py ingest`)

Stage B — Retrieval (`cli.py search`)

Stage C — Generation (`scripts/06_generate_abc_store_v2.py`)