Building a deep agentic document-understanding system on Sciforium
This notebook is a hands-on tour. By the end you will have built, stage by stage, an investor-grade due-diligence pipeline that takes a PDF and returns a cited, fact-checked memo (plus a podcast and a cover image). Everything is defined inline — no project imports. Each stage uses one primitive of the Sciforium API. The same pattern underlies most useful production agent pipelines: cheap model ×N in parallel → mid-tier model ×M in parallel → best model ×1 for synthesis.What you’ll learn
| Primitive | Endpoint | Where it shows up |
|---|---|---|
chat() | POST /v1/chat/completions | Every LLM call |
chat_with_attachment() | same, with a file content part | Metrics extraction, verification |
parse_file() | POST /api/attachments/parse | Turning a PDF into text |
synthesize_speech() | POST /v1/audio/speech | Podcast TTS |
generate_image() | POST /v1/images/generations | Cover image |
| Agentic pattern | Stage |
|---|---|
| Many small LLM calls in parallel | Extraction, analysis |
| Bounded concurrency (semaphore) around a third-party API | Exa grounding |
| Batching work so a big task streams progress | Verification |
| Evidence fusion — feeding the output of cheap calls into a best-in-class synthesis | Memo |
| Schema-constrained JSON output with robust parsing | Verification, podcast scripting |
| Plan → fan-out generate → stitch | Podcast |
Prerequisites
- A Sciforium API key in
.envasSCIFORIUM_API_KEY. - Optional:
EXA_API_KEYfor web grounding. pip install openai httpx(already in this project’s venv).
0 · Setup
Load environment variables, set base URLs, and sanity-check the key. No pipeline imports — everything from here down is written in the notebook.1 · Configuration — models and tasks
The whole pipeline is parameterised by aMODELS dict and two task lists. This is the only place you choose capability vs. cost per stage.
Rule of thumb:
- Extractor — cheap and fast. You’ll call it many times in parallel.
- Verifier — mid-tier. We batch the work so throughput > single-call capability.
- Analyst — mid/strong. Fewer calls, each needs to reason across evidence.
- Synthesizer — best model you have. One call, highest-stakes output.
2 · Primitive — chat() one-shot completion
Sciforium speaks the OpenAI Chat Completions protocol. Anything that works with openai-python against OpenAI works here — just point base_url at https://api.sciforium.com/v1.
We use AsyncOpenAI so downstream stages can fan out with asyncio.gather.
3 · Primitive — chat_with_attachment() multimodal file input
When numeric fidelity matters — tables, figures, dense financials — a lossy text parse is a bad input. Sciforium’s chat endpoint accepts a file content part with a base64 data URL. The gateway extracts native bytes before the model sees the message.
We’ll use this in two places: the metrics extraction task, and the whole verification stage.
4 · Primitive — parse_file() layout-aware parser
chat_with_attachment is great for a single question about one doc, but for multi-stage pipelines you usually want plain text once — otherwise every extraction step re-parses the same file on the server.
Sciforium’s /api/attachments/parse endpoint returns per-page structured text with layout preserved. Call it once, cache the output, reuse for every text-only stage.
5 · Pick a document
SetDOC_PATH to your PDF. The fallback below grabs the newest PDF in jobs/ if you’ve already run something through the web UI.
6 · Stage 1 — Parallel extraction (the many-small-calls pattern)
Five tasks, fired simultaneously withasyncio.gather. Because each task is a separate HTTP request, the wall-clock cost is dominated by the slowest one — not the sum.
Notice the use_attachment=True flag on the metrics task. For that one call we skip our own parse output and send the original PDF — the gateway’s native extractor preserves table numerics better than anything a general parser does with a bag of words.
7 · Stage 2 — Grounding with Exa (bounded concurrency)
For every top claim and every named person we want external corroboration. Two rules:- Fan out — queries are independent, so run them with
asyncio.gather. - Bound the fan — Exa (like any third-party API) rate-limits aggressive callers. A shared
asyncio.Semaphorecaps how many requests are in flight. 4 is a safe default for free/basic tiers.
8 · Stage 3 — Batched verification
We now ask theverifier model to re-check every extracted claim and metric against the original document. The naive implementation is one giant call with all 40 items — which turns into a multi-minute silent request that users abandon.
The fix is a pattern worth remembering: split the work into batches, fire the batches in parallel, log progress per batch. The user sees steady motion and the provider can serve the batches independently. Same total work, much better UX.
We also need to coax the model into returning JSON reliably. parse_json_response is a tiny utility that handles the three common failure modes (markdown fences, prose prelude, partial braces).
9 · Stage 4 — Analysis with evidence fusion
Every analyst task sees the same context packet: extractions + verification verdicts + numbered web evidence + the investor’s focus. The system prompt tells the model to downgrade confidence on anything the verifier markedUNVERIFIED or CONTRADICTED — that’s how fact-checking actually propagates into reasoning.
All six analyses run in parallel.
10 · Stage 5 — Synthesis with citation & confidence invariants
One call to the best model. The prompt enforces three invariants the reader can verify:- Citation — every external fact gets an inline
[n]linked to a numberedSourceslist. - Confidence tagging — every company claim carries
[VERIFIED],[PARTIAL],[UNVERIFIED], or[CONTRADICTED], taken from the verifier’s output. - No invented numbers — every metric must come from the extractions or verification.
11 · Stage 6 — Multimodal output (TTS + image)
Two more primitives and a nice orchestration pattern.synthesize_speech—POST /v1/audio/speechreturns WAV bytes.generate_image—POST /v1/images/generationsreturns a base64 PNG.synthesize_podcast— a mini-pipeline: ask the synthesizer to split the memo into short spoken chunks, TTS each chunk in parallel, stitch them into one WAV with silence gaps. This is the “plan → fan-out → stitch” pattern you can use for any long-form generation where latency matters.
What you’ve built
A full document → cited memo pipeline, in roughly 250 lines of Python, sitting entirely on top of Sciforium primitives:- One endpoint (
/v1/chat/completions) handled text extraction, verification, analysis, and synthesis. - The same endpoint with a
filecontent part did multimodal grounding against the original PDF. - A separate attachments endpoint gave you high-fidelity text once, cached for every downstream text-only call.
- TTS and image endpoints rounded it out into a multimodal deliverable.
Patterns worth keeping
- Stack models by cost. Cheap → mid → best, matched to call volume. Don’t use your biggest model for 40 parallel extractions.
- Fan out, then bound.
asyncio.gatherfor independent work; aSemaphorein front of any external API that can rate-limit you. - Batch big prompts. Prefer five parallel 8-item prompts over one 40-item prompt — better throughput, better UX, better error isolation.
- Cache the parse. Run the attachments API once, reuse the text. Attach the file only for the passes that truly need native bytes.
- Enforce invariants in the synthesis prompt. Citations, confidence tags, no invented numbers — these are what separate a demo from a trustworthy artefact.
Where to go next
- Swap
MODELSvalues to try different verifiers, analysts, or synthesizers. - Extend
EXTRACTION_TASKS/ANALYSIS_TASKS— new lenses cost roughly nothing since they fan out in parallel. - Add retries with backoff around
chat()(see the Exa helper for the pattern) for production robustness. - Batch mode: process many PDFs and bucket/rank them — same building blocks, wrapped in an outer
asyncio.Semaphore.