Overview
This guide walks you through parsing one or more files with the Sciforium API, extracting plain text from the results, and sending that text to an LLM with a question.
| Step | What happens |
|---|
| 1 | Configure — set your file path, question, model, and API key |
| 2 | Install dependencies |
| 3 | Initialize the Sciforium client |
| 4a | Parse a single file |
| 4b | (Optional) Batch-parse multiple files in parallel |
| 5 | (Optional) Inspect the raw parse response |
| 6 | Extract plain text from the parse results |
| 7 | Chat — send the extracted text + your question to the LLM |
Prerequisites
- A Sciforium API key — get one at console.sciforium.com
- Python 3.8+
- The file(s) you want to parse accessible on disk (or uploaded to Colab — see the note below)
If you’re running in Google Colab, click the folder icon in the left sidebar, upload your file, then right-click it and select Copy path to use in your code. Uploaded files are deleted when the runtime disconnects.
Step 1 — Configuration
Set the four variables below before running anything else.
FILE_PATH — absolute path to the file you want to parse. Supported formats: PDF, DOCX, DOC, TXT, MD, CSV, HTML, JSON, PNG, JPG.
QUESTION — the question you want to ask the LLM about the document.
MODEL — the LLM model identifier (e.g. openai/gpt-oss-120b, anthropic/claude-sonnet-4-6).
SCIFORIUM_API_KEY — your Sciforium API key.
FILE_PATH = "/content/sample_data/sample.pdf"
QUESTION = "Extract the candidate's email address and latest company."
MODEL = "openai/gpt-oss-120b"
SCIFORIUM_API_KEY = "your-api-key-here"
Never commit your API key to version control. Use environment variables or a secrets manager in production.
Step 2 — Install dependencies
Run this once if openai or requests aren’t already installed.
pip install openai requests
Step 3 — Initialize the client
This constructs the Sciforium endpoint URLs and resolves the API key, falling back to the SCIFORIUM_API_KEY environment variable if set.
import base64, os
from pathlib import Path
import requests
from openai import OpenAI
BASE_URL = os.environ.get("SCIFORIUM_BASE_URL", "https://api.sciforium.com").rstrip("/")
PARSE_API_URL = BASE_URL + "/api/attachments/parse"
LLM_BASE_URL = BASE_URL + "/v1"
API_KEY = os.environ.get("SCIFORIUM_API_KEY", SCIFORIUM_API_KEY)
assert API_KEY, "Set SCIFORIUM_API_KEY in the Config cell or as an environment variable."
Step 4a — Parse a single file
Reads the file from FILE_PATH, base64-encodes it, and POSTs it to the Sciforium parse endpoint. The response contains structured content (pages, text, metadata) for the file.
import time
MIME_MAP = {
"pdf": "application/pdf",
"docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"doc": "application/msword",
"txt": "text/plain",
"md": "text/markdown",
"csv": "text/csv",
"html": "text/html",
"json": "application/json",
"png": "image/png",
"jpg": "image/jpeg",
"jpeg": "image/jpeg",
}
path = Path(FILE_PATH)
mime = MIME_MAP.get(path.suffix.lstrip(".").lower(), "application/octet-stream")
encoded = base64.b64encode(path.read_bytes()).decode()
print(f"Parsing '{path.name}' ({path.stat().st_size // 1024} KB)...")
start_t = time.time()
resp = requests.post(
PARSE_API_URL,
headers={"Content-Type": "application/json", "x-api-key": API_KEY},
json={
"files": [{
"url": f"data:{mime};base64,{encoded}",
"filename": path.name,
"media_type": mime,
}]
},
timeout=300,
)
elapsed = time.time() - start_t
resp.raise_for_status()
parse_response = resp.json()
meta = parse_response.get("metadata", {})
print(
f"Done – files={meta.get('total_files', 1)} | "
f"completed={meta.get('completed', '?')} | "
f"failed={meta.get('failed', 0)} | "
f"time={meta.get('total_processing_time_ms', '?')}ms | "
f"wall={elapsed:.2f}s"
)
Step 4b — Batch parse (optional)
Use this instead of Step 4a to parse multiple files in parallel with a thread pool. Add all your file paths to FILE_PATHS.
Run either Step 4a or Step 4b — not both. If you use this batch step, update Step 6 to iterate over parse_responses (plural) instead of parse_response.
from concurrent.futures import ThreadPoolExecutor, as_completed
FILE_PATHS = [
"/content/sample_data/sample1.pdf",
"/content/sample_data/sample2.pdf",
# Add more file paths here
]
def parse_single_file(file_path):
path = Path(file_path)
mime = MIME_MAP.get(path.suffix.lstrip(".").lower(), "application/octet-stream")
encoded = base64.b64encode(path.read_bytes()).decode()
print(f"Parsing '{path.name}' ({path.stat().st_size // 1024} KB)...")
start_t = time.time()
resp = requests.post(
PARSE_API_URL,
headers={"Content-Type": "application/json", "x-api-key": API_KEY},
json={
"files": [{
"url": f"data:{mime};base64,{encoded}",
"filename": path.name,
"media_type": mime,
}]
},
timeout=300,
)
elapsed = time.time() - start_t
resp.raise_for_status()
parse_response = resp.json()
meta = parse_response.get("metadata", {})
print(
f"Done – '{path.name}' | "
f"completed={meta.get('completed', '?')} | "
f"failed={meta.get('failed', 0)} | "
f"time={meta.get('total_processing_time_ms', '?')}ms | "
f"wall={elapsed:.2f}s"
)
return parse_response
parse_responses = []
with ThreadPoolExecutor(max_workers=min(8, len(FILE_PATHS))) as executor:
futures = {executor.submit(parse_single_file, fp): fp for fp in FILE_PATHS}
for future in as_completed(futures):
parse_responses.append(future.result())
Step 5 — Inspect the raw response (optional)
Print the full JSON to explore the response schema or debug issues. You can skip this step — it has no side effects.
import json
print(json.dumps(parse_response, indent=2))
This walks the parse response and stitches all page text into a single document_text string. Pages are labeled [Page N] so the LLM can reference them. Files that failed to parse are skipped with a warning.
texts = []
for result in parse_response.get("results", []):
if result.get("status") not in ("success", "completed"):
print(f"Warning: '{result.get('filename')}' status={result.get('status')}, skipping.")
continue
raw = result.get("content") or {}
if not isinstance(raw, dict):
raw = {"text": raw}
pages = raw.get("pages") or []
if pages:
texts.append(
"\n\n".join(
f"[Page {p.get('page', '?')}]\n{p['text'].strip()}"
for p in pages
if p.get("text", "").strip()
)
)
print(f" '{result['filename']}': {len(pages)} pages")
elif raw.get("text", "").strip():
texts.append(raw["text"].strip())
document_text = "\n\n".join(texts)
assert document_text, "No text extracted."
print(f"Extracted ~{len(document_text):,} characters.")
You are responsible for context management beyond this point. If document_text is larger than your model’s context window, you must truncate, chunk, or summarize it before sending. For large documents, consider splitting by page and processing in batches, or using a retrieval step (e.g. embeddings + vector search) to select only the relevant sections.
Step 7 — Ask the LLM
Send document_text plus your QUESTION to the configured model via the Sciforium OpenAI-compatible gateway.
client = OpenAI(api_key=API_KEY, base_url=LLM_BASE_URL)
response = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "system",
"content": "You are a helpful assistant. Answer questions based on the document content provided. Be concise and accurate.",
},
{
"role": "user",
"content": f"<document>\n{document_text}\n</document>\n\n{QUESTION}",
},
],
)
answer = response.choices[0].message.content
print(answer)
To ask multiple questions without re-parsing, just change QUESTION and re-run this cell.