Documentation Index
Fetch the complete documentation index at: https://docs.sciforium.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
This guide walks you through parsing one or more files with the Sciforium API, extracting plain text from the results, and sending that text to an LLM with a question.
| Step | What happens |
|---|
| 1 | Configure — set your file path, question, model, and API key |
| 2 | Install dependencies |
| 3 | Initialize the Sciforium client |
| 4a | Parse a single file |
| 4b | *(Optional)*Batch-parse multiple files in parallel |
| 5 | *(Optional)*Inspect the raw parse response |
| 6 | Extract plain text from the parse results |
| 7 | Chat — send the extracted text + your question to the LLM |
Prerequisites
- A Sciforium API key — get one at console.sciforium.com
- Python 3.8+
- The file(s) you want to parse accessible on disk (or uploaded to Colab — see the note below)
If you’re running in Google Colab, click the folder icon in the left sidebar, upload your file, then right-click it and select Copy path to use in your code. Uploaded files are deleted when the runtime disconnects.
Step 1 — Configuration
Set the four variables below before running anything else.
FILE_PATH — absolute path to the file you want to parse. Supported formats: PDF, DOCX, DOC, TXT, MD, CSV, HTML, JSON - any utf-8 encoded file.
QUESTION — the question you want to ask the LLM about the document.
MODEL — the LLM model identifier (e.g. openai/gpt-oss-120b, anthropic/claude-sonnet-4-6).
SCIFORIUM_API_KEY — your Sciforium API key.
FILE_PATH = "/content/sample_data/sample.pdf"
QUESTION = "Extract the candidate's email address and latest company."
MODEL = "openai/gpt-oss-120b"
SCIFORIUM_API_KEY = "your-api-key-here"
Never commit your API key to version control. Use environment variables or a secrets manager in production.
Step 2 — Install dependencies
Run this once if openai or requests aren’t already installed.
pip install openai requests
Step 3 — Initialize the client
This constructs the Sciforium endpoint URLs and resolves the API key, falling back to the SCIFORIUM_API_KEY environment variable if set.
import base64, os
from pathlib import Path
import requests
from openai import OpenAI
BASE_URL = os.environ.get("SCIFORIUM_BASE_URL", "https://api.sciforium.com").rstrip("/")
PARSE_API_URL = BASE_URL + "/api/attachments/parse"
LLM_BASE_URL = BASE_URL + "/v1"
API_KEY = os.environ.get("SCIFORIUM_API_KEY", SCIFORIUM_API_KEY)
assert API_KEY, "Set SCIFORIUM_API_KEY in the Config cell or as an environment variable."
Step 4a — Parse a single file
Reads the file from FILE_PATH, base64-encodes it, and POSTs it to the Sciforium parse endpoint. The response contains structured content (pages, text, metadata) for the file.
import time
MIME_MAP = {
"pdf": "application/pdf",
"docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"doc": "application/msword",
"txt": "text/plain",
"md": "text/markdown",
"csv": "text/csv",
"html": "text/html",
"json": "application/json",
"png": "image/png",
"jpg": "image/jpeg",
"jpeg": "image/jpeg",
}
path = Path(FILE_PATH)
mime = MIME_MAP.get(path.suffix.lstrip(".").lower(), "application/octet-stream")
encoded = base64.b64encode(path.read_bytes()).decode()
print(f"Parsing '{path.name}' ({path.stat().st_size // 1024} KB)...")
start_t = time.time()
resp = requests.post(
PARSE_API_URL,
headers={"Content-Type": "application/json", "x-api-key": API_KEY},
json={
"files": [{
"url": f"data:{mime};base64,{encoded}",
"filename": path.name,
"media_type": mime,
}]
},
timeout=300,
)
elapsed = time.time() - start_t
resp.raise_for_status()
parse_response = resp.json()
meta = parse_response.get("metadata", {})
print(
f"Done – files={meta.get('total_files', 1)} | "
f"completed={meta.get('completed', '?')} | "
f"failed={meta.get('failed', 0)} | "
f"time={meta.get('total_processing_time_ms', '?')}ms | "
f"wall={elapsed:.2f}s"
)
Step 4b — Batch parse (optional)
Use this instead of Step 4a to parse multiple files in parallel with a thread pool. Add all your file paths to FILE_PATHS.
Run either Step 4a or Step 4b — not both. If you use this batch step, update Step 6 to iterate over parse_responses (plural) instead of parse_response.
from concurrent.futures import ThreadPoolExecutor, as_completed
FILE_PATHS = [
"/content/sample_data/sample1.pdf",
"/content/sample_data/sample2.pdf",
# Add more file paths here
]
def parse_single_file(file_path):
path = Path(file_path)
mime = MIME_MAP.get(path.suffix.lstrip(".").lower(), "application/octet-stream")
encoded = base64.b64encode(path.read_bytes()).decode()
print(f"Parsing '{path.name}' ({path.stat().st_size // 1024} KB)...")
start_t = time.time()
resp = requests.post(
PARSE_API_URL,
headers={"Content-Type": "application/json", "x-api-key": API_KEY},
json={
"files": [{
"url": f"data:{mime};base64,{encoded}",
"filename": path.name,
"media_type": mime,
}]
},
timeout=300,
)
elapsed = time.time() - start_t
resp.raise_for_status()
parse_response = resp.json()
meta = parse_response.get("metadata", {})
print(
f"Done – '{path.name}' | "
f"completed={meta.get('completed', '?')} | "
f"failed={meta.get('failed', 0)} | "
f"time={meta.get('total_processing_time_ms', '?')}ms | "
f"wall={elapsed:.2f}s"
)
return parse_response
parse_responses = []
with ThreadPoolExecutor(max_workers=min(8, len(FILE_PATHS))) as executor:
futures = {executor.submit(parse_single_file, fp): fp for fp in FILE_PATHS}
for future in as_completed(futures):
parse_responses.append(future.result())
Step 5 — Inspect the raw response (optional)
Print the full JSON to explore the response schema or debug issues. You can skip this step — it has no side effects.
import json
print(json.dumps(parse_response, indent=2))
This walks the parse response and stitches all page text into a single document_text string. Pages are labeled [Page N] so the LLM can reference them. Files that failed to parse are skipped with a warning.
texts = []
for result in parse_response.get("results", []):
if result.get("status") not in ("success", "completed"):
print(f"Warning: '{result.get('filename')}' status={result.get('status')}, skipping.")
continue
raw = result.get("content") or {}
if not isinstance(raw, dict):
raw = {"text": raw}
pages = raw.get("pages") or []
if pages:
texts.append(
"\n\n".join(
f"[Page {p.get('page', '?')}]\n{p['text'].strip()}"
for p in pages
if p.get("text", "").strip()
)
)
print(f" '{result['filename']}': {len(pages)} pages")
elif raw.get("text", "").strip():
texts.append(raw["text"].strip())
document_text = "\n\n".join(texts)
assert document_text, "No text extracted."
print(f"Extracted ~{len(document_text):,} characters.")
You are responsible for context management beyond this point. If document_text is larger than your model’s context window, you must truncate, chunk, or summarize it before sending. For large documents, consider splitting by page and processing in batches, or using a retrieval step (e.g. embeddings + vector search) to select only the relevant sections.
Step 7 — Ask the LLM
Send document_text plus your QUESTION to the configured model via the Sciforium OpenAI-compatible gateway.
client = OpenAI(api_key=API_KEY, base_url=LLM_BASE_URL)
response = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "system",
"content": "You are a helpful assistant. Answer questions based on the document content provided. Be concise and accurate.",
},
{
"role": "user",
"content": f"<document>\n{document_text}\n</document>\n\n{QUESTION}",
},
],
)
answer = response.choices[0].message.content
print(answer)
To ask multiple questions without re-parsing, just change QUESTION and re-run this cell.