Skip to main content

Overview

This guide walks you through parsing one or more files with the Sciforium API, extracting plain text from the results, and sending that text to an LLM with a question.
StepWhat happens
1Configure — set your file path, question, model, and API key
2Install dependencies
3Initialize the Sciforium client
4aParse a single file
4b(Optional) Batch-parse multiple files in parallel
5(Optional) Inspect the raw parse response
6Extract plain text from the parse results
7Chat — send the extracted text + your question to the LLM

Prerequisites

  • A Sciforium API key — get one at console.sciforium.com
  • Python 3.8+
  • The file(s) you want to parse accessible on disk (or uploaded to Colab — see the note below)
If you’re running in Google Colab, click the folder icon in the left sidebar, upload your file, then right-click it and select Copy path to use in your code. Uploaded files are deleted when the runtime disconnects.

Step 1 — Configuration

Set the four variables below before running anything else.
  • FILE_PATH — absolute path to the file you want to parse. Supported formats: PDF, DOCX, DOC, TXT, MD, CSV, HTML, JSON, PNG, JPG.
  • QUESTION — the question you want to ask the LLM about the document.
  • MODEL — the LLM model identifier (e.g. openai/gpt-oss-120b, anthropic/claude-sonnet-4-6).
  • SCIFORIUM_API_KEY — your Sciforium API key.
FILE_PATH = "/content/sample_data/sample.pdf"
QUESTION  = "Extract the candidate's email address and latest company."
MODEL     = "openai/gpt-oss-120b"

SCIFORIUM_API_KEY = "your-api-key-here"
Never commit your API key to version control. Use environment variables or a secrets manager in production.

Step 2 — Install dependencies

Run this once if openai or requests aren’t already installed.
pip install openai requests

Step 3 — Initialize the client

This constructs the Sciforium endpoint URLs and resolves the API key, falling back to the SCIFORIUM_API_KEY environment variable if set.
import base64, os
from pathlib import Path
import requests
from openai import OpenAI

BASE_URL      = os.environ.get("SCIFORIUM_BASE_URL", "https://api.sciforium.com").rstrip("/")
PARSE_API_URL = BASE_URL + "/api/attachments/parse"
LLM_BASE_URL  = BASE_URL + "/v1"
API_KEY       = os.environ.get("SCIFORIUM_API_KEY", SCIFORIUM_API_KEY)

assert API_KEY, "Set SCIFORIUM_API_KEY in the Config cell or as an environment variable."

Step 4a — Parse a single file

Reads the file from FILE_PATH, base64-encodes it, and POSTs it to the Sciforium parse endpoint. The response contains structured content (pages, text, metadata) for the file.
import time

MIME_MAP = {
    "pdf": "application/pdf",
    "docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "doc": "application/msword",
    "txt": "text/plain",
    "md": "text/markdown",
    "csv": "text/csv",
    "html": "text/html",
    "json": "application/json",
    "png": "image/png",
    "jpg": "image/jpeg",
    "jpeg": "image/jpeg",
}

path = Path(FILE_PATH)
mime = MIME_MAP.get(path.suffix.lstrip(".").lower(), "application/octet-stream")
encoded = base64.b64encode(path.read_bytes()).decode()
print(f"Parsing '{path.name}' ({path.stat().st_size // 1024} KB)...")

start_t = time.time()
resp = requests.post(
    PARSE_API_URL,
    headers={"Content-Type": "application/json", "x-api-key": API_KEY},
    json={
        "files": [{
            "url": f"data:{mime};base64,{encoded}",
            "filename": path.name,
            "media_type": mime,
        }]
    },
    timeout=300,
)
elapsed = time.time() - start_t
resp.raise_for_status()
parse_response = resp.json()

meta = parse_response.get("metadata", {})
print(
    f"Done – files={meta.get('total_files', 1)} | "
    f"completed={meta.get('completed', '?')} | "
    f"failed={meta.get('failed', 0)} | "
    f"time={meta.get('total_processing_time_ms', '?')}ms | "
    f"wall={elapsed:.2f}s"
)

Step 4b — Batch parse (optional)

Use this instead of Step 4a to parse multiple files in parallel with a thread pool. Add all your file paths to FILE_PATHS.
Run either Step 4a or Step 4b — not both. If you use this batch step, update Step 6 to iterate over parse_responses (plural) instead of parse_response.
from concurrent.futures import ThreadPoolExecutor, as_completed

FILE_PATHS = [
    "/content/sample_data/sample1.pdf",
    "/content/sample_data/sample2.pdf",
    # Add more file paths here
]

def parse_single_file(file_path):
    path = Path(file_path)
    mime = MIME_MAP.get(path.suffix.lstrip(".").lower(), "application/octet-stream")
    encoded = base64.b64encode(path.read_bytes()).decode()
    print(f"Parsing '{path.name}' ({path.stat().st_size // 1024} KB)...")

    start_t = time.time()
    resp = requests.post(
        PARSE_API_URL,
        headers={"Content-Type": "application/json", "x-api-key": API_KEY},
        json={
            "files": [{
                "url": f"data:{mime};base64,{encoded}",
                "filename": path.name,
                "media_type": mime,
            }]
        },
        timeout=300,
    )
    elapsed = time.time() - start_t
    resp.raise_for_status()
    parse_response = resp.json()

    meta = parse_response.get("metadata", {})
    print(
        f"Done – '{path.name}' | "
        f"completed={meta.get('completed', '?')} | "
        f"failed={meta.get('failed', 0)} | "
        f"time={meta.get('total_processing_time_ms', '?')}ms | "
        f"wall={elapsed:.2f}s"
    )
    return parse_response

parse_responses = []
with ThreadPoolExecutor(max_workers=min(8, len(FILE_PATHS))) as executor:
    futures = {executor.submit(parse_single_file, fp): fp for fp in FILE_PATHS}
    for future in as_completed(futures):
        parse_responses.append(future.result())

Step 5 — Inspect the raw response (optional)

Print the full JSON to explore the response schema or debug issues. You can skip this step — it has no side effects.
import json
print(json.dumps(parse_response, indent=2))

Step 6 — Extract text from parse results

This walks the parse response and stitches all page text into a single document_text string. Pages are labeled [Page N] so the LLM can reference them. Files that failed to parse are skipped with a warning.
texts = []
for result in parse_response.get("results", []):
    if result.get("status") not in ("success", "completed"):
        print(f"Warning: '{result.get('filename')}' status={result.get('status')}, skipping.")
        continue

    raw = result.get("content") or {}
    if not isinstance(raw, dict):
        raw = {"text": raw}

    pages = raw.get("pages") or []
    if pages:
        texts.append(
            "\n\n".join(
                f"[Page {p.get('page', '?')}]\n{p['text'].strip()}"
                for p in pages
                if p.get("text", "").strip()
            )
        )
        print(f"  '{result['filename']}': {len(pages)} pages")
    elif raw.get("text", "").strip():
        texts.append(raw["text"].strip())

document_text = "\n\n".join(texts)
assert document_text, "No text extracted."
print(f"Extracted ~{len(document_text):,} characters.")
You are responsible for context management beyond this point. If document_text is larger than your model’s context window, you must truncate, chunk, or summarize it before sending. For large documents, consider splitting by page and processing in batches, or using a retrieval step (e.g. embeddings + vector search) to select only the relevant sections.

Step 7 — Ask the LLM

Send document_text plus your QUESTION to the configured model via the Sciforium OpenAI-compatible gateway.
client = OpenAI(api_key=API_KEY, base_url=LLM_BASE_URL)

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Answer questions based on the document content provided. Be concise and accurate.",
        },
        {
            "role": "user",
            "content": f"<document>\n{document_text}\n</document>\n\n{QUESTION}",
        },
    ],
)

answer = response.choices[0].message.content
print(answer)
To ask multiple questions without re-parsing, just change QUESTION and re-run this cell.