Parse & Chat - Sciforium

Overview

This guide walks you through parsing one or more files with the Sciforium API, extracting plain text from the results, and sending that text to an LLM with a question.

Step	What happens
1	Configure — set your file path, question, model, and API key
2	Install dependencies
3	Initialize the Sciforium client
4a	Parse a single file
4b	(Optional)Batch-parse multiple files in parallel
5	(Optional)Inspect the raw parse response
6	Extract plain text from the parse results
7	Chat — send the extracted text + your question to the LLM

Prerequisites

A Sciforium API key — get one at console.sciforium.com
Python 3.8+
The file(s) you want to parse accessible on disk (or uploaded to Colab — see the note below)

If you’re running in Google Colab, click the folder icon in the left sidebar, upload your file, then right-click it and select Copy path to use in your code. Uploaded files are deleted when the runtime disconnects.

Step 1 — Configuration

Set the four variables below before running anything else.

FILE_PATH — absolute path to the file you want to parse. Supported formats: PDF, DOCX, DOC, TXT, MD, CSV, HTML, JSON - any utf-8 encoded file.
QUESTION — the question you want to ask the LLM about the document.
MODEL — the LLM model identifier (e.g. openai/gpt-oss-120b, anthropic/claude-sonnet-4-6).
SCIFORIUM_API_KEY — your Sciforium API key.

FILE_PATH = "/content/sample_data/sample.pdf"
QUESTION  = "Extract the candidate's email address and latest company."
MODEL     = "openai/gpt-oss-120b"

SCIFORIUM_API_KEY = "your-api-key-here"

Never commit your API key to version control. Use environment variables or a secrets manager in production.

Step 2 — Install dependencies

Run this once if openai or requests aren’t already installed.

pip install openai requests

Step 3 — Initialize the client

This constructs the Sciforium endpoint URLs and resolves the API key, falling back to the SCIFORIUM_API_KEY environment variable if set.

import base64, os
from pathlib import Path
import requests
from openai import OpenAI

BASE_URL      = os.environ.get("SCIFORIUM_BASE_URL", "https://api.sciforium.com").rstrip("/")
PARSE_API_URL = BASE_URL + "/api/attachments/parse"
LLM_BASE_URL  = BASE_URL + "/v1"
API_KEY       = os.environ.get("SCIFORIUM_API_KEY", SCIFORIUM_API_KEY)

assert API_KEY, "Set SCIFORIUM_API_KEY in the Config cell or as an environment variable."

Step 4a — Parse a single file

Reads the file from FILE_PATH, base64-encodes it, and POSTs it to the Sciforium parse endpoint. The response contains structured content (pages, text, metadata) for the file.

import time

MIME_MAP = {
    "pdf": "application/pdf",
    "docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "doc": "application/msword",
    "txt": "text/plain",
    "md": "text/markdown",
    "csv": "text/csv",
    "html": "text/html",
    "json": "application/json",
    "png": "image/png",
    "jpg": "image/jpeg",
    "jpeg": "image/jpeg",
}

path = Path(FILE_PATH)
mime = MIME_MAP.get(path.suffix.lstrip(".").lower(), "application/octet-stream")
encoded = base64.b64encode(path.read_bytes()).decode()
print(f"Parsing '{path.name}' ({path.stat().st_size // 1024} KB)...")

start_t = time.time()
resp = requests.post(
    PARSE_API_URL,
    headers={"Content-Type": "application/json", "x-api-key": API_KEY},
    json={
        "files": [{
            "url": f"data:{mime};base64,{encoded}",
            "filename": path.name,
            "media_type": mime,
        }]
    },
    timeout=300,
)
elapsed = time.time() - start_t
resp.raise_for_status()
parse_response = resp.json()

meta = parse_response.get("metadata", {})
print(
    f"Done – files={meta.get('total_files', 1)} | "
    f"completed={meta.get('completed', '?')} | "
    f"failed={meta.get('failed', 0)} | "
    f"time={meta.get('total_processing_time_ms', '?')}ms | "
    f"wall={elapsed:.2f}s"
)

Step 4b — Batch parse (optional)

Use this instead of Step 4a to parse multiple files in parallel with a thread pool. Add all your file paths to FILE_PATHS.

Run either Step 4a or Step 4b — not both. If you use this batch step, update Step 6 to iterate over parse_responses (plural) instead of parse_response.

from concurrent.futures import ThreadPoolExecutor, as_completed

FILE_PATHS = [
    "/content/sample_data/sample1.pdf",
    "/content/sample_data/sample2.pdf",
    # Add more file paths here
]

def parse_single_file(file_path):
    path = Path(file_path)
    mime = MIME_MAP.get(path.suffix.lstrip(".").lower(), "application/octet-stream")
    encoded = base64.b64encode(path.read_bytes()).decode()
    print(f"Parsing '{path.name}' ({path.stat().st_size // 1024} KB)...")

    start_t = time.time()
    resp = requests.post(
        PARSE_API_URL,
        headers={"Content-Type": "application/json", "x-api-key": API_KEY},
        json={
            "files": [{
                "url": f"data:{mime};base64,{encoded}",
                "filename": path.name,
                "media_type": mime,
            }]
        },
        timeout=300,
    )
    elapsed = time.time() - start_t
    resp.raise_for_status()
    parse_response = resp.json()

    meta = parse_response.get("metadata", {})
    print(
        f"Done – '{path.name}' | "
        f"completed={meta.get('completed', '?')} | "
        f"failed={meta.get('failed', 0)} | "
        f"time={meta.get('total_processing_time_ms', '?')}ms | "
        f"wall={elapsed:.2f}s"
    )
    return parse_response

parse_responses = []
with ThreadPoolExecutor(max_workers=min(8, len(FILE_PATHS))) as executor:
    futures = {executor.submit(parse_single_file, fp): fp for fp in FILE_PATHS}
    for future in as_completed(futures):
        parse_responses.append(future.result())

Step 5 — Inspect the raw response (optional)

Print the full JSON to explore the response schema or debug issues. You can skip this step — it has no side effects.

import json
print(json.dumps(parse_response, indent=2))

Step 6 — Extract text from parse results

This walks the parse response and stitches all page text into a single document_text string. Pages are labeled [Page N] so the LLM can reference them. Files that failed to parse are skipped with a warning.

texts = []
for result in parse_response.get("results", []):
    if result.get("status") not in ("success", "completed"):
        print(f"Warning: '{result.get('filename')}' status={result.get('status')}, skipping.")
        continue

    raw = result.get("content") or {}
    if not isinstance(raw, dict):
        raw = {"text": raw}

    pages = raw.get("pages") or []
    if pages:
        texts.append(
            "\n\n".join(
                f"[Page {p.get('page', '?')}]\n{p['text'].strip()}"
                for p in pages
                if p.get("text", "").strip()
            )
        )
        print(f"  '{result['filename']}': {len(pages)} pages")
    elif raw.get("text", "").strip():
        texts.append(raw["text"].strip())

document_text = "\n\n".join(texts)
assert document_text, "No text extracted."
print(f"Extracted ~{len(document_text):,} characters.")

You are responsible for context management beyond this point. If document_text is larger than your model’s context window, you must truncate, chunk, or summarize it before sending. For large documents, consider splitting by page and processing in batches, or using a retrieval step (e.g. embeddings + vector search) to select only the relevant sections.

Step 7 — Ask the LLM

Send document_text plus your QUESTION to the configured model via the Sciforium OpenAI-compatible gateway.

client = OpenAI(api_key=API_KEY, base_url=LLM_BASE_URL)

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Answer questions based on the document content provided. Be concise and accurate.",
        },
        {
            "role": "user",
            "content": f"<document>\n{document_text}\n</document>\n\n{QUESTION}",
        },
    ],
)

answer = response.choices[0].message.content
print(answer)

To ask multiple questions without re-parsing, just change QUESTION and re-run this cell.

​Overview

​Prerequisites

​Step 1 — Configuration

​Step 2 — Install dependencies

​Step 3 — Initialize the client

​Step 4a — Parse a single file

​Step 4b — Batch parse (optional)

​Step 5 — Inspect the raw response (optional)

​Step 6 — Extract text from parse results

​Step 7 — Ask the LLM

Overview

Prerequisites

Step 1 — Configuration

Step 2 — Install dependencies

Step 3 — Initialize the client

Step 4a — Parse a single file

Step 4b — Batch parse (optional)

Step 5 — Inspect the raw response (optional)

Step 6 — Extract text from parse results

Step 7 — Ask the LLM