Experimental: Contact Us
This surface is directional and may change before public release. Contact Spiral if you want to use it in an early workflow.
The Document Extension is intended for PDF, office document, HTML, plain-text, and document-image corpora. It should let users query document metadata, pages, chunks, OCR output, and retrieval artifacts without rendering or embedding everything eagerly.
Planned table functions
| Function | Purpose |
|---|---|
documents.scan(path) | Discover documents or read a document manifest. |
documents.pages(path) | Produce page-level rows and deferred page references. |
documents.chunks(path) | Produce text chunks with page and span metadata. |
documents.ocr(page_ref) | Materialize OCR text for selected pages. |
documents.embeddings(chunk_ref) | Produce or read retrieval embeddings. |
These names are preview syntax. The functions are not registered in the current default CLI session.
Output shape
Page functions should produce rows like:
| Column | Meaning |
|---|---|
doc_id | Stable document id. |
path | Local path or object URI. |
mime_type | Document MIME type. |
page_number | One-based page number when applicable. |
title | Optional document title. |
page_ref | Deferred page reference. |
Chunk functions should add text span information such as chunk_id,
page_number, char_start, char_end, and text.
Example shape
SELECT doc_id, page_number, page_ref
FROM documents.pages('./corpus/')
WHERE mime_type = 'application/pdf';Notes
- Metadata queries should avoid rendering pages or running OCR.
- OCR, chunking, and embedding policies should be explicit and reproducible.
- Retrieval-ready artifacts should document whether they are ordinary tables, Vortex files, or extension-managed artifacts.
Last updated on