Skip to Content
Experimental: Contact Us

This surface is directional and may change before public release. Contact Spiral if you want to use it in an early workflow.

The Document Extension is intended for PDF, office document, HTML, plain-text, and document-image corpora. It should let users query document metadata, pages, chunks, OCR output, and retrieval artifacts without rendering or embedding everything eagerly.

Planned table functions

FunctionPurpose
documents.scan(path)Discover documents or read a document manifest.
documents.pages(path)Produce page-level rows and deferred page references.
documents.chunks(path)Produce text chunks with page and span metadata.
documents.ocr(page_ref)Materialize OCR text for selected pages.
documents.embeddings(chunk_ref)Produce or read retrieval embeddings.

These names are preview syntax. The functions are not registered in the current default CLI session.

Output shape

Page functions should produce rows like:

ColumnMeaning
doc_idStable document id.
pathLocal path or object URI.
mime_typeDocument MIME type.
page_numberOne-based page number when applicable.
titleOptional document title.
page_refDeferred page reference.

Chunk functions should add text span information such as chunk_id, page_number, char_start, char_end, and text.

Example shape

SELECT doc_id, page_number, page_ref FROM documents.pages('./corpus/') WHERE mime_type = 'application/pdf';

Notes

  • Metadata queries should avoid rendering pages or running OCR.
  • OCR, chunking, and embedding policies should be explicit and reproducible.
  • Retrieval-ready artifacts should document whether they are ordinary tables, Vortex files, or extension-managed artifacts.
Last updated on