# SpiralDB Documentation > SpiralDB is a fast analytical database for storing and querying multimodal, multi-rate data streams — built on Vortex, optimized for AI workloads on physical-world data. Source: https://docs.spiraldb.com --- # Welcome to Spiral Section: Overview Source: https://docs.spiraldb.com/ SpiralDB is an extremely fast analytical database designed for storing and querying multimodal, multi-rate data streams — the kind of data produced by sensors, cameras, and other instruments in fields like robotics, biology, autonomous systems, and other domains at the intersection of AI and the physical world. It is built by the creators of [Vortex](https://vortex.dev/), the open-source columnar file format, and uses Vortex as its storage foundation. This gives SpiralDB best-in-class compression and scan performance out of the box. The core abstraction is Collections: hierarchical, tree-structured groups of records (similar to nested JSON arrays) backed by sorted columnar storage. They enable: - **Multimodal storage with isolated column groups.** Columns are stored independently, so tables can be arbitrarily wide — mixing video frames, embeddings, tensors, and metadata — without penalizing queries that only touch a few columns. - **Multi-rate alignment.** Different sensors record at different rates (a camera at 60 Hz, a robotic arm at 500 Hz). SpiralDB provides a streaming resample primitive that aligns sibling data streams to a shared timeline, so you can combine them in a single query. - **Free column appends.** Adding derived data like labels, annotations, or embeddings doesn't require rewriting existing columns. New columns are appended independently through enrichment. Additionally, SpiralDB supports **GPU-optimized data loading** by streaming data directly from object storage to the GPU, bypassing the traditional CPU-bound staging bottleneck. It integrates with PyTorch via a custom DataLoader with built-in sharding, shuffling, checkpointing, and multi-threaded Rust-based I/O. ## Who is SpiralDB for? SpiralDB is built for teams training AI models on data from the physical world — any domain where you're capturing observations from sensors, instruments, or simulations and need to turn them into training-ready datasets. This includes teams working in robotics, autonomous vehicles, biology, neuroscience, organic chemistry, materials science, and similar fields. The common thread is data that is both multimodal (video, audio, point clouds, scalar telemetry, embeddings) and temporal (recorded over time, often at different rates across sources). If your workflow involves ingesting heterogeneous sensor data, aligning streams across time resolutions, deriving new features like labels or embeddings, and then loading the result into GPUs at high throughput — SpiralDB is designed for exactly that pipeline. ## Use Cases - [**Feature Engineering**](https://docs.spiraldb.com/usecases/media-tables): Large-scale multimodal datasets. - [**Data Loaders**](https://docs.spiraldb.com/usecases/data-loaders): High-throughput data loading for machine learning. ## Dive Deeper - [**Spiral Tables**](https://docs.spiraldb.com/spiral-tables): Store, analyze and query massive and/or multimodal datasets. - [**File Systems**](https://docs.spiraldb.com/filesystems): Learn how to configure and manage file systems in Spiral. - [**Access and Permissions**](https://docs.spiraldb.com/authorization): Manage user permissions and access control. ## Getting Started Ready to take Spiral for a spin? Use `pip` (or your favorite package manager) to install the Python client library and CLI. ```bash copy pip install pyspiral ``` ```bash copy uv add pyspiral ``` --- # Organizations Section: Concepts Source: https://docs.spiraldb.com/organizations Organizations are the top-level administrative unit in Spiral. They are used to configure single sign-on, billing, audit logging, and other settings that apply to all users within the organization. Since resources belong to projects, and all projects must belong to an organization, each user must be a member of at least one organization in order to use Spiral. ## Organization Roles Each member of an organization is given one of three roles: - **Owner**: The owner has full control over the organization. - **Member**: Members cannot configure the organization, but are able to create projects. - **Guest**: Guests can only access projects they have been invited to, and cannot create new ones. You can check your role within an organization using the Spiral CLI: ```bash copy spiral orgs ls ``` ```bash Organizations ┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ ┃ id ┃ name ┃ role ┃ ┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ 👉 │ org_01J9KF5Y2CB6DFK9YMBKG7S0Q4 │ Spiral │ owner │ └────┴────────────────────────────────┴────────┴───────┘ ``` The 👉 symbol indicates the organization you are currently logged into. All administrative commands will apply to this organization. You can switch organization using the Spiral CLI: ```bash copy spiral orgs switch org_01J9KF5Y2CB6DFK9YMBKG7S0Q4 ``` You can decide the role of a user when you [invite](#manual-invitation) them to your organization, or by configuring default role mappings with [directory sync](#directory-sync). ## User Management There are three ways to manage users within your organization (in order of recommendation): SSO, domain verification, and manual invitation. ### Single Sign-On (SSO) Single sign-on (SSO) is the recommended way to manage organization membership. This allows you to automatically add and remove users from your organization based on your existing identity provider. You can configure SSO for your organization using the Spiral CLI: ```bash copy spiral orgs sso ``` See also the [directory sync](https://docs.spiraldb.com/organizations#directory-sync) feature for more advanced group-based configuration. ### Domain Verification If you are unable to use SSO, you can verify your domain using a DNS record. This will automatically add to your organization any user that signs up with an email address from your domain. ```bash copy spiral orgs domains ``` ### Manual Invitation Finally, you can manually invite users to your organization using the Spiral CLI: ```bash copy spiral orgs invite --role ``` ## Groups and Teams Groups and teams allow you to grant permissions to sets of users at a time. Groups and teams are not yet supported. Please contact us if you are interested in this feature. Groups are created automatically via [directory sync](https://docs.spiraldb.com/organizations#directory-sync), a feature that replicates group structures from an external directory provider such as *Active Directory* or *Google Workspace*. Teams are functionally identical to groups, however they are managed manually by an organization owner. ### Directory Sync Coming soon! Please contact support if you would like to enable this feature. ### Managing Teams Coming soon! Please contact support if you would like to enable this feature. --- # Projects Section: Concepts Source: https://docs.spiraldb.com/projects In Spiral, projects are the fundamental unit for structuring resources, access controls, and usage tracking. Projects can be managed either through the Spiral CLI or the PySpiral client. ```bash copy spiral projects --help ``` ## Creating a Project Projects are created within an organization. Each project has a globally unique and discoverable ID that is used to identify it across the Spiral platform. You can also configure a description to provide context only to those users with permission to view the project. To create a new project, use the Spiral CLI: ```bash copy spiral projects create --description "My First Project" ``` ## Listing Projects You can list all projects visible to you using the Spiral CLI: ```bash copy spiral projects ls ``` ## Dropping a Project You can drop a project using the Spiral CLI: ```bash copy spiral projects drop blazing-parakeet-036267 ``` Note that dropping a project does not delete data from storage. To permanently delete project data, please contact support. ```bash Projects ┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ id ┃ organization_id ┃ name ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ blazing-parakeet-036267 │ org_01J9KF5Y2CB6DFK9YMBKG7S0Q4 │ My First Project │ └─────────────────────────┴────────────────────────────────┴──────────────────┘ ``` ## Permissions Projects can be shared with other users, teams, organizations etc. using **grants**. For more information on grants and principals, see the documentation on [authorization](https://docs.spiraldb.com/authorization). For more information on access across systems, see the documentation on [OIDC](https://docs.spiraldb.com/oidc). You can list all grants for a project using the Spiral CLI: ```bash copy spiral projects grants blazing-parakeet-036267 ``` ```bash Project Grants ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ id ┃ project_id ┃ role_id ┃ principal ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ grant_Y7DH5asdM33eMR7leXtXx │ blazing-parakeet-036267 │ admin │ /org/org_01J9KF5Y2CB6DFK9YMBKG7S0Q4/user/user_01JAYY9RA8AR8ZCC93VJD7S71Z │ │ grant_zwZdWVcBUBluRIhHebsbG │ blazing-parakeet-036267 │ viewer │ /org/org_01J9KF5Y2CB6DFK9YMBKG7S0Q4/role/guest │ │ grant_4SyXL7hRsS8jgc0eoc6KU │ blazing-parakeet-036267 │ viewer │ /org/org_01J9KF5Y2CB6DFK9YMBKG7S0Q4/role/member │ │ grant_ASyW3OFWhw943e2So96ZZ │ blazing-parakeet-036267 │ admin │ /org/org_01J9KF5Y2CB6DFK9YMBKG7S0Q4/role/owner │ └─────────────────────────────┴─────────────────────────┴─────────┴──────────────────────────────────────────────────────────────────────────┘ ``` You can grant a role on a project to a user using the Spiral CLI: ```bash copy spiral projects grant blazing-parakeet-036267 --role viewer --org-id org_01J9KF5Y2CB6DFK9YMBKG7S0Q4 --org-user user_01JAYY9RA8AR8ZCC93VJD7S71Z ``` You can grant a role on a project to all users with a specific role in an organization using the Spiral CLI: ```bash copy spiral projects grant blazing-parakeet-036267 --role viewer --org-id org_01J9KF5Y2CB6DFK9YMBKG7S0Q4 --org-role member ``` For more details on managing grants, such as authorizing compute jobs in AWS, GCP, or Modal to access project resources, see [grant](https://docs.spiraldb.com/cli#spiral-projects-grant) command. --- # Authorization Section: Concepts Source: https://docs.spiraldb.com/authorization Spiral uses a role-based access control (RBAC) model for authorization, where currently the only unit of permissioning is the project. In this model, **principals** (users, teams, workloads) are granted **roles** on **projects**. A role confers a set of **permissions** on the project, which are used to determine whether a principal is allowed to perform a given action. ## Roles The roles currently defined in Spiral are: - `admin` - full access to the project - `editor` - read/write access to the project - `viewer` - read-only access to the project These roles expand to resource-specific permissions, such as `table:read` and `table:write`. Please reach out to us if you have specific requirements for custom roles. ## Principals Principals are entities that can be granted roles on projects. Currently, the following types of principals are supported: - **Org-scoped Users** - an individual org/user pair - **Teams** - groups of users as defined manually in Spiral - **Groups** - groups of users as synchronized from an external identity provider - **Organizations** - a team implicitly defined as all members of an organization - **Workloads** - non-human entities. Workloads authenticate via OIDC, AWS IAM, or client credentials. See [Workloads](#workloads) below and the [OIDC Integrations](https://docs.spiraldb.com/oidc) page for setup instructions. ## Grants Grants are the association of a principal with a role on a project. They are the mechanism by which permissions are conferred. Grants can be listed, created, updated, and deleted using the Spiral CLI. To create a grant, see [grant](https://docs.spiraldb.com/cli#spiral-projects-grant) command. ## Workloads Workloads are non-human principals in Spiral — they represent services, jobs, or automations that need to access data. Workloads are managed via the `spiral workloads` set of CLI commands. If workloads write to resources in a single project, it is recommended that they be placed in that project. If workloads modify multiple projects, it is recommended to keep them in a separate project from all resources. An environment can authenticate as workload using either a **policy** or **client credentials**. ### Policies A workload policy maps an external identity to a Spiral workload. This is the recommended approach — it avoids managing static credentials and integrates with your existing identity provider. A policy maps claims from your identity provider's tokens to the Spiral workload. The policy specifies conditions (key=value pairs) that must match the token's claims. To use a workload policy, follow these steps: 1. **Create a workload** — `spiral workloads create`. 2. **Create a workload policy** — `spiral workloads create-policy`. 3. **Grant the workload a role** — `spiral projects grant`. 4. **Set `SPIRAL_WORKLOAD_ID`** in the environment where you use the pyspiral library. ### OpenID Connect [OIDC](https://auth0.com/docs/authenticate/protocols/openid-connect-protocol) is a protocol for authenticating entities between systems. We strongly recommend using OIDC for workload authentication — it is easier to set up and more secure than managing static credentials. Spiral integrates with several OIDC providers. Check out the [OIDC Integrations](https://docs.spiraldb.com/oidc) page for available integrations and how to set them up. ### Client Credentials When external identity is not available (e.g. the environment does not provide OIDC tokens), you can issue client credentials instead and store them as secrets in your environment. This is a less secure approach, but it can be used as a fallback. To use client credentials, follow these steps: 1. **Create a workload** — `spiral workloads create`. 2. **Issue credentials** — `spiral workloads issue-creds`. 3. **Grant the workload a role** — `spiral projects grant`. 4. **Set `SPIRAL_CLIENT_ID` and `SPIRAL_CLIENT_SECRET`** in the environment where you use the pyspiral library. --- # Security Section: Concepts Source: https://docs.spiraldb.com/security Spiral is designed with security as a core principle, implementing defense-in-depth strategies across authentication, authorization, encryption, and data protection. ## Architecture Spiral Tables are designed in such way that your data never leaves your VPC. The only exception are key columns. Spiral stores key schema for each table, which **includes key column names**, and tracks [fragments](https://docs.spiraldb.com/format) which **include key column stats**, such as min/max values. PySpiral, Spiral client library, runs in your VPC and can access your bucket, but it is not authorized to read from the bucket. Spiral, control plane, provides temporary authorization for each read / write operation to PySpiral via signed URLs. ![VPC](/img/architecture-vpc.png) ## Authentication Spiral supports multiple authentication mechanisms to integrate with your existing infrastructure. ### OIDC Integrations Spiral integrates with several OIDC (OpenID Connect) identity providers for seamless workload authentication: - **GCP** - Google Cloud service accounts - **Modal** - Serverless platform integration - **AWS** - AWS IAM roles (coming soon) See [OIDC Integrations](https://docs.spiraldb.com/oidc) for detailed setup instructions. ### SSO (Single Sign-On) Enterprise organizations can configure SSO through WorkOS, enabling: - SAML and OIDC-based enterprise SSO - Domain-verified automatic user provisioning - Centralized identity management ## Authorization Spiral implements role-based access control (RBAC) with fine-grained permissions. See [Authorization](https://docs.spiraldb.com/authorization) for complete details. ### Hierarchy Access is controlled at two levels: 1. **Organization level** - Manages membership and project creation 2. **Project level** - Controls access to tables, indexes, and file systems ### Roles | Role | Project Access | | ---------- | -------------------------------------- | | **Admin** | Full access including grant management | | **Editor** | Read/write access to all resources | | **Viewer** | Read-only access | ### Principals Grants can be assigned to various principal types: - Users and user groups - Organizations and teams - Workloads (service accounts, GitHub Actions, Modal apps) Use OIDC-based workload authentication instead of static API keys whenever possible. This eliminates the need to manage and rotate secrets. ## Encryption ### Encryption in Transit All communication with Spiral is encrypted using TLS 1.2 or higher: - **HTTPS only** - All API endpoints require TLS - **Modern cipher suites** - Using industry-standard encryption - **Certificate validation** - Web PKI root certificate verification ### Encryption at Rest Data stored in Spiral is protected at rest: - **Object storage encryption** - Data files are encrypted by the underlying cloud storage (S3, GCS, Azure Blob) - **Database encryption** - Metadata is encrypted at rest in the database layer ## Data Protection ### Multi-Tenant Isolation Spiral provides strong isolation between tenants: - **Organization boundaries** - Data is isolated at the organization level - **Project scoping** - All operations are scoped to authorized projects ## Network Security ### API Security - **Rate limiting** - Protection against abuse - **Request validation** - All inputs are validated before processing - **CORS configuration** - Controlled cross-origin access ## Compliance Spiral is designed to help you meet compliance requirements: - **Data residency** - Control where your data is stored - **Access controls** - Fine-grained RBAC for compliance with least-privilege principles - **Audit trails** - Logging for compliance and forensics Contact us for specific compliance requirements including SOC 2, HIPAA, or GDPR. ## Security Best Practices ### For Administrators 1. **Use SSO** - Enable SSO for your organization when possible 2. **Principle of least privilege** - Grant only the minimum necessary permissions 3. **Regular access reviews** - Periodically review and revoke unnecessary grants 4. **Use workload identity** - Prefer OIDC over static credentials ### For Developers 1. **Use OIDC authentication** - Avoid storing API keys in code or configuration 2. **Scope access appropriately** - Request only the permissions your workload needs 3. **Rotate credentials** - If using API keys, rotate them regularly 4. **Secure your environment** - Protect the environment where Spiral credentials are used ## Reporting Security Issues If you discover a security vulnerability in Spiral, please report it responsibly: - Email: [security@spiraldb.com](mailto:security@spiraldb.com) - Include details about the vulnerability and steps to reproduce We take all security reports seriously and will respond promptly. --- # File Systems Section: Concepts Source: https://docs.spiraldb.com/filesystems Many Spiral resources make use of object storage. Spiral File Systems provide a means to securely federate and accelerate access to underlying object storage. As with all resources in Spiral, File Systems are scoped to a project. However, unlike many other resource types, each project may only have a single file system. This file system is often referred to as the project's **default** file system. By default, any resource that requires a file system will use the project's default file system, unless otherwise specified. ## Configuration You can see a project's file system configuration using the Spiral CLI ```bash copy spiral fs show ``` When resources like *Tables* are created they configure a prefix in a file system. File systems can only be reconfigured when they have zero used prefixes. ## Built-in Providers Spiral supports several built-in file system providers. The full set of providers can be listed using the CLI: ```bash copy spiral fs list-providers ``` Configure a project's file system provider: ```bash copy spiral fs update --provider ``` If you require a specific provider, please reach out to us, and we should be able to add it for you. ## Bring Your Own Bucket If you have an existing bucket that you would like to use as a Spiral File System, you can update the project's default file system to use it. See specific provider instructions below for how to configure the bucket and permissions. Use the CLI to update default file system: ```bash copy spiral fs update --help ``` We recommend that you use a dedicated project to register your own bucket as default file system, i.e. avoid creating any resources in this project. This allows you to separate bucket management permissions from resource management permissions. Projects with resources can be configured to use the "bucket" project as their default file system using: ```bash copy spiral fs update --upstream ``` ### AWS S3 To configure an S3 bucket for Spiral, create an IAM policy first. ```json { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowListBucketAll", "Effect": "Allow", "Action": "s3:ListBucket*", "Resource": [ "arn:aws:s3:::{your-bucket-name}" ] }, { "Sid": "AllowObjectSome", "Effect": "Allow", "Action": [ "s3:*Object", "s3:*ObjectVersion", "s3:AbortMultipartUpload", "s3:ListMultipartUploadParts" ], "Resource": [ "arn:aws:s3:::{your-bucket-name}/*" ] } ] } ``` Next, you have to allow Spiral to assume a role with this policy. The easiest and most secure way to do this (avoids any long lived tokens!), is to allow Spiral's GCP identity to assume the role. Create an IAM role with a following trust policy and attach the above object storage access policy to it. ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "accounts.google.com" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "accounts.google.com:sub": "116500466089430548312", "accounts.google.com:aud": "116500466089430548312", "accounts.google.com:oaud": "https://iss.spiraldb.com" } } } ] } ``` `116500466089430548312` is a unique identifier for Spiral's service account ([spiraldb-filesystems@pyspiral-dev.iam.gserviceaccount.com](mailto:spiraldb-filesystems@pyspiral-dev.iam.gserviceaccount.com)) currently running in GCP. Spiral exchanges a short-lived GCP identity token for AWS temporary credentials. See [AWS docs](https://aws.amazon.com/blogs/security/access-aws-using-a-google-cloud-platform-native-workload-identity/) for more details on this approach. Use the ARN of the created role to update project's file system: ```bash copy spiral fs update --type s3 --bucket --region --role-arn ``` ### GCP GCS To configure a GCS bucket for Spiral, grant `Storage Object User` and `Storage Object Viewer` roles on the bucket to Spiral service account `spiraldb-filesystems@pyspiral-dev.iam.gserviceaccount.com`. Use the CLI to update project's file system: ```bash copy spiral fs update --type gcs --bucket --region ``` ### S3-compatible S3-compatible providers (e.g. MinIO, Tigris, R2) are supported using access and secret keys. Use the CLI to update project's file system: ```bash copy spiral fs update --type s3like --endpoint --bucket --region ``` Secrets are stored in [WorkOS Vault](https://workos.com/vault). Use the CLI if you want to [BYOKs](https://workos.com/docs/vault/byok): ```bash copy spiral orgs keys ``` --- # Spiral Tables Section: Concepts Source: https://docs.spiraldb.com/spiral-tables Spiral Tables a powerful and flexible way for storing, analyzing, and querying massive and/or multimodal datasets. The data model will feel familiar to users of SQL- or DataFrame-style systems, yet is designed to be more flexible, more powerful, and more useful in the context of modern data processing. Tables are stored and queried directly from **object storage**. ## Key Features - **Sufficiently Schemaless:** Spiral supports complex data with nested relationships using column groups. Tables are sparse and support column appends **without** rewriting existing rows. - **High-throughput Scanning**: Saturate the network bandwidth with an optimized high-throughput scanning. - **Scalable Storage with Flexible Cost-Performance**: Based on log-structured merge (LSM) trees and lakehouse architecture for efficient data storage and retrieval, with acceleration layer when performance is critical. - **Python-centric Data Access Layer**: Retrieve data in powerful columnar or row-based formats like PyArrow, Pandas, Polars, Dask, PyTorch and more with intuitive projection and filtering syntax. - **Cell Push-down:** Keep large values like images, audio, and video seamlessly integrated with your tabular data. Filter rows and read only the parts you need with cell-level filtering. No need for separate storage systems. Spiral Tables is **built with [Vortex](https://docs.vortex.dev/)**, our SOTA file format and columnar data toolkit. Vortex’s random access performance advancements enable search and indexing features (soon)! ## Dive Deeper - [**PySpiral**](https://docs.spiraldb.com/python-api/spiral): Learn how to interact with Spiral Tables using the Python client library. - [**Data Model**](https://docs.spiraldb.com/tables/data-model): Understand Spiral's flexible and powerful data model. - [**Best Practices**](https://docs.spiraldb.com/tables/best-practices): Tips and tricks for optimizing your use of Spiral Tables. --- # Media Tables Section: Use Cases Source: https://docs.spiraldb.com/usecases/media-tables [Data Model](https://docs.spiraldb.com/tables/data-model) of [Spiral Tables](https://docs.spiraldb.com/spiral-tables#key-features) makes them an ideal choice for storing and managing large-scale multimodal datasets (including images, videos, audio files, text, transformer-based embeddings and more). Tables are **designed to store both data and traditional metadata** (labels, annotations, complex nested data and more) **in a single model**. Let's build a table of crawled image data. ## Start from Metadata First, create a table with a composite primary key made up of `page_url` and `key` columns. Spiral Tables have primary keys that can consist of multiple columns, and are used as both sort order as well as to support updates and zero-copy appends. In this case, `key` represents a unique identifier for each media item on the page identified by `page_url`. ```python from spiral import Spiral import pyarrow as pa sp = Spiral() project = sp.project("example") table = project.create_table( "media-table", key_schema=pa.schema({"page_url": pa.string(), "key": pa.int64()}) ) ``` While it's possible to write raw data (images, audio, video, etc.) directly into Spiral Tables, it's much more common to start from metadata such as URLs or S3 paths, and then use [Enrichment](https://docs.spiraldb.com/tables/enrich) to fetch the actual media data. Let's skip crawling web pages as it is not specific to Spiral, and explore already ingested metadata. This metadata is based on [filtered-wit](https://huggingface.co/datasets/laion/filtered-wit) dataset. ```python >>> table.schema() ``` ```python pa.schema({ "page_url": pa.string(), "key": pa.int64(), "page_title": pa.string(), "section_title": pa.string(), "caption": pa.string(), "caption_attribution_description": pa.string(), "url": pa.string(), "context": pa.struct({ "page_description": pa.string(), "section_description": pa.string() }) }) ``` In our table, `context` is a [Column Group](https://docs.spiraldb.com/tables/data-model#column-groups). While Spiral Tables don't require up-front schema design, check out [Best Practices](https://docs.spiraldb.com/tables/best-practices#column-groups) when it comes to splitting data into column groups for optimal performance. In this case, contextual metadata is large text that is rarely filtered on so it makes sense to group it separately. Let's use Polars to look at some sample data. ```python table.to_polars_lazy_frame().head(100) ``` ![Media Table](/img/media-tables-tbl.png) Let's use Spiral CLI to explore the structure of the table (the output is truncated). ```bash copy spiral tables manifests --project example --table media-table ``` ``` Key Space manifest 131 fragments, total: 60.6MB, avg: 473.5KB, metadata: 431.7KB ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┃ ID ┃ Size (Metadata) ┃ Format ┃ Key Span ┃ Level ┃ Committed At ┃ Compacted At ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │ 0qaq87gz87 │ 30.4MB (19.6KB) │ vortex │ 0..1294000 │ L0 │ 2025-11-06 18:20:00.224537+00:00 │ N/A │ ``` ``` Column Group manifest for table_sl6o0u 6 fragments, total: 113.9MB, avg: 19.0MB, metadata: 111.5KB ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┃ ID ┃ Size (Metadata) ┃ Format ┃ Key Span ┃ Level ┃ Committed At ┃ Compacted At ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │ pqjb420f8r │ 20.2MB (19.1KB) │ vortex │ 0..228261 │ L0 │ 2025-11-06 18:20:00.224537+00:00 │ N/A │ │ vrv19qwg1i │ 20.2MB (19.1KB) │ vortex │ 228261..456522 │ L0 │ 2025-11-06 18:20:00.224537+00:00 │ N/A │ │ 9lhiuacvoi │ 20.0MB (19.1KB) │ vortex │ 456522..684783 │ L0 │ 2025-11-06 18:20:00.224537+00:00 │ N/A │ │ yq7dqeed9r │ 20.1MB (19.1KB) │ vortex │ 684783..913044 │ L0 │ 2025-11-06 18:20:00.224537+00:00 │ N/A │ │ 2tbvh0v6td │ 20.1MB (19.1KB) │ vortex │ 913044..1141305 │ L0 │ 2025-11-06 18:20:00.224537+00:00 │ N/A │ │ d4xlsgo7ff │ 13.3MB (16.1KB) │ vortex │ 1141305..1294000 │ L0 │ 2025-11-06 18:20:00.224537+00:00 │ N/A │ ``` ``` Column Group manifest for table_sl6o0u.context 11 fragments, total: 2.6GB, avg: 120.9MB, metadata: 2.7MB ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┃ ID ┃ Size (Metadata) ┃ Format ┃ Key Span ┃ Level ┃ Committed At ┃ Compacted At ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │ xn6rq15db4 │ 129.5MB (2.4KB) │ vortex │ 0..1177 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ vn4d04pp2r │ 128.7MB (2.4KB) │ vortex │ 1177..2354 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ yqgn69c1rh │ 126.2MB (2.4KB) │ vortex │ 2354..3531 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ 9mmqc78utg │ 129.3MB (2.4KB) │ vortex │ 3531..4708 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ ffsgw7yf5m │ 126.6MB (2.4KB) │ vortex │ 4708..5885 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ 3319x978du │ 127.5MB (2.4KB) │ vortex │ 5885..7062 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ ``` ## Enrich with Data Our metadata table has URLs pointing to images. Let's use [Table Enrichment](https://docs.spiraldb.com/tables/enrich) to fetch the images and store them directly into the table. Using [Expressions](https://docs.spiraldb.com/tables/expressions) we can define how to derive new columns (like `image`) from existing columns (like `url`). ```python from spiral import expressions as se enrichment = table.enrich( { "image": se.http.get(table["url"]) } ) ``` There are different ways to run the enrichment, but in this case let's execute it in a streaming fashion since it is the simplest. ```python enrichment.run() ``` `se.http.get` fetches the image data from the URL as well as some useful metadata. It creates two columns groups, `image` (with the raw image bytes in `bytes` column) and `image.meta` (with metadata such as status code). Our table now looks like this: ```python table.schema() ``` ```python pa.schema({ "page_url": pa.string(), "key": pa.int64(), "page_title": pa.string(), "section_title": pa.string(), "caption": pa.string(), "caption_attribution_description": pa.string(), "url": pa.string(), "context": pa.struct({ "page_description": pa.string(), "section_description": pa.string() }), "image": pa.struct({ "bytes": pa.binary(), "meta": pa.struct({ "location": pa.string(), "last_modified": pa.int64(), "size": pa.uint64(), "e_tag": pa.string(), "version": pa.string(), "status_code": pa.uint16() }) }) }) ``` Let's use Spiral CLI to explore the updated structure of the table (the output is truncated). We expect to see two new column groups: `image` and `image.meta`. ```bash copy spiral tables manifests --project example --table media-table ``` ``` Column Group manifest for table_sl6o0u.image 1165 fragments, total: 137.6GB, avg: 120.9MB, metadata: 2.7MB ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┃ ID ┃ Size (Metadata) ┃ Format ┃ Key Span ┃ Level ┃ Committed At ┃ Compacted At ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │ xn6rq15db4 │ 129.5MB (2.4KB) │ vortex │ 0..1177 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ vn4d04pp2r │ 128.7MB (2.4KB) │ vortex │ 1177..2354 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ yqgn69c1rh │ 126.2MB (2.4KB) │ vortex │ 2354..3531 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ 9mmqc78utg │ 129.3MB (2.4KB) │ vortex │ 3531..4708 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ ffsgw7yf5m │ 126.6MB (2.4KB) │ vortex │ 4708..5885 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ 3319x978du │ 127.5MB (2.4KB) │ vortex │ 5885..7062 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ ``` ``` Column Group manifest for table_sl6o0u.image.meta 130 fragments, total: 1.5MB, avg: 12.0KB, metadata: 977.8KB ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┃ ID ┃ Size (Metadata) ┃ Format ┃ Key Span ┃ Level ┃ Committed At ┃ Compacted At ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │ cyjkjdaezp │ 11.8KB (7.5KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ cmcb44n6z6 │ 12.2KB (7.5KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ t5vovxqks2 │ 12.3KB (7.6KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ gof24kc956 │ 12.2KB (7.5KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ x6ujppiw06 │ 11.8KB (7.5KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ r0gesz2qpe │ 12.1KB (7.6KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ ``` ## Explore Tables The entry point for reading data from Spiral Tables is [scan](https://docs.spiraldb.com/python-api/spiral#scan) and [Scan Object](https://docs.spiraldb.com/tables/scan#scan-object). Let's explore failures in our enrichment by scanning the table and filtering on `image.meta.status_code`. ```python scan = sp.scan( table[["url", "image.meta.status_code"]], where=table["image.meta.status_code"] != pa.scalar(200, pa.uint16()), ) ``` Let's check if the result is what we expect. ```python scan.schema() ``` ```python pa.schema({ "url": pa.string(), "image": pa.struct({ "meta": pa.struct({ "status_code": pa.uint16(), }), }), }) ``` Let's execute the scan and check the results. ```python scan.to_polars() ``` ![Failed Enrichments](/img/media-tables-meta.png) Nice! Only a few failed enrichments, and most of them are `404 Not Found` errors as expected. Now let's get images from a specific page URL. ```python sp.scan( table["image"][["bytes"]], where=(table["page_url"] == "https://en.wikipedia.org/wiki/Ballona_Creek") ).to_table().to_pydict() ``` ```python {'bytes': [b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x03G\x00\x00\x02h\x08\x03\x00\x00\x00\x9a\xe9j\xe6\x00\x00\x03\x00PLTE\xff\xff\xff\xde\xde\xdeBBB::B!1)B:)RRR\x9c\x9c\x94\xce\xce\xc5\xbd\xbd\xd6\xbd\xbd\xbd\x9c\xa5\xb5\xbd\xc5\xc5\xad\xa... ``` ## High Throughput Scans Spiral Tables are designed for high-throughput scans. Getting a stream of record batches from any scan is as simple as this: ```python sp.scan(table["image"][["bytes"]]).to_record_batches() ``` Checkout the [API Reference](https://docs.spiraldb.com/python-api/spiral#to_record_batches) for more details on how to customize scans. See also: - [Data Loaders](https://docs.spiraldb.com/usecases/data-loaders) - [Distributed Compute](https://docs.spiraldb.com/tables/distributed) --- # Data Loaders Section: Use Cases Source: https://docs.spiraldb.com/usecases/data-loaders Low Data Loader Efficiency means ***fewer iterations on the model***. Even the best teams spend tremendous amounts of human capital building, maintaining and adapting custom solutions - and their ***GPU still sits idle***. The state of the art data loading solutions are complex, brittle and hard to maintain. ![Training Data](/img/data-loading-sota.png) Spiral is designed to deliver maximal throughput and elasticity enabling efficient training cycles. ![Data Loader](/img/data-loading-spiral.png) ## Training on (Live) Tables Tables offer a powerful and flexible way for storing dynamic massive and/or multimodal datasets, and an easy way to obtain high throughput production-ready data loader. [Scan](https://docs.spiraldb.com/tables/scan#scan-object) object integrates with many training frameworks. We recommend starting with a `Scan` object which provides a simple interface for getting a PyTorch-compatible `DataLoader`. ```python from spiral import Spiral, SpiralDataLoader from spiral.demo import images sp = Spiral() table = images(sp) scan = sp.scan(table) data_loader: SpiralDataLoader = scan.to_data_loader(seed=42, batch_size=32) ``` Unlike PyTorch's `DataLoader` which uses multiprocessing for I/O (num\_workers), `SpiralDataLoader` leverages Spiral's efficient Rust-based streaming and only uses multiprocessing for CPU-bound post-processing transforms. Key differences from PyTorch's `DataLoader`: - No num\_workers for I/O (Spiral's Rust layer is already multi-threaded) - `map_workers` for parallel post-processing (tokenization, decoding, etc.) - Explicit shard-based architecture for distributed training - Built-in checkpoint CPU-bound transforms such as tokenization can be parallelized, similar to `num_workers` in PyTorch, while remaining compatible with Spiral's efficient I/O. ```python import pyarrow as pa # CPU-bound batch transform function def tokenizer_fn(rb: pa.RecordBatch): # tokenize the rb return rb data_loader: SpiralDataLoader = scan.to_data_loader(transform_fn=tokenizer_fn, map_workers=8) ``` Built-in checkpointing allows resuming training from the last processed record in case of interruptions. ```python loader = scan.to_data_loader(batch_size=2, seed=42) for i, batch in enumerate(loader): if i == 3: # Save checkpoint checkpoint = loader.state_dict() # ... # Resume from checkpoint loader = scan.resume_data_loader(checkpoint, batch_size=32) ``` A lower-level interface such as Torch-compatible `IterableDataset` is also available. ```python from torch.utils.data import IterableDataset, DataLoader iterable_dataset: IterableDataset = scan.to_iterable_dataset() data_loader = DataLoader(iterable_dataset, batch_size=32) ``` ## Distributed Training Distributed training is natively supported. ```python data_loader: SpiralDataLoader = scan.to_distributed_data_loader(seed=42, batch_size=32) ``` Optionally, world size and rank can be specified explicitly. ```python from spiral import World world = World(rank=0, world_size=64) data_loader: SpiralDataLoader = scan.to_distributed_data_loader( world=world, seed=42, batch_size=32 ) ``` Training on filtered data with a higly selective filter in distribute setting can lead to unbalanced shards across nodes. In such cases, it is recommended to provide shards through `shards` arg to ensure balanced data distribution. Shards can be obtained from scan via [shards()](https://docs.spiraldb.com/python-api/spiral#shards) or built using [compute\_shards()](https://docs.spiraldb.com/python-api/spiral#compute_shards). --- # Overview Section: Collections Source: https://docs.spiraldb.com/collections/overview Collections are not yet available. Please contact us if you are interested. SpiralDB stores your data as a tree of **collections** — like JSON arrays of objects, but backed by columnar storage. Each collection is physically a separate sorted table. Queries across collections are sort-merge joins — the cheapest possible join! ## Schema Here is a dataset of scene recordings for world modeling, where each scene contains clips from different cameras, and each clip contains video frames, object detections, and audio: ```python from spiral import Spiral sp = Spiral() ds = sp.dataset("recordings") print(ds.schema) ``` ``` recordings ├── scenes {_id: utf8} │ ├── location: utf8 │ ├── clips {_id: utf8} │ │ ├── camera: utf8 │ │ ├── start_time: f64 │ │ ├── duration: f64 │ │ ├── frames [] one per video frame (e.g. 30fps) │ │ │ ├── rgb: tensor[H, W, 3] │ │ │ └── depth: tensor[H, W] │ │ ├── detections [] one per detected object (variable rate) │ │ │ ├── label: utf8 │ │ │ ├── confidence: f32 │ │ │ └── ts: f64 │ │ └── audio [] one per chunk (e.g. 512 samples at 16kHz) │ │ ├── pcm: tensor[512] │ │ └── level_db: f32 │ └── metadata: struct │ ├── weather: utf8 │ └── lighting: utf8 └── models {_id: utf8} ├── name: utf8 └── config: struct ├── lr: f64 └── batch_size: i32 ``` Three kinds of things appear in this tree: - **Collections** (`scenes`, `clips`, `frames`, `detections`, `audio`, `models`). Records you can iterate over. Each has its own key — either *keyed* (`{_id: type}` — upserts match on `_id`) or *positional* (`[]` — rows are ordered, with a virtual `_pos` field, `u64`, starting at 0). - **Structs** (`metadata`, `config`). Single nested objects — one per parent record. Dot access navigates through them. They don't introduce new rows. - **Scalars** (`location`, `duration`, `rgb`, `confidence`). Leaf values. Note that `frames`, `detections`, and `audio` are **sibling collections** within `clips`. **They have independent row counts** — a 30-second clip might have 900 frames, 12 detections, and 960 audio chunks. ## Physical Model Each collection is stored as its own sorted columnar table. A child table's sort key is always a prefix extension of its parent's key: ``` scenes: sorted by (scenes_id) clips: sorted by (scenes_id, clips_id) frames: sorted by (scenes_id, clips_id, frames_pos) detections: sorted by (scenes_id, clips_id, detections_pos) audio: sorted by (scenes_id, clips_id, audio_pos) ``` Because every parent-child pair shares a key prefix, joins between them are sort-merge joins. When two streams have identical key sets (not just the same key schema, but the same actual rows), the join degenerates to a zip: walk both streams in lockstep with no searching. ## Navigating the Tree Dot access walks the tree and returns a lazy reference. Nothing is read until you scan. ```python scenes = ds.scenes clips = scenes.clips frames = clips.frames detections = clips.detections audio = clips.audio frames.rgb # references recordings/scenes/clips/frames/rgb scenes.metadata.weather # references recordings/scenes/metadata/weather ``` Save any part of the tree to a variable and keep navigating. Path variables make queries readable — especially when referencing ancestors deep in the tree. ## Expression Levels Every expression has a **level** — the collection whose key columns define its row domain. The level is determined by the expression's source and is always unambiguous. **Field references** have the level of their collection: ```python scenes.location # level: scenes — one value per scene clips.duration # level: clips — one value per clip detections.confidence # level: detections — one value per detection ``` **Scalar operations** inherit the deepest level of their operands: ```python clips.duration * 2 # level: clips detections.confidence * 100 # level: detections detections.ts - clips.start_time # level: detections (clips.start_time broadcasts) ``` **Reductions** move one level up (to the immediate parent): ```python clips.duration.sum() # level: scenes (clips → scenes) detections.confidence.mean() # level: clips (detections → clips) detections.confidence.mean().sum() # level: scenes (detections → clips → scenes) ``` ## Scanning `ds.scan()` reads data and produces rows. Pass one or more expressions — they must all lie on the same ancestor path (no siblings or unrelated branches). Ancestor expressions are broadcast to the scan level automatically. ```python ds.scan(duration=clips.duration, camera=clips.camera).to_arrow() ``` ``` ┌──────────┬────────┐ │ duration │ camera │ ├──────────┼────────┤ │ 30.0 │ front │ │ 15.0 │ front │ │ 20.0 │ rear │ └──────────┴────────┘ ``` The **scan level** is the deepest result level among all expressions. Shallower expressions are broadcast automatically. ```python # Scan level = clips (deepest result level) ds.scan( duration=clips.duration, # level: clips location=scenes.location, # level: scenes → broadcasts to clips confidence_mean=detections.confidence.mean(), # level: clips (reduced from detections) ).to_arrow() ``` ``` ┌──────────┬──────────┬─────────────────┐ │ duration │ location │ confidence_mean │ ├──────────┼──────────┼─────────────────┤ │ 30.0 │ downtown │ 0.85 │ │ 15.0 │ downtown │ 0.92 │ │ 20.0 │ highway │ 0.78 │ └──────────┴──────────┴─────────────────┘ ``` **Projection forms.** Several shorthand forms work inside `ds.scan()`: ```python # Bare collections ds.scan(frames) # Kwargs — keys become output column names ds.scan(rgb=frames.rgb, dur=clips.duration) # Dict — keys become output column names, nested dicts become structs ds.scan({ 'image': frames.rgb, 'meta': { 'dur': clips.duration, 'loc': scenes.location, }, }) ``` ## Implicit Broadcasting When an ancestor expression appears in a scan at a deeper level, its values are automatically repeated for every child row. This is the only implicit operation — it is always unambiguous because each child row has exactly one parent. ```python ds.scan( rgb=frames.rgb, # level: frames duration=clips.duration, # level: clips → broadcast to frames location=scenes.location, # level: scenes → broadcast to frames ).to_arrow() ``` ``` ┌─────┬──────────┬──────────┐ │ rgb │ duration │ location │ ├─────┼──────────┼──────────┤ │ f0 │ 30.0 │ downtown │ │ f1 │ 30.0 │ downtown │ │ f2 │ 15.0 │ downtown │ │ f3 │ 20.0 │ highway │ └─────┴──────────┴──────────┘ ``` Physically, this is a sort-merge join between the frames table and the clips/scenes tables on their shared key prefix. **Siblings and unrelated collections are errors:** ```python # ✗ frames and detections are siblings — neither is an ancestor of the other ds.scan(rgb=frames.rgb, label=detections.label) # Error: cannot combine 'frames' and 'detections' — they are siblings # ✗ models is on a different branch entirely ds.scan(dur=clips.duration, name=ds.models.name) # Error: 'clips' and 'models' share no ancestor path ``` Siblings must be explicitly aligned to be combined in the same scan — see [Aligning Siblings](#aligning-siblings) below. ## Reduction Referencing a child collection's fields from a parent level requires an explicit **reduction** — you must specify how many-to-one values collapse. Reduction methods move the expression one level up to the immediate parent. ```python # At scenes level: one value per scene clips.duration.sum() # total recording time per scene clips.duration.mean() # average clip duration per scene clips.duration.list() # list of clip durations per scene clips.duration.count() # number of clips per scene clips.duration.min() # shortest clip per scene clips.duration.max() # longest clip per scene clips.duration.first() # first clip's duration per scene clips.duration.last() # last clip's duration per scene ``` Reductions compose — each step moves one level up: ```python detections.confidence.mean() # level: clips (mean per clip) detections.confidence.mean().sum() # level: scenes (sum of per-clip means) ``` ## Filtering Filter narrows a collection's row domain without changing its level. It is a first-class operation on collections. **`.where()` on a collection** returns a filtered collection that can be reused: ```python from spiral import _ # `_` refers to the collection being filtered: long_clips = clips.where(_.duration > 10) # Use the filtered collection in scans: ds.scan(duration=long_clips.duration).to_arrow() ``` ``` ┌──────────┐ │ duration │ ├──────────┤ │ 30.0 │ │ 15.0 │ │ 20.0 │ └──────────┘ ``` `_` supports arithmetic, comparisons, and boolean operators (`&`, `|`, `~`): ```python from spiral import _ # Compound predicates: high_conf_cars = detections.where((_.confidence > 0.8) & (_.label == "car")) cars_or_peds = detections.where((_.label == "car") | (_.label == "pedestrian")) not_bicycles = detections.where(~(_.label == "bicycle")) ``` **Filters affect reductions.** When a filtered collection is aggregated, only the matching rows participate: ```python high_conf = detections.where(_.confidence > 0.9) # Count only high-confidence detections per clip: ds.scan(cnt=high_conf.count()) ``` **Filter placement is an optimizer concern.** The user specifies *what* to filter; the system decides *where* in the plan to apply it. When two streams have identical row domains, the join becomes a zip (lockstep walk, no searching). The optimizer weighs this against the benefit of pushing filters down to reduce data volume. ## Lists and Collections Lists and child collections are the same abstraction viewed from different levels. From the **clips** level, `detections` is conceptually a list — for each clip, there is a variable-length sequence of detection rows. Reducing that list is the same as a streaming GroupBy over the detections collection: ```python detections.confidence.sum() # streaming: one pass, constant memory detections.confidence.list() # materializes all values as a list per clip ``` Both produce one value per clip. The first streams and reduces in a single pass; the second materializes the list for later processing. Use `.sum()`, `.mean()`, etc. when you want a scalar; use `.list()` when you need the raw values. Going the other direction, if a table has a list-typed column, you can **explode** it to create a child level: ```python # Suppose scenes has a tags: list column scenes.tags # level: scenes, type: list scenes.tags.len() # level: scenes, type: u64 scenes.tags.explode() # level: (_id, $idx) — one row per tag ``` `.explode()` is the inverse of `.list()`. A list column IS a pre-materialized child collection. This duality means the same methods work on both child collection fields and list columns: ```python # Child collection field at parent level: detections.confidence.sum() # sum of confidences per clip detections.confidence.list() # list of confidences per clip detections.confidence.count() # number of detections per clip # List column in the schema: scenes.tags.sum() # (if numeric) sum of elements scenes.tags.list() # identity — it's already a list scenes.tags.count() # number of elements ``` ## Window Functions A window function computes a value for each row using its neighboring rows within the same parent group. It is a **reduction that preserves the input's structure** — the result stays at the same level as the input. ```python # Rolling 3-frame window: for each frame, collect [frame-2, frame-1, frame] frames.rgb.window(frame=(-2, 0)).list() # level: frames, type: list # Rolling mean over a 5-detection window: detections.confidence.window(frame=(-2, 2)).mean() # level: detections, type: f32 ``` **Windows never cross parent boundaries.** A rolling window on frames is implicitly partitioned by clips — frame 0 of clip c2 never includes frames from clip c1. This is because a window function is fundamentally a reduction grouped by the parent level: 1. Group frames by parent clip 2. Within each group, compute a sliding aggregate 3. Result: one value per frame (same cardinality as input) The system optimizes this as a streaming window — no materialization of intermediate lists. ## Aligning Siblings `frames`, `detections`, and `audio` are siblings — they share a parent (`clips`) but have independent row domains. To combine them in a single scan, you need to align them to a **shared index**. ### Resampling `.resample(by, target)` is a **bucket-join**: each row in `self` is assigned to a row in `target` by evaluating the `by` expression, which must produce a value in `target`'s positional key space (i.e. a valid `_pos` index into `target`). Because both `self` and `target` are sorted by the same key prefix, and `by` must be monotonically non-decreasing, the assignment is a streaming walk — both sides advance forward together with no random access. Multiple source rows may land in the same bucket, so `resample` pushes a virtual `_pos` component onto the result key. Chain a reduction (`.mean()`, `.count()`, `.list()`, etc.) to collapse back to `target`'s key level. The two parameters: - **`by`** — an expression evaluated over `self` that maps each source row to a `target` position. Must be monotonically non-decreasing. - **`target`** — a sibling collection (same parent as `self`) that defines the output row domain. ```python clips = ds.scenes.clips detections = clips.detections frames = clips.frames # Convert detection timestamp (seconds) → frame index at clip fps. # At 10fps: ts=0.05 → frame 0, ts=0.15 → frame 1. frame_idx = (detections.ts * clips.fps).floor() resampled = detections.resample(by=frame_idx, target=frames) # resampled key: (scenes_id, clips_id, frames_pos, _pos) # Chain reductions to get back to frames level. ds.scan( rgb = frames.rgb, avg_conf = resampled.confidence.mean(), det_count = resampled.count(), ).to_arrow() ``` ``` ┌─────┬─────┬──────────┬───────────┐ │ _id │ rgb │ avg_conf │ det_count │ ├─────┼─────┼──────────┼───────────┤ │ c1 │ f0 │ 0.90 │ 1 │ │ c1 │ f1 │ 0.85 │ 1 │ │ c2 │ f2 │ 0.70 │ 1 │ │ c3 │ f3 │ 0.775 │ 2 │ └─────┴─────┴──────────┴───────────┘ ``` ## Row Selection The `[]` operator **steps into** a collection's key prefix, pinning one or more leading key values. Each index removes the leftmost unbound key dimension from the row domain — you are walking into the tree, not filtering arbitrary rows. ```python ds.scenes["s1"] # pin scenes._id = "s1" ds.scenes["s1"].clips["c1"] # pin scenes._id = "s1", clips._id = "c1" ds.scenes["s1"].clips["c1"].frames[0] # fully specified — one frame ``` Because the data is sorted by key prefix, stepping into a prefix is a seek, not a scan. All downstream expressions and joins see the narrowed domain: ```python s1_clips = ds.scenes["s1"].clips ds.scan(duration=s1_clips.duration, location=scenes.location).to_arrow() ``` ``` ┌──────────┬──────────┐ │ duration │ location │ ├──────────┼──────────┤ │ 30.0 │ downtown │ │ 15.0 │ downtown │ └──────────┴──────────┘ ``` When every collection in the path is scalar-indexed, you have a fully-specified path to a single value: ```pycon >>> ds.scenes["s1"].location.to_py() "downtown" >>> ds.scenes["s1"].clips["c1"].duration.to_py() 30.0 ``` ### Selection forms **Scalar index** — pins one key value at one level: ```python ds.scenes["s1"] # one scene ds.scenes["s1"].clips["c1"] # one clip within one scene ds.scenes["s1"].clips["c1"].frames[0] # one frame (by _pos) ``` **Array of keys** — selects specific rows (must be sorted): ```python ds.scenes[["s1", "s2", "s5"]] # level: scenes, row domain: 3 scenes ``` **Key table** — multi-level row selection. Each column corresponds to a level in the hierarchy, rows correspond element-wise: ```python import pyarrow as pa ds.scenes.clips[pa.table({ "scenes": ["s1", "s1", "s2"], "scenes.clips": ["c1", "c2", "c3"], })] # level: clips, row domain: 3 specific clips ``` Key tables are useful for sampling, precomputed batch indices, or feeding an externally-computed selection into a training loop. ## Execution Model ### Physical Storage Each collection is a separate sorted table. Conceptually, child tables store parent keys for joining: **scenes** — sorted by `(scenes_id)` ``` ┌───────────┬──────────┐ │ scenes_id │ location │ ├───────────┼──────────┤ │ s1 │ downtown │ │ s2 │ highway │ └───────────┴──────────┘ ``` **clips** — sorted by `(scenes_id, clips_id)` ``` ┌───────────┬──────────┬──────────┐ │ scenes_id │ clips_id │ duration │ ├───────────┼──────────┼──────────┤ │ s1 │ c1 │ 30.0 │ │ s1 │ c2 │ 15.0 │ │ s2 │ c3 │ 20.0 │ └───────────┴──────────┴──────────┘ ``` **frames** — sorted by `(scenes_id, clips_id, frames_pos)` ``` ┌───────────┬──────────┬────────────┬─────┬───────┐ │ scenes_id │ clips_id │ frames_pos │ rgb │ depth │ ├───────────┼──────────┼────────────┼─────┼───────┤ │ s1 │ c1 │ 0 │ f0 │ d0 │ │ s1 │ c1 │ 1 │ f1 │ d1 │ │ s1 │ c2 │ 0 │ f2 │ d2 │ │ s2 │ c3 │ 0 │ f3 │ d3 │ └───────────┴──────────┴────────────┴─────┴───────┘ ``` In practice, SpiralDB maintains efficient indexes to optimize cross-level joins, and the Vortex columnar format makes broadcasting a zero-copy operation. But the logical model above is equivalent. ### Execution Plan Consider the same resampling query from the [previous section](#aligning-siblings), but with a filter on detections: ```python dts = detections.where(detections.label == "pedestrian") ds.scan( frames.rgb, dts.resample(by=(dts.ts * clips.fps).floor(), target=frames).list(), clips.id, clips.duration, ).to_arrow() ``` Each node below is a streaming operator; data flows top-to-bottom from table scans to the final output: ``` Scan(scenes) Scan(clips) Scan(frames) Scan(detections) │ │ │ │ └─── Join ────┘ │ Filter(label=="pedestrian") │ │ │ │ │ Resample(agg=list) │ │ │ │ └──────── Join ──────────┘ │ │ └─────────── Join ──────────────┘ │ Output ``` Four table scans feed into the plan. The filter on `label` is pushed down before resampling. Resample is bucket join, and enables sort-merge join with the frames branch. The ancestor fields (scenes, clips) are joined via sort-merge on the shared key prefix. **Filter placement is an optimizer choice.** The filter could instead be applied after the zip — this preserves the cheap lockstep walk but processes more rows through the resample. Pushing the filter before the zip reduces data volume but may break the row-domain identity, turning the zip into a join. The optimizer weighs filter selectivity against join cost. **Sideways information passing.** When the filter is pushed to the detections scan, it can also accelerate the frames scan. Both streams are sorted by the same key prefix (`scenes_id`, `clips_id`). As the filtered detections stream skips past clips with no matching detections, it can pass updated key bounds to the frames stream, allowing it to seek forward rather than scan through frames that will be discarded after the zip. This cross-operator communication — where one scan's progress informs another's seek position — is known as **sideways information passing**. ## Writing Data All writes are **upserts**. Writing to an existing key replaces the record; writing to a new key creates it. **Prefer batch writes** for best performance — columnar storage is most efficient when writing many records at once. **Single records** — setitem with a dict: ```python ds.scenes["s3"] = { "location": "park", "metadata": {"weather": "sunny", "lighting": "natural"}, "clips": [ { "_id": "c10", "camera": "front", "start_time": 0.0, "duration": 60.0, "frames": [{"rgb": b"f10", "depth": b"d10"}], "detections": [{"label": "car", "confidence": 0.9, "ts": 0.5}], "audio": [{"pcm": b"a10", "level_db": -20.0}], }, ], } ``` **Scalars** — when the path is fully indexed: ```python ds.scenes["s1"].location = "downtown_v2" ds.scenes["s1"].clips["c1"].duration = 45.0 ``` **Batch writes** — setitem with an Arrow table or list of dicts: ```python import pyarrow as pa ds.scenes["s1"].clips["c1"].frames = pa.table({ "rgb": [b"f20", b"f21", b"f22"], "depth": [b"d20", b"d21", b"d22"], }) ``` Note that positional collections (like `frames`) do not support upserting individual rows by index. You can only replace the entire collection at once. **Scoped writes** — the client batches them for you: ```python with ds.write() as w: for scene in scene_dicts: w.scenes[scene['_id']] = scene # all writes flushed as a single batch on exit ``` --- # Data Model Section: Tables Source: https://docs.spiraldb.com/tables/data-model The data model of Spiral Tables is dictionary-like structures of columnar arrays, sorted and unique by a set of primary key columns. Tables are **sorted** and **unique** by a primary key. The data model enables: - Efficient **sparse columns**. - Columns arranged in a **nested dictionary**-like structure. - Support for large values and **cell-level pushdown** filtering. - Support for **appending columns** without rewriting the entire table. ## Key Schema When a table is created, a `key schema` is defined that represents the primary key and sort order of the table. This schema is fixed and cannot be changed after the table is created. The key schema can be any number of columns of the following types: - `(u)int{8,16,32,64}` - `float{16,32,64}` - `timestamp` - `bytes` (up to 1KB) - `string` (up to 1KB) ## Column Groups The columns of a table are arranged in a nested dictionary-like structure. We refer to this as the *schema tree*. Column groups can be thought of as horizontal partitions of the table. Tables provide complete isolation between column groups, scanning only the column groups needed for a query. For example, the [GitHub Archive](https://www.gharchive.org) dataset has a key schema that looks like this: ```python import pyarrow as pa key_schema = [('created_at', pa.timestamp('us')), ('id', pa.int64())] ``` And a schema tree of: ```python schema_tree = { 'name': pa.string(), 'public': pa.bool_(), 'payload': pa.string(), 'repo': { 'id': pa.int64(), 'name': pa.string(), 'url': pa.string() }, 'actor': { 'id': pa.int64(), 'login': pa.string(), 'gravatar_id': pa.string(), 'url': pa.string(), 'avatar_url': pa.string() }, 'org': { 'id': pa.int64(), 'login': pa.string(), 'gravatar_id': pa.string(), 'url': pa.string(), 'avatar_url': pa.string() } } ``` A **column group** refers to a set of sibling leaf columns in the schema tree. For example, in the schema tree above, there is a root (\`\`) column group containing the `name`, `public`, and `payload` columns; as well as `repo`, `actor` and `org` column groups. ## Storage Model Each column group is stored as a log-structured merge (LSM) tree in object storage. This is a data structure that consists of sorted runs of data. The key columns are split out and stored in key files, while the value columns are stored in fragment files. Background maintenance jobs periodically compact the LSM tree to merge overlapping sorted runs of value columns to improve read performance. See [Table Format](https://docs.spiraldb.com/format) for a more detailed specification of the storage model. --- # Write Tables Section: Tables Source: https://docs.spiraldb.com/tables/write We're going to explore the Spiral Tables write API using the ever classic [GitHub Archive dataset](https://www.gharchive.org/). ## Create a Table Currently, tables can only be created using the Spiral Python API. ```python import pyarrow as pa import spiral from spiral.demo import demo_project sp = spiral.Spiral() project = demo_project(sp) # Define the key schema for the table. # Random keys such as UUIDs should be avoided, so we're using `created_at` as part of primary key. key_schema = pa.schema([('created_at', pa.timestamp('ms')), ('id', pa.string())]) # Create the 'events-v1' table in the 'gharchive' dataset events = project.create_table('gharchive.events-v1', key_schema=key_schema, exist_ok=True) ``` To open the table in the future: ```python events = project.table("gharchive.events-v1") ``` We'll define a function that downloads data from the GitHub Archive and download an hour of data. ```python import pyarrow as pa import pandas as pd import duckdb def get_events(period: pd.Period, limit=100) -> pa.Table: json_gz_url = f"https://data.gharchive.org/{period.strftime('%Y-%m-%d')}-{str(period.hour)}.json.gz" arrow_table = ( duckdb.read_json(json_gz_url, union_by_name=True) .limit(limit) # remove this line to load more rows .select(""" * REPLACE ( cast(created_at AS TIMESTAMP_MS) AS created_at, to_json(payload) AS payload ) """) .to_arrow_table() ) events = duckdb.from_arrow(arrow_table).order("created_at, id").distinct().to_arrow_table() events = ( events.drop_columns("id") .add_column(0, "id", events["id"].cast(pa.large_string())) .drop_columns("created_at") .add_column(0, "created_at", events["created_at"].cast(pa.timestamp("ms"))) .drop_columns("org") ) return events data: pa.Table = get_events(pd.Period("2023-01-01T00:00:00Z", freq="h")) ``` ## Write to a Table All write operations to a Spiral table must include the key columns. Writing nested data such as struct arrays or dictionaries into a table creates column groups. See [Best Practices](https://docs.spiraldb.com/tables/best-practices#column-groups) for how to design your schema for optimal performance. The write API accepts a variety of data formats, including PyArrow tables and dictionaries. Since our function returns a PyArrow table, we can write it directly to the Spiral table. ```python events.write(data) ``` We could also have written multiple times atomically within a transaction. ```python with events.txn() as tx: for period in pd.period_range("2023-01-01T00:00:00Z", "2023-01-01T01:00:00Z", freq="h"): data = get_events(period) tx.write(data) ``` Or in parallel, useful when ingesting large amounts of data: ```python from concurrent.futures import ThreadPoolExecutor def write_period(tx, period: pd.Period): data = get_events(period) tx.write(data) with events.txn() as tx: with ThreadPoolExecutor() as executor: for period in pd.period_range("2023-01-01T00:00:00Z", "2023-01-01T01:00:00Z", freq="h"): executor.submit(write_period, tx, period) ``` It is important that the primary key columns are unique within the transaction because all writes in a transaction will have a same timestamp. If a different value is written for the same column and the same key in the same transaction, the result when reading the table is **undefined**. Our table schema now looks like this: ```python events.schema().to_arrow() ``` ```python pa.schema({ "created_at": pa.timestamp("ms"), "id": pa.string(), "payload": pa.string(), "public": pa.bool_(), "type": pa.string(), "actor": pa.struct({ "avatar_url": pa.string(), "display_login": pa.string(), "gravatar_id": pa.string(), "id": pa.int64(), "login": pa.string(), "url": pa.string(), }), "repo": pa.struct({ "id": pa.int64(), "name": pa.string(), "url": pa.string(), }), }) ``` Recall that column groups are defined as sibling columns in the same node of the schema tree. Our schema, in addition to the root column group, contains `actor`, `org`, and `repo` column groups. While columnar storage formats are quite efficient at pruning unused columns, the distance in the file between the data for each column can introduce seek operations that can significantly slow down queries over object storage. In many cases, if you know upfront that you will always query a set of columns together, you can improve query performance by grouping these columns together in a column group. ## Appending Columns Spiral Tables are merged across writes making it very easy to append columns to an existing table. Suppose we wanted to derive an `action` column. We could easily append these columns to the existing table without rewriting the entire dataset. This example will use some syntax from the `spiral.expressions` module. For more detailed examples, please refer to the [Expressions](https://docs.spiraldb.com/tables/expressions). ```python action = sp.scan({ 'action': { "repo_id": events["repo.id"], "actor_id": events["actor.id"], "type": events["type"], "payload": events["payload"] } }, where=events["repo.name"] == "apache/arrow").to_table() ``` Our table schema now contains `action` column group, like this: ```python events.schema().to_arrow() ``` ``` pa.schema( "created_at": pa.timestamp("ms"), "id": pa.string(), "payload": pa.string(), "public": pa.bool_(), "type": pa.string(), "action": pa.struct({ "actor_id": pa.int64(), "org_id": pa.int64(), "payload": pa.string(), "repo_id": pa.int64(), "type": pa.string(), }), "actor": pa.struct({ "avatar_url": pa.string(), "display_login": pa.string(), "gravatar_id": pa.string(), "id": pa.int64(), "login": pa.string(), "url": pa.string(), }), "repo": pa.struct({ "id": pa.int64(), "name": pa.string(), "url": pa.string(), }), } ``` --- # Scan & Query Section: Tables Source: https://docs.spiraldb.com/tables/scan Scanning tables is the process of reading data row-by-row performing row-based scalar transformations, or row-based filtering operations. ## Scan Object The `scan` method returns a `Scan` object that encapsulates a specific query. This object can then return rows of data, or be used to perform further operations. ```python import spiral import pyarrow as pa from spiral.demo import gharchive sp = spiral.Spiral() events_table = gharchive(sp) scan = sp.scan(events_table[["id", "type", "public", "actor", "payload.*", "repo"]]) # The result schema of the scan schema: pa.Schema = scan.schema # Read as a stream of RecordBatches record_batches: pa.RecordBatchReader = scan.to_record_batches() # Read into a single PyArrow Table arrow_table: pa.Table = scan.to_table() # Read into a Dask DataFrame dask_df: dd.DataFrame = scan.to_dask() # Read into a Ray Dataset # ray_ds: ray.data.Dataset = scan.to_ray_dataset() # Read into a Pandas DataFrame pandas_df: pd.DataFrame = scan.to_pandas() # Read into a Polars DataFrame polars_df: pl.DataFrame = scan.to_polars() # Read into a Torch-compatible DataLoader for model training data_loader: torch.utils.data.DataLoader = scan.to_data_loader() # Read into a Torch-compatible DataLoader for distributed model training distributed_data_loader: torch.utils.data.DataLoader = scan.to_distributed_data_loader() # Read into a Torch-compatible IterableDataset for model training iterable_dataset: torch.utils.data.IterableDataset = scan.to_iterable_dataset() ``` Scan's `to_record_batches` method returns rows sorted by the table's key. If ordering is not important, consider using `to_unordered_record_batches` which can be faster. Scan executes splits concurrently, but can only keep some number of splits in memory at a time which limits the degree of concurrency. Unlike ordered streams, `to_unordered_record_batches` yields batches as soon as they are ready, without waiting for the previous batch to be processed so buffering never limits concurrency. ## Filtering Filtering is the process of selecting rows that meet a certain condition. For example, to find events with a specific event type: ```python insertion_events = sp.scan( events_table, where=events_table['type'] == 'PullRequestEvent' ) ``` Any expression that resolves to a boolean value can be used as a filter. See the [Expressions](https://docs.spiraldb.com/tables/expressions) documentation for more information. ## Projection Projection is the process of applying a transformation function to a single row of a table. This can be as simple as selecting a subset of columns, through to much more complex functions such as passing a string column through an LLM API. A projection expression *must* resolve to a struct value. See [Expressions](https://docs.spiraldb.com/tables/expressions) documentation for more information. ### Nested Data Scanning a table selects all columns including nested column groups: ```python sp.scan(events_table.select(exclude=['payload'])).schema.to_arrow() ``` ```python pa.schema({ "created_at": pa.timestamp("ms"), "id": pa.string(), "public": pa.bool_(), "type": pa.string(), "actor": pa.struct({ "avatar_url": pa.string(), "display_login": pa.string(), "gravatar_id": pa.string(), "id": pa.int64(), "login": pa.string(), "url": pa.string(), }), "repo": pa.struct({ "id": pa.int64(), "name": pa.string(), "url": pa.string(), }) }) ``` You can select an entire column group using bracket notation. Remember, the result must always be a struct but column group is a struct so this is valid: ```python sp.scan(events_table["repo"]).schema.to_arrow() ``` ```python pa.schema({ "id": pa.int64(), "name": pa.string(), "url": pa.string(), }) ``` Selecting a column like this is not valid. When selecting column, use double brackets: ```python sp.scan(events_table[["public"]]).schema.to_arrow() ``` ```python pa.schema({ "public": pa.bool_(), }) ``` Or multiple columns: ```python sp.scan(events_table[["public", "repo"]]).schema.to_arrow() ``` ```python pa.schema({ "public": pa.bool_(), "repo": pa.struct({ "id": pa.int64(), "name": pa.string(), "url": pa.string(), }), }) ``` You can pack columns with custom names using a dictionary: ```python sp.scan({ "is_public": events_table["public"], "repo": events_table["repo"], }).schema.to_arrow() ``` ```python pa.schema({ "is_public": pa.bool_(), "repo": pa.struct({ "id": pa.int64(), "name": pa.string(), "url": pa.string(), }), }) ``` Double-brackets is a "syntax sugar" for the `.select()` method: ```python sp.scan(events_table.select("public")).schema.to_arrow() ``` ```python pa.schema({ "public": pa.bool_(), }) ``` Selection applies over columns and nested column groups: ```python sp.scan(events_table.select("public", "repo")).schema.to_arrow() ``` ```python pa.schema({ "public": pa.bool_(), "repo": pa.struct({ "id": pa.int64(), "name": pa.string(), "url": pa.string(), }), }) ``` It is possible to "select into" a column group to get specific fields within nested structures: ```python sp.scan(events_table.select("public", "repo.id", "repo.url")).schema.to_arrow() ``` ```python pa.schema({ "public": pa.bool_(), "repo": pa.struct({ "id": pa.int64(), "url": pa.string(), }), }) ``` You can select from multiple nested structures: ```python sp.scan(events_table.select("public", "repo", "actor.login")).schema.to_arrow() ``` ```python pa.schema({ "public": pa.bool_(), "repo": pa.struct({ "id": pa.int64(), "name": pa.string(), "url": pa.string(), }), "actor": pa.struct({ "login": pa.string(), }), }) ``` Note that selecting "into" a column group returns a nested structure. To flatten, "step into" the column group first: ```python sp.scan(events_table[["public"]], events_table["repo"][["id", "url"]]).schema.to_arrow() ``` ```python pa.schema({ "public": pa.bool_(), "id": pa.int64(), "url": pa.string(), }) ``` Column groups support the same selection operations. This is equivalent to the flattened example above: ```python sp.scan(events_table["repo"].select("id", "url")).schema.to_arrow() ``` ```python pa.schema({ "id": pa.int64(), "url": pa.string(), }) ``` Use `exclude` to remove specific columns, column groups, or keys: ```python sp.scan(events_table.select(exclude=["payload", "repo", "actor"])).schema.to_arrow() ``` ```python pa.schema({ "created_at": pa.timestamp("ms"), "id": pa.string(), "public": pa.bool_(), "type": pa.string(), }) ``` Wildcard `"*"` allows you to select or exclude columns from a specific column group, without including or excluding nested column groups: ```python sp.scan(events_table.select("*")).schema.to_arrow() ``` ```python pa.schema({ "created_at": pa.timestamp("ms"), "id": pa.string(), "public": pa.bool_(), "type": pa.string(), }) ``` Exclude all columns to keep only nested groups except payload column group: ```python sp.scan(events_table.select(exclude=["*", "payload"])).schema.to_arrow() ``` ```python pa.schema({ "actor": pa.struct({ "avatar_url": pa.string(), "display_login": pa.string(), "gravatar_id": pa.string(), "id": pa.int64(), "login": pa.string(), "url": pa.string(), }), "repo": pa.struct({ "id": pa.int64(), "name": pa.string(), "url": pa.string(), }), }) ``` Wildcards can be mixed with other selections: ```python sp.scan(events_table.select("*", "actor.login")).schema.to_arrow() ``` ```python pa.schema({ "created_at": pa.timestamp("ms"), "id": pa.string(), "public": pa.bool_(), "type": pa.string(), "actor": pa.struct({ "login": pa.string(), }), }) ``` ## Querying Recent years have seen several excellent query engines including: - **[Polars](https://pola.rs)** - with Python and Rust DataFrame APIs - **[DuckDB](https://duckdb.org)** - with a SQL oriented API - **[DataFusion](https://datafusion.apache.org)** - SQL as well as a Rust DataFrame API ### Polars Tables can also be queried with [Polars LazyFrames](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html): ```python tbl = gharchive(sp) df = tbl.to_polars_lazy_frame() result = df.collect() ``` ### DuckDB Tables can be turned into PyArrow Datasets with `to_dataset()`, which in turn enables the [DuckDB Python API](https://duckdb.org/docs/api/python/overview). ```python import duckdb tbl = gharchive(sp) ds = tbl.to_arrow_dataset() result = duckdb.execute("SELECT type, COUNT(*) FROM ds GROUP BY type") ``` Or using the [DuckDB Relational API](https://duckdb.org/docs/api/python/relational_api.html): ```python result = tbl.to_duckdb_relation().filter("public is not null").to_arrow_table() ``` ## Key Table Key Table scan is an equivalent of a primary key lookup. Any scan can be evaluated against a table of keys to return only the rows that match the keys, with columns defined by the scan's projection. The result will always contain all the rows from the input key table, and if scan specifies any filters, rows that do not match the filter will be returned as nulls. ```python import pyarrow as pa scan = sp.scan(events_table) key_table = pa.table({ "created_at": ['2025-11-20 13:00:00'], "id": ['4807206745'], }) results = scan.to_record_batches(key_table=key_table) ``` For optimal performance, provide keys that are sorted and unique. Unsorted and duplicate keys are still supported, but performance is less predictable. It is possible to stream the keys into the scan using an Arrow `RecordBatchReader`. ```python import pyarrow as pa from typing import Iterable from spiral.demo import abc table_abc = abc(sp) scan = sp.scan(table_abc.select("a", "b")) def batches() -> Iterable[pa.RecordBatch]: yield from [ pa.record_batch( { "a": pa.array([3, 5, 8], type=pa.int64()), } ), pa.record_batch( { "a": pa.array([0, 1, 2, 6, 7, 9], type=pa.int64()), } ), pa.record_batch( { "a": pa.array([1, 2, 3, 4, 5], type=pa.int64()), } ), ] key_table_reader = pa.RecordBatchReader.from_batches( table_abc.key_schema.to_arrow(), batches(), ) scan.to_record_batches( key_table=key_table_reader ).read_all().to_pylist() ``` ```python [ {"a": 3, "b": 103}, {"a": 5, "b": 105}, {"a": 8, "b": 108}, {"a": 0, "b": 100}, {"a": 1, "b": 101}, {"a": 2, "b": 102}, {"a": 6, "b": 106}, {"a": 7, "b": 107}, {"a": 9, "b": 109}, {"a": 1, "b": 101}, {"a": 2, "b": 102}, {"a": 3, "b": 103}, {"a": 4, "b": 104}, {"a": 5, "b": 105}, ] ``` Note that when streaming scan results with `to_record_batches`, the result stream can contain more batches than the input key stream; the scan is optimized to return rows as soon as possible. The batch in the result stream will never contain rows that span multiple input key batches. `to_record_batches` supports a `batch_aligned` parameter that can be used to ensure that each output batch corresponds exactly to an input key batch. The flag can have a performance impact since it may require buffering rows until the end of the input key batch is reached and even an extra copy. It is recommended to only use this flag when the full result for a given input batch is needed before processing can continue. ## Serialization Scan objects support Python's pickle protocol, enabling seamless integration with distributed systems like Ray and multiprocessing without requiring manual serialization. ### Pickle ```python import pickle from spiral.demo import abc table_abc = abc(sp) scan = sp.scan(table_abc[["a", "b"]], where=table_abc["a"] < 3) # Serialize pickled = pickle.dumps(scan) # Deserialize restored = pickle.loads(pickled) restored.to_table().to_pylist() ``` ```python [{"a": 0, "b": 100}, {"a": 1, "b": 101}, {"a": 2, "b": 102}] ``` ### Ray ```python import ray from spiral.demo import abc table_abc = abc(sp) @ray.remote def process_scan(scan): return scan.to_table().num_rows scan = sp.scan(table_abc) count = ray.get(process_scan.remote(scan)) ``` ### Manual State (JSON-compatible) For non-pickle contexts, use `to_bytes_compressed()` and `resume_scan()`: ```python from spiral.demo import abc table_abc = abc(sp) scan = sp.scan(table_abc[["a", "b"]], where=table_abc["a"] < 3) # Serialize scan state state = scan.to_bytes_compressed() # bytes, can be base64-encoded for JSON # Resume on another process or machine restored = sp.resume_scan(state) restored.to_table().to_pylist() ``` ```python [{"a": 0, "b": 100}, {"a": 1, "b": 101}, {"a": 2, "b": 102}] ``` ## Cross-table It is possible to jointly scan any tables that have a common key schema. This has the same behaviour as an outer join but is more efficient since we know the tables are both sorted by the key columns. ```python sp.scan( { "events": events_table, "other_events": other_events_table, } ) ``` --- # Enrichment Section: Tables Source: https://docs.spiraldb.com/tables/enrich The easiest way to build multimodal Spiral tables is to start with metadata with pointers, such as file paths or URLs, and use **enrichment** to ingest the actual data bytes, such as images, audio, video, etc. With column groups design supporting hundreds of thousands of columns, horizontally expanding tables is a powerful primitive. We'll explore this using the [Open Images dataset](https://github.com/cvdfoundation/open-images-dataset), a large-scale collection of images with associated metadata. ## Starting with Metadata Let's begin by creating a table and loading the image URLs from the Open Images dataset. ```python import spiral from spiral.demo import images import pandas as pd import pyarrow as pa sp = spiral.Spiral() table = images(sp) ``` At this point, our table contains only lightweight metadata: ```python table.schema().to_arrow() ``` ```python pa.schema({ "idx": pa.int64(), "etag": pa.string(), "size": pa.int64(), "url": pa.string(), }) ``` ## Ingesting Data We can enrich the table by fetching the actual image data from ingested URLs. ```python from spiral import expressions as se # Create an enrichment that fetches images from S3 enrichment = table.enrich( se.pack({ "image": se.http.get(table["url"]) }) ) # Apply the enrichment in a streaming fashion enrichment.run() ``` After enrichment, our table now contains the actual image data, as well as metadata requests: ```python table.schema().to_arrow() ``` ```python pa.schema({ "idx": pa.int64(), "etag": pa.string(), "size": pa.int64(), "url": pa.string(), "image": pa.struct({ "bytes": pa.binary(), "meta": pa.struct({ "e_tag": pa.string(), "last_modified": pa.timestamp("ms"), "location": pa.string(), "size": pa.uint64(), "status_code": pa.uint16(), "version": pa.string(), }), }), }) ``` ### Incremental Progress Rows can be selectively enriched using the `where` parameter. The `se.http.get()` expression reads data from S3 URLs. Spiral handles parallel fetching, retries, and error handling automatically, but some requests may still fail (e.g., 404 Not Found). The errors can be inspected in using the `status_code` metadata field. Let's retry fetching images that previously failed with a 500 status code. ```python enrichment = table.enrich( se.pack({ "image": se.http.get(table["url"]) }), where=(table["image.meta.status_code"] == pa.scalar(500, pa.uint16())) ) enrichment.run() ``` ### Other Data Sources In addition to `se.http.get()`, Spiral supports other data sources such as `se.s3.get()` for object storage ingestion and `se.file.get()` for local filesystem reads. Note that the client environment must have appropriate permissions to access these resources. ## Deriving Columns with UDFs It's also possible to enrich with User-Defined Functions (UDFs) for custom processing logic. Let's say we had a column `path` representing a path within a specific S3 bucket, and we wanted to enrich with the full image data. To use `se.s3.get()` we need to have URLs in the form `s3://bucket_name/path/to/object`. Let's enrich table with these. First, we define a UDF to convert S3 paths to full S3 URLs. ```python import pyarrow as pa from spiral.expressions.udf import UDF class GetPath(UDF): """ turns urls like 'https://c2.staticflickr.com/6/5606/15611395595_f51465687d_o.jpg' into paths like '/6/5606/15611395595_f51465687d_o.jpg' """ def name(self): return "get_path" def return_type(self, input_type: pa.DataType): return pa.string() def invoke(self, urls: pa.Array): from urllib.parse import urlparse paths = [ urlparse(url).path for url in urls.to_pylist() ] return pa.array(paths, type=pa.string()) ``` Now we can use this UDF in an enrichment to generate full S3 URLs and fetch the images. ```python get_path = GetPath() enrichment = table.enrich( se.pack({ "path": get_path(table["url"]), }) ) enrichment.run() ``` ## Distributing Work Distributed enrichment is supported through distributed execution engines like Dask or Ray. Distributed enrichment will default to table shards, but that might not be sufficient parallelism. When metadata is small, default sharding may contain way too many rows for a single distributed task. Custom shards can be generated using `compute_shards()` which will scan the key columns to compute a set of shards with roughly equal numbers of keys (and thus rows) per shard. --- # Distributed Section: Tables Source: https://docs.spiraldb.com/tables/distributed Both Spiral Table scans and transactions can be row-wise partitioned for use in distributed query engines such as PySpark, Dask, and Ray. For the engines listed below we have first-class integrations. For any other engine, see [Others](#Others) for instructions on how to integrate a Spiral scan or transaction with an arbitrary distributed compute system with a Python API. ## Ray A Spiral Table can be read by a Ray Dataset: ```python import spiral from spiral.demo import gharchive, demo_project sp = spiral.Spiral() table = gharchive(sp) scan = sp.scan(table) ds: ray.data.Dataset = scan.to_ray() ``` A Ray Dataset can be written into a Spiral Table: ```python ds: ray.Dataset = sp.scan(table).to_ray() gharchive_events_copy = demo_project(sp).create_table( "gharchive.events_copy", key_schema=table.key_schema ) with gharchive_events_copy.txn() as txn: ds.write_datasink(txn.to_ray_datasink()) ``` ## Others Spiral's distribution primitives should integrate with any Python-based distributed compute platform. In this section, we use Python's `ThreadPoolExecutor` to demonstrate this integration. ### Reading All scan functions (e.g. [`to_table()`](https://docs.spiraldb.com/python-api/spiral#to_table), [`to_pandas()`](https://docs.spiraldb.com/python-api/spiral#to_pandas), [`to_record_batches()`](https://docs.spiraldb.com/python-api/spiral#to_record_batches)) accept a `shards` argument which are row-wise partitions of the table defined by non-overlapping key ranges. There are two ways to generate a set of shards for a given scan: 1. [`shards()`](https://docs.spiraldb.com/python-api/spiral#shards) which is cheap to compute but shards based on the physical data layout. 2. [`compute_shards()`](https://docs.spiraldb.com/python-api/spiral#compute_shards) which scans the key columns to compute a set of shards with roughly equal numbers of keys (and thus rows) per shard. When distributing work across processes or nodes, one must ensure a consistent table state. The state of a table can be fixed (i.e. later writes are ignored) by constructing a [`snapshot()`](https://docs.spiraldb.com/python-api/spiral#snapshot). A Snapshot's [`asof`](https://docs.spiraldb.com/python-api/spiral#asof), a psuedo-timestamp, is sufficient to uniquely identify the state of a Spiral table. Any Scan which specifies the same asof will see the same state of the table: ```python from concurrent.futures import ThreadPoolExecutor from datetime import datetime asof = table.snapshot().asof table_id = table.table_id predicate = table["created_at"] > datetime(2020, 3, 1) driver_scan = sp.scan(table, where=predicate, asof=asof) shards = driver_scan.shards() def worker_fn(shard): """Code executed on another node.""" from spiral import Spiral table = Spiral().table(table_id) worker_scan = sp.scan(table, where=predicate, asof=asof, shard=shard) return worker_scan.to_table() with ThreadPoolExecutor() as executor: results = executor.map(worker_fn, shards) ``` ### Writing Spiral Transactions are also horizontally scalable. The orchestration / driver node should start a transaction, execute and collect the transaction operations of each node, compose them into one transaction, and commit the entire transaction at once. Instead of serializing the transaction object itself, we [`Transaction.take()`](https://docs.spiraldb.com/python-api/spiral#take) the list of operations and serialize those with `Operation.to_json`. `Operation.from_json` deserializes an operation and [`Transaction.include()`](https://docs.spiraldb.com/python-api/spiral#include) subsumes one or more operations into another transaction. ```python from concurrent.futures import ThreadPoolExecutor from spiral.core.table.spec import TransactionOps import pyarrow as pa project = demo_project(sp) project_id = project.id table = project.create_table("squares", key_schema={"key": pa.int64()}) def worker_fn(index: int) -> TransactionOps: """Code executed on another node.""" from spiral import Spiral sp = Spiral() project = sp.project(project_id) txn = project.table("squares").txn() txn.write({ "key": [index], "square": [index * index], }) return txn.take() with table.txn() as txn: # The driver-side transaction must start before the workers! with ThreadPoolExecutor() as executor: worker_results = list(executor.map(worker_fn, [0, 1, 2, 3])) for worker_tx in worker_results: txn.include(worker_tx) ``` See also: - [Data Loading](https://docs.spiraldb.com/usecases/data-loaders) --- # Expressions Section: Tables Source: https://docs.spiraldb.com/tables/expressions The `expressions` module provides a powerful way to build and manipulate expressions for data processing. This guide will walk you through common use cases using [GitHub event data](https://www.gharchive.org/) as examples. ## Data Schema Let's take a look at the public GitHub Archive dataset hosted in Spiral: ```python import spiral as sp from spiral.demo import gharchive import pyarrow as pa sp = sp.Spiral() table = gharchive(sp) ``` ## Basic Concepts ### Tables Spiral tables are structured as a series of nested structs. Use Python dict syntax to project columns from the table. ```python # Create a variable referring to the actor's login actor_login = table["actor"]["login"] ``` ### Scalars Scalars are constant values that can be used in expressions. ```python # Compare with scalar is_octocat = actor_login == "octocat" ``` Scalars can be explicitly created using the `se.scalar` function. ```python from spiral import expressions as se # Create a scalar example_user = se.scalar("octocat") ``` ## Common Operations Full list of common operations is available in API docs. ### Filtering ```python # Find issues with number greater than 100 high_number_issues = table["payload"]["issue"]["number"] > 100 ``` ### Combining ```python octocat_high_issues = is_octocat & high_number_issues ``` This is equivalent to: ```python from spiral import expressions as se # Find issues created by octocat that are high-numbered octocat_high_issues = se.and_( table["actor"]["login"] == "octocat", table["payload"]["issue"]["number"] > 100 ) ``` ## Struct Operations Struct operations allow you to manipulate nested data structures. Although `se.struct` is a submodule of `se`, some functions are also available directly on `se`. ### Selecting Fields ```python from spiral import expressions as se # Select only actor information. actor_info = se.select(table["actor"], names=["login", "avatar_url"]) # Exclude the avatar URL. actor_info_no_avatar = se.select(actor_info, exclude=["avatar_url"]) ``` ### Merging Structs ```python from spiral import expressions as se # Combine actor and repo information. actor_and_repo = se.merge(table["actor"], table["repo"]) ``` On a key clash keys are taken from the right most struct. ### Creating New Structs ```python # Create a simplified event representation simple_event = se.pack({ "user": table["actor"]["login"], "repo_name": table["repo"]["name"], "action": table["payload"]["action"], }) ``` ## Advanced Example ```python import spiral import pyarrow as pa from spiral import expressions as se from datetime import datetime from spiral.demo import gharchive sp = spiral.Spiral() table = gharchive(sp) # First, let's create some reusable components actor = table["actor"] repo = table["repo"] payload = table["payload"] created_at = table["created_at"] # 1. Filter conditions is_2024 = created_at >= datetime(2024, 1, 1) is_issue_event = payload["action"] == "opened" has_avatar = actor["gravatar_id"] != "" # 2. Popular repositories (arbitrary threshold for example) popular_repo = repo["id"] > 1000000 # 3. Specific users of interest vip_users = ["octocat", "torvalds", "gvanrossum"] is_vip = se.or_(*[ actor["login"] == user for user in vip_users ]) # 4. Combine all filters where_filter = se.and_( is_2024, se.or_( se.and_(is_issue_event, popular_repo), is_vip ) ) # 5. Create a scoring system user_score = { "base_score": se.divide(actor["id"], 100), "avatar_bonus": has_avatar.cast(pa.int64()) * 50 } # 6. Create the final enriched event structure projection = se.pack({ # Original fields "timestamp": created_at, # User info with computed fields "user": se.pack({ "login": actor["login"], "is_vip": is_vip, "has_avatar": has_avatar, **user_score }), # Repository info "repository": se.pack({ "name": repo["name"], "is_popular": popular_repo }), # Event type info "event_type": se.pack({ "is_issue": is_issue_event, # Null for non-issue events "issue_number": payload["issue"]["number"], }), # Computed total score "total_score": se.add( user_score["base_score"], user_score["avatar_bonus"] ) }) # 7. Scan the table. scan: Scan = sp.scan(projection, where=where_filter) scan.schema.to_arrow() ``` ```python pa.schema({ 'timestamp': pa.timestamp("ms"), 'user': pa.struct({ 'login': pa.string(), 'is_vip': pa.bool_(), 'has_avatar': pa.bool_(), 'base_score': pa.int64(), 'avatar_bonus': pa.int64() }), 'repository': pa.struct({ 'name': pa.string(), 'is_popular': pa.bool_() }), 'event_type': pa.struct({ 'is_issue': pa.bool_(), 'issue_number': pa.int64() }), 'total_score': pa.int64() }) ``` --- # Best Practices Section: Tables Source: https://docs.spiraldb.com/tables/best-practices Spiral Tables are designed to be flexible, yet performant for a wide variety of use cases. However, there are some best practices to follow when designing your table schema to ensure optimal performance. ## Key Schema The key schema defines the sorting order of the table and is crucial for performance. In general, rows frequently accessed together should be "close" in the key space. - Avoid using random keys such as UUIDs, as they lead to low locality and frequent compactions. - If possible, use keys that reflect the natural ordering of your data. If data doesn't have a natural ordering, consider using a time-based key (e.g., UUIDv7, timestamp column, etc.) to group recent data together. ## Column Groups Column groups are horizontal partitions of a table that are (almost) completely independent from each other. This design allows Spiral Tables to support very wide schemas with hundreds of thousands of columns efficiently. - Columns that aren't used as filters should be separated from those that commonly appear in filters. - Columns that commonly appear in filters together should be placed in the same column group. - Columns with large cells (e.g. text), and especially binary columns, (e.g., images, files, etc.) should be placed in their own column groups. When scanning a table, Spiral prunes fragments based on column groups. To do this efficiently, **at least one column from a column group must be included in the projection or filter when scanning the table**. If no column groups are referenced in your query, the scan would have to consider all column groups across the table, which defeats the purpose of having independent column groups and can significantly impact performance. Due to this design, **it is not currently possible to read or write just key columns without including at least one column from any column group**. If this limitation is causing problems for your use case, please reach out. ## Interactive Mode When in interactive mode, such as in a Jupyter notebook, it is recommended to enable disk caching to speed up subsequent scans of the same table. Disk caching can be enabled and configured when creating a Spiral client. ```python from spiral import Spiral sp = Spiral(overrides={"cache.enabled": "1"}) ``` Or with more configuration options: ```python from spiral import Spiral sp = Spiral(overrides={ "keys_cache.enabled": "1", "keys_cache.memory_capacity_bytes": "1073741824", # 1 GiB "keys_cache.disk_capacity_bytes": "10737418240", # 10 GiB "metadat_cache.enabled": "1", "metadat_cache.memory_capacity_bytes": "1073741824", # 1 GiB "metadat_cache.disk_capacity_bytes": "10737418240", # 10 GiB }) ``` --- # spiral Section: PySpiral Source: https://docs.spiraldb.com/python-api/spiral ## Spiral ```python class Spiral() ``` Main client for interacting with the Spiral data platform. Configuration is loaded with the following priority (highest to lowest): 1. Explicit parameters. 2. Environment variables (`SPIRAL__*`) 3. Config file (`~/.spiral.toml`) 4. Default values (production URLs) **Examples**: ```python import spiral # Default configuration sp = spiral.Spiral() # With config overrides sp = spiral.Spiral(overrides={"limits.concurrency": "16"}) ``` **Arguments**: - `config` - Custom ClientSettings object. Defaults to global settings. - `overrides` - Configuration overrides using dot notation, see the [Client Configuration](https://docs.spiraldb.com/config) page for a full list. ### config ```python @property def config() -> ClientSettings ``` Returns the client's configuration ### authn ```python @property def authn() -> Authn ``` Get the authentication handler for this client. ### list\_projects ```python def list_projects() -> list["Project"] ``` List project IDs. ### create\_project ```python def create_project(*, description: str | None = None, id_prefix: str | None = None, **kwargs) -> "Project" ``` Create a project in the current, or given, organization. ### project ```python def project(project_id: str) -> "Project" ``` Open an existing project. ### scan ```python def scan(*projections: ExprLike, where: ExprLike | None = None, asof: datetime | int | None = None, shard: Shard | None = None, limit: int | None = None, hide_progress_bar: bool = False) -> Scan ``` Starts a read transaction on the Spiral. **Arguments**: - `projections` - a set of expressions that return struct arrays. - `where` - a query expression to apply to the data. - `asof` - execute the scan on the version of the table as of the given timestamp. Each transaction creates a new version of the table. The commit timestamp of the transaction can be used in asof. See `spiral txn` for transaction commands in CLI. - `shard` - if provided, opens the scan only for the given shard. While shards can be provided when executing the scan, providing a shard here optimizes the scan planning phase and can significantly reduce metadata download. - `limit` - maximum number of rows to return. When set, the scan will stop reading data once the limit is reached, providing efficient early termination. - `hide_progress_bar` - if True, disables the progress bar during scan building. ### scan\_keys ```python def scan_keys(*projections: ExprLike, where: ExprLike | None = None, asof: datetime | int | None = None, shard: Shard | None = None, limit: int | None = None, hide_progress_bar: bool = False) -> Scan ``` Starts a keys-only read transaction on the Spiral. To determine which keys are present in at least one column group of the table, key scan the table itself: ``` sp.scan_keys(table) ``` **Arguments**: - `projections` - scan the keys of the column groups referenced by these expressions. - `where` - a query expression to apply to the data. - `asof` - execute the scan on the version of the table as of the given timestamp. Each transaction creates a new version of the table. The commit timestamp of the transaction can be used in asof. See `spiral txn` for transaction commands in CLI. - `shard` - if provided, opens the scan only for the given shard. While shards can be provided when executing the scan, providing a shard here optimizes the scan planning phase and can significantly reduce metadata download. - `limit` - maximum number of rows to return. When set, the scan will stop reading data once the limit is reached, providing efficient early termination. - `hide_progress_bar` - if True, disables the progress bar during scan building. ### sample ```python def sample(*projections: ExprLike, sampler: Callable[[pa.Array], pa.Array], where: ExprLike | None = None, asof: datetime | int | None = None, hide_progress_bar: bool = False) -> SampleScan ``` Creates a SampleScan that can be inspected before execution. NOTE: This API is experimental and will likely change in the near future. For most use cases, prefer using `sample()` directly. This method is useful when you need to inspect the key\_scan and value\_scan plans before executing the sample operation. Call `to_record_batches()` on the returned SampleScan to execute and get a RecordBatchReader. **Arguments**: - `projections` - a set of expressions that return struct arrays. - `sampler` - A function that takes a struct array of keys and returns a boolean array indicating which keys to sample. - `where` - a query expression to apply to the data. - `asof` - execute the scan on the version of the table as of the given timestamp. Each transaction creates a new version of the table. The commit timestamp of the transaction can be used in asof. See `spiral txn` for transaction commands in CLI. - `hide_progress_bar` - if True, disables the progress bar during scan building. **Returns**: A SampleScan object with key\_plan(), value\_plan(), and to\_record\_batches() methods. ### search ```python def search(top_k: int, *rank_by: ExprLike, filters: ExprLike | None = None, freshness_window: timedelta | None = None) -> pa.RecordBatchReader ``` Queries the index with the given rank by and filters clauses. Returns a stream of scored keys. **Arguments**: - `top_k` - The number of top results to return. - `rank_by` - Rank by expressions are combined for scoring. See `se.text.find` and `se.text.boost` for scoring expressions. - `filters` - The `filters` expression is used to filter the results. It must return a boolean value and use only conjunctions (ANDs). Expressions in filters statement are considered either a `must` or `must_not` clause in search terminology. - `freshness_window` - If provided, the index will not be refreshed if its freshness does not exceed this window. ### resume\_scan ```python def resume_scan(scan_bytes: bytes) -> Scan ``` Open a serialized scan in this instance of a client. **Arguments**: - `scan_bytes` - The compressed scan bytes returned by a previous scan's to\_bytes\_compressed(). ### compute\_shards ```python def compute_shards(*projections: ExprLike, where: ExprLike | None = None, asof: datetime | int | None = None, batch_size: int | None = None) -> list[Shard] ``` Computes shards over the given projections and filter. **Arguments**: - `projections` - a set of expressions that return struct arrays. - `where` - a query expression to apply to the data. - `asof` - execute the scan on the version of the table as of the given timestamp. Each transaction creates a new version of the table. The commit timestamp of the transaction can be used in asof. See `spiral txn` for transaction commands in CLI. - `batch_size` - a specific batch size, otherwise the shards will be computed based on the fragments in the table. ## Project ```python class Project() ``` ### table ```python def table(identifier: str) -> Table ``` Open a table with a `dataset.table` identifier, or `table` name using the `default` dataset. ### create\_table ```python def create_table(identifier: str, *, key_schema: Schema | pa.Schema | Iterable[pa.Field[pa.DataType]] | Iterable[tuple[str, pa.DataType]] | Mapping[str, pa.DataType], root_uri: Uri | None = None, exist_ok: bool = False) -> Table ``` Create a new table in the project. **Arguments**: - `identifier` - The table identifier, in the form `dataset.table` or `table`. - `key_schema` - The schema of the table's keys. - `root_uri` - The root URI for the table. - `exist_ok` - If True, do not raise an error if the table already exists. ### move\_table ```python def move_table(identifier: str, new_dataset: str) ``` Move a table to a new dataset in the project. **Arguments**: - `identifier` - The table identifier, in the form `dataset.table` or `table`. - `new_dataset` - The dataset into which to move this table. ### rename\_table ```python def rename_table(identifier: str, new_table: str) ``` Move a table to a new dataset in the project. **Arguments**: - `identifier` - The table identifier, in the form `dataset.table` or `table`. - `new_dataset` - The dataset into which to move this table. ### drop\_table ```python def drop_table(identifier: str) ``` Drop a table from the project. **Arguments**: - `identifier` - The table identifier, in the form `dataset.table` or `table`. ### text\_index ```python def text_index(name: str) -> TextIndex ``` Returns the index with the given name. ### create\_text\_index ```python def create_text_index(name: str, *projections: ExprLike, where: ExprLike | None = None, root_uri: Uri | None = None, exist_ok: bool = False) -> TextIndex ``` Creates a text index over the table projection. **Arguments**: - `name` - The index name. Must be unique within the project. - `projections` - At least one projection expression is required. All projections must reference the same table. - `where` - An optional filter expression to apply to the index. - `root_uri` - The root URI for the index. - `exist_ok` - If True, do not raise an error if the index already exists. ### compute ```python def compute() -> Compute ``` Gets compute resources configured for this project. NOTE: Compute is experimental and will likely change in the near future. ## Table ```python class Table(Expr) ``` API for interacting with a SpiralDB's Table. Spiral Table is a powerful and flexible way for storing, analyzing, and querying massive and/or multimodal datasets. The data model will feel familiar to users of SQL- or DataFrame-style systems, yet is designed to be more flexible, more powerful, and more useful in the context of modern data processing. Tables are stored and queried directly from object storage. ### identifier ```python @property def identifier() -> str ``` Returns the fully qualified identifier of the table. ### project ```python @property def project() -> str | None ``` Returns the project of the table. ### dataset ```python @property def dataset() -> str | None ``` Returns the dataset of the table. ### name ```python @property def name() -> str | None ``` Returns the name of the table. ### key\_schema ```python @property def key_schema() -> Schema ``` Returns the key schema of the table. ### schema ```python def schema() -> Schema ``` Returns the FULL schema of the table. NOTE: This can be expensive for large tables. ### write ```python def write(table: LazyTableLike, push_down_nulls: bool = False, **kwargs) -> None ``` Write an item to the table inside a single transaction. **Arguments**: - `push_down_nulls`: Whether to push down nullable structs down its children. E.g. `[{"a": 1}, null]` would become `[{"a": 1}, {"a": null}]`. SpiralDB doesn't allow struct-level nullability, so use this option if your data contains nullable structs. - `table`: The table to write. ### enrich ```python def enrich(*projections: ExprLike, where: ExprLike | None = None) -> Enrichment ``` Returns an Enrichment object that, when applied, produces new columns. Enrichment can be applied in different ways, e.g. distributed. **Arguments**: - `projections`: Projection expressions deriving new columns to write back. Expressions can be over multiple Spiral tables, but all tables including this one must share the same key schema. - `where`: Optional filter expression to apply when reading the input tables. ### drop\_columns ```python def drop_columns(column_paths: list[str]) -> None ``` Drops the specified columns from the table. **Arguments**: - `column_paths`: Fully qualified column names. (e.g., "column\_name" or "nested.field"). All columns must exist, if a column doesn't exist the function will return an error. ### snapshot ```python def snapshot(asof: datetime | int | None = None) -> Snapshot ``` Returns a snapshot of the table at the given timestamp. Each transaction creates a new version of the table. The commit timestamp of the transaction can be used in asof. See `spiral txn` for transaction commands in CLI. ### txn ```python def txn(**kwargs) -> Transaction ``` Begins a new transaction. Transaction must be committed for writes to become visible. While transaction can be used to atomically write data to the table, it is important that the primary key columns are unique within the transaction. The behavior is undefined if this is not the case. ### to\_arrow\_dataset ```python def to_arrow_dataset() -> "ds.Dataset" ``` Returns a PyArrow Dataset representing the table. ### to\_polars\_lazy\_frame ```python def to_polars_lazy_frame() -> "pl.LazyFrame" ``` Returns a Polars LazyFrame for the Spiral table. ### to\_duckdb\_relation ```python def to_duckdb_relation() -> "duckdb.DuckDBPyRelation" ``` Returns a DuckDB relation for the Spiral table. ### key ```python def key(*parts) -> Key ``` Creates a Key object for the given parts according to the table's key schema. **Arguments**: - `parts` - Parts of the key. Must be a valid prefix of the table's key schema. **Returns**: Key object representing the given parts. ### set\_metadata ```python def set_metadata(metadata: dict[str, bytes]) -> None ``` Set metadata on the table. Entries are upserted: existing keys are overwritten, keys not mentioned are left unchanged. **Warnings**: The metadata is not versioned. Parallel calls to set\_metadata (on the same machine or on different machines) race to set a value. Ordering is not guaranteed. Each key + value pair can be at most 8 KiB, keys can be at most 255 characters, and a table can have at most 1024 entries. **Examples**: ``` table.set_metadata({"source": b"sensor", "version": b"2"}) ``` If you need to store more complex data, use json to serialize: ``` import json table.set_metadata({"config": json.dumps({"f1": 123, "f2": "abc"}).encode()}) ``` **Arguments**: - `metadata` - A dict mapping string keys to bytes values. ### get\_metadata ```python def get_metadata() -> dict[str, bytes] ``` Get metadata for the table. **Returns**: Dict mapping string keys to bytes values. ### drop\_metadata ```python def drop_metadata(keys: list[str]) -> None ``` Remove the given keys from the table's metadata. Keys that do not exist are silently ignored. **Arguments**: - `keys` - List of metadata keys to remove. ### column\_group ```python def column_group(*paths: str) -> ColumnGroup ``` Creates a ColumnGroup object for the given column paths. **Arguments**: - `paths` - Path to the column group. List of column names or dot-separated paths. **Returns**: ColumnGroup object representing the given columns. ## Snapshot ```python class Snapshot() ``` Spiral table snapshot. A snapshot represents a point-in-time view of a table. ### asof ```python @property def asof() -> Timestamp ``` Returns the asof timestamp of the snapshot. ### schema ```python def schema() -> Schema ``` Returns the schema of the snapshot. ### table ```python @property def table() -> "Table" ``` Returns the table associated with the snapshot. ### to\_arrow\_dataset ```python def to_arrow_dataset() -> "ds.Dataset" ``` Returns a PyArrow Dataset representing the table. ### to\_polars\_lazy\_frame ```python def to_polars_lazy_frame() -> "pl.LazyFrame" ``` Returns a Polars LazyFrame for the Spiral table. ### to\_duckdb\_relation ```python def to_duckdb_relation() -> "duckdb.DuckDBPyRelation" ``` Returns a DuckDB relation for the Spiral table. ### set\_column\_metadata ```python def set_column_metadata(column_path: str, metadata: dict[str, bytes]) -> None ``` Set metadata on a column, identified by its dotted path. Entries are upserted: existing keys are overwritten, keys not mentioned are left unchanged. **Arguments**: - `column_path` - Dotted path to the column (e.g. "color" or "video.frames"). - `metadata` - A dict mapping string keys to bytes values. ### get\_column\_metadata ```python def get_column_metadata(column_path: str) -> dict[str, bytes] ``` Get metadata for a column, identified by its dotted path. **Arguments**: - `column_path` - Dotted path to the column (e.g. "color" or "video.frames"). **Returns**: Dict mapping string keys to bytes values. ### drop\_column\_metadata ```python def drop_column_metadata(column_path: str, keys: list[str]) -> None ``` Remove the given keys from a column's metadata. Keys that do not exist are silently ignored. **Arguments**: - `column_path` - Dotted path to the column (e.g. "color" or "video.frames"). - `keys` - List of metadata keys to remove. ## Scan ```python class Scan() ``` Scan object. ### limit ```python @property def limit() -> int | None ``` Returns the limit set on this scan, if any. ### schema ```python @property def schema() -> Schema ``` Returns the schema of the scan. ### key\_schema ```python @property def key_schema() -> Schema ``` Returns the key schema of the scan. ### table\_ids ```python @property def table_ids() -> list[str] ``` Returns the set of table IDs referenced by this scan. ### plan ```python def plan() -> "Plan" ``` Builds the executable plan for this scan. ### to\_bytes\_compressed ```python def to_bytes_compressed() -> bytes ``` Get the scan state as compressed bytes. This state can be used to resume the scan later using Spiral.resume\_scan(). **Returns**: Compressed bytes representing the internal scan state. ### \_\_getstate\_\_ ```python def __getstate__() -> bytes ``` Serialize scan for pickling. Enables seamless integration with distributed systems like Ray, Dask, and Python's multiprocessing without requiring manual serialization. **Returns**: Zstd-compressed bytes containing JSON-serialized config and scan state. ### \_\_setstate\_\_ ```python def __setstate__(state: bytes) -> None ``` Deserialize scan from pickled state. **Arguments**: - `state` - Zstd-compressed bytes from **getstate**. ## Plan ```python class Plan() ``` Executable plan object. ### limit ```python @property def limit() -> int | None ``` Returns the limit set on this scan, if any. ### metrics ```python @property def metrics() -> dict[str, Any] ``` Returns metrics about the plan. ### schema ```python @property def schema() -> Schema ``` Returns the schema of the plan. ### key\_schema ```python @property def key_schema() -> Schema ``` Returns the key schema of the plan. ### is\_empty ```python def is_empty() -> bool ``` Check if the Spiral is empty for the given key range. False negatives are possible, but false positives are not, i.e. is\_empty can return False and scan can return zero rows. ### to\_record\_batches ```python def to_record_batches( *, shards: Iterable[Shard] | None = None, key_table: TableLike | None = None, batch_readahead: int | None = None, batch_aligned: bool | None = None, grouping_prefix: list[str] | None = None, explode: str | None = None, hide_progress_bar: bool | None = None) -> pa.RecordBatchReader ``` Read as a stream of RecordBatches. **Arguments**: - `shards` - Optional iterable of shards to evaluate. If provided, only the specified shards will be read. Must not be provided together with key\_table. - `key_table` - a table of keys to "take" (including aux columns for cell-push-down). Key table must be either a table or a stream of table-like objects (e.g. Arrow's RecordBatchReader). For optimal performance, each batch should contain sorted and unique keys. Unsorted and duplicate keys are still supported, but performance is less predictable. - `batch_readahead` - the number of batches to prefetch in the background. - `batch_aligned` - if True, ensures that batches are aligned with key\_table batches. The stream will yield batches that correspond exactly to the batches in key\_table, but may be less efficient and use more memory (aligning batches requires buffering and maybe a copy). Must only be used when key\_table is provided. - `grouping_prefix` - list of key column names to group by. Must be a prefix of the key schema. Non-group columns are collected into List arrays. - `explode` - name of a `List` column to unnest. The struct fields become top-level columns and other columns are repeated to match. Typically used together with grouping\_prefix. - `hide_progress_bar` - DEPRECATED, the progress bar can be hidden when opening a scan ### to\_unordered\_record\_batches ```python def to_unordered_record_batches( *, shards: Iterable[Shard] | None = None, key_table: TableLike | None = None, batch_readahead: int | None = None, hide_progress_bar: bool | None = None) -> pa.RecordBatchReader ``` Read as a stream of RecordBatches, NOT ordered by key. **Arguments**: - `shards` - Optional iterable of shards to evaluate. If provided, only the specified shards will be read. Must not be provided together with key\_table. - `key_table` - a table of keys to "take" (including aux columns for cell-push-down). Key table must be either a table or a stream of table-like objects (e.g. Arrow's RecordBatchReader). For optimal performance, each batch should contain sorted and unique keys. Unsorted and duplicate keys are still supported, but performance is less predictable. - `batch_readahead` - the number of batches to prefetch in the background. - `hide_progress_bar` - DEPRECATED, the progress bar can be hidden when opening a scan ### to\_table ```python def to_table(*, shards: Iterable[Shard] | None = None, key_table: TableLike | None = None, batch_readahead: int | None = None, hide_progress_bar: bool | None = None) -> pa.Table ``` Read into a single PyArrow Table. **Warnings**: This downloads the entire Spiral Table into memory on this machine. **Arguments**: - `shards` - Optional iterable of shards to evaluate. If provided, only the specified shards will be read. Must not be provided together with key\_table. - `key_table` - a table of keys to "take" (including aux columns for cell-push-down). - `batch_readahead` - the number of batches to prefetch in the background. - `hide_progress_bar` - If True, disables the progress bar during reading. **Returns**: pyarrow\.Table ### to\_dask ```python def to_dask(*, shards: Iterable[Shard] | None = None, batch_readahead: int | None = None, hide_progress_bar: bool | None = None) -> "dd.DataFrame" ``` Read into a Dask DataFrame. Requires the `dask` package to be installed. Dask execution has some limitations, e.g. UDFs are not currently supported. These limitations usually manifest as serialization errors when Dask workers attempt to serialize the state. If you are encountering such issues, please reach out to the support for assistance. **Arguments**: - `shards` - Optional iterable of shards to evaluate. If provided, only the specified shards will be read. - `batch_readahead` - the number of batches to prefetch in the background. Applies to each shard read task. - `hide_progress_bar` - If True, disables the progress bar during reading. **Returns**: dask.dataframe.DataFrame ### to\_ray ```python def to_ray(*, shards: Iterable[Shard] | None = None, batch_readahead: int | None = None, hide_progress_bar: bool | None = None) -> "ray.data.Dataset" ``` Read into a Ray Dataset. Requires the `ray` package to be installed. **Warnings**: The output row order is not guaranteed. Shards are read concurrently and Ray does not guarantee inter-block ordering. Sort the resulting dataset if order matters. If the Scan returns zero rows, the resulting Ray Dataset will have [an empty schema](https://github.com/ray-project/ray/issues/59946). **Arguments**: - `shards` - Optional iterable of shards to evaluate. If provided, only the specified shards will be read. - `batch_readahead` - the number of batches to prefetch in the background. - `hide_progress_bar` - If True, disables the progress bar during reading. **Returns**: - `ray.data.Dataset` - A Ray Dataset distributed across shards. ### to\_pandas ```python def to_pandas(*, shards: Iterable[Shard] | None = None, key_table: TableLike = None, batch_readahead: int | None = None, hide_progress_bar: bool | None = None) -> "pd.DataFrame" ``` Read into a Pandas DataFrame. Requires the `pandas` package to be installed. **Warnings**: This downloads the entire Spiral Table into memory on this machine. **Arguments**: - `shards` - Optional iterable of shards to evaluate. If provided, only the specified shards will be read. Must not be provided together with key\_table. - `key_table` - a table of keys to "take" (including aux columns for cell-push-down). - `batch_readahead` - the number of batches to prefetch in the background. - `hide_progress_bar` - If True, disables the progress bar during reading. **Returns**: pandas.DataFrame ### to\_polars ```python def to_polars(*, shards: Iterable[Shard] | None = None, key_table: TableLike = None, batch_readahead: int | None = None, hide_progress_bar: bool | None = None) -> "pl.DataFrame" ``` Read into a Polars DataFrame. Requires the `polars` package to be installed. **Warnings**: This downloads the entire Spiral Table into memory on this machine. To lazily interact with a Spiral Table try Table.to\_polars\_lazy\_frame. **Arguments**: - `shards` - Optional iterable of shards to evaluate. If provided, only the specified shards will be read. Must not be provided together with key\_table. - `key_table` - a table of keys to "take" (including aux columns for cell-push-down). - `batch_readahead` - the number of batches to prefetch in the background. - `hide_progress_bar` - If True, disables the progress bar during reading. **Returns**: polars.DataFrame ### to\_data\_loader ```python def to_data_loader(seed: int = 42, shuffle_buffer_size: int = 0, batch_size: int = 32, **kwargs) -> "SpiralDataLoader" ``` Read into a Torch-compatible DataLoader for single-node training. **Arguments**: - `seed` - Random seed for reproducibility. - `shuffle_buffer_size` - Size of shuffle buffer. Zero means no shuffling. - `batch_size` - Batch size. - `**kwargs` - Additional arguments passed to SpiralDataLoader constructor. **Returns**: SpiralDataLoader with shuffled shards. ### to\_distributed\_data\_loader ```python def to_distributed_data_loader(world: "World | None" = None, shards: Iterable[Shard] | None = None, seed: int = 42, shuffle_buffer_size: int = 0, batch_size: int = 32, **kwargs) -> "SpiralDataLoader" ``` Read into a Torch-compatible DataLoader for distributed training. **Arguments**: - `world` - World configuration with rank and world\_size. If None, auto-detects from torch.distributed. - `shards` - Optional sharding. Sharding is global, i.e. the world will be used to select the shards for this rank. If None, uses scan's natural sharding. - `seed` - Random seed for reproducibility. - `shuffle_buffer_size` - Size of shuffle buffer. Zero means no shuffling. - `batch_size` - Batch size. - `**kwargs` - Additional arguments passed to SpiralDataLoader constructor. **Returns**: SpiralDataLoader with shards partitioned for this rank. Auto-detect from PyTorch distributed: ```python import spiral from spiral.dataloader import SpiralDataLoader, World from spiral.demo import fineweb sp = spiral.Spiral() fineweb_table = fineweb(sp) plan = sp.scan(fineweb_table[["text"]]).plan() loader: SpiralDataLoader = plan.to_distributed_data_loader(batch_size=32) ``` Explicit world configuration: ```python world = World(rank=0, world_size=4) loader: SpiralDataLoader = plan.to_distributed_data_loader(world=world, batch_size=32) ``` ### resume\_data\_loader ```python def resume_data_loader(state: dict[str, Any], **kwargs) -> "SpiralDataLoader" ``` Create a DataLoader from checkpoint state, resuming from where it left off. This is the recommended way to resume training from a checkpoint. It extracts the seed, samples\_yielded, and shards from the state dict and creates a new DataLoader that will skip the already-processed samples. **Arguments**: - `state` - Checkpoint state from state\_dict(). - `**kwargs` - Additional arguments to pass to SpiralDataLoader constructor. These will override values in the state dict where applicable. **Returns**: New SpiralDataLoader instance configured to resume from the checkpoint. Save checkpoint during training: ```python import spiral from spiral.dataloader import SpiralDataLoader, World from spiral.demo import images, fineweb sp = spiral.Spiral() table = images(sp) fineweb_table = fineweb(sp) plan = sp.scan(fineweb_table[["text"]]).plan() loader = plan.to_distributed_data_loader(batch_size=32, seed=42) checkpoint = loader.state_dict() ``` Resume later - uses same shards from checkpoint: ```python resumed_loader = plan.resume_data_loader( checkpoint, batch_size=32, # An optional transform_fn may be provided: # transform_fn=my_transform, ) ``` ### to\_iterable\_dataset ```python def to_iterable_dataset(shards: Iterable[Shard] | None = None, seed: int = 42, shuffle_buffer_size: int = 0, batch_readahead: int | None = None, infinite: bool = False) -> "hf.IterableDataset" ``` Returns a Huggingface's IterableDataset. Requires `datasets` package to be installed. Note: For new code, consider using SpiralDataLoader instead. **Arguments**: - `shards` - Optional iterable of shards to read. If None, uses scan's natural sharding. - `seed` - Base random seed for deterministic shuffling and checkpointing. - `shuffle_buffer_size` - Size of shuffle buffer for within-shard shuffling. 0 means no shuffling. - `batch_readahead` - Controls how many batches to read ahead concurrently. If pipeline includes work after reading (e.g. decoding, transforming, ...) this can be set higher. Otherwise, it should be kept low to reduce next batch latency. Defaults to min(number of CPU cores, 64) or to shuffle.buffer\_size/16 if shuffle is not None. - `infinite` - If True, the returned IterableDataset will loop infinitely over the data, re-shuffling ranges after exhausting all data. ### shards ```python def shards() -> list[Shard] ``` Get list of shards for this plan. The shards are based on the scan's physical data layout (file fragments). Each shard contains a key range and cardinality (set to None when unknown). **Returns**: List of Shard objects with key range and cardinality (if known). ## Transaction ```python @final class Transaction() ``` Spiral table transaction. While transaction can be used to atomically write data to the table, it is important that the primary key columns are unique within the transaction. ### status ```python @property def status() -> str ``` The status of the transaction. ### table ```python @property def table() -> Table ``` The table associated with this transaction. ### is\_empty ```python def is_empty() -> bool ``` Check if the transaction has no operations. ### write ```python def write(table: LazyTableLike, push_down_nulls: bool = False) ``` Write an item to the table inside a single transaction. **Arguments**: - `push_down_nulls`: Whether to push down nullable structs down its children. E.g. `[{"a": 1}, null]` would become `[{"a": 1}, {"a": null}]`. SpiralDB doesn't allow struct-level nullability, so use this option if your data contains nullable structs. - `table`: The table to write. ### writeback ```python def writeback(plan: Plan, *, shards: Iterable[Shard] | None = None) ``` Write back the results of a plan to the table. **Arguments**: - `plan`: The plan to write back. The plan does NOT need to be over the same table as transaction, but it does need to have the same key schema. - `shards`: The shards to read from. If not provided, all shards are read. ### to\_ray\_datasink ```python def to_ray_datasink() -> ray.data.Datasink[tuple[Timestamp, list[str]]] ``` Returns a Ray Datasink which writes into this transaction. ### drop\_columns ```python def drop_columns(column_paths: list[str]) ``` Drops the specified columns from the table. **Arguments**: - `column_paths`: Fully qualified column names. (e.g., "column\_name" or "nested.field"). All columns must exist, if a column doesn't exist the function will return an error. ### compact\_key\_space ```python def compact_key_space() ``` Compact the key space of the table. ### compact\_column\_group ```python def compact_column_group(column_group: ColumnGroup, strategy: CompactionStrategy, *, key_range: KeyRange | Shard | None = None) ``` Compact the specified column group of the table. This method is a convenience method that combines planning, execution, and progress submission, when there is no distributed context, i.e. compaction is run on a single node. See [https://docs.spiraldb.com/config](https://docs.spiraldb.com/config) for configuration options. **Arguments**: - `column_group`: The column group to compact. Can be obtained from Table.column\_groups. - `strategy`: The compaction strategy to use. - `key_range`: Optional key range or shard to limit the range of compaction. Requires that no fragment overlaps the given key range, but is not covered by it. ### column\_group\_compaction ```python def column_group_compaction( column_group: ColumnGroup, strategy: CompactionStrategy, *, key_range: KeyRange | Shard | None = None) -> Compaction ``` Plan a compaction for the specified column group. Called by the "driver". **Arguments**: - `column_group`: The column group to compact. Can be obtained from Table.column\_groups. - `strategy`: The compaction strategy to use. - `key_range`: Optional key range or shard to limit the range of compaction. Requires that no fragment overlaps the given key range, but is not covered by it. **Returns**: A Compaction object representing the planned compaction. ### column\_group\_compaction\_execute\_tasks ```python def column_group_compaction_execute_tasks(tasks: CompactionTasks) -> None ``` Execute the given compaction tasks inside the transaction. Called by "worker" inside of a worker transaction. Operations can be collected by calling `take()` after this method. **Arguments**: - `tasks`: The compaction tasks to execute. ### take ```python def take() -> TransactionOps ``` Take the operations from the transaction Transaction can no longer be committed or aborted after calling this method. . ### include ```python def include(ops: TransactionOps) ``` Include the given operations in the transaction. Checks for conflicts between the included operations and any existing operations. IMPORTANT: The `self` transaction must be started at or before the timestamp of the included operations. ### commit ```python def commit(*, txn_dump: str | None = None) ``` Commit the transaction. ### load\_dumps ```python @staticmethod def load_dumps(client: Spiral, *txn_dump: str) -> list[Operation] ``` Load a transaction from a dump file. ### abort ```python def abort() ``` Abort the transaction. ## Compaction ```python class Compaction() ``` Compaction is used to optimize the physical layout of data in a table's column group. ### column\_group ```python @property def column_group() -> ColumnGroup ``` The column group being compacted. ### run ```python def run() -> None ``` Run the compaction to completion by repeatedly getting tasks and executing them. ### run\_dask ```python def run_dask(client: dask.distributed.Client) -> None ``` Run the compaction using distributed Dask workers. Tasks are partitioned and distributed across Dask workers for parallel execution. **Arguments**: - `client` - Dask distributed client. Required. Create with: `from dask.distributed import Client; client = Client()` ### run\_ray ```python def run_ray() -> None ``` Run the compaction using distributed Ray workers. Ray must be initialized before calling this method. To initialize Ray run `ray.init()` for a local cluster or `ray.init(address="ray://
:")` to connect to an existing cluster. Tasks are partitioned and distributed across Ray workers for parallel execution. ## Enrichment ```python class Enrichment() ``` An enrichment is used to derive new columns from the existing ones, such as fetching data from object storage with `se.s3.get` or compute embeddings. With column groups design supporting 100s of thousands of columns, horizontally expanding tables are a powerful primitive. Spiral optimizes enrichments where source and destination table are the same. ### table ```python @property def table() -> Table ``` The table to write back into. ### projection ```python @property def projection() -> Expr ``` The projection expression. ### where ```python @property def where() -> Expr | None ``` The filter expression. ### run ```python def run(*, shards: Iterable[Shard] | None = None) -> None ``` Apply the enrichment onto the table in a streaming fashion. For large tables, consider using `run_dask()` or `run_ray()` for distributed execution. **Arguments**: - `shards` - Optional iterable of shards to process. If not provided, processes all data. ### run\_dask ```python def run_dask(client: dask.distributed.Client, *, shards: Iterable[Shard] | None = None, checkpoint: str | None = None) -> None ``` Use distributed Dask to apply the enrichment. Requires `dask[distributed]` to be installed. How shards are determined: - If `shards` is provided, those will be used directly. - Else, if `checkpoint` is provided, shards will be loaded from the checkpoint file. - Else, the scan's default sharding will be used. **Arguments**: - `client` - Dask distributed client. Required. Create with: `from dask.distributed import Client; client = Client()` - `shards` - Optional iterable of shards to process. If not provided, uses default sharding or checkpoint sharding if available. - `checkpoint` - Optional path to checkpoint file for incremental progress. If the file exists, processing will resume from failed shards. Failed shards are written back to this file. ### run\_ray ```python def run_ray(*, shards: Iterable[Shard] | None = None, checkpoint: str | None = None) -> None ``` Use distributed Ray to apply the enrichment. Requires `ray` to be installed and initialized. Ray must be initialized before calling this method. To initialize Ray run `ray.init()` for a local cluster or `ray.init(address="ray://
:")` to connect to an existing cluster. Ray execution has some limitations. These limitations usually manifest as serialization errors when Ray workers attempt to serialize the state. If you are encountering such issues, consider splitting the enrichment into UDF-only derivation that will be executed in a streaming fashion, followed by a Ray enrichment for the rest of the computation. If that is not possible, please reach out to support for assistance. How shards are determined: - If `shards` is provided, those will be used directly. - Else, if `checkpoint` is provided, shards will be loaded from the checkpoint file. - Else, the scan's default sharding will be used. **Arguments**: - `shards` - Optional iterable of shards to process. If not provided, uses default sharding or checkpoint sharding if available. - `checkpoint` - Optional path to checkpoint file for incremental progress. If the file exists, processing will resume from failed shards. Failed shards are written back to this file. ## World ```python @dataclass(frozen=True) class World() ``` Distributed training configuration. **Attributes**: - `rank` - Process rank (0 to world\_size-1). - `world_size` - Total number of processes. ### shards ```python def shards(shards: list[Shard], shuffle_seed: int | None = None) -> list[Shard] ``` Partition shards for distributed training. **Arguments**: - `shards` - List of Shard objects to partition. - `shuffle_seed` - Optional seed to shuffle before partitioning. **Returns**: Subset of shards for this rank (round-robin partitioning). ### from\_torch ```python @classmethod def from_torch(cls) -> World ``` Auto-detect world configuration from PyTorch distributed. ## SpiralDataLoader ```python class SpiralDataLoader() ``` DataLoader optimized for Spiral's multi-threaded streaming architecture. Unlike PyTorch's DataLoader which uses multiprocessing for I/O (num\_workers), SpiralDataLoader leverages Spiral's efficient Rust-based streaming and only uses multiprocessing for CPU-bound post-processing transforms. Key differences from PyTorch DataLoader: - No num\_workers for I/O (Spiral's Rust layer is already multi-threaded) - map\_workers for parallel post-processing (tokenization, decoding, etc.) - Built-in checkpoint support via skip\_samples - Explicit shard-based architecture for distributed training Simple usage: ```python def train_step(batch): pass loader = SpiralDataLoader(plan, batch_size=32) for batch in loader: train_step(batch) ``` With parallel transforms: ```python def tokenize_batch(batch): # ... return batch loader = SpiralDataLoader( plan, batch_size=32, transform_fn=tokenize_batch, map_workers=4, ) ``` ### \_\_init\_\_ ```python def __init__(plan: Plan, *, shards: list[Shard] | None = None, shuffle_shards: bool = True, seed: int = 42, skip_samples: int = 0, shuffle_buffer_size: int = 0, batch_size: int = 32, batch_readahead: int | None = None, transform_fn: Callable[[pa.RecordBatch], Any] | None = None, map_workers: int = 0, infinite: bool = False) ``` Initialize SpiralDataLoader. **Arguments**: - `plan` - Spiral plan to load data from. - `shards` - Optional list of Shard objects to read. If None, uses plan's natural sharding based on physical layout. - `shuffle_shards` - Whether to shuffle the list of shards. Uses the provided seed. - `seed` - Base random seed for deterministic shuffling and checkpointing. - `skip_samples` - Number of samples to skip at the beginning (for resuming from checkpoint). - `shuffle_buffer_size` - Size of shuffle buffer for within-shard shuffling. 0 means no shuffling. - `batch_size` - Number of rows per batch. - `batch_readahead` - Number of batches to prefetch in background. If None, uses a sensible default based on whether transforms are applied. - `transform_fn` - Optional function to transform each batch. Takes a PyArrow RecordBatch and returns any type. Users can call batch.to\_pydict() inside the function if they need a dict. If map\_workers > 0, this function must be picklable. - `map_workers` - Number of worker processes for parallel transform\_fn application. 0 means single-process (no parallelism). Use this for CPU-bound transforms like tokenization or audio decoding. - `infinite` - Whether to cycle through the dataset infinitely. If True, the dataloader will repeat the dataset indefinitely. If False, the dataloader will stop after going through the dataset once. ### \_\_iter\_\_ ```python def __iter__() -> Iterator[Any] ``` Iterate over batches. ### state\_dict ```python def state_dict() -> dict[str, Any] ``` Get checkpoint state for resuming. **Returns**: Dictionary containing samples\_yielded, seed, and shards. Example checkpoint: ```python loader = SpiralDataLoader(plan, batch_size=32, seed=42) for i, batch in enumerate(loader): if i == 10: checkpoint = loader.state_dict() break ``` Example resume: ```python loader = SpiralDataLoader.from_state_dict(plan, checkpoint, batch_size=32) ``` ### from\_state\_dict ```python @classmethod def from_state_dict(cls, plan: Plan, state: dict[str, Any], **kwargs) -> SpiralDataLoader ``` Create a DataLoader from checkpoint state, resuming from where it left off. This is the recommended way to resume training from a checkpoint. It extracts the seed, samples\_yielded, and shards from the state dict and creates a new DataLoader that will skip the already-processed samples. **Arguments**: - `plan` - Spiral plan to load data from. - `state` - Checkpoint state from state\_dict(). - `**kwargs` - Additional arguments to pass to SpiralDataLoader constructor. These will override values in the state dict where applicable. **Returns**: New SpiralDataLoader instance configured to resume from the checkpoint. Save checkpoint during training: ```python loader = plan.to_distributed_data_loader(batch_size=32, seed=42) checkpoint = loader.state_dict() ``` Resume later using the same shards from checkpoint: ```python resumed_loader = SpiralDataLoader.from_state_dict( plan, checkpoint, batch_size=32, # An optional transform_fn may be provided: # transform_fn=my_transform, ) ``` --- # spiral.expressions Section: PySpiral Source: https://docs.spiraldb.com/python-api/expressions #### aux ```python def aux(name: builtins.str, dtype: pa.DataType) -> Expr ``` Create a variable expression referencing a column in the auxiliary table. Auxiliary table is optionally given to `Scan#to_record_batches` function when reading only specific keys or doing cell pushdown. **Arguments**: - `name` - variable name - `dtype` - must match dtype of the column in the auxiliary table. #### scalar ```python def scalar(value: Any) -> Expr ``` Create a scalar expression. #### cast ```python def cast(expr: ExprLike, dtype: pa.DataType) -> Expr ``` Cast an expression into another PyArrow DataType. #### and\_ ```python def and_(expr: ExprLike, *exprs: ExprLike) -> Expr ``` Create a conjunction of one or more expressions. #### or\_ ```python def or_(expr: ExprLike, *exprs: ExprLike) -> Expr ``` Create a disjunction of one or more expressions. #### eq ```python def eq(lhs: ExprLike, rhs: ExprLike) -> Expr ``` Create an equality comparison. #### neq ```python def neq(lhs: ExprLike, rhs: ExprLike) -> Expr ``` Create a not-equal comparison. #### xor ```python def xor(lhs: ExprLike, rhs: ExprLike) -> Expr ``` Create a XOR comparison. #### lt ```python def lt(lhs: ExprLike, rhs: ExprLike) -> Expr ``` Create a less-than comparison. #### lte ```python def lte(lhs: ExprLike, rhs: ExprLike) -> Expr ``` Create a less-than-or-equal comparison. #### gt ```python def gt(lhs: ExprLike, rhs: ExprLike) -> Expr ``` Create a greater-than comparison. #### gte ```python def gte(lhs: ExprLike, rhs: ExprLike) -> Expr ``` Create a greater-than-or-equal comparison. #### negate ```python def negate(expr: ExprLike) -> Expr ``` Negate the given expression. #### not\_ ```python def not_(expr: ExprLike) -> Expr ``` Negate the given expression. #### is\_null ```python def is_null(expr: ExprLike) -> Expr ``` Check if the given expression is null. #### is\_not\_null ```python def is_not_null(expr: ExprLike) -> Expr ``` Check if the given expression is not null. #### add ```python def add(lhs: ExprLike, rhs: ExprLike) -> Expr ``` Add two expressions. #### subtract ```python def subtract(lhs: ExprLike, rhs: ExprLike) -> Expr ``` Subtract two expressions. #### multiply ```python def multiply(lhs: ExprLike, rhs: ExprLike) -> Expr ``` Multiply two expressions. #### divide ```python def divide(lhs: ExprLike, rhs: ExprLike) -> Expr ``` Divide two expressions. #### modulo ```python def modulo(lhs: ExprLike, rhs: ExprLike) -> Expr ``` Modulo two expressions. #### getitem ```python def getitem(expr: ExprLike, field: str) -> Expr ``` Get field from a struct. **Arguments**: - `expr` - The struct expression to get the field from. - `field` - The field to get. Dot-separated string is supported to access nested fields. #### pack ```python def pack(fields: dict[str, ExprLike], *, nullable: bool = False) -> Expr ``` Assemble a new struct from the given named fields. **Arguments**: - `fields` - A dictionary of field names to expressions. The field names will be used as the struct field names. #### merge ```python def merge(*structs: "ExprLike") -> Expr ``` Merge fields from the given structs into a single struct. **Arguments**: - `*structs` - Each expression must evaluate to a struct. **Returns**: A single struct containing all the fields from the input structs. If a field is present in multiple structs, the value from the last struct is used. #### select ```python def select(expr: ExprLike, names: list[str] = None, exclude: list[str] = None) -> Expr ``` Select fields from a struct. **Arguments**: - `expr` - The struct-like expression to select fields from. - `names` - Field names to select. If a path contains a dot, it is assumed to be a nested struct field. - `exclude` - List of field names to exclude from result. Exactly one of `names` or `exclude` must be provided. ## Expr ```python class Expr() ``` Base class for Spiral expressions. All expressions support comparison and basic arithmetic operations. #### \_\_getitem\_\_ ```python def __getitem__(item: str | int | list[str]) -> "Expr" ``` Get an item from a struct or list. **Arguments**: - `item` - The key or index to get. If item is a string, it is assumed to be a field in a struct. Dot-separated string is supported to access nested fields. If item is a list of strings, it is assumed to be a list of fields in a struct. If item is an integer, it is assumed to be an index in a list. #### cast ```python def cast(dtype: pa.DataType) -> "Expr" ``` Cast the expression result to a different data type. #### select ```python def select(*paths: str, exclude: list[str] = None) -> "Expr" ``` Select fields from a struct-like expression. **Arguments**: - `*paths` - Field names to select. If a path contains a dot, it is assumed to be a nested struct field. - `exclude` - List of field names to exclude from result. ## UDF ```python @dataclass(frozen=True) class UDF(abc.ABC) ``` A User-Defined Function (UDF). This class should be subclassed to define custom UDFs. **Warning** You must define **eq** and **hash** correctly for your sub-class! A simple way to do this is to annotate your sub-class with `@dataclass(frozen=True)` and list its type-annotated member variables: ```python from dataclasses import dataclass from spiral import expressions as se @dataclass(frozen=True) class FMA(se.UDF): b: int c: int def name(self): ... def return_type(self, scope): ... def invoke(self, scope): ... ``` **Example** ```python import spiral from spiral.demo import fineweb sp = spiral.Spiral() fineweb_table = fineweb(sp) from spiral import expressions as se import pyarrow as pa class MyAdd(se.UDF): def name(self): return "my_add" def return_type(self, scope: pa.DataType): if not isinstance(scope, pa.StructType): raise ValueError("Expected struct type as input") return scope.field(0).type def invoke(self, scope: pa.Array): if not isinstance(scope, pa.StructArray): raise ValueError("Expected struct array as input") return pa.compute.add(scope.field(0), scope.field(1)) my_add = MyAdd() expr = my_add(fineweb_table.select("first_arg", "second_arg")) ``` #### \_\_call\_\_ ```python def __call__(scope: ExprLike) -> Expr ``` Create an expression that calls this UDF with the given arguments. #### name ```python @abc.abstractmethod def name() -> str ``` A name used when displaying this UDF. **Example** This UDF: ```python @dataclass(frozen=True) class Foo(se.UDF): bar: int baz: str def name(self): return "hello_world" def return_type(self, s): ... def invoke(self, s): ... Foo(3, "hello")(1) ``` renders as follows: `udf_hello_world[Foo(bar=3, baz="hello")](...)` You can change what is displayed between the square brackets by defining **str**: ```python @dataclass(frozen=True) class Foo(se.UDF): bar: int baz: str def __str__(self): return f'{bar}, {baz}' def name(self): return "hello_world" def return_type(self, s): ... def invoke(self, s): ... Foo(3, "hello")(1) ``` renders as follows: `udf_hello_world[3, "hello"](...)` #### return\_type ```python @abc.abstractmethod def return_type(scope: pa.DataType) -> pa.DataType ``` Must return the return type of the UDF given the input scope type. All expressions in Spiral must return nullable (Arrow default) types, including nested structs, meaning that all fields in structs must also be nullable, and if those fields are structs, their fields must also be nullable, and so on. #### invoke ```python @abc.abstractmethod def invoke(scope: pa.Array) -> pa.Array ``` Must implement the UDF logic given the input scope array. --- # spiral.expressions.s3 Section: PySpiral Source: https://docs.spiraldb.com/python-api/expressions.s3 #### get ```python def get(expr: ExprLike, abort_on_error: bool = False) -> Expr ``` Read data from object storage by the s3:// URL. **Arguments**: - `expr` - URLs of the data that needs to be read from object storage. - `abort_on_error` - Should the expression abort on errors or just collect them. --- # spiral.expressions.http Section: PySpiral Source: https://docs.spiraldb.com/python-api/expressions.http #### get ```python def get(expr: ExprLike, abort_on_error: bool = False) -> Expr ``` Read data from the URL. **Arguments**: - `expr` - URLs of the data that needs to be read. - `abort_on_error` - Should the expression abort on errors or just collect them. --- # spiral.expressions.file Section: PySpiral Source: https://docs.spiraldb.com/python-api/expressions.file #### get ```python def get(expr: ExprLike, abort_on_error: bool = False) -> Expr ``` Read data from the local filesystem by the file:// URL. **Arguments**: - `expr` - URLs of the data that needs to be read. - `abort_on_error` - Should the expression abort on errors or just collect them. --- # spiral.expressions.list Section: PySpiral Source: https://docs.spiraldb.com/python-api/expressions.list Spiral List Expressions. ```python from spiral import expressions as se se.list ``` #### contains ```python def contains(expr: ExprLike, value: ExprLike) -> Expr ``` Check if a list contains a value. **Arguments**: - `expr` - The list expression. - `value` - The value to search for. ```python se.list.contains(table["tags"], "important") ``` #### map\_ ```python def map_(expr: ExprLike, transform: Callable[[Expr], ExprLike]) -> Expr ``` Apply a transform to each element of a list, returning a new list. **Arguments**: - `expr` - The list expression. - `transform` - A lambda that takes an element expression and returns a transformed expression. ```python se.list.map_(table["scores"], lambda el: el > 90) ``` #### any\_ ```python def any_(expr: ExprLike, predicate: Callable[[Expr], ExprLike] | None = None) -> Expr ``` Check if any element of a list satisfies a predicate. **Arguments**: - `expr` - A list expression. If no predicate is given, must be a list of booleans. - `predicate` - Optional lambda applied to each element. When provided, it is equivalent to `any_(map_(expr, predicate))`. ```python se.list.any_(table["scores"], lambda el: el > 90) ``` #### all\_ ```python def all_(expr: ExprLike, predicate: Callable[[Expr], ExprLike] | None = None) -> Expr ``` Check if all elements of a list satisfy a predicate. **Arguments**: - `expr` - A list expression. If no predicate is given, must be a list of booleans. - `predicate` - Optional lambda applied to each element. When provided, it is equivalent to `all_(map_(expr, predicate))`. ```python se.list.all_(table["scores"], lambda el: el > 90) ``` #### slice ```python def slice(array: ExprLike, start: int, stop: int | None, step: int = 1) -> Expr ``` Slice each element of a list array to start..stop by step. --- # spiral.expressions.vector Section: PySpiral Source: https://docs.spiraldb.com/python-api/expressions.vector #### cosine\_similarity ```python def cosine_similarity(left: ExprLike, right: ExprLike) -> Expr ``` Compute the cosine similarity of `left` and `right`. See the Wikipedia article ["Cosine similarity"](https://en.wikipedia.org/wiki/Cosine_similarity). Only implemented for fixed size lists of the 16-, 32-, and 64-bit floating-point types. ##### Examples ```python >>> import spiral.expressions as se >>> import pyarrow as pa >>> needle = pa.scalar([1, 0, 0], type=pa.list_(pa.float64(), 3)) >>> sp.scan({"result": se.vector.cosine_similarity(t["embedding"], needle)}).to_table() pyarrow.Table x: int64 result: double ---- x: ... result: ... ``` **Arguments**: - `left` - an array of fixed-size-list of length N and type T - `right` - an array of fixed-size-list of length N and type T **Returns**: an array of T #### dot ```python def dot(left: ExprLike, right: ExprLike) -> Expr ``` Compute the dot product of `left` and `right`. Only implemented for fixed size lists of the 16-, 32-, and 64-bit floating-point types. ##### Examples ```python >>> import spiral.expressions as se >>> sp.scan({"5_34_sim": se.vector.dot(t["layer5"], t["layer34"])}).to_table() pyarrow.Table x: int64 5_34_sim: double ---- x: ... 4_34_sim: ... ``` **Arguments**: - `left` - an array of fixed-size-list of length N and type T - `right` - an array of fixed-size-list of length N and type T **Returns**: an array of T #### norm ```python def norm(array: ExprLike) -> Expr ``` Compute the Euclidean (L2) norm of `array`. Only implemented for fixed size lists of the 16-, 32-, and 64-bit floating-point types. ##### Examples ```python >>> import spiral.expressions as se >>> sp.scan({"norm": se.vector.norm(t["embedding"])}).to_table() pyarrow.Table x: int64 norm: double ---- x: ... norm: ... ``` **Arguments**: - `array` - an array of fixed-size-list of type T **Returns**: an array of T --- # spiral.viz Section: PySpiral Source: https://docs.spiraldb.com/python-api/viz ## send\_to\_rerun ```python def send_to_rerun(data: Table | Scan | pa.RecordBatchReader | pa.Table | pa.RecordBatch | pa.ChunkedArray | pa.StructArray, spec: dict[str, Archetype], *, default_timeline: str | None = None) ``` Visualize a table, scan, record batch, record batch reader, array, or table. **Arguments**: - `data` - positional-only; the data to visualize - `spec` - positional-only; a named set of visualizations referencing columns of data. - `default_timeline` - keyword-only; a dot-separated path to a timeline column (an integer or timestamp). ## Archetype ```python @final class Archetype() ``` An ergonomic shim around a Rerun Archetype. An archetype defines visualization semantics for a value. For example, a pair of floating-point values could be interpreted as: - a point in 2-d Euclidean space, - a vector in 2-d Euclidean space (perhaps rooted at the origin, perhaps not), or - a point in 2-d Hyberbolic space among many other possible interpretations. Rerun defines archetypes for the first two: Points2D and Arrows2D. `spiral.viz.Archetype` is a shim which flattens or broadcasts components of an Archetype so that they are all aligned to one another and to their timeline. **Example** Consider a 2-d point cloud which rotates around the origin. We store it as an Arrow table with two columns: `timeline` of type `timestamp[ns]` and `cloud` of type `list(list(f64, 2))`. The outer lists represent the point cloud at each timestamp and the inner fixed-size-list of length 2 represents the pairs. ```python import math import pyarrow as pa from datetime import datetime import spiral from spiral import viz as sv timeline = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] def point_on_circle(index, time): x = 2 * math.pi * (index / 8 + time) return (math.cos(x), math.sin(x)) cloud = [ [point_on_circle(index, time) for index in range(8)] for time in timeline ] timeline = [datetime.fromtimestamp(t) for t in timeline] table = pa.table({ "timeline": pa.array(timeline, type=pa.timestamp("ns")), "cloud": pa.array(cloud), }) ``` Now we can use `sv.Points2D`, a spiral Archetype, to convert this data into a Rerun visualization. `sv.Points2D` will automatically broadcast the timeline across the values of the outer list. ```python import rerun rerun.init("spiral 1") sv.send_to_rerun( table, {"point_cloud": sv.Points2D(positions="cloud", timeline="timeline")} ) rerun.notebook_show() ``` ## ArrowComponentBatchLike ```python @final class ArrowComponentBatchLike(rr.ComponentBatchLike) ``` A Rerun shim around an Arrow arrray. ## ArrowTimeColumn ```python @final class ArrowTimeColumn(rr._send_columns.TimeColumnLike) ``` A timeline represented as a series of points on that timeline. A timeline may either be temporal or ordinal. The `ArrowTimeColumn` for a temporal timeline must have type `timestamp[ns]`. An ordinal timeline must have type `int64`. **Example** A series of values may simply use their row index as a timeline: ``` import pyarrow as pa import spiral data = ['a', 'b', 'c', 'd'] ordinal_timeline = pa.array(list(range(len(data)))) spiral.viz.ArrowTimeColumn.from_name_and_data("my_timeline", ordinal_timeline) ``` They may also use any nanosecond-resolution timestamp column. For example, a 10Hz sensor with data from 0s to 1 might use a timeline such as this: ``` from datetime import datetime timeline = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] timeline = [datetime.fromtimestamp(t) for t in timeline] spiral.viz.ArrowTimeColumn.from_name_and_data("my_timeline", timeline) ``` ## Arrows2D ```python def Arrows2D(**kwargs) ``` [rerun.archetypes.Arrows2D](https://rerun.io/docs/reference/types/archetypes/arrows2d) ## Arrows3D ```python def Arrows3D(**kwargs) ``` [rerun.archetypes.Arrows3D](https://rerun.io/docs/reference/types/archetypes/arrows3d) ## Points2D ```python def Points2D(**kwargs) ``` [rerun.archetypes.Points2D](https://rerun.io/docs/reference/types/archetypes/points2d) ## Points3D ```python def Points3D(**kwargs) ``` [rerun.archetypes.Points3D](https://rerun.io/docs/reference/types/archetypes/points3d) ## TextLog ```python def TextLog(**kwargs) ``` [rerun.archetypes.TextLog](https://rerun.io/docs/reference/types/archetypes/text_log) ## Scalars ```python def Scalars(**kwargs) ``` [rerun.archetypes.Scalars](https://rerun.io/docs/reference/types/archetypes/scalars) --- # Command Line Section: PySpiral Source: https://docs.spiraldb.com/cli **Usage**: ```console $ spiral [OPTIONS] COMMAND [ARGS]... ``` **Options**: - `--version`: Display the version of the Spiral CLI. - `-v, --verbose`: Increase log verbosity. Use -v for info, -vv for debug. \[default: 0] - `--install-completion`: Install completion for the current shell. - `--show-completion`: Show completion for the current shell, to copy it or customize the installation. - `--help`: Show this message and exit. **Commands**: - `login` - `whoami`: Display the current user's information. - `orgs`: Org admin. - `projects`: Projects and grants. - `tables`: Spiral Tables. - `fs`: File Systems. - `txn`: Table Transactions. - `workloads` ## `spiral login` **Usage**: ```console $ spiral login [OPTIONS] ``` **Options**: - `--org-id TEXT` - `--force / --no-force`: \[default: no-force] - `--show-token / --no-show-token`: \[default: no-show-token] - `--help`: Show this message and exit. ## `spiral whoami` Display the current user's information. **Usage**: ```console $ spiral whoami [OPTIONS] ``` **Options**: - `--help`: Show this message and exit. ## `spiral orgs` **Usage**: ```console $ spiral orgs [OPTIONS] COMMAND [ARGS]... ``` **Options**: - `--help`: Show this message and exit. **Commands**: - `switch`: Switch the active organization. - `ls`: List organizations. - `invite`: Invite a user to the organization. - `sso`: Configure single sign-on for your... - `directory`: Configure directory services for your... - `audit-logs`: Configure audit logs for your organization. - `log-streams`: Configure log streams for your organization. - `domains`: Configure domains for your organization. - `keys`: Configure bring-your-own-key for your... ### `spiral orgs switch` Switch the active organization. **Usage**: ```console $ spiral orgs switch [OPTIONS] [ORG_ID] ``` **Arguments**: - `[ORG_ID]`: Organization ID **Options**: - `--help`: Show this message and exit. ### `spiral orgs ls` List organizations. **Usage**: ```console $ spiral orgs ls [OPTIONS] ``` **Options**: - `--help`: Show this message and exit. ### `spiral orgs invite` Invite a user to the organization. **Usage**: ```console $ spiral orgs invite [OPTIONS] EMAIL ``` **Arguments**: - `EMAIL`: \[required] **Options**: - `--role [owner|member|guest]`: \[default: member] - `--expires-in-days INTEGER`: \[default: 7] - `--help`: Show this message and exit. ### `spiral orgs sso` Configure single sign-on for your organization. **Usage**: ```console $ spiral orgs sso [OPTIONS] ``` **Options**: - `--help`: Show this message and exit. ### `spiral orgs directory` Configure directory services for your organization. **Usage**: ```console $ spiral orgs directory [OPTIONS] ``` **Options**: - `--help`: Show this message and exit. ### `spiral orgs audit-logs` Configure audit logs for your organization. **Usage**: ```console $ spiral orgs audit-logs [OPTIONS] ``` **Options**: - `--help`: Show this message and exit. ### `spiral orgs log-streams` Configure log streams for your organization. **Usage**: ```console $ spiral orgs log-streams [OPTIONS] ``` **Options**: - `--help`: Show this message and exit. ### `spiral orgs domains` Configure domains for your organization. **Usage**: ```console $ spiral orgs domains [OPTIONS] ``` **Options**: - `--help`: Show this message and exit. ### `spiral orgs keys` Configure bring-your-own-key for your organization. **Usage**: ```console $ spiral orgs keys [OPTIONS] ``` **Options**: - `--help`: Show this message and exit. ## `spiral projects` **Usage**: ```console $ spiral projects [OPTIONS] COMMAND [ARGS]... ``` **Options**: - `--help`: Show this message and exit. **Commands**: - `ls`: List projects. - `create`: Create a new project. - `drop`: Drop a project. - `grant`: Grant a role on a project to a principal. - `grants`: List project grants. ### `spiral projects ls` List projects. **Usage**: ```console $ spiral projects ls [OPTIONS] ``` **Options**: - `--help`: Show this message and exit. ### `spiral projects create` Create a new project. **Usage**: ```console $ spiral projects create [OPTIONS] ``` **Options**: - `--description TEXT`: Description for the project. \[required] - `--id-prefix TEXT`: An optional ID prefix to which a random number will be appended. - `--help`: Show this message and exit. ### `spiral projects drop` Drop a project. **Usage**: ```console $ spiral projects drop [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--help`: Show this message and exit. ### `spiral projects grant` Grant a role on a project to a principal. **Usage**: ```console $ spiral projects grant [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--role [viewer|editor|admin]`: Project role to grant. \[required] - `--org-id TEXT`: Pass an organization ID to grant a role to an organization user(s). - `--org-user TEXT`: Pass a user ID when using --org-id to grant a role to a user. - `--org-role [owner|member|guest]`: Pass an org role when using --org-id to grant a role to all users with that role. - `--workload-id TEXT`: Pass a workload ID to grant a role to a workload. - `--github TEXT`: Pass an `{org}/{repo}` string to grant a role to a job running in GitHub Actions. - `--modal TEXT`: Pass a `{workspace_id}/{env_name}` string to grant a role to a job running in Modal environment. - `--gcp-service-account TEXT`: Pass a `{service_account_email}/{unique_id}` to grant a role to a GCP service account. - `--aws-iam-role TEXT`: Pass a `{account_id}/{role_name}` to grant a Spiral role to an AWS IAM Role. - `--help`: Show this message and exit. ### `spiral projects grants` List project grants. **Usage**: ```console $ spiral projects grants [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--help`: Show this message and exit. ## `spiral tables` **Usage**: ```console $ spiral tables [OPTIONS] COMMAND [ARGS]... ``` **Options**: - `--help`: Show this message and exit. **Commands**: - `ls`: List tables. - `head`: Show the leading rows of the table. - `move`: Move table to a different dataset. - `rename`: Rename table. - `drop`: Drop table. - `key-schema`: Show the table key schema. - `schema`: Compute the full table schema. - `wal`: Fetch Write-Ahead-Log. - `flush`: Flush Write-Ahead-Log. - `truncate`: Truncate column group metadata. - `manifests`: Display all fragments from metadata. (DEPRECATED: Use 'fragments' instead.) - `fragments`: Display all fragments from metadata. - `fragments-scan`: Display the fragments used in a scan of a... - `debug-scan`: Visualize the scan of a given column group. - `stats`: Display table statistics and compaction... ### `spiral tables ls` List tables. **Usage**: ```console $ spiral tables ls [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--help`: Show this message and exit. ### `spiral tables head` Show the leading rows of the table. **Usage**: ```console $ spiral tables head [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `-n INTEGER`: Maximum number of rows to show. Defaults to 10. \[default: 10] - `--asof TEXT`: Transaction timestamp from `spiral txn ls` (microseconds or UTC datetime). - `--help`: Show this message and exit. ### `spiral tables move` Move table to a different dataset. **Usage**: ```console $ spiral tables move [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `--new-dataset TEXT`: New dataset name. - `--help`: Show this message and exit. ### `spiral tables rename` Rename table. **Usage**: ```console $ spiral tables rename [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `--new-table TEXT`: New table name. - `--help`: Show this message and exit. ### `spiral tables drop` Drop table. **Usage**: ```console $ spiral tables drop [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `--help`: Show this message and exit. ### `spiral tables key-schema` Show the table key schema. **Usage**: ```console $ spiral tables key-schema [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `--help`: Show this message and exit. ### `spiral tables schema` Compute the full table schema. **Usage**: ```console $ spiral tables schema [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `--help`: Show this message and exit. ### `spiral tables wal` Fetch Write-Ahead-Log. **Usage**: ```console $ spiral tables wal [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `--help`: Show this message and exit. ### `spiral tables flush` Flush Write-Ahead-Log. **Usage**: ```console $ spiral tables flush [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `--help`: Show this message and exit. ### `spiral tables truncate` Truncate column group metadata. **Usage**: ```console $ spiral tables truncate [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `--help`: Show this message and exit. ### `spiral tables manifests` Display all fragments from metadata. **Usage**: ```console $ spiral tables manifests [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `--max-rows INTEGER`: Maximum number of rows to show per manifest. - `--asof TEXT`: Transaction timestamp from `spiral txn ls` (microseconds or UTC datetime). - `--help`: Show this message and exit. ### `spiral tables fragments` Display all fragments from metadata. **Usage**: ```console $ spiral tables fragments [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `--max-rows INTEGER`: Maximum number of fragments to show per manifest. - `--asof TEXT`: Transaction timestamp from `spiral txn ls` (microseconds or UTC datetime). - `--help`: Show this message and exit. ### `spiral tables fragments-scan` Display the fragments used in a scan of a given column group. **Usage**: ```console $ spiral tables fragments-scan [OPTIONS] [PROJECT] [COLUMN_GROUP] ``` **Arguments**: - `[PROJECT]`: Project ID - `[COLUMN_GROUP]`: Dot-separated column group path. \[default: .] **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `--help`: Show this message and exit. ### `spiral tables debug-scan` Visualize the scan of a given column group. **Usage**: ```console $ spiral tables debug-scan [OPTIONS] [PROJECT] [COLUMN_GROUP] ``` **Arguments**: - `[PROJECT]`: Project ID - `[COLUMN_GROUP]`: Dot-separated column group path. \[default: .] **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `--help`: Show this message and exit. ### `spiral tables stats` Display table statistics and compaction indicators. **Usage**: ```console $ spiral tables stats [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `--help`: Show this message and exit. ## `spiral fs` **Usage**: ```console $ spiral fs [OPTIONS] COMMAND [ARGS]... ``` **Options**: - `--help`: Show this message and exit. **Commands**: - `show`: Show the file system configured for project. - `update`: Update a project's default file system. - `list-providers`: Lists the available built-in file system... ### `spiral fs show` Show the file system configured for project. **Usage**: ```console $ spiral fs show [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--help`: Show this message and exit. ### `spiral fs update` Update a project's default file system. **Usage**: ```console $ spiral fs update [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--type [builtin|s3|s3like|gcs|upstream]`: Type of the file system. - `--provider TEXT`: Provider, when using `builtin` type. - `--endpoint TEXT`: Endpoint, when using `s3` or `s3like` type. - `--region TEXT`: Region, when using `s3`, `s3like` or `gcs` type (defaults to `auto` for `s3` when `endpoint` is set). - `--bucket TEXT`: Bucket, when using `s3` or `gcs` type. - `--role-arn TEXT`: Role ARN to assume, when using `s3` type. - `-y, --yes`: Skip confirmation prompt. - `--dangerously-move-data`: Allow updating a file system which has mounted tables. - `--help`: Show this message and exit. ### `spiral fs list-providers` Lists the available built-in file system providers. **Usage**: ```console $ spiral fs list-providers [OPTIONS] ``` **Options**: - `--help`: Show this message and exit. ## `spiral txn` **Usage**: ```console $ spiral txn [OPTIONS] COMMAND [ARGS]... ``` **Options**: - `--help`: Show this message and exit. **Commands**: - `ls`: List transactions for a table. - `revert`: Revert a transaction by table ID and... ### `spiral txn ls` List transactions for a table. **Usage**: ```console $ spiral txn ls [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--table TEXT`: Table name. - `--dataset TEXT`: Dataset name. - `--since INTEGER`: List transactions committed after this timestamp (microseconds). - `--help`: Show this message and exit. ### `spiral txn revert` Revert a transaction by table ID and transaction index. **Usage**: ```console $ spiral txn revert [OPTIONS] TABLE_ID TXN_IDX ``` **Arguments**: - `TABLE_ID`: \[required] - `TXN_IDX`: \[required] **Options**: - `--help`: Show this message and exit. ## `spiral workloads` **Usage**: ```console $ spiral workloads [OPTIONS] COMMAND [ARGS]... ``` **Options**: - `--help`: Show this message and exit. **Commands**: - `ls`: List workloads. - `create`: Create a new workload. - `deactivate`: Deactivate a workload. - `issue-creds`: Issue new workflow credentials. - `revoke-creds`: Revoke workflow credentials. - `policies`: List workload policies. - `delete-policy`: Delete a workload policy. - `create-policy`: Create a workload policy. ### `spiral workloads ls` List workloads. **Usage**: ```console $ spiral workloads ls [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--help`: Show this message and exit. ### `spiral workloads create` Create a new workload. **Usage**: ```console $ spiral workloads create [OPTIONS] [PROJECT] ``` **Arguments**: - `[PROJECT]`: Project ID **Options**: - `--name TEXT`: Friendly name for the workload. - `--help`: Show this message and exit. ### `spiral workloads deactivate` Deactivate a workload. Removes all associated credentials and policies. **Usage**: ```console $ spiral workloads deactivate [OPTIONS] WORKLOAD_ID ``` **Arguments**: - `WORKLOAD_ID`: Workload ID. \[required] **Options**: - `--help`: Show this message and exit. ### `spiral workloads issue-creds` Issue new workflow credentials. **Usage**: ```console $ spiral workloads issue-creds [OPTIONS] WORKLOAD_ID ``` **Arguments**: - `WORKLOAD_ID`: Workload ID. \[required] **Options**: - `--skip-prompt / --no-skip-prompt`: Skip prompt and print secret to console. \[default: no-skip-prompt] - `--help`: Show this message and exit. ### `spiral workloads revoke-creds` Revoke workflow credentials. **Usage**: ```console $ spiral workloads revoke-creds [OPTIONS] CLIENT_ID ``` **Arguments**: - `CLIENT_ID`: Client ID to revoke. \[required] **Options**: - `--help`: Show this message and exit. ### `spiral workloads policies` List workload policies. **Usage**: ```console $ spiral workloads policies [OPTIONS] [WORKLOAD_ID] ``` **Arguments**: - `[WORKLOAD_ID]`: Workload ID. **Options**: - `--help`: Show this message and exit. ### `spiral workloads delete-policy` Delete a workload policy. **Usage**: ```console $ spiral workloads delete-policy [OPTIONS] POLICY_ID ``` **Arguments**: - `POLICY_ID`: Policy ID. \[required] **Options**: - `--help`: Show this message and exit. ### `spiral workloads create-policy` Create a workload policy. The audience (aud) claim of OIDC tokens must be [https://iss.spiraldb.com](https://iss.spiraldb.com) (except for Modal, where the audience is oidc.modal.com). See [https://docs.spiraldb.com/oidc](https://docs.spiraldb.com/oidc) for more info. **Usage**: ```console $ spiral workloads create-policy [OPTIONS] COMMAND [ARGS]... ``` **Options**: - `--help`: Show this message and exit. **Commands**: - `github`: Create a policy for a GitHub Actions... - `modal`: Create a policy for a Modal environment. - `gcp`: Create a policy for a GCP environment. - `aws`: Create a policy for an AWS environment. - `oidc`: Create a policy with a custom OIDC issuer. #### `spiral workloads create-policy github` Create a policy for a GitHub Actions environment. **Usage**: ```console $ spiral workloads create-policy github [OPTIONS] WORKLOAD_ID ``` **Arguments**: - `WORKLOAD_ID`: Workload ID. \[required] **Options**: - `--repository TEXT`: GitHub repository (org/repo). \[required] - `--environment TEXT`: GitHub deployment environment name. - `--ref TEXT`: Git ref (e.g. refs/heads/main). - `--sha TEXT`: Commit SHA that triggered the workflow. - `--repository-owner TEXT`: Repository owner (org or user). - `--actor-id TEXT`: GitHub user ID of the actor. - `--repository-visibility TEXT`: Repository visibility (public, private, internal). - `--repository-id TEXT`: Repository ID. - `--repository-owner-id TEXT`: Repository owner ID. - `--runner-environment TEXT`: Runner environment (github-hosted or self-hosted). - `--actor TEXT`: GitHub username of the actor. - `--workflow TEXT`: Workflow name. - `--head-ref TEXT`: Head ref (source branch for PRs). - `--base-ref TEXT`: Base ref (target branch for PRs). - `--event-name TEXT`: Event that triggered the workflow (e.g. push, pull\_request). - `--ref-type TEXT`: Ref type (branch or tag). - `--job-workflow-ref TEXT`: Reusable workflow ref. - `--help`: Show this message and exit. #### `spiral workloads create-policy modal` Create a policy for a Modal environment. **Usage**: ```console $ spiral workloads create-policy modal [OPTIONS] WORKLOAD_ID ``` **Arguments**: - `WORKLOAD_ID`: Workload ID. \[required] **Options**: - `--workspace-id TEXT`: Modal workspace ID. \[required] - `--environment-name TEXT`: Modal environment name. - `--environment-id TEXT`: Modal environment ID. - `--app-id TEXT`: Modal app ID. - `--app-name TEXT`: Modal app name. - `--function-id TEXT`: Modal function ID. - `--function-name TEXT`: Modal function name. - `--container-id TEXT`: Modal container ID. - `--help`: Show this message and exit. #### `spiral workloads create-policy gcp` Create a policy for a GCP environment. **Usage**: ```console $ spiral workloads create-policy gcp [OPTIONS] WORKLOAD_ID ``` **Arguments**: - `WORKLOAD_ID`: Workload ID. \[required] **Options**: - `--email TEXT`: GCP service account email. \[required] - `--unique-id TEXT`: GCP unique ID (sub claim). \[required] - `--help`: Show this message and exit. #### `spiral workloads create-policy aws` Create a policy for an AWS environment. **Usage**: ```console $ spiral workloads create-policy aws [OPTIONS] WORKLOAD_ID ``` **Arguments**: - `WORKLOAD_ID`: Workload ID. \[required] **Options**: - `--account TEXT`: AWS account ID. \[required] - `--role TEXT`: AWS IAM role name. \[required] - `--help`: Show this message and exit. #### `spiral workloads create-policy oidc` Create a policy with a custom OIDC issuer. **Usage**: ```console $ spiral workloads create-policy oidc [OPTIONS] WORKLOAD_ID ``` **Arguments**: - `WORKLOAD_ID`: Workload ID. \[required] **Options**: - `--iss TEXT`: OIDC issuer URL. \[required] - `--conditions TEXT`: Conditions in KEY=VALUE format. Can be provided multiple times. - `--help`: Show this message and exit. --- # Client Configuration Section: PySpiral Source: https://docs.spiraldb.com/config This document describes the configuration options for the Spiral client. Configuration can be set via: - **File**: `~/.spiral.toml` - **Environment variables**: Prefixed with `SPIRAL__` (double underscore) - **Runtime overrides**: Override values when initializing the client, using the dot notation ## Caching Spiral supports three independent caches for different file types: - **`keys_cache`**: Caches key space files. **Enabled memory-only by default**. Recommended for workloads with repeated key lookups. - **`manifests_cache`**: Caches manifest files. Recommended to enable for better query planning performance. - **`fragments_cache`**: Caches fragment (data) files. **Not recommended** to enable in most cases as fragments are typically very large and benefit less from caching. Each cache can be independently configured with separate memory limits, disk capacity, and storage paths. Set `disk_capacity_bytes` to `0` for memory-only caching. ## Top level Main configuration for the Spiral client, that can affect many different parts of the client. | Field | Default | Environment Variable | Description | | ------------- | -------- | --------------------- | ----------------------------- | | `dev` | `false` | `SPIRAL__DEV` | Development mode flag | | `file_format` | `vortex` | `SPIRAL__FILE_FORMAT` | File format for table storage | ## Authn Authentication related settings | Field | Default | Environment Variable | Description | | ------------------- | ------- | ---------------------------- | -------------------------------------------------------------------------------------------------------------------------- | | `authn.device_code` | `false` | `SPIRAL__AUTHN__DEVICE_CODE` | Whether to force device code authentication flow. This exist as an override when e.g. ssh-ing into an environment machine. | | `authn.token` | `-` | `SPIRAL__AUTHN__TOKEN` | Optional authentication token. | ## Fragments Cache Client cache limits and constraints | Field | Default | Environment Variable | Description | | --------------------------------------------- | ---------------- | ------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- | | `fragments_cache.blob_index_size` | `4 * (1 << 10)` | `SPIRAL__FRAGMENTS_CACHE__BLOB_INDEX_SIZE` | Size of the blob index for each blob on disk. A larger index can hold more entries per blob but increases per-blob I/O. | | `fragments_cache.block_size` | `16 * (1 << 20)` | `SPIRAL__FRAGMENTS_CACHE__BLOCK_SIZE` | Block size for the disk cache engine. This is the minimum eviction unit and also limits the maximum cacheable entry size. | | `fragments_cache.buffer_pool_size` | `16 * (1 << 20)` | `SPIRAL__FRAGMENTS_CACHE__BUFFER_POOL_SIZE` | Total flush buffer pool size in bytes (RAM). Each flusher gets `buffer_pool_size / flushers` bytes. Entries larger than this are dropped. | | `fragments_cache.disk_capacity_bytes` | `20 * (1 << 30)` | `SPIRAL__FRAGMENTS_CACHE__DISK_CAPACITY_BYTES` | Disk cache size in bytes. | | `fragments_cache.disk_path` | `spiral` | `SPIRAL__FRAGMENTS_CACHE__DISK_PATH` | Disk cache path override. If not set, defaults to $XDG\_CACHE\_HOME/spiral or \~/.cache/spiral | | `fragments_cache.enabled` | `false` | `SPIRAL__FRAGMENTS_CACHE__ENABLED` | Whether this cache is enabled. Default varies per cache type (see ClientSettings). | | `fragments_cache.memory_capacity_bytes` | `2 * (1 << 30)` | `SPIRAL__FRAGMENTS_CACHE__MEMORY_CAPACITY_BYTES` | Memory cache size in bytes. | | `fragments_cache.submit_queue_size_threshold` | `16 * (1 << 20)` | `SPIRAL__FRAGMENTS_CACHE__SUBMIT_QUEUE_SIZE_THRESHOLD` | If the total estimated size in the submit queue exceeds this threshold, further entries are silently dropped. | | `fragments_cache.write_on_insertion` | `-` | `SPIRAL__FRAGMENTS_CACHE__WRITE_ON_INSERTION` | When true, entries are written to disk immediately on insertion. When false, entries are only written to disk when evicted from memory. | ## Keys Cache Client cache limits and constraints | Field | Default | Environment Variable | Description | | ---------------------------------------- | ---------------- | ------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | | `keys_cache.blob_index_size` | `4 * (1 << 10)` | `SPIRAL__KEYS_CACHE__BLOB_INDEX_SIZE` | Size of the blob index for each blob on disk. A larger index can hold more entries per blob but increases per-blob I/O. | | `keys_cache.block_size` | `16 * (1 << 20)` | `SPIRAL__KEYS_CACHE__BLOCK_SIZE` | Block size for the disk cache engine. This is the minimum eviction unit and also limits the maximum cacheable entry size. | | `keys_cache.buffer_pool_size` | `16 * (1 << 20)` | `SPIRAL__KEYS_CACHE__BUFFER_POOL_SIZE` | Total flush buffer pool size in bytes (RAM). Each flusher gets `buffer_pool_size / flushers` bytes. Entries larger than this are dropped. | | `keys_cache.disk_capacity_bytes` | `0` | `SPIRAL__KEYS_CACHE__DISK_CAPACITY_BYTES` | Disk cache size in bytes. | | `keys_cache.disk_path` | `spiral` | `SPIRAL__KEYS_CACHE__DISK_PATH` | Disk cache path override. If not set, defaults to $XDG\_CACHE\_HOME/spiral or \~/.cache/spiral | | `keys_cache.enabled` | `true` | `SPIRAL__KEYS_CACHE__ENABLED` | Whether this cache is enabled. Default varies per cache type (see ClientSettings). | | `keys_cache.memory_capacity_bytes` | `1 << 30` | `SPIRAL__KEYS_CACHE__MEMORY_CAPACITY_BYTES` | Memory cache size in bytes. | | `keys_cache.submit_queue_size_threshold` | `16 * (1 << 20)` | `SPIRAL__KEYS_CACHE__SUBMIT_QUEUE_SIZE_THRESHOLD` | If the total estimated size in the submit queue exceeds this threshold, further entries are silently dropped. | | `keys_cache.write_on_insertion` | `-` | `SPIRAL__KEYS_CACHE__WRITE_ON_INSERTION` | When true, entries are written to disk immediately on insertion. When false, entries are only written to disk when evicted from memory. | ## Limits Client resource limits and constraints | Field | Default | Environment Variable | Description | | ----------------------------------------------------- | ---------------------------------------------- | -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `limits.batch_readahead` | `NonZeroUsize :: new (min (num_cpus :: get ()` | `SPIRAL__LIMITS__BATCH_READAHEAD` | Maximum number of concurrent shards to evaluate, "read ahead", when scanning. Defaults to min(number of CPU cores, 32). | | `limits.compaction_size_based_threshold_bytes` | `8 * 1024 * 1024` | `SPIRAL__LIMITS__COMPACTION_SIZE_BASED_THRESHOLD_BYTES` | Fragments smaller than this threshold will be considered for compaction when size-based compaction strategy is used. | | `limits.compaction_split_compressed_max_bytes` | `1024 * 1024 * 1024` | `SPIRAL__LIMITS__COMPACTION_SPLIT_COMPRESSED_MAX_BYTES` | Maximum total compressed size of fragments accumulated in a single compaction split. | | `limits.compaction_task_concurrency` | `1` | `SPIRAL__LIMITS__COMPACTION_TASK_CONCURRENCY` | Maximum number of concurrent tasks in compaction. | | `limits.http_max_retries` | `10` | `SPIRAL__LIMITS__HTTP_MAX_RETRIES` | Maximum number of HTTP request retries on transient errors. | | `limits.http_max_retry_interval_ms` | `5 * 60 * 1000` | `SPIRAL__LIMITS__HTTP_MAX_RETRY_INTERVAL_MS` | Maximum number of milliseconds to wait between HTTP request retries. | | `limits.http_min_retry_interval_ms` | `100` | `SPIRAL__LIMITS__HTTP_MIN_RETRY_INTERVAL_MS` | Minimum number of milliseconds to wait between HTTP request retries. | | `limits.io_merge_threshold` | `1024 * 1024` | `SPIRAL__LIMITS__IO_MERGE_THRESHOLD` | Byte range merge distance threshold (in bytes) for IO coalescing. Nearby byte ranges within this distance are merged into single HTTP requests. | | `limits.join_set_concurrency` | `num_cpus :: get ()` | `SPIRAL__LIMITS__JOIN_SET_CONCURRENCY` | Maximum number of concurrent tasks in a JoinSet. Defaults to the number of CPU cores. | | `limits.key_space_max_rows` | `100_000_000` | `SPIRAL__LIMITS__KEY_SPACE_MAX_ROWS` | Maximum number of rows in a key space. Compaction keeps fragments across column groups aligned to L1 key spaces. When this limit is larger, the L1 key spaces might change more frequently, causing a re-alignment to happen. When this limit is smaller, more fragments may have to be split due to key space boundaries. | | `limits.manifest_read_concurrency` | `2 * num_cpus :: get ()` | `SPIRAL__LIMITS__MANIFEST_READ_CONCURRENCY` | Maximum number of concurrent manifest file reads. | | `limits.manifest_write_concurrency` | `2 * num_cpus :: get ()` | `SPIRAL__LIMITS__MANIFEST_WRITE_CONCURRENCY` | Maximum number of concurrent manifest file writes. | | `limits.object_storage_client_pool_idle_timeout_s` | `None` | `SPIRAL__LIMITS__OBJECT_STORAGE_CLIENT_POOL_IDLE_TIMEOUT_S` | `pool_idle_timeout` for object storage clients. | | `limits.object_storage_client_pool_max_idle_per_host` | `None` | `SPIRAL__LIMITS__OBJECT_STORAGE_CLIENT_POOL_MAX_IDLE_PER_HOST` | `pool_max_idle_per_host` for object storage clients. | | `limits.prefetch_buffer_size` | `20` | `SPIRAL__LIMITS__PREFETCH_BUFFER_SIZE` | Per-stream channel buffer size (number of KeyTables buffered ahead) | | `limits.read_max_iops_per_file` | `-` | `SPIRAL__LIMITS__READ_MAX_IOPS_PER_FILE` | Maximum number of concurrent HTTP requests when reading a single file. This is an upper bound on the number of concurrent I/O requests to a single VortexReadAt instance. Usually only one of these exists per Vortex file, so this is a rough upper limit on the number of concurrent requsts to each fragment or manifest file. See also: \[read\_thread\_per\_core\_per\_file]. | | `limits.read_threads_per_core_per_file` | `1` | `SPIRAL__LIMITS__READ_THREADS_PER_CORE_PER_FILE` | The number of read threads per core per file. See also: \[read\_max\_iops\_per\_file], which is a limit shared across all cores and threads reading from the same file object. | | `limits.scan_eval_concurrency` | `16` | `SPIRAL__LIMITS__SCAN_EVAL_CONCURRENCY` | Maximum number of concurrent scan node evaluations (streams preparation) during scans. | | `limits.scan_min_rows_per_chunk` | `1000` | `SPIRAL__LIMITS__SCAN_MIN_ROWS_PER_CHUNK` | Minimum number of rows to buffer when executing the scan Small chunks are concatenated together until this threshold is reached. | | `limits.spfs_max_retries` | `8` | `SPIRAL__LIMITS__SPFS_MAX_RETRIES` | Maximum number of SPFS operation retries on transient errors. A single SPFS read operation may comprise tens or hundreds of distinct HTTP requests. SPFS write operations may comprise multiple HTTP requests but are typically just one (long) HTTP request. | | `limits.transaction_compact_threshold` | `100` | `SPIRAL__LIMITS__TRANSACTION_COMPACT_THRESHOLD` | Automatically compact transaction operations before commit if operation count exceeds this threshold. Set to 0 to disable auto-compaction. | | `limits.transaction_manifest_max_rows` | `-` | `SPIRAL__LIMITS__TRANSACTION_MANIFEST_MAX_ROWS` | Maximum number of rows in manifest when compacting transaction. | | `limits.transaction_retries` | `3` | `SPIRAL__LIMITS__TRANSACTION_RETRIES` | Maximum number of transaction retries on conflict before giving up. | | `limits.write_buffer_max_bytes` | `1024 * 1024 * 1024` | `SPIRAL__LIMITS__WRITE_BUFFER_MAX_BYTES` | Maximum number of bytes to buffer in memory when writing key table batches. | | `limits.write_partition_max_bytes` | `128 * 1024 * 1024` | `SPIRAL__LIMITS__WRITE_PARTITION_MAX_BYTES` | Maximum number of bytes in an uncompressed fragment file. | ## Manifests Cache Client cache limits and constraints | Field | Default | Environment Variable | Description | | --------------------------------------------- | ---------------- | ------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- | | `manifests_cache.blob_index_size` | `4 * (1 << 10)` | `SPIRAL__MANIFESTS_CACHE__BLOB_INDEX_SIZE` | Size of the blob index for each blob on disk. A larger index can hold more entries per blob but increases per-blob I/O. | | `manifests_cache.block_size` | `16 * (1 << 20)` | `SPIRAL__MANIFESTS_CACHE__BLOCK_SIZE` | Block size for the disk cache engine. This is the minimum eviction unit and also limits the maximum cacheable entry size. | | `manifests_cache.buffer_pool_size` | `16 * (1 << 20)` | `SPIRAL__MANIFESTS_CACHE__BUFFER_POOL_SIZE` | Total flush buffer pool size in bytes (RAM). Each flusher gets `buffer_pool_size / flushers` bytes. Entries larger than this are dropped. | | `manifests_cache.disk_capacity_bytes` | `20 * (1 << 30)` | `SPIRAL__MANIFESTS_CACHE__DISK_CAPACITY_BYTES` | Disk cache size in bytes. | | `manifests_cache.disk_path` | `spiral` | `SPIRAL__MANIFESTS_CACHE__DISK_PATH` | Disk cache path override. If not set, defaults to $XDG\_CACHE\_HOME/spiral or \~/.cache/spiral | | `manifests_cache.enabled` | `false` | `SPIRAL__MANIFESTS_CACHE__ENABLED` | Whether this cache is enabled. Default varies per cache type (see ClientSettings). | | `manifests_cache.memory_capacity_bytes` | `2 * (1 << 30)` | `SPIRAL__MANIFESTS_CACHE__MEMORY_CAPACITY_BYTES` | Memory cache size in bytes. | | `manifests_cache.submit_queue_size_threshold` | `16 * (1 << 20)` | `SPIRAL__MANIFESTS_CACHE__SUBMIT_QUEUE_SIZE_THRESHOLD` | If the total estimated size in the submit queue exceeds this threshold, further entries are silently dropped. | | `manifests_cache.write_on_insertion` | `-` | `SPIRAL__MANIFESTS_CACHE__WRITE_ON_INSERTION` | When true, entries are written to disk immediately on insertion. When false, entries are only written to disk when evicted from memory. | ## Server API endpoint configuration for the Spiral control plane | Field | Default | Environment Variable | Description | | ------------ | -------------------------- | --------------------- | ----------------------------- | | `server.url` | `https://api.spiraldb.dev` | `SPIRAL__SERVER__URL` | The SpiralDB API endpoint URL | ## Spfs SpFS (Spiral File System) configuration | Field | Default | Environment Variable | Description | | ---------- | --------------------------- | -------------------- | ------------------------------------------ | | `spfs.url` | `https://spfs.spiraldb.dev` | `SPIRAL__SPFS__URL` | The SpFS (Spiral File System) endpoint URL | ## Telemetry Telemetry configuration for the Spiral client | Field | Default | Environment Variable | Description | | ------------------- | ------- | ---------------------------- | ------------------------------------------------ | | `telemetry.enabled` | `false` | `SPIRAL__TELEMETRY__ENABLED` | Whether client-side telemetry export is enabled. | --- # OIDC Integrations Section: Reference Source: https://docs.spiraldb.com/oidc [OIDC](https://auth0.com/docs/authenticate/protocols/openid-connect-protocol) is a protocol for authenticating entities between systems. Spiral integrates with several OIDC providers to allow [workloads](https://docs.spiraldb.com/authorization#workloads) to authenticate without managing static credentials. For general workload setup steps (creating a workload, policy, and grant), see [Workloads](https://docs.spiraldb.com/authorization#workloads). ## GitHub Actions [GitHub Actions](https://github.com/features/actions) workflows can authenticate to Spiral using [GitHub's OIDC tokens](https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/about-security-hardening-with-openid-connect). Create a workload policy matching tokens from a specific repository: ```bash copy spiral workloads create-policy github work_p9cqam --repository vortex-data/vortex ``` In your GitHub Actions workflow, request an OIDC token and set the workload ID: ```yaml permissions: id-token: write env: SPIRAL_WORKLOAD_ID: work_p9cqam ``` For more on GitHub's OIDC claims, see [Understanding the OIDC token](https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/about-security-hardening-with-openid-connect#understanding-the-oidc-token). ## Modal [Modal](https://modal.com/) functions can authenticate to Spiral using [Modal's OIDC tokens](https://modal.com/docs/guide/oidc-integration). Create a workload policy matching tokens from a specific workspace and environment: ```bash copy spiral workloads create-policy modal work_p9cqam --workspace-id ws-12345abcd --environment-name main ``` In your Modal function, set the workload ID: ```bash export SPIRAL_WORKLOAD_ID=work_p9cqam ``` For more on Modal's OIDC claims, see [Understanding your OIDC claims](https://modal.com/docs/guide/oidc-integration#step-0-understand-your-oidc-claims). ## GCP [GCP](https://cloud.google.com/) service accounts can authenticate to Spiral using [GCP metadata identity tokens](https://cloud.google.com/docs/authentication/get-id-token). Create a workload policy matching tokens from a specific service account: ```bash copy spiral workloads create-policy gcp work_p9cqam \ --email my-sa@my-project.iam.gserviceaccount.com \ --unique-id 107691503500061507150 ``` Your code must be running on a Google Cloud service that provides metadata identity tokens: - Compute Engine - App Engine (standard and flexible) - Cloud Run / Cloud Run functions - Google Kubernetes Engine (see [GKE](#gke) below) - Cloud Build ### GKE Google Kubernetes Engine pods authenticate through GCP IAM service account impersonation. The GKE Kubernetes service account impersonates a GCP IAM service account, and that GCP IAM service account is the identity matched by the workload policy. 1. [Enable Workload Identity Federation](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#enable_on_clusters_and_node_pools) for your GKE cluster. 2. [Grant your GKE service account impersonation rights](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#kubernetes-sa-to-iam) on a GCP IAM service account. We recommend creating a dedicated GCP IAM service account for this. You may use `roles/iam.serviceAccountOpenIdTokenCreator` instead of `roles/iam.workloadIdentityUser`. 3. Create a workload and policy for the GCP IAM service account (the one being impersonated), and grant it access to your project. The policy should match the GCP IAM service account, not the GKE Kubernetes service account. 4. Set the workload ID in your pod's environment: ```bash export SPIRAL_WORKLOAD_ID=work_p9cqam ``` ## AWS [AWS](https://aws.amazon.com/) IAM roles can authenticate to Spiral using STS (Security Token Service). AWS authentication uses STS rather than OIDC, but the setup follows the same workload and policy pattern. Create a workload policy matching a specific IAM role: ```bash copy spiral workloads create-policy aws work_p9cqam --account 123456789012 --role MyDeployRole ``` Your code must be running with the specified IAM role assumed (e.g. via EC2 instance role, ECS task role, or Lambda execution role). In your AWS environment, set the workload ID: ```bash export SPIRAL_WORKLOAD_ID=work_p9cqam ``` ## OIDC Provider If your identity provider is not listed above, you can create a policy with a custom OIDC provider. Create a workload policy with a custom issuer and claim conditions: ```bash copy spiral workloads create-policy oidc work_p9cqam \ --iss https://login.example.com \ --conditions sub=service-account-1 \ --conditions email=bot@example.com ``` In your environment, set the workload ID and the OIDC token issued by your provider: ```bash export SPIRAL_WORKLOAD_ID=work_p9cqam export SPIRAL_OIDC_PROVIDER_TOKEN= ``` The pyspiral client detects `SPIRAL_OIDC_PROVIDER_TOKEN` and automatically exchanges it for a Spiral JWT. Your OIDC provider must issue tokens with the audience (`aud`) claim set to `https://iss.spiraldb.com`. --- # Architecture Section: Reference Source: https://docs.spiraldb.com/architecture ![SpiralDB Architecture](/img/architecture.svg) --- # Table Format Section: Reference Source: https://docs.spiraldb.com/format Spiral Tables are designed as a table format, similar to Apache Iceberg. The client is configurable with a pluggable metastore, and the SpiralDB server provides a reference implementation over HTTP. ## Data Model The data model is tabular, requiring a mandatory primary key and an optional partitioning key. A primary key is a column or a group of columns used to uniquely identify a row in a table. A partitioning key is simply a prefix of the primary key that forces a physical partitioning of data files. From here on, we will refer to the primary (and partitioning) key as the "key". ### Schema Tree & Column Groups The table's schema forms a tree, with a column at each leaf. Sibling leaves are grouped together into a *column group*. At a basic level, column groups can be thought of as independent tables that are efficiently zipped together at read time. ![Column Tree](/img/column-tree.png) ## Storage Model The storage model consists of key and column group tables, each partitioned and sorted by the key. At a basic level, a read operation performs an outer join on these tables. ![Wide Table](/img/wide-table.png) Keys (and column groups) are stored as a log-structured merge (LSM) tree. This is a data structure that consists of sorted runs of data. Within each sorted run, all keys are sorted and unique. But between runs, keys may overlap. Periodic compaction of an LSM tree attempts to read overlapping sorted runs, merge them together, and write them back as a single non-overlapping run. This improves read performance since the overlapping keys must be merged at read time. ![LSM Tree](/img/lsm-tree.png) In the context of keys, each sorted run represents a *key space*. ### Key Spaces Key spaces store the actual key data, separately from the column group data. This separation allows us to reuse key spaces across multiple column groups and reduce the number of outer joins required at read time. Key spaces have an upper bound on the number of keys they can store, which depends on the level of the LSM tree. Furthermore, each key space is content-addressed, allowing us to deduplicate identical runs of keys across multiple column groups. *(not yet implemented)* Key spaces are stored as a partitioned Merkle hash trie, enabling us to quickly join dense runs of identical keys. *(not yet implemented; currently, key spaces are stored as a collection of key files)* ### Column Groups Column groups are stored as partitions, called fragments, which are sorted by key and aligned to exactly one key space. Fragments have an upper bound on the number of bytes they can store. The partitioning key is used to split data. Alignment of the fragment to the key space, called key mask, can be dense or sparse \[1], and is used to join the column group data with the key space data at read time. From here on, we will refer to fragment with dense alignment as dense fragment. *(sparse alignment is not yet implemented)* ![Fragments](/img/fragments.png) Each write to a column group creates new fragments and a manifest — a listing of metadata for each new fragment. Metadata includes a file handle, the fragment's alignment to the key space, and additional fields such as column names and statistics used for pruning during read time. Column group configuration, such as the latest schema and whether the schema is mutable, is stored in column group metadata. To support column rename and deletion, fragment columns are replaced with column IDs and metadata stores the mapping. Additionally, metadata holds a listing of all active fragments, called a snapshot. Metadata must be atomically updated, but to support atomic updates across many column groups, we need a table-wide write-ahead log. ### Write-Ahead Log The write-ahead log (WAL) is a table-wide log of atomic write operations. It contains a sequence of operations detailing the addition/removal of key spaces and fragments, as well as any configuration changes to the column groups. Similar to the column group snapshot, a listing of all active key spaces, called the key space snapshot, is stored in the WAL. Periodically, the WAL is flushed. Key space addition/removal operations are flushed into the key space snapshot, and column group fragments and configuration are flushed into column group metadata. ### Compaction Key space compaction is an LSM tree compaction. L0 runs are all potentially overlapping; all other levels have non-overlapping sorted runs. *(there are only two levels currently)* L1 initially starts with a key range of `b''` to `b'0xff'`, so compaction selects all L0 runs and merges them together. While writing, if the number of keys grows too large, a new key boundary is introduced and the sorted run is split in two. Subsequent compactions respect the new set of key ranges and include any existing L1 data alongside overlapping L0 data. When a key space is compacted, a displacement map is produced. The displacement map describes how to map a fragment aligned to an origin key space to one or more fragments aligned to a destination key space. *(not yet implemented)* L1 ranges determine the key boundaries that form the basis for column group compaction. The goal of minor compaction is to reduce outer joins, but it does not rewrite fragments' data to achieve that. It uses displacement maps to re-align fragments to new, higher-level key spaces, purely as a metadata operation. Minor compaction might produce sparse or fragments that reuse a file handle but mask the file rows, called masked fragments. *(minor compaction is not yet implemented)* The goal of major compaction is to maintain reasonably large and not too sparse fragments (in the context of alignment), and it will rewrite fragments to achieve that. It is similar to an LSM tree compaction, in that sorted runs are merged, split, and moved across levels, except it uses the key space LSM tree structure to do so. This ensures that fragments are optimal both in the context of a single column group and the entire table. *** \[1] A fragment is densely aligned to a key space if and only if there are no key gaps in the fragment. For example, consider a key space with the keys 1, 3, 4, 5, 10 and two fragments: fragment one has data for keys 1, 3, 4, 5 and fragment two has data for keys 3, 5. Fragment one is densely aligned to the key space because, within the key range of its rows, every key has a row. Fragment two is sparsely aligned because within the key range of its rows one key lacks a corresponding row, key 4. --- # Anatomy of a Scan Section: Reference Source: https://docs.spiraldb.com/anatomy-of-scan Before we dive into the details of table scans, let's remind ourselves of the few design decisions in a [table format](https://docs.spiraldb.com/format): - [Column Groups](https://docs.spiraldb.com/format#column-groups) are co-partitioned sets of columns. A partition is called fragment. - Fragments are stored as [Vortex](https://vortex.dev/) files, sorted by primary key, but keys are stored separately. - [Key Spaces](https://docs.spiraldb.com/format#key-spaces) store keys and are optimized for sort-merge joins (especially for dense runs of identical keys). - Manifests store fragments metadata including an alignment between fragment and a key space. ![Fragments](/img/fragments.png) These design decisions have following implications for table scans: - Metadata that is frequently used for filtering is partitioned separately from data columns that are usually just projected (for example, it's very rare to filter on audio bytes). - Fragments are partitioned by size (optimized for object storage). In practice, fragments that store metadata have a lot more rows compared to fragments that store projected data. ## A look at the table Our table contains audio data with following schema: ```bash Schema({ audio_length=f64?, silence_ratio=f64? audio={ bytes=binary?, meta={ size=u64?, e_tag=utf8?, }? }? }) ``` Table contains audio bytes in a column group called `audio`, and two "metadata" column groups, a root one and `audio.meta` with some additional source-specific metadata. A look at the manifests shows following (truncated): ```bash Key Space manifest 131 fragments, total: 60.6MB, avg: 473.5KB, metadata: 431.7KB ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┃ ID ┃ Size (Metadata) ┃ Format ┃ Key Span ┃ Level ┃ Committed At ┃ Compacted At ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │ m86rvtu9y6 │ 260.5KB (3.2KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ 0qaq87gz87 │ 30.4MB (19.6KB) │ vortex │ 0..1294000 │ L0 │ 2025-11-06 18:20:00.224537+00:00 │ N/A │ │ dve28ce6z8 │ 247.1KB (3.1KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ hcjbwo31q5 │ 237.4KB (3.1KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ tzs9ssfgd7 │ 238.0KB (3.2KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ Column Group manifest for table_sl6o0u 6 fragments, total: 113.9MB, avg: 19.0MB, metadata: 111.5KB ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┃ ID ┃ Size (Metadata) ┃ Format ┃ Key Span ┃ Level ┃ Committed At ┃ Compacted At ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │ pqjb420f8r │ 20.2MB (19.1KB) │ vortex │ 0..228261 │ L0 │ 2025-11-06 18:20:00.224537+00:00 │ N/A │ │ vrv19qwg1i │ 20.2MB (19.1KB) │ vortex │ 228261..456522 │ L0 │ 2025-11-06 18:20:00.224537+00:00 │ N/A │ │ 9lhiuacvoi │ 20.0MB (19.1KB) │ vortex │ 456522..684783 │ L0 │ 2025-11-06 18:20:00.224537+00:00 │ N/A │ │ yq7dqeed9r │ 20.1MB (19.1KB) │ vortex │ 684783..913044 │ L0 │ 2025-11-06 18:20:00.224537+00:00 │ N/A │ │ 2tbvh0v6td │ 20.1MB (19.1KB) │ vortex │ 913044..1141305 │ L0 │ 2025-11-06 18:20:00.224537+00:00 │ N/A │ Column Group manifest for table_sl6o0u.audio 1165 fragments, total: 137.6GB, avg: 120.9MB, metadata: 2.7MB ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┃ ID ┃ Size (Metadata) ┃ Format ┃ Key Span ┃ Level ┃ Committed At ┃ Compacted At ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │ xn6rq15db4 │ 129.5MB (2.4KB) │ vortex │ 0..1177 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ vn4d04pp2r │ 128.7MB (2.4KB) │ vortex │ 1177..2354 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ yqgn69c1rh │ 126.2MB (2.4KB) │ vortex │ 2354..3531 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ 9mmqc78utg │ 129.3MB (2.4KB) │ vortex │ 3531..4708 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ ffsgw7yf5m │ 126.6MB (2.4KB) │ vortex │ 4708..5885 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ Column Group manifest for table_sl6o0u.audio.meta 130 fragments, total: 63.8MB, avg: 502.5KB, metadata: 891.4KB ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┃ ID ┃ Size (Metadata) ┃ Format ┃ Key Span ┃ Level ┃ Committed At ┃ Compacted At ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │ lkbhpyhmhq │ 507.0KB (6.9KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ y052mol59q │ 514.9KB (6.9KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ w128y2udmu │ 504.6KB (6.9KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ v36ealhiys │ 502.9KB (6.9KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ │ a03enoo1j5 │ 502.7KB (6.9KB) │ vortex │ 0..10000 │ L0 │ 2025-11-06 20:10:41.359348+00:00 │ N/A │ ``` ## A typical scan A typical scan over this table might look like this: ```python sp.scan(table["audio.bytes"], where=table["silence_ratio"] < 0.1) ``` When executing this scan, Spiral client performs following steps: ### Load Fragment Manifest(s) Client identifies that the scan involves two column groups: - root, for filtering on `silence_ratio` - `audio`, for projecting `audio.bytes` Client scans the table's manifests to identify relevant fragments for both column groups. Client determines that `silence_ratio` filter can be pushed-down into the root column group scan, and prunes manifests using statistics metadata about fragments. ![Fragments](/img/scan-fragments.png) ### Scan Filtered Column Group(s) Client scans the fragments of the column group involved in filtering (root). Client applies the filter `silence_ratio < 0.1` and produces a row mask (and the projected columns but in this case no columns are being projected from this column group). In practice, this is a columnar file scan over lots of rows (very efficient!). Row mask indicates which rows satisfy the filter condition, and when combined with alignment metadata from manifests, client can determine which keys correspond to the filtered rows. ![Filtered](/img/scan-filtered.png) ### Join Needed Key Spaces Client identifies the key spaces needed for join between the filtered column group and the projected column group. Client loads key spaces, applies row mask that is the result of the filtering, and joins them together to produce a new row mask that indicates which value rows are needed for the projected column group. ![Keys](/img/scan-keys.png) ### Take Projected Column Group(s) Client scans the fragments of the column group involved in projection, `audio` in this case. Client applies the row mask obtained from the previous step to read only the needed rows from the projected column group. ![Take](/img/scan-take.png) This is only possible because of the random access performance of Vortex files. ## About Performance Scans are optimized for high-throughput. - Filters are pushed down into Vortex file scans over large number of rows (10-20x faster that Parquet!). - Keys are joined efficiently using partitioned Merkle hash tries. - Projections are evaluated as random access Vortex file reads (100x faster than Parquet!). This last point means that each batch of rows out of a scan is expressed only as masked Vortex array, enabling **zero-copy & zero-decompression** transfer of data from storage all the way to the end user. And since Vortex arrays can decompress on the GPU, this means that **data can be transferred directly into GPU memory!** --- # Troubleshooting Section: Reference Source: https://docs.spiraldb.com/troubleshooting A guide to troubleshooting common Spiral client-side issues. ## Exporting telemetry Client-side [OTEL](https://opentelemetry.io/) telemetry data is exported automatically when telemetry is enabled. To export to your own collector instead of Spiral Cloud, set the `SPIRAL_OTEL_ENDPOINT` environment variable to your OTEL collector endpoint. Standard OTEL environment variables can be used to configure the OTEL exporter, such as `OTEL_EXPORTER_OTLP_HEADERS`, etc. To enable telemetry export, see [Telemetry](https://docs.spiraldb.com/config#telemetry). ### Enabling OpenTelemetry internal logs By default, the OpenTelemetry SDK's internal diagnostic logs are suppressed. To enable them (e.g. to debug export failures), set the `RUST_LOG` environment variable to include the `opentelemetry` target: ```bash RUST_LOG=warn,opentelemetry=debug ``` This will show detailed logs from the OpenTelemetry SDK, including export attempts, retries, and errors. ## Temporary workarounds Following issues may occur and currently have temporary workarounds. We are actively working on permanent fixes. If the workaround does not work for you, please reach out to us. - *"ArrowInvalid: offset overflow while concatenating arrays"* The error occurs on write when working with variable-length data (e.g., bytes, strings, lists) and the total size of the data exceeds the maximum allowed size for a single Arrow array. Spiral currently does not support writing Arrow chunked arrays so all chunks are combined into a single array before writing. To resolve this, we recommend using "large" Arrow types (e.g. `pa.large_binary()`, `pa.large_string()`, `pa.large_list()`) when passing in Arrow arrays. Spiral uses [Vortex](https://vortex.dev/) as storage and memory format. Unlike Arrow, Vortex has logical types and does not differentiate between "large" types. - *"Too many open files"* This error occurs when the operating system limit for open file descriptors is reached. Default limits are usually too low for data-intensive applications. Use `ulimit -n` to check the current soft limit and `ulimit -Hn` to check the hard limit. Use `ulimit -n ` to increase the limit for the session or reach out to your system administrator to increase the hard limit. We recommend setting the limit to at least 65536 or higher, but do not exceed the hard limit.