# SpiralDB Documentation
> SpiralDB is a fast analytical database for storing and querying multimodal, multi-rate data streams — built on Vortex, optimized for AI workloads on physical-world data.

Source: https://docs.spiraldb.com

---
# Welcome to Spiral
Section: Overview
Source: https://docs.spiraldb.com/

SpiralDB is an extremely fast analytical database designed for storing and querying multimodal, multi-rate data streams — the kind of data produced by sensors, cameras, and other instruments in fields like robotics, biology, autonomous systems, and other domains at the intersection of AI and the physical world.

It is built by the creators of [Vortex](https://vortex.dev/), the open-source columnar file format, and uses Vortex as its storage foundation. This gives SpiralDB best-in-class compression and scan performance out of the box.

The core abstraction is Collections: hierarchical, tree-structured groups of records (similar to nested JSON arrays) backed by sorted columnar storage. They enable:

- **Multimodal storage with isolated column groups.** Columns are stored independently, so tables can be arbitrarily wide — mixing video frames, embeddings, tensors, and metadata — without penalizing queries that only touch a few columns.
- **Multi-rate alignment.** Different sensors record at different rates (a camera at 60 Hz, a robotic arm at 500 Hz). SpiralDB provides a streaming resample primitive that aligns sibling data streams to a shared timeline, so you can combine them in a single query.
- **Free column appends.** Adding derived data like labels, annotations, or embeddings doesn't require rewriting existing columns. New columns are appended independently through enrichment.

Additionally, SpiralDB supports **GPU-optimized data loading** by streaming data directly from object storage to the GPU, bypassing the traditional CPU-bound staging bottleneck. It integrates with PyTorch via a custom DataLoader with built-in sharding, shuffling, checkpointing, and multi-threaded Rust-based I/O.

## Who is SpiralDB for?

SpiralDB is built for teams training AI models on data from the physical world — any domain where you're capturing observations from sensors, instruments, or simulations and need to turn them into training-ready datasets.

This includes teams working in robotics, autonomous vehicles, biology, neuroscience, organic chemistry, materials science, and similar fields. The common thread is data that is both multimodal (video, audio, point clouds, scalar telemetry, embeddings) and temporal (recorded over time, often at different rates across sources).

If your workflow involves ingesting heterogeneous sensor data, aligning streams across time resolutions, deriving new features like labels or embeddings, and then loading the result into GPUs at high throughput — SpiralDB is designed for exactly that pipeline.

## Use Cases

- [**Feature Engineering**](https://docs.spiraldb.com/usecases/media-tables): Large-scale multimodal datasets.
- [**Data Loaders**](https://docs.spiraldb.com/usecases/data-loaders): High-throughput data loading for machine learning.

## Dive Deeper

- [**Spiral Tables**](https://docs.spiraldb.com/spiral-tables): Store, analyze and query massive and/or multimodal datasets.
- [**File Systems**](https://docs.spiraldb.com/filesystems): Learn how to configure and manage file systems in Spiral.
- [**Access and Permissions**](https://docs.spiraldb.com/authorization): Manage user permissions and access control.

## Getting Started

Ready to take Spiral for a spin?

Use `pip` (or your favorite package manager) to install the Python client library and CLI.

```bash copy
pip install pyspiral
```

```bash copy
uv add pyspiral
```

---
# Organizations
Section: Concepts
Source: https://docs.spiraldb.com/organizations

Organizations are the top-level administrative unit in Spiral. They are used to configure single
sign-on, billing, audit logging, and other settings that apply to all users within the organization.

Since resources belong to projects, and all projects must belong to an organization, each user must
be a member of at least one organization in order to use Spiral.

## Organization Roles

Each member of an organization is given one of three roles:

- **Owner**: The owner has full control over the organization.
- **Member**: Members cannot configure the organization, but are able to create projects.
- **Guest**: Guests can only access projects they have been invited to, and cannot create new ones.

You can check your role within an organization using the Spiral CLI:

```bash copy
spiral orgs ls
```

```bash
                     Organizations
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃    ┃ id                             ┃ name   ┃ role  ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ 👉 │ org_01J9KF5Y2CB6DFK9YMBKG7S0Q4 │ Spiral │ owner │
└────┴────────────────────────────────┴────────┴───────┘
```

The 👉 symbol indicates the organization you are currently logged into. All administrative commands will apply to
this organization. You can switch organization using the Spiral CLI:

```bash copy
spiral orgs switch org_01J9KF5Y2CB6DFK9YMBKG7S0Q4
```

You can decide the role of a user when you [invite](#manual-invitation) them to your organization, or by configuring
default role mappings with [directory sync](#directory-sync).

## User Management

There are three ways to manage users within your organization (in order of recommendation): SSO, domain verification,
and manual invitation.

### Single Sign-On (SSO)

Single sign-on (SSO) is the recommended way to manage organization membership. This allows you to automatically add
and remove users from your organization based on your existing identity provider.

You can configure SSO for your organization using the Spiral CLI:

```bash copy
spiral orgs sso
```

See also the [directory sync](https://docs.spiraldb.com/organizations#directory-sync) feature for more advanced group-based configuration.

### Domain Verification

If you are unable to use SSO, you can verify your domain using a DNS record. This will automatically
add to your organization any user that signs up with an email address from your domain.

```bash copy
spiral orgs domains
```

### Manual Invitation

Finally, you can manually invite users to your organization using the Spiral CLI:

```bash copy
spiral orgs invite <email> --role <role>
```

## Groups and Teams

Groups and teams allow you to grant permissions to sets of users at a time.

Groups and teams are not yet supported. Please contact us if you are interested in this feature.

Groups are created automatically via [directory sync](https://docs.spiraldb.com/organizations#directory-sync), a feature that replicates group structures from
an external directory provider such as *Active Directory* or *Google Workspace*.

Teams are functionally identical to groups, however they are managed manually by an organization owner.

### Directory Sync

Coming soon! Please contact support if you would like to enable this feature.

### Managing Teams

Coming soon! Please contact support if you would like to enable this feature.

---
# Projects
Section: Concepts
Source: https://docs.spiraldb.com/projects

In Spiral, projects are the fundamental unit for structuring resources, access controls, and usage tracking.

Projects can be managed either through the Spiral CLI or the PySpiral client.

```bash copy
spiral projects --help
```

## Creating a Project

Projects are created within an organization. Each project has a globally unique and discoverable ID that is used to
identify it across the Spiral platform. You can also configure a description to provide context only to those users
with permission to view the project.

To create a new project, use the Spiral CLI:

```bash copy
spiral projects create --description "My First Project"
```

## Listing Projects

You can list all projects visible to you using the Spiral CLI:

```bash copy
spiral projects ls
```

## Dropping a Project

You can drop a project using the Spiral CLI:

```bash copy
spiral projects drop blazing-parakeet-036267
```

Note that dropping a project does not delete data from storage. To permanently delete project data, please contact support.

```bash
                                   Projects
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ id                      ┃ organization_id                ┃ name             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ blazing-parakeet-036267 │ org_01J9KF5Y2CB6DFK9YMBKG7S0Q4 │ My First Project │
└─────────────────────────┴────────────────────────────────┴──────────────────┘
```

## Permissions

Projects can be shared with other users, teams, organizations etc. using **grants**.
For more information on grants and principals, see the documentation on [authorization](https://docs.spiraldb.com/authorization).
For more information on access across systems, see the documentation on [OIDC](https://docs.spiraldb.com/oidc).

You can list all grants for a project using the Spiral CLI:

```bash copy
spiral projects grants blazing-parakeet-036267
```

```bash
                                                                Project Grants
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ id                          ┃ project_id              ┃ role_id ┃ principal                                                                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ grant_Y7DH5asdM33eMR7leXtXx │ blazing-parakeet-036267 │ admin   │ /org/org_01J9KF5Y2CB6DFK9YMBKG7S0Q4/user/user_01JAYY9RA8AR8ZCC93VJD7S71Z │
│ grant_zwZdWVcBUBluRIhHebsbG │ blazing-parakeet-036267 │ viewer  │ /org/org_01J9KF5Y2CB6DFK9YMBKG7S0Q4/role/guest                           │
│ grant_4SyXL7hRsS8jgc0eoc6KU │ blazing-parakeet-036267 │ viewer  │ /org/org_01J9KF5Y2CB6DFK9YMBKG7S0Q4/role/member                          │
│ grant_ASyW3OFWhw943e2So96ZZ │ blazing-parakeet-036267 │ admin   │ /org/org_01J9KF5Y2CB6DFK9YMBKG7S0Q4/role/owner                           │
└─────────────────────────────┴─────────────────────────┴─────────┴──────────────────────────────────────────────────────────────────────────┘
```

You can grant a role on a project to a user using the Spiral CLI:

```bash copy
spiral projects grant blazing-parakeet-036267 --role viewer --org-id org_01J9KF5Y2CB6DFK9YMBKG7S0Q4 --org-user user_01JAYY9RA8AR8ZCC93VJD7S71Z
```

You can grant a role on a project to all users with a specific role in an organization using the Spiral CLI:

```bash copy
spiral projects grant blazing-parakeet-036267 --role viewer --org-id org_01J9KF5Y2CB6DFK9YMBKG7S0Q4 --org-role member
```

For more details on managing grants, such as authorizing compute jobs in AWS, GCP, or Modal to access project resources,
see [grant](https://docs.spiraldb.com/cli#spiral-projects-grant) command.

---
# Authorization
Section: Concepts
Source: https://docs.spiraldb.com/authorization

Spiral uses a role-based access control (RBAC) model for authorization, where currently the only unit of
permissioning is the project.

In this model, **principals** (users, teams, workloads) are granted **roles** on **projects**.

A role confers a set of **permissions** on the project, which are used to determine whether a principal is allowed to
perform a given action.

## Roles

The roles currently defined in Spiral are:

- `admin` - full access to the project
- `editor` - read/write access to the project
- `viewer` - read-only access to the project

These roles expand to resource-specific permissions, such as `table:read` and `table:write`.

Please reach out to us if you have specific requirements for custom roles.

## Principals

Principals are entities that can be granted roles on projects. Currently, the following types of principals are
supported:

- **Org-scoped Users** - an individual org/user pair
- **Teams** - groups of users as defined manually in Spiral
- **Groups** - groups of users as synchronized from an external identity provider
- **Organizations** - a team implicitly defined as all members of an organization
- **Workloads** - non-human entities. Workloads authenticate via OIDC, AWS IAM, or client credentials.
  See [Workloads](#workloads) below and the [OIDC Integrations](https://docs.spiraldb.com/oidc) page for setup instructions.

## Grants

Grants are the association of a principal with a role on a project. They are the mechanism by which permissions are
conferred. Grants can be listed, created, updated, and deleted using the Spiral CLI.
To create a grant, see [grant](https://docs.spiraldb.com/cli#spiral-projects-grant) command.

## Workloads

Workloads are non-human principals in Spiral — they represent services, jobs, or automations that need to access data.
Workloads are managed via the `spiral workloads` set of CLI commands.

If workloads write to resources in a single project, it is recommended that they be placed in that project.
If workloads modify multiple projects, it is recommended to keep them in a separate project from all resources.

An environment can authenticate as workload using either a **policy** or **client credentials**.

### Policies

A workload policy maps an external identity to a Spiral workload. This is the recommended approach — it avoids
managing static credentials and integrates with your existing identity provider.

A policy maps claims from your identity provider's tokens to the Spiral workload.
The policy specifies conditions (key=value pairs) that must match the token's claims.

To use a workload policy, follow these steps:

1. **Create a workload** — `spiral workloads create`.
2. **Create a workload policy** — `spiral workloads create-policy`.
3. **Grant the workload a role** — `spiral projects grant`.
4. **Set `SPIRAL_WORKLOAD_ID`** in the environment where you use the pyspiral library.

### OpenID Connect

[OIDC](https://auth0.com/docs/authenticate/protocols/openid-connect-protocol) is a protocol for authenticating entities between systems.
We strongly recommend using OIDC for workload authentication — it is easier to set up and more secure than managing static credentials.

Spiral integrates with several OIDC providers. Check out the [OIDC Integrations](https://docs.spiraldb.com/oidc) page for available
integrations and how to set them up.

### Client Credentials

When external identity is not available (e.g. the environment does not provide OIDC tokens), you can issue
client credentials instead and store them as secrets in your environment. This is a less secure approach,
but it can be used as a fallback.

To use client credentials, follow these steps:

1. **Create a workload** — `spiral workloads create`.
2. **Issue credentials** — `spiral workloads issue-creds`.
3. **Grant the workload a role** — `spiral projects grant`.
4. **Set `SPIRAL_CLIENT_ID` and `SPIRAL_CLIENT_SECRET`** in the environment where you use the pyspiral library.

---
# Security
Section: Concepts
Source: https://docs.spiraldb.com/security

Spiral is designed with security as a core principle, implementing defense-in-depth strategies across
authentication, authorization, encryption, and data protection.

## Architecture

Spiral Tables are designed in such way that your data never leaves your VPC. The only exception are key columns.
Spiral stores key schema for each table, which **includes key column names**, and tracks [fragments](https://docs.spiraldb.com/format)
which **include key column stats**, such as min/max values. PySpiral, Spiral client library, runs in your VPC and
can access your bucket, but it is not authorized to read from the bucket. Spiral, control plane, provides temporary
authorization for each read / write operation to PySpiral via signed URLs.

![VPC](/img/architecture-vpc.png)

## Authentication

Spiral supports multiple authentication mechanisms to integrate with your existing infrastructure.

### OIDC Integrations

Spiral integrates with several OIDC (OpenID Connect) identity providers for seamless workload authentication:

- **GCP** - Google Cloud service accounts
- **Modal** - Serverless platform integration
- **AWS** - AWS IAM roles (coming soon)

See [OIDC Integrations](https://docs.spiraldb.com/oidc) for detailed setup instructions.

### SSO (Single Sign-On)

Enterprise organizations can configure SSO through WorkOS, enabling:

- SAML and OIDC-based enterprise SSO
- Domain-verified automatic user provisioning
- Centralized identity management

## Authorization

Spiral implements role-based access control (RBAC) with fine-grained permissions.
See [Authorization](https://docs.spiraldb.com/authorization) for complete details.

### Hierarchy

Access is controlled at two levels:

1. **Organization level** - Manages membership and project creation
2. **Project level** - Controls access to tables, indexes, and file systems

### Roles

| Role       | Project Access                         |
| ---------- | -------------------------------------- |
| **Admin**  | Full access including grant management |
| **Editor** | Read/write access to all resources     |
| **Viewer** | Read-only access                       |

### Principals

Grants can be assigned to various principal types:

- Users and user groups
- Organizations and teams
- Workloads (service accounts, GitHub Actions, Modal apps)

Use OIDC-based workload authentication instead of static API keys whenever possible.
This eliminates the need to manage and rotate secrets.

## Encryption

### Encryption in Transit

All communication with Spiral is encrypted using TLS 1.2 or higher:

- **HTTPS only** - All API endpoints require TLS
- **Modern cipher suites** - Using industry-standard encryption
- **Certificate validation** - Web PKI root certificate verification

### Encryption at Rest

Data stored in Spiral is protected at rest:

- **Object storage encryption** - Data files are encrypted by the underlying cloud storage (S3, GCS, Azure Blob)
- **Database encryption** - Metadata is encrypted at rest in the database layer

## Data Protection

### Multi-Tenant Isolation

Spiral provides strong isolation between tenants:

- **Organization boundaries** - Data is isolated at the organization level
- **Project scoping** - All operations are scoped to authorized projects

## Network Security

### API Security

- **Rate limiting** - Protection against abuse
- **Request validation** - All inputs are validated before processing
- **CORS configuration** - Controlled cross-origin access

## Compliance

Spiral is designed to help you meet compliance requirements:

- **Data residency** - Control where your data is stored
- **Access controls** - Fine-grained RBAC for compliance with least-privilege principles
- **Audit trails** - Logging for compliance and forensics

Contact us for specific compliance requirements including SOC 2, HIPAA, or GDPR.

## Security Best Practices

### For Administrators

1. **Use SSO** - Enable SSO for your organization when possible
2. **Principle of least privilege** - Grant only the minimum necessary permissions
3. **Regular access reviews** - Periodically review and revoke unnecessary grants
4. **Use workload identity** - Prefer OIDC over static credentials

### For Developers

1. **Use OIDC authentication** - Avoid storing API keys in code or configuration
2. **Scope access appropriately** - Request only the permissions your workload needs
3. **Rotate credentials** - If using API keys, rotate them regularly
4. **Secure your environment** - Protect the environment where Spiral credentials are used

## Reporting Security Issues

If you discover a security vulnerability in Spiral, please report it responsibly:

- Email: [security@spiraldb.com](mailto:security@spiraldb.com)
- Include details about the vulnerability and steps to reproduce

We take all security reports seriously and will respond promptly.

---
# File Systems
Section: Concepts
Source: https://docs.spiraldb.com/filesystems

Many Spiral resources make use of object storage. Spiral File Systems provide a means to securely federate and
accelerate access to underlying object storage.

As with all resources in Spiral, File Systems are scoped to a project. However, unlike many other resource types,
each project may only have a single file system. This file system is often referred to as the project's **default**
file system. By default, any resource that requires a file system will use the project's default file system, unless
otherwise specified.

## Configuration

You can see a project's file system configuration using the Spiral CLI

```bash copy
spiral fs show <project_id>
```

When resources like *Tables* are created they configure a prefix in a file system. File systems can only be
reconfigured when they have zero used prefixes.

## Built-in Providers

Spiral supports several built-in file system providers.

The full set of providers can be listed using the CLI:

```bash copy
spiral fs list-providers
```

Configure a project's file system provider:

```bash copy
spiral fs update <project_id> --provider <provider>
```

If you require a specific provider, please reach out to us, and we should be able to add it for you.

## Bring Your Own Bucket

If you have an existing bucket that you would like to use as a Spiral File System, you can update the project's default
file system to use it. See specific provider instructions below for how to configure the bucket and permissions.

Use the CLI to update default file system:

```bash copy
spiral fs update --help
```

We recommend that you use a dedicated project to register your own bucket as default file system, i.e. avoid
creating any resources in this project. This allows you to separate bucket management permissions from resource
management permissions. Projects with resources can be configured to use the "bucket" project as their default
file system using:

```bash copy
spiral fs update <project_id> --upstream <bucket_project_id>
```

### AWS S3

To configure an S3 bucket for Spiral, create an IAM policy first.

```json
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "AllowListBucketAll",
			"Effect": "Allow",
			"Action": "s3:ListBucket*",
			"Resource": [
				"arn:aws:s3:::{your-bucket-name}"
			]
		},
		{
			"Sid": "AllowObjectSome",
			"Effect": "Allow",
			"Action": [
				"s3:*Object",
				"s3:*ObjectVersion",
				"s3:AbortMultipartUpload",
				"s3:ListMultipartUploadParts"
			],
			"Resource": [
				"arn:aws:s3:::{your-bucket-name}/*"
			]
		}
	]
}
```

Next, you have to allow Spiral to assume a role with this policy. The easiest and most secure way to do this
(avoids any long lived tokens!), is to allow Spiral's GCP identity to assume the role.

Create an IAM role with a following trust policy and attach the above object storage access policy to it.

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "accounts.google.com"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "accounts.google.com:sub": "116500466089430548312",
                    "accounts.google.com:aud": "116500466089430548312",
                    "accounts.google.com:oaud": "https://iss.spiraldb.com"
                }
            }
        }
    ]
}
```

`116500466089430548312` is a unique identifier for Spiral's service account ([spiraldb-filesystems@pyspiral-dev.iam.gserviceaccount.com](mailto:spiraldb-filesystems@pyspiral-dev.iam.gserviceaccount.com)) currently running in GCP.
Spiral exchanges a short-lived GCP identity token for AWS temporary credentials. See [AWS docs](https://aws.amazon.com/blogs/security/access-aws-using-a-google-cloud-platform-native-workload-identity/) for more details on this approach.

Use the ARN of the created role to update project's file system:

```bash copy
spiral fs update --type s3 --bucket <your-bucket-name> --region <your-bucket-region> --role-arn <your-role-arn>
```

### GCP GCS

To configure a GCS bucket for Spiral, grant `Storage Object User` and `Storage Object Viewer` roles on the bucket to
Spiral service account `spiraldb-filesystems@pyspiral-dev.iam.gserviceaccount.com`.

Use the CLI to update project's file system:

```bash copy
spiral fs update --type gcs --bucket <your-bucket-name> --region <your-bucket-region>
```

### S3-compatible

S3-compatible providers (e.g. MinIO, Tigris, R2) are supported using access and secret keys.

Use the CLI to update project's file system:

```bash copy
spiral fs update --type s3like --endpoint <your-provider-endpoint> --bucket <your-bucket-name> --region <your-provider-region>
```

Secrets are stored in [WorkOS Vault](https://workos.com/vault). Use the CLI if you want to [BYOKs](https://workos.com/docs/vault/byok):

```bash copy
spiral orgs keys
```

---
# Spiral Tables
Section: Concepts
Source: https://docs.spiraldb.com/spiral-tables

Spiral Tables a powerful and flexible way for storing, analyzing, and querying massive and/or multimodal datasets.
The data model will feel familiar to users of SQL- or DataFrame-style systems, yet is designed to be more flexible,
more powerful, and more useful in the context of modern data processing. Tables are stored and queried directly from
**object storage**.

## Key Features

- **Sufficiently Schemaless:** Spiral supports complex data with nested relationships using column groups. Tables are sparse and support column appends **without** rewriting existing rows.
- **High-throughput Scanning**: Saturate the network bandwidth with an optimized high-throughput scanning.
- **Scalable Storage with Flexible Cost-Performance**: Based on log-structured merge (LSM) trees and lakehouse architecture for efficient data storage and retrieval, with acceleration layer when performance is critical.
- **Python-centric Data Access Layer**: Retrieve data in powerful columnar or row-based formats like PyArrow, Pandas, Polars, Dask, PyTorch and more with intuitive projection and filtering syntax.
- **Cell Push-down:** Keep large values like images, audio, and video seamlessly integrated with your tabular data. Filter rows and read only the parts you need with cell-level filtering. No need for separate storage systems.

Spiral Tables is **built with [Vortex](https://docs.vortex.dev/)**, our SOTA file format and columnar data toolkit. Vortex’s random access performance advancements enable search and indexing features (soon)!

## Dive Deeper

- [**PySpiral**](https://docs.spiraldb.com/python-api/spiral): Learn how to interact with Spiral Tables using the Python client library.
- [**Data Model**](https://docs.spiraldb.com/tables/data-model): Understand Spiral's flexible and powerful data model.
- [**Best Practices**](https://docs.spiraldb.com/tables/best-practices): Tips and tricks for optimizing your use of Spiral Tables.

---
# Media Tables
Section: Use Cases
Source: https://docs.spiraldb.com/usecases/media-tables

[Data Model](https://docs.spiraldb.com/tables/data-model) of [Spiral Tables](https://docs.spiraldb.com/spiral-tables#key-features)
makes them an ideal choice for storing and managing large-scale multimodal datasets (including images, videos, audio files, text, transformer-based embeddings and more).
Tables are **designed to store both data and traditional metadata** (labels, annotations, complex nested data and more) **in a single model**.

Let's build a table of crawled image data.

## Start from Metadata

First, create a table with a composite primary key made up of `page_url` and `key` columns. Spiral Tables have
primary keys that can consist of multiple columns, and are used as both sort order as well as to support updates
and zero-copy appends. In this case, `key` represents a unique identifier for each media item on the page identified by `page_url`.

```python
from spiral import Spiral
import pyarrow as pa

sp = Spiral()
project = sp.project("example")

table = project.create_table(
    "media-table",
    key_schema=pa.schema({"page_url": pa.string(), "key": pa.int64()})
)
```

While it's possible to write raw data (images, audio, video, etc.) directly into Spiral Tables, it's much more common
to start from metadata such as URLs or S3 paths, and then use [Enrichment](https://docs.spiraldb.com/tables/enrich) to fetch the actual media data.

Let's skip crawling web pages as it is not specific to Spiral, and explore already ingested metadata. This metadata
is based on [filtered-wit](https://huggingface.co/datasets/laion/filtered-wit) dataset.

```python
>>> table.schema()
```

```python
pa.schema({
    "page_url": pa.string(),
    "key": pa.int64(),
    "page_title": pa.string(),
    "section_title": pa.string(),
    "caption": pa.string(),
    "caption_attribution_description": pa.string(),
    "url": pa.string(),
    "context": pa.struct({
        "page_description": pa.string(),
        "section_description": pa.string()
    })
})
```

In our table, `context` is a [Column Group](https://docs.spiraldb.com/tables/data-model#column-groups). While Spiral Tables don't require
up-front schema design, check out [Best Practices](https://docs.spiraldb.com/tables/best-practices#column-groups) when it comes to splitting
data into column groups for optimal performance. In this case, contextual metadata is large text that is rarely filtered
on so it makes sense to group it separately.

Let's use Polars to look at some sample data.

```python
table.to_polars_lazy_frame().head(100)
```

![Media Table](/img/media-tables-tbl.png)

Let's use Spiral CLI to explore the structure of the table (the output is truncated).

```bash copy
spiral tables manifests --project example --table media-table
```

```
Key Space manifest
131 fragments, total: 60.6MB, avg: 473.5KB, metadata: 431.7KB
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ ID         ┃ Size (Metadata) ┃ Format ┃  Key Span  ┃ Level ┃           Committed At           ┃ Compacted At ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ 0qaq87gz87 │ 30.4MB (19.6KB) │ vortex │ 0..1294000 │  L0   │ 2025-11-06 18:20:00.224537+00:00 │     N/A      │
```

```
Column Group manifest for table_sl6o0u
6 fragments, total: 113.9MB, avg: 19.0MB, metadata: 111.5KB
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ ID         ┃ Size (Metadata) ┃ Format ┃     Key Span     ┃ Level ┃           Committed At           ┃ Compacted At ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ pqjb420f8r │ 20.2MB (19.1KB) │ vortex │    0..228261     │  L0   │ 2025-11-06 18:20:00.224537+00:00 │     N/A      │
│ vrv19qwg1i │ 20.2MB (19.1KB) │ vortex │  228261..456522  │  L0   │ 2025-11-06 18:20:00.224537+00:00 │     N/A      │
│ 9lhiuacvoi │ 20.0MB (19.1KB) │ vortex │  456522..684783  │  L0   │ 2025-11-06 18:20:00.224537+00:00 │     N/A      │
│ yq7dqeed9r │ 20.1MB (19.1KB) │ vortex │  684783..913044  │  L0   │ 2025-11-06 18:20:00.224537+00:00 │     N/A      │
│ 2tbvh0v6td │ 20.1MB (19.1KB) │ vortex │ 913044..1141305  │  L0   │ 2025-11-06 18:20:00.224537+00:00 │     N/A      │
│ d4xlsgo7ff │ 13.3MB (16.1KB) │ vortex │ 1141305..1294000 │  L0   │ 2025-11-06 18:20:00.224537+00:00 │     N/A      │
```

```
Column Group manifest for table_sl6o0u.context
11 fragments, total: 2.6GB, avg: 120.9MB, metadata: 2.7MB
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ ID         ┃ Size (Metadata) ┃ Format ┃  Key Span   ┃ Level ┃           Committed At           ┃ Compacted At ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ xn6rq15db4 │ 129.5MB (2.4KB) │ vortex │   0..1177   │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ vn4d04pp2r │ 128.7MB (2.4KB) │ vortex │ 1177..2354  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ yqgn69c1rh │ 126.2MB (2.4KB) │ vortex │ 2354..3531  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ 9mmqc78utg │ 129.3MB (2.4KB) │ vortex │ 3531..4708  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ ffsgw7yf5m │ 126.6MB (2.4KB) │ vortex │ 4708..5885  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ 3319x978du │ 127.5MB (2.4KB) │ vortex │ 5885..7062  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
```

## Enrich with Data

Our metadata table has URLs pointing to images. Let's use [Table Enrichment](https://docs.spiraldb.com/tables/enrich) to fetch the images and store them
directly into the table. Using [Expressions](https://docs.spiraldb.com/tables/expressions) we can define how to derive new columns (like `image`)
from existing columns (like `url`).

```python
from spiral import expressions as se

enrichment = table.enrich(
    {
        "image": se.http.get(table["url"])
    }
)
```

There are different ways to run the enrichment, but in this case let's execute it in a streaming fashion since it is the simplest.

```python
enrichment.run()
```

`se.http.get` fetches the image data from the URL as well as some useful metadata. It creates two columns groups,
`image` (with the raw image bytes in `bytes` column) and `image.meta` (with metadata such as status code).
Our table now looks like this:

```python
table.schema()
```

```python
pa.schema({
    "page_url": pa.string(),
    "key": pa.int64(),
    "page_title": pa.string(),
    "section_title": pa.string(),
    "caption": pa.string(),
    "caption_attribution_description": pa.string(),
    "url": pa.string(),
    "context": pa.struct({
        "page_description": pa.string(),
        "section_description": pa.string()
    }),
    "image": pa.struct({
        "bytes": pa.binary(),
        "meta": pa.struct({
            "location": pa.string(),
            "last_modified": pa.int64(),
            "size": pa.uint64(),
            "e_tag": pa.string(),
            "version": pa.string(),
            "status_code": pa.uint16()
        })
    })
})
```

Let's use Spiral CLI to explore the updated structure of the table (the output is truncated).
We expect to see two new column groups: `image` and `image.meta`.

```bash copy
spiral tables manifests --project example --table media-table
```

```
Column Group manifest for table_sl6o0u.image
1165 fragments, total: 137.6GB, avg: 120.9MB, metadata: 2.7MB
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ ID         ┃ Size (Metadata) ┃ Format ┃  Key Span   ┃ Level ┃           Committed At           ┃ Compacted At ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ xn6rq15db4 │ 129.5MB (2.4KB) │ vortex │   0..1177   │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ vn4d04pp2r │ 128.7MB (2.4KB) │ vortex │ 1177..2354  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ yqgn69c1rh │ 126.2MB (2.4KB) │ vortex │ 2354..3531  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ 9mmqc78utg │ 129.3MB (2.4KB) │ vortex │ 3531..4708  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ ffsgw7yf5m │ 126.6MB (2.4KB) │ vortex │ 4708..5885  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ 3319x978du │ 127.5MB (2.4KB) │ vortex │ 5885..7062  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
```

```
Column Group manifest for table_sl6o0u.image.meta
130 fragments, total: 1.5MB, avg: 12.0KB, metadata: 977.8KB
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ ID         ┃ Size (Metadata) ┃ Format ┃ Key Span ┃ Level ┃           Committed At           ┃ Compacted At ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ cyjkjdaezp │  11.8KB (7.5KB) │ vortex │ 0..10000 │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ cmcb44n6z6 │  12.2KB (7.5KB) │ vortex │ 0..10000 │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ t5vovxqks2 │  12.3KB (7.6KB) │ vortex │ 0..10000 │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ gof24kc956 │  12.2KB (7.5KB) │ vortex │ 0..10000 │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ x6ujppiw06 │  11.8KB (7.5KB) │ vortex │ 0..10000 │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ r0gesz2qpe │  12.1KB (7.6KB) │ vortex │ 0..10000 │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
```

## Explore Tables

The entry point for reading data from Spiral Tables is [scan](https://docs.spiraldb.com/python-api/spiral#scan) and [Scan Object](https://docs.spiraldb.com/tables/scan#scan-object).

Let's explore failures in our enrichment by scanning the table and filtering on `image.meta.status_code`.

```python
scan = sp.scan(
    table[["url", "image.meta.status_code"]],
    where=table["image.meta.status_code"] != pa.scalar(200, pa.uint16()),
)
```

Let's check if the result is what we expect.

```python
scan.schema()
```

```python
pa.schema({
    "url": pa.string(),
    "image": pa.struct({
        "meta": pa.struct({
            "status_code": pa.uint16(),
        }),
    }),
})
```

Let's execute the scan and check the results.

```python
scan.to_polars()
```

![Failed Enrichments](/img/media-tables-meta.png)

Nice! Only a few failed enrichments, and most of them are `404 Not Found` errors as expected.

Now let's get images from a specific page URL.

```python
sp.scan(
    table["image"][["bytes"]],
    where=(table["page_url"] == "https://en.wikipedia.org/wiki/Ballona_Creek")
).to_table().to_pydict()
```

```python
{'bytes': [b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x03G\x00\x00\x02h\x08\x03\x00\x00\x00\x9a\xe9j\xe6\x00\x00\x03\x00PLTE\xff\xff\xff\xde\xde\xdeBBB::B!1)B:)RRR\x9c\x9c\x94\xce\xce\xc5\xbd\xbd\xd6\xbd\xbd\xbd\x9c\xa5\xb5\xbd\xc5\xc5\xad\xa...
```

## High Throughput Scans

Spiral Tables are designed for high-throughput scans.

Getting a stream of record batches from any scan is as simple as this:

```python
sp.scan(table["image"][["bytes"]]).to_record_batches()
```

Checkout the [API Reference](https://docs.spiraldb.com/python-api/spiral#to_record_batches) for more details on how to customize scans.

See also:

- [Data Loaders](https://docs.spiraldb.com/usecases/data-loaders)
- [Distributed Compute](https://docs.spiraldb.com/tables/distributed)

---
# Data Loaders
Section: Use Cases
Source: https://docs.spiraldb.com/usecases/data-loaders

Low Data Loader Efficiency means ***fewer iterations on the model***. Even the best teams spend tremendous amounts of
human capital building, maintaining and adapting custom solutions - and their ***GPU still sits idle***.

The state of the art data loading solutions are complex, brittle and hard to maintain.

![Training Data](/img/data-loading-sota.png)

Spiral is designed to deliver maximal throughput and elasticity enabling efficient training cycles.

![Data Loader](/img/data-loading-spiral.png)

## Training on (Live) Tables

Tables offer a powerful and flexible way for storing dynamic massive and/or multimodal datasets, and an easy way to
obtain high throughput production-ready data loader. [Scan](https://docs.spiraldb.com/tables/scan#scan-object) object integrates with many training frameworks.

We recommend starting with a `Scan` object which provides a simple interface for getting a PyTorch-compatible `DataLoader`.

```python
from spiral import Spiral, SpiralDataLoader
from spiral.demo import images

sp = Spiral()
table = images(sp)
scan = sp.scan(table)

data_loader: SpiralDataLoader = scan.to_data_loader(seed=42, batch_size=32)
```

Unlike PyTorch's `DataLoader` which uses multiprocessing for I/O (num\_workers), `SpiralDataLoader` leverages Spiral's
efficient Rust-based streaming and only uses multiprocessing for CPU-bound post-processing transforms.

Key differences from PyTorch's `DataLoader`:

- No num\_workers for I/O (Spiral's Rust layer is already multi-threaded)
- `map_workers` for parallel post-processing (tokenization, decoding, etc.)
- Explicit shard-based architecture for distributed training
- Built-in checkpoint

CPU-bound transforms such as tokenization can be parallelized, similar to `num_workers` in PyTorch, while remaining
compatible with Spiral's efficient I/O.

```python
import pyarrow as pa

# CPU-bound batch transform function
def tokenizer_fn(rb: pa.RecordBatch):
    # tokenize the rb
    return rb

data_loader: SpiralDataLoader = scan.to_data_loader(transform_fn=tokenizer_fn, map_workers=8)
```

Built-in checkpointing allows resuming training from the last processed record in case of interruptions.

```python
loader = scan.to_data_loader(batch_size=2, seed=42)
for i, batch in enumerate(loader):
    if i == 3:
        # Save checkpoint
        checkpoint = loader.state_dict()
        # ...

# Resume from checkpoint
loader = scan.resume_data_loader(checkpoint, batch_size=32)
```

A lower-level interface such as Torch-compatible `IterableDataset` is also available.

```python
from torch.utils.data import IterableDataset, DataLoader

iterable_dataset: IterableDataset = scan.to_iterable_dataset()

data_loader = DataLoader(iterable_dataset, batch_size=32)
```

## Distributed Training

Distributed training is natively supported.

```python
data_loader: SpiralDataLoader = scan.to_distributed_data_loader(seed=42, batch_size=32)
```

Optionally, world size and rank can be specified explicitly.

```python
from spiral import World

world = World(rank=0, world_size=64)

data_loader: SpiralDataLoader = scan.to_distributed_data_loader(
    world=world, seed=42, batch_size=32
)
```

Training on filtered data with a higly selective filter in distribute setting can lead to unbalanced shards across nodes.
In such cases, it is recommended to provide shards through `shards` arg to ensure balanced data distribution.
Shards can be obtained from scan via [shards()](https://docs.spiraldb.com/python-api/spiral#shards) or built using [compute\_shards()](https://docs.spiraldb.com/python-api/spiral#compute_shards).

---
# Overview
Section: Collections
Source: https://docs.spiraldb.com/collections/overview

Collections are not yet available. Please contact us if you are interested.

SpiralDB stores your data as a tree of **collections** — like JSON arrays of objects, but backed by columnar storage.
Each collection is physically a separate sorted table. Queries across collections are sort-merge joins — the cheapest
possible join!

## Schema

Here is a dataset of scene recordings for world modeling, where each scene contains clips from different cameras,
and each clip contains video frames, object detections, and audio:

```python
from spiral import Spiral

sp = Spiral()

ds = sp.dataset("recordings")
print(ds.schema)
```

```
recordings
├── scenes {_id: utf8}
│   ├── location: utf8
│   ├── clips {_id: utf8}
│   │   ├── camera: utf8
│   │   ├── start_time: f64
│   │   ├── duration: f64
│   │   ├── frames []                          one per video frame (e.g. 30fps)
│   │   │   ├── rgb: tensor<u8>[H, W, 3]
│   │   │   └── depth: tensor<f32>[H, W]
│   │   ├── detections []                      one per detected object (variable rate)
│   │   │   ├── label: utf8
│   │   │   ├── confidence: f32
│   │   │   └── ts: f64
│   │   └── audio []                           one per chunk (e.g. 512 samples at 16kHz)
│   │       ├── pcm: tensor<i16>[512]
│   │       └── level_db: f32
│   └── metadata: struct
│       ├── weather: utf8
│       └── lighting: utf8
└── models {_id: utf8}
    ├── name: utf8
    └── config: struct
        ├── lr: f64
        └── batch_size: i32
```

Three kinds of things appear in this tree:

- **Collections** (`scenes`, `clips`, `frames`, `detections`, `audio`, `models`).
  Records you can iterate over. Each has its own key — either *keyed* (`{_id: type}` — upserts match on `_id`) or
  *positional* (`[]` — rows are ordered, with a virtual `_pos` field, `u64`, starting at 0).

- **Structs** (`metadata`, `config`). Single nested objects — one per parent record.
  Dot access navigates through them. They don't introduce new rows.

- **Scalars** (`location`, `duration`, `rgb`, `confidence`). Leaf values.

Note that `frames`, `detections`, and `audio` are **sibling collections** within `clips`.
**They have independent row counts** — a 30-second clip might have 900 frames, 12 detections, and 960 audio chunks.

## Physical Model

Each collection is stored as its own sorted columnar table.
A child table's sort key is always a prefix extension of its parent's key:

```
scenes:      sorted by (scenes_id)
clips:       sorted by (scenes_id, clips_id)
frames:      sorted by (scenes_id, clips_id, frames_pos)
detections:  sorted by (scenes_id, clips_id, detections_pos)
audio:       sorted by (scenes_id, clips_id, audio_pos)
```

Because every parent-child pair shares a key prefix, joins between them are sort-merge joins.
When two streams have identical key sets (not just the same key schema, but the same actual rows),
the join degenerates to a zip: walk both streams in lockstep with no searching.

## Navigating the Tree

Dot access walks the tree and returns a lazy reference. Nothing is read until you scan.

```python
scenes = ds.scenes
clips = scenes.clips
frames = clips.frames
detections = clips.detections
audio = clips.audio

frames.rgb                  # references recordings/scenes/clips/frames/rgb
scenes.metadata.weather     # references recordings/scenes/metadata/weather
```

Save any part of the tree to a variable and keep navigating.
Path variables make queries readable — especially when referencing ancestors deep in the tree.

## Expression Levels

Every expression has a **level** — the collection whose key columns define its row domain.
The level is determined by the expression's source and is always unambiguous.

**Field references** have the level of their collection:

```python
scenes.location          # level: scenes     — one value per scene
clips.duration           # level: clips      — one value per clip
detections.confidence    # level: detections — one value per detection
```

**Scalar operations** inherit the deepest level of their operands:

```python
clips.duration * 2                          # level: clips
detections.confidence * 100                 # level: detections
detections.ts - clips.start_time            # level: detections (clips.start_time broadcasts)
```

**Reductions** move one level up (to the immediate parent):

```python
clips.duration.sum()                    # level: scenes     (clips → scenes)
detections.confidence.mean()            # level: clips      (detections → clips)
detections.confidence.mean().sum()      # level: scenes     (detections → clips → scenes)
```

## Scanning

`ds.scan()` reads data and produces rows. Pass one or more expressions — they must all lie on the same ancestor path
(no siblings or unrelated branches). Ancestor expressions are broadcast to the scan level automatically.

```python
ds.scan(duration=clips.duration, camera=clips.camera).to_arrow()
```

```
┌──────────┬────────┐
│ duration │ camera │
├──────────┼────────┤
│     30.0 │ front  │
│     15.0 │ front  │
│     20.0 │ rear   │
└──────────┴────────┘
```

The **scan level** is the deepest result level among all expressions. Shallower expressions are broadcast automatically.

```python
# Scan level = clips (deepest result level)
ds.scan(
    duration=clips.duration,                        # level: clips
    location=scenes.location,                       # level: scenes → broadcasts to clips
    confidence_mean=detections.confidence.mean(),   # level: clips (reduced from detections)
).to_arrow()
```

```
┌──────────┬──────────┬─────────────────┐
│ duration │ location │ confidence_mean │
├──────────┼──────────┼─────────────────┤
│     30.0 │ downtown │            0.85 │
│     15.0 │ downtown │            0.92 │
│     20.0 │ highway  │            0.78 │
└──────────┴──────────┴─────────────────┘
```

**Projection forms.** Several shorthand forms work inside `ds.scan()`:

```python
# Bare collections
ds.scan(frames)

# Kwargs — keys become output column names
ds.scan(rgb=frames.rgb, dur=clips.duration)

# Dict — keys become output column names, nested dicts become structs
ds.scan({
    'image': frames.rgb,
    'meta': {
        'dur': clips.duration,
        'loc': scenes.location,
    },
})
```

## Implicit Broadcasting

When an ancestor expression appears in a scan at a deeper level, its values are automatically repeated for every
child row. This is the only implicit operation — it is always unambiguous because each child row has exactly one parent.

```python
ds.scan(
    rgb=frames.rgb,                  # level: frames
    duration=clips.duration,         # level: clips  → broadcast to frames
    location=scenes.location,        # level: scenes → broadcast to frames
).to_arrow()
```

```
┌─────┬──────────┬──────────┐
│ rgb │ duration │ location │
├─────┼──────────┼──────────┤
│ f0  │     30.0 │ downtown │
│ f1  │     30.0 │ downtown │
│ f2  │     15.0 │ downtown │
│ f3  │     20.0 │ highway  │
└─────┴──────────┴──────────┘
```

Physically, this is a sort-merge join between the frames table and the clips/scenes tables on their shared key prefix.

**Siblings and unrelated collections are errors:**

```python
# ✗ frames and detections are siblings — neither is an ancestor of the other
ds.scan(rgb=frames.rgb, label=detections.label)
# Error: cannot combine 'frames' and 'detections' — they are siblings

# ✗ models is on a different branch entirely
ds.scan(dur=clips.duration, name=ds.models.name)
# Error: 'clips' and 'models' share no ancestor path
```

Siblings must be explicitly aligned to be combined in the same scan — see [Aligning Siblings](#aligning-siblings) below.

## Reduction

Referencing a child collection's fields from a parent level requires
an explicit **reduction** — you must specify how many-to-one values
collapse. Reduction methods move the expression one level up to the
immediate parent.

```python
# At scenes level: one value per scene
clips.duration.sum()      # total recording time per scene
clips.duration.mean()     # average clip duration per scene
clips.duration.list()     # list of clip durations per scene
clips.duration.count()    # number of clips per scene
clips.duration.min()      # shortest clip per scene
clips.duration.max()      # longest clip per scene
clips.duration.first()    # first clip's duration per scene
clips.duration.last()     # last clip's duration per scene
```

Reductions compose — each step moves one level up:

```python
detections.confidence.mean()          # level: clips  (mean per clip)
detections.confidence.mean().sum()    # level: scenes (sum of per-clip means)
```

## Filtering

Filter narrows a collection's row domain without changing its level. It is a first-class operation on collections.

**`.where()` on a collection** returns a filtered collection that can be reused:

```python
from spiral import _

# `_` refers to the collection being filtered:
long_clips = clips.where(_.duration > 10)

# Use the filtered collection in scans:
ds.scan(duration=long_clips.duration).to_arrow()
```

```
┌──────────┐
│ duration │
├──────────┤
│     30.0 │
│     15.0 │
│     20.0 │
└──────────┘
```

`_` supports arithmetic, comparisons, and boolean operators (`&`, `|`, `~`):

```python
from spiral import _

# Compound predicates:
high_conf_cars = detections.where((_.confidence > 0.8) & (_.label == "car"))
cars_or_peds   = detections.where((_.label == "car") | (_.label == "pedestrian"))
not_bicycles   = detections.where(~(_.label == "bicycle"))
```

**Filters affect reductions.** When a filtered collection is aggregated, only the matching rows participate:

```python
high_conf = detections.where(_.confidence > 0.9)

# Count only high-confidence detections per clip:
ds.scan(cnt=high_conf.count())
```

**Filter placement is an optimizer concern.** The user specifies
*what* to filter; the system decides *where* in the plan to apply
it. When two streams have identical row domains, the join becomes a
zip (lockstep walk, no searching). The optimizer weighs this against
the benefit of pushing filters down to reduce data volume.

## Lists and Collections

Lists and child collections are the same abstraction viewed from different levels.

From the **clips** level, `detections` is conceptually a list — for
each clip, there is a variable-length sequence of detection rows.
Reducing that list is the same as a streaming GroupBy over the
detections collection:

```python
detections.confidence.sum()     # streaming: one pass, constant memory
detections.confidence.list()    # materializes all values as a list per clip
```

Both produce one value per clip. The first streams and reduces in a single pass;
the second materializes the list for later processing. Use `.sum()`, `.mean()`,
etc. when you want a scalar; use `.list()` when you need the raw values.

Going the other direction, if a table has a list-typed column, you
can **explode** it to create a child level:

```python
# Suppose scenes has a tags: list<utf8> column
scenes.tags              # level: scenes, type: list<utf8>
scenes.tags.len()        # level: scenes, type: u64
scenes.tags.explode()    # level: (_id, $idx) — one row per tag
```

`.explode()` is the inverse of `.list()`. A list column IS a
pre-materialized child collection.

This duality means the same methods work on both child collection
fields and list columns:

```python
# Child collection field at parent level:
detections.confidence.sum()      # sum of confidences per clip
detections.confidence.list()     # list of confidences per clip
detections.confidence.count()    # number of detections per clip

# List column in the schema:
scenes.tags.sum()         # (if numeric) sum of elements
scenes.tags.list()        # identity — it's already a list
scenes.tags.count()       # number of elements
```

## Window Functions

A window function computes a value for each row using its
neighboring rows within the same parent group. It is a **reduction
that preserves the input's structure** — the result stays at the
same level as the input.

```python
# Rolling 3-frame window: for each frame, collect [frame-2, frame-1, frame]
frames.rgb.window(frame=(-2, 0)).list()
# level: frames, type: list<binary>

# Rolling mean over a 5-detection window:
detections.confidence.window(frame=(-2, 2)).mean()
# level: detections, type: f32
```

**Windows never cross parent boundaries.** A rolling window on
frames is implicitly partitioned by clips — frame 0 of clip c2
never includes frames from clip c1. This is because a window
function is fundamentally a reduction grouped by the parent level:

1. Group frames by parent clip
2. Within each group, compute a sliding aggregate
3. Result: one value per frame (same cardinality as input)

The system optimizes this as a streaming window — no materialization
of intermediate lists.

## Aligning Siblings

`frames`, `detections`, and `audio` are siblings — they share a parent
(`clips`) but have independent row domains. To combine them in a
single scan, you need to align them to a **shared index**.

### Resampling

`.resample(by, target)` is a **bucket-join**: each row in `self` is
assigned to a row in `target` by evaluating the `by` expression, which
must produce a value in `target`'s positional key space (i.e. a valid
`_pos` index into `target`).

Because both `self` and `target` are sorted by the same key prefix, and
`by` must be monotonically non-decreasing, the assignment is a streaming
walk — both sides advance forward together with no random access.

Multiple source rows may land in the same bucket, so `resample` pushes a
virtual `_pos` component onto the result key. Chain a reduction (`.mean()`,
`.count()`, `.list()`, etc.) to collapse back to `target`'s key level.

The two parameters:

- **`by`** — an expression evaluated over `self` that maps each source row
  to a `target` position. Must be monotonically non-decreasing.
- **`target`** — a sibling collection (same parent as `self`) that defines
  the output row domain.

```python
clips      = ds.scenes.clips
detections = clips.detections
frames     = clips.frames

# Convert detection timestamp (seconds) → frame index at clip fps.
# At 10fps: ts=0.05 → frame 0, ts=0.15 → frame 1.
frame_idx = (detections.ts * clips.fps).floor()
resampled  = detections.resample(by=frame_idx, target=frames)

# resampled key: (scenes_id, clips_id, frames_pos, _pos)
# Chain reductions to get back to frames level.
ds.scan(
    rgb       = frames.rgb,
    avg_conf  = resampled.confidence.mean(),
    det_count = resampled.count(),
).to_arrow()
```

```
┌─────┬─────┬──────────┬───────────┐
│ _id │ rgb │ avg_conf │ det_count │
├─────┼─────┼──────────┼───────────┤
│ c1  │ f0  │ 0.90     │ 1         │
│ c1  │ f1  │ 0.85     │ 1         │
│ c2  │ f2  │ 0.70     │ 1         │
│ c3  │ f3  │ 0.775    │ 2         │
└─────┴─────┴──────────┴───────────┘
```

## Row Selection

The `[]` operator **steps into** a collection's key prefix,
pinning one or more leading key values. Each index removes the
leftmost unbound key dimension from the row domain — you are
walking into the tree, not filtering arbitrary rows.

```python
ds.scenes["s1"]                        # pin scenes._id = "s1"
ds.scenes["s1"].clips["c1"]            # pin scenes._id = "s1", clips._id = "c1"
ds.scenes["s1"].clips["c1"].frames[0]  # fully specified — one frame
```

Because the data is sorted by key prefix, stepping into a prefix
is a seek, not a scan. All downstream expressions and joins see
the narrowed domain:

```python
s1_clips = ds.scenes["s1"].clips

ds.scan(duration=s1_clips.duration, location=scenes.location).to_arrow()
```

```
┌──────────┬──────────┐
│ duration │ location │
├──────────┼──────────┤
│     30.0 │ downtown │
│     15.0 │ downtown │
└──────────┴──────────┘
```

When every collection in the path is scalar-indexed, you have a
fully-specified path to a single value:

```pycon
>>> ds.scenes["s1"].location.to_py()
"downtown"
>>> ds.scenes["s1"].clips["c1"].duration.to_py()
30.0
```

### Selection forms

**Scalar index** — pins one key value at one level:

```python
ds.scenes["s1"]                           # one scene
ds.scenes["s1"].clips["c1"]              # one clip within one scene
ds.scenes["s1"].clips["c1"].frames[0]    # one frame (by _pos)
```

**Array of keys** — selects specific rows (must be sorted):

```python
ds.scenes[["s1", "s2", "s5"]]
# level: scenes, row domain: 3 scenes
```

**Key table** — multi-level row selection. Each column corresponds
to a level in the hierarchy, rows correspond element-wise:

```python
import pyarrow as pa

ds.scenes.clips[pa.table({
    "scenes": ["s1", "s1", "s2"],
    "scenes.clips": ["c1", "c2", "c3"],
})]
# level: clips, row domain: 3 specific clips
```

Key tables are useful for sampling, precomputed batch indices, or
feeding an externally-computed selection into a training loop.

## Execution Model

### Physical Storage

Each collection is a separate sorted table. Conceptually, child tables store parent keys for joining:

**scenes** — sorted by `(scenes_id)`

```
┌───────────┬──────────┐
│ scenes_id │ location │
├───────────┼──────────┤
│ s1        │ downtown │
│ s2        │ highway  │
└───────────┴──────────┘
```

**clips** — sorted by `(scenes_id, clips_id)`

```
┌───────────┬──────────┬──────────┐
│ scenes_id │ clips_id │ duration │
├───────────┼──────────┼──────────┤
│ s1        │ c1       │     30.0 │
│ s1        │ c2       │     15.0 │
│ s2        │ c3       │     20.0 │
└───────────┴──────────┴──────────┘
```

**frames** — sorted by `(scenes_id, clips_id, frames_pos)`

```
┌───────────┬──────────┬────────────┬─────┬───────┐
│ scenes_id │ clips_id │ frames_pos │ rgb │ depth │
├───────────┼──────────┼────────────┼─────┼───────┤
│ s1        │ c1       │ 0          │ f0  │ d0    │
│ s1        │ c1       │ 1          │ f1  │ d1    │
│ s1        │ c2       │ 0          │ f2  │ d2    │
│ s2        │ c3       │ 0          │ f3  │ d3    │
└───────────┴──────────┴────────────┴─────┴───────┘
```

In practice, SpiralDB maintains efficient indexes to optimize
cross-level joins, and the Vortex columnar format makes
broadcasting a zero-copy operation. But the logical model above
is equivalent.

### Execution Plan

Consider the same resampling query from the
[previous section](#aligning-siblings), but with a filter on
detections:

```python
dts = detections.where(detections.label == "pedestrian")

ds.scan(
    frames.rgb,
    dts.resample(by=(dts.ts * clips.fps).floor(), target=frames).list(),
    clips.id,
    clips.duration,
).to_arrow()
```

Each node below is a streaming operator; data flows top-to-bottom
from table scans to the final output:

```
Scan(scenes) Scan(clips) Scan(frames)            Scan(detections)
    │             │             │                        │
    └─── Join ────┘             │              Filter(label=="pedestrian")
           │                    │                        │
           │                    │                 Resample(agg=list)
           │                    │                        │
           │                    └──────── Join ──────────┘
           │                               │
           └─────────── Join ──────────────┘
                          │
                       Output
```

Four table scans feed into the plan. The filter on `label` is
pushed down before resampling. Resample is bucket join, and
enables sort-merge join with the frames branch. The ancestor fields
(scenes, clips) are joined via sort-merge on the shared key prefix.

**Filter placement is an optimizer choice.** The filter could
instead be applied after the zip — this preserves the cheap
lockstep walk but processes more rows through the resample.
Pushing the filter before the zip reduces data volume but may
break the row-domain identity, turning the zip into a join.
The optimizer weighs filter selectivity against join cost.

**Sideways information passing.** When the filter is pushed to
the detections scan, it can also accelerate the frames scan. Both
streams are sorted by the same key prefix (`scenes_id`,
`clips_id`). As the filtered detections stream skips past clips
with no matching detections, it can pass updated key bounds to the
frames stream, allowing it to seek forward rather than scan
through frames that will be discarded after the zip. This
cross-operator communication — where one scan's progress
informs another's seek position — is known as **sideways
information passing**.

## Writing Data

All writes are **upserts**. Writing to an existing key replaces the
record; writing to a new key creates it.

**Prefer batch writes** for best performance — columnar storage is
most efficient when writing many records at once.

**Single records** — setitem with a dict:

```python
ds.scenes["s3"] = {
    "location": "park",
    "metadata": {"weather": "sunny", "lighting": "natural"},
    "clips": [
        {
            "_id": "c10",
            "camera": "front",
            "start_time": 0.0,
            "duration": 60.0,
            "frames": [{"rgb": b"f10", "depth": b"d10"}],
            "detections": [{"label": "car", "confidence": 0.9, "ts": 0.5}],
            "audio": [{"pcm": b"a10", "level_db": -20.0}],
        },
    ],
}
```

**Scalars** — when the path is fully indexed:

```python
ds.scenes["s1"].location = "downtown_v2"
ds.scenes["s1"].clips["c1"].duration = 45.0
```

**Batch writes** — setitem with an Arrow table or list of dicts:

```python
import pyarrow as pa

ds.scenes["s1"].clips["c1"].frames = pa.table({
    "rgb":   [b"f20", b"f21", b"f22"],
    "depth": [b"d20", b"d21", b"d22"],
})
```

Note that positional collections (like `frames`) do not support upserting individual rows by index.
You can only replace the entire collection at once.

**Scoped writes** — the client batches them for you:

```python
with ds.write() as w:
    for scene in scene_dicts:
        w.scenes[scene['_id']] = scene
    # all writes flushed as a single batch on exit
```

---
# Data Model
Section: Tables
Source: https://docs.spiraldb.com/tables/data-model

The data model of Spiral Tables is dictionary-like structures of columnar arrays, sorted and unique by a set of primary
key columns. Tables are **sorted** and **unique** by a primary key. The data model enables:

- Efficient **sparse columns**.
- Columns arranged in a **nested dictionary**-like structure.
- Support for large values and **cell-level pushdown** filtering.
- Support for **appending columns** without rewriting the entire table.

## Key Schema

When a table is created, a `key schema` is defined that represents the primary key and sort order of the table. This
schema is fixed and cannot be changed after the table is created.

The key schema can be any number of columns of the following types:

- `(u)int{8,16,32,64}`
- `float{16,32,64}`
- `timestamp`
- `bytes` (up to 1KB)
- `string` (up to 1KB)

## Column Groups

The columns of a table are arranged in a nested dictionary-like structure. We refer to this as the *schema tree*.
Column groups can be thought of as horizontal partitions of the table. Tables provide complete isolation between
column groups, scanning only the column groups needed for a query.

For example, the [GitHub Archive](https://www.gharchive.org) dataset has a key schema that looks like this:

```python
import pyarrow as pa

key_schema = [('created_at', pa.timestamp('us')), ('id', pa.int64())]
```

And a schema tree of:

```python
schema_tree = {
    'name': pa.string(),
    'public': pa.bool_(),
    'payload': pa.string(),
    'repo': {
        'id': pa.int64(),
        'name': pa.string(),
        'url': pa.string()
    },
    'actor': {
        'id': pa.int64(),
        'login': pa.string(),
        'gravatar_id': pa.string(),
        'url': pa.string(),
        'avatar_url': pa.string()
    },
    'org': {
        'id': pa.int64(),
        'login': pa.string(),
        'gravatar_id': pa.string(),
        'url': pa.string(),
        'avatar_url': pa.string()
    }
}
```

A **column group** refers to a set of sibling leaf columns in the schema tree. For example, in the schema tree above,
there is a root (\`\`) column group containing the `name`, `public`, and `payload` columns; as well as `repo`, `actor`
and `org` column groups.

## Storage Model

Each column group is stored as a log-structured merge (LSM) tree in object storage. This is a data structure that
consists of sorted runs of data.

The key columns are split out and stored in key files, while the value columns are stored in fragment files. Background
maintenance jobs periodically compact the LSM tree to merge overlapping sorted runs of value columns to improve read
performance.

See [Table Format](https://docs.spiraldb.com/format) for a more detailed specification of the storage model.

---
# Write Tables
Section: Tables
Source: https://docs.spiraldb.com/tables/write

We're going to explore the Spiral Tables write API using the ever classic [GitHub Archive dataset](https://www.gharchive.org/).

## Create a Table

Currently, tables can only be created using the Spiral Python API.

```python
import pyarrow as pa
import spiral
from spiral.demo import demo_project


sp = spiral.Spiral()
project = demo_project(sp)

# Define the key schema for the table.
# Random keys such as UUIDs should be avoided, so we're using `created_at` as part of primary key.
key_schema = pa.schema([('created_at', pa.timestamp('ms')), ('id', pa.string())])

# Create the 'events-v1' table in the 'gharchive' dataset
events = project.create_table('gharchive.events-v1', key_schema=key_schema, exist_ok=True)
```

To open the table in the future:

```python
events = project.table("gharchive.events-v1")
```

We'll define a function that downloads data from the GitHub Archive and download an hour of data.

```python
import pyarrow as pa
import pandas as pd
import duckdb

def get_events(period: pd.Period, limit=100) -> pa.Table:
    json_gz_url = f"https://data.gharchive.org/{period.strftime('%Y-%m-%d')}-{str(period.hour)}.json.gz"
    arrow_table = (
        duckdb.read_json(json_gz_url, union_by_name=True)
        .limit(limit) # remove this line to load more rows
        .select("""
        * REPLACE (
            cast(created_at AS TIMESTAMP_MS) AS created_at,
            to_json(payload) AS payload
        )
        """)
        .to_arrow_table()
    )

    events = duckdb.from_arrow(arrow_table).order("created_at, id").distinct().to_arrow_table()
    events = (
        events.drop_columns("id")
        .add_column(0, "id", events["id"].cast(pa.large_string()))
        .drop_columns("created_at")
        .add_column(0, "created_at", events["created_at"].cast(pa.timestamp("ms")))
        .drop_columns("org")
    )

    return events

data: pa.Table = get_events(pd.Period("2023-01-01T00:00:00Z", freq="h"))
```

## Write to a Table

All write operations to a Spiral table must include the key columns.

Writing nested data such as struct arrays or dictionaries into a table creates column groups.
See [Best Practices](https://docs.spiraldb.com/tables/best-practices#column-groups) for how to design your schema for optimal performance.

The write API accepts a variety of data formats, including PyArrow tables and dictionaries.
Since our function returns a PyArrow table, we can write it directly to the Spiral table.

```python
events.write(data)
```

We could also have written multiple times atomically within a transaction.

```python
with events.txn() as tx:
    for period in pd.period_range("2023-01-01T00:00:00Z", "2023-01-01T01:00:00Z", freq="h"):
        data = get_events(period)
        tx.write(data)
```

Or in parallel, useful when ingesting large amounts of data:

```python
from concurrent.futures import ThreadPoolExecutor

def write_period(tx, period: pd.Period):
    data = get_events(period)
    tx.write(data)

with events.txn() as tx:
    with ThreadPoolExecutor() as executor:
        for period in pd.period_range("2023-01-01T00:00:00Z", "2023-01-01T01:00:00Z", freq="h"):
            executor.submit(write_period, tx, period)
```

It is important that the primary key columns are unique within the transaction because all writes in a transaction will have a same timestamp.

If a different value is written for the same column and the same key in the same transaction,
the result when reading the table is **undefined**.

Our table schema now looks like this:

```python
events.schema().to_arrow()
```

```python
pa.schema({
    "created_at": pa.timestamp("ms"),
    "id": pa.string(),
    "payload": pa.string(),
    "public": pa.bool_(),
    "type": pa.string(),
    "actor": pa.struct({
        "avatar_url": pa.string(),
        "display_login": pa.string(),
        "gravatar_id": pa.string(),
        "id": pa.int64(),
        "login": pa.string(),
        "url": pa.string(),
    }),
    "repo": pa.struct({
        "id": pa.int64(),
        "name": pa.string(),
        "url": pa.string(),
    }),
})
```

Recall that column groups are defined as sibling columns in the same node of the schema tree.
Our schema, in addition to the root column group, contains `actor`, `org`, and `repo` column groups.

While columnar storage formats are quite efficient at pruning unused columns, the distance in the file between
the data for each column can introduce seek operations that can significantly slow down queries over object storage.
In many cases, if you know upfront that you will always query a set of columns together, you can improve query
performance by grouping these columns together in a column group.

## Appending Columns

Spiral Tables are merged across writes making it very easy to append columns to an existing table.

Suppose we wanted to derive an `action` column. We could easily append these columns to the existing table without
rewriting the entire dataset.

This example will use some syntax from the `spiral.expressions` module. For more detailed examples, please refer to
the [Expressions](https://docs.spiraldb.com/tables/expressions).

```python
action = sp.scan({
    'action': {
        "repo_id": events["repo.id"],
        "actor_id": events["actor.id"],
        "type": events["type"],
        "payload": events["payload"]
    }
}, where=events["repo.name"] == "apache/arrow").to_table()
```

Our table schema now contains `action` column group, like this:

```python
events.schema().to_arrow()
```

```
pa.schema(
    "created_at": pa.timestamp("ms"),
    "id": pa.string(),
    "payload": pa.string(),
    "public": pa.bool_(),
    "type": pa.string(),
    "action": pa.struct({
        "actor_id": pa.int64(),
        "org_id": pa.int64(),
        "payload": pa.string(),
        "repo_id": pa.int64(),
        "type": pa.string(),
    }),
    "actor": pa.struct({
        "avatar_url": pa.string(),
        "display_login": pa.string(),
        "gravatar_id": pa.string(),
        "id": pa.int64(),
        "login": pa.string(),
        "url": pa.string(),
    }),
    "repo": pa.struct({
        "id": pa.int64(),
        "name": pa.string(),
        "url": pa.string(),
    }),
}
```

---
# Scan & Query
Section: Tables
Source: https://docs.spiraldb.com/tables/scan

Scanning tables is the process of reading data row-by-row performing row-based scalar transformations, or row-based filtering operations.

## Scan Object

The `scan` method returns a `Scan` object that encapsulates a specific query. This object can then return
rows of data, or be used to perform further operations.

```python
import spiral
import pyarrow as pa
from spiral.demo import gharchive


sp = spiral.Spiral()
events_table = gharchive(sp)

scan = sp.scan(events_table[["id", "type", "public", "actor", "payload.*", "repo"]])

# The result schema of the scan
schema: pa.Schema = scan.schema

# Read as a stream of RecordBatches
record_batches: pa.RecordBatchReader = scan.to_record_batches()

# Read into a single PyArrow Table
arrow_table: pa.Table = scan.to_table()

# Read into a Dask DataFrame
dask_df: dd.DataFrame = scan.to_dask()

# Read into a Ray Dataset
# ray_ds: ray.data.Dataset = scan.to_ray_dataset()

# Read into a Pandas DataFrame
pandas_df: pd.DataFrame = scan.to_pandas()

# Read into a Polars DataFrame
polars_df: pl.DataFrame = scan.to_polars()

# Read into a Torch-compatible DataLoader for model training
data_loader: torch.utils.data.DataLoader = scan.to_data_loader()

# Read into a Torch-compatible DataLoader for distributed model training
distributed_data_loader: torch.utils.data.DataLoader = scan.to_distributed_data_loader()

# Read into a Torch-compatible IterableDataset for model training
iterable_dataset: torch.utils.data.IterableDataset = scan.to_iterable_dataset()
```

Scan's `to_record_batches` method returns rows sorted by the table's key. If ordering is not important,
consider using `to_unordered_record_batches` which can be faster. Scan executes splits concurrently,
but can only keep some number of splits in memory at a time which limits the degree of concurrency.
Unlike ordered streams, `to_unordered_record_batches` yields batches as soon as they are ready,
without waiting for the previous batch to be processed so buffering never limits concurrency.

## Filtering

Filtering is the process of selecting rows that meet a certain condition. For example, to find events with a specific
event type:

```python
insertion_events = sp.scan(
    events_table,
    where=events_table['type'] == 'PullRequestEvent'
)
```

Any expression that resolves to a boolean value can be used as a filter. See the [Expressions](https://docs.spiraldb.com/tables/expressions)
documentation for more information.

## Projection

Projection is the process of applying a transformation function to a single row of a table. This can be as simple as
selecting a subset of columns, through to much more complex functions such as passing a string column through an
LLM API.

A projection expression *must* resolve to a struct value. See [Expressions](https://docs.spiraldb.com/tables/expressions)
documentation for more information.

### Nested Data

Scanning a table selects all columns including nested column groups:

```python
sp.scan(events_table.select(exclude=['payload'])).schema.to_arrow()
```

```python
pa.schema({
    "created_at": pa.timestamp("ms"),
    "id": pa.string(),
    "public": pa.bool_(),
    "type": pa.string(),
    "actor": pa.struct({
        "avatar_url": pa.string(),
        "display_login": pa.string(),
        "gravatar_id": pa.string(),
        "id": pa.int64(),
        "login": pa.string(),
        "url": pa.string(),
    }),
    "repo": pa.struct({
        "id": pa.int64(),
        "name": pa.string(),
        "url": pa.string(),
    })
})
```

You can select an entire column group using bracket notation. Remember, the result must always be a struct but
column group is a struct so this is valid:

```python
sp.scan(events_table["repo"]).schema.to_arrow()
```

```python
pa.schema({
    "id": pa.int64(),
    "name": pa.string(),
    "url": pa.string(),
})
```

Selecting a column like this is not valid. When selecting column, use double brackets:

```python
sp.scan(events_table[["public"]]).schema.to_arrow()
```

```python
pa.schema({
    "public": pa.bool_(),
})
```

Or multiple columns:

```python
sp.scan(events_table[["public", "repo"]]).schema.to_arrow()
```

```python
pa.schema({
    "public": pa.bool_(),
    "repo": pa.struct({
        "id": pa.int64(),
        "name": pa.string(),
        "url": pa.string(),
    }),
})
```

You can pack columns with custom names using a dictionary:

```python
sp.scan({
    "is_public": events_table["public"],
    "repo": events_table["repo"],
}).schema.to_arrow()
```

```python
pa.schema({
    "is_public": pa.bool_(),
    "repo": pa.struct({
        "id": pa.int64(),
        "name": pa.string(),
        "url": pa.string(),
    }),
})
```

Double-brackets is a "syntax sugar" for the `.select()` method:

```python
sp.scan(events_table.select("public")).schema.to_arrow()
```

```python
pa.schema({
    "public": pa.bool_(),
})
```

Selection applies over columns and nested column groups:

```python
sp.scan(events_table.select("public", "repo")).schema.to_arrow()
```

```python
pa.schema({
    "public": pa.bool_(),
    "repo": pa.struct({
        "id": pa.int64(),
        "name": pa.string(),
        "url": pa.string(),
    }),
})
```

It is possible to "select into" a column group to get specific fields within nested structures:

```python
sp.scan(events_table.select("public", "repo.id", "repo.url")).schema.to_arrow()
```

```python
pa.schema({
    "public": pa.bool_(),
    "repo": pa.struct({
        "id": pa.int64(),
        "url": pa.string(),
    }),
})
```

You can select from multiple nested structures:

```python
sp.scan(events_table.select("public", "repo", "actor.login")).schema.to_arrow()
```

```python
pa.schema({
    "public": pa.bool_(),
    "repo": pa.struct({
        "id": pa.int64(),
        "name": pa.string(),
        "url": pa.string(),
    }),
    "actor": pa.struct({
        "login": pa.string(),
    }),
})
```

Note that selecting "into" a column group returns a nested structure. To flatten, "step into" the column group first:

```python
sp.scan(events_table[["public"]], events_table["repo"][["id", "url"]]).schema.to_arrow()
```

```python
pa.schema({
    "public": pa.bool_(),
    "id": pa.int64(),
    "url": pa.string(),
})
```

Column groups support the same selection operations. This is equivalent to the flattened example above:

```python
sp.scan(events_table["repo"].select("id", "url")).schema.to_arrow()
```

```python
pa.schema({
    "id": pa.int64(),
    "url": pa.string(),
})
```

Use `exclude` to remove specific columns, column groups, or keys:

```python
sp.scan(events_table.select(exclude=["payload", "repo", "actor"])).schema.to_arrow()
```

```python
pa.schema({
    "created_at": pa.timestamp("ms"),
    "id": pa.string(),
    "public": pa.bool_(),
    "type": pa.string(),
})
```

Wildcard `"*"` allows you to select or exclude columns from a specific column group, without including or excluding nested column groups:

```python
sp.scan(events_table.select("*")).schema.to_arrow()
```

```python
pa.schema({
    "created_at": pa.timestamp("ms"),
    "id": pa.string(),
    "public": pa.bool_(),
    "type": pa.string(),
})
```

Exclude all columns to keep only nested groups except payload column group:

```python
sp.scan(events_table.select(exclude=["*", "payload"])).schema.to_arrow()
```

```python
pa.schema({
    "actor": pa.struct({
        "avatar_url": pa.string(),
        "display_login": pa.string(),
        "gravatar_id": pa.string(),
        "id": pa.int64(),
        "login": pa.string(),
        "url": pa.string(),
    }),
    "repo": pa.struct({
        "id": pa.int64(),
        "name": pa.string(),
        "url": pa.string(),
    }),
})
```

Wildcards can be mixed with other selections:

```python
sp.scan(events_table.select("*", "actor.login")).schema.to_arrow()
```

```python
pa.schema({
    "created_at": pa.timestamp("ms"),
    "id": pa.string(),
    "public": pa.bool_(),
    "type": pa.string(),
    "actor": pa.struct({
        "login": pa.string(),
    }),
})
```

## Querying

Recent years have seen several excellent query engines including:

- **[Polars](https://pola.rs)** - with Python and Rust DataFrame APIs
- **[DuckDB](https://duckdb.org)** - with a SQL oriented API
- **[DataFusion](https://datafusion.apache.org)** - SQL as well as a Rust DataFrame API

### Polars

Tables can also be queried with [Polars LazyFrames](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html):

```python
tbl = gharchive(sp)
df = tbl.to_polars_lazy_frame()
result = df.collect()
```

### DuckDB

Tables can be turned into PyArrow Datasets with `to_dataset()`, which in turn enables the [DuckDB Python API](https://duckdb.org/docs/api/python/overview).

```python
import duckdb

tbl = gharchive(sp)
ds = tbl.to_arrow_dataset()

result = duckdb.execute("SELECT type, COUNT(*) FROM ds GROUP BY type")
```

Or using the [DuckDB Relational API](https://duckdb.org/docs/api/python/relational_api.html):

```python
result = tbl.to_duckdb_relation().filter("public is not null").to_arrow_table()
```

## Key Table

Key Table scan is an equivalent of a primary key lookup. Any scan can be evaluated against a table of keys
to return only the rows that match the keys, with columns defined by the scan's projection. The result
will always contain all the rows from the input key table, and if scan specifies any filters, rows that do not match
the filter will be returned as nulls.

```python
import pyarrow as pa

scan = sp.scan(events_table)

key_table = pa.table({
    "created_at": ['2025-11-20 13:00:00'],
    "id": ['4807206745'],
})
results = scan.to_record_batches(key_table=key_table)
```

For optimal performance, provide keys that are sorted and unique. Unsorted and duplicate keys are
still supported, but performance is less predictable.

It is possible to stream the keys into the scan using an Arrow `RecordBatchReader`.

```python
import pyarrow as pa
from typing import Iterable

from spiral.demo import abc

table_abc = abc(sp)

scan = sp.scan(table_abc.select("a", "b"))

def batches() -> Iterable[pa.RecordBatch]:
    yield from [
        pa.record_batch(
            {
                "a": pa.array([3, 5, 8], type=pa.int64()),
            }
        ),
        pa.record_batch(
            {
                "a": pa.array([0, 1, 2, 6, 7, 9], type=pa.int64()),
            }
        ),
        pa.record_batch(
            {
                "a": pa.array([1, 2, 3, 4, 5], type=pa.int64()),
            }
        ),
    ]

key_table_reader = pa.RecordBatchReader.from_batches(
    table_abc.key_schema.to_arrow(),
    batches(),
)

scan.to_record_batches(
    key_table=key_table_reader
).read_all().to_pylist()
```

```python
[
    {"a": 3, "b": 103},
    {"a": 5, "b": 105},
    {"a": 8, "b": 108},
    {"a": 0, "b": 100},
    {"a": 1, "b": 101},
    {"a": 2, "b": 102},
    {"a": 6, "b": 106},
    {"a": 7, "b": 107},
    {"a": 9, "b": 109},
    {"a": 1, "b": 101},
    {"a": 2, "b": 102},
    {"a": 3, "b": 103},
    {"a": 4, "b": 104},
    {"a": 5, "b": 105},
]
```

Note that when streaming scan results with `to_record_batches`, the result stream can contain more batches
than the input key stream; the scan is optimized to return rows as soon as possible. The batch in the result stream will
never contain rows that span multiple input key batches. `to_record_batches` supports a `batch_aligned` parameter that
can be used to ensure that each output batch corresponds exactly to an input key batch. The flag can have a performance
impact since it may require buffering rows until the end of the input key batch is reached and even an extra copy.
It is recommended to only use this flag when the full result for a given input batch is needed before processing can continue.

## Serialization

Scan objects support Python's pickle protocol, enabling seamless integration with distributed systems
like Ray and multiprocessing without requiring manual serialization.

### Pickle

```python
import pickle

from spiral.demo import abc

table_abc = abc(sp)

scan = sp.scan(table_abc[["a", "b"]], where=table_abc["a"] < 3)

# Serialize
pickled = pickle.dumps(scan)

# Deserialize
restored = pickle.loads(pickled)

restored.to_table().to_pylist()
```

```python
[{"a": 0, "b": 100}, {"a": 1, "b": 101}, {"a": 2, "b": 102}]
```

### Ray

```python
import ray

from spiral.demo import abc

table_abc = abc(sp)

@ray.remote
def process_scan(scan):
    return scan.to_table().num_rows

scan = sp.scan(table_abc)
count = ray.get(process_scan.remote(scan))
```

### Manual State (JSON-compatible)

For non-pickle contexts, use `to_bytes_compressed()` and `resume_scan()`:

```python
from spiral.demo import abc

table_abc = abc(sp)

scan = sp.scan(table_abc[["a", "b"]], where=table_abc["a"] < 3)

# Serialize scan state
state = scan.to_bytes_compressed()  # bytes, can be base64-encoded for JSON

# Resume on another process or machine
restored = sp.resume_scan(state)

restored.to_table().to_pylist()
```

```python
[{"a": 0, "b": 100}, {"a": 1, "b": 101}, {"a": 2, "b": 102}]
```

## Cross-table

It is possible to jointly scan any tables that have a common key schema. This has the same behaviour as an outer join
but is more efficient since we know the tables are both sorted by the key columns.

```python
sp.scan(
    {
        "events": events_table,
        "other_events": other_events_table,
    }
)
```

---
# Enrichment
Section: Tables
Source: https://docs.spiraldb.com/tables/enrich

The easiest way to build multimodal Spiral tables is to start with metadata with pointers, such as file paths or URLs,
and use **enrichment** to ingest the actual data bytes, such as images, audio, video, etc. With column groups
design supporting hundreds of thousands of columns, horizontally expanding tables is a powerful primitive.

We'll explore this using the [Open Images dataset](https://github.com/cvdfoundation/open-images-dataset),
a large-scale collection of images with associated metadata.

## Starting with Metadata

Let's begin by creating a table and loading the image URLs from the Open Images dataset.

```python
import spiral
from spiral.demo import images
import pandas as pd
import pyarrow as pa

sp = spiral.Spiral()
table = images(sp)
```

At this point, our table contains only lightweight metadata:

```python
table.schema().to_arrow()
```

```python
pa.schema({
    "idx": pa.int64(),
    "etag": pa.string(),
    "size": pa.int64(),
    "url": pa.string(),
})
```

## Ingesting Data

We can enrich the table by fetching the actual image data from ingested URLs.

```python
from spiral import expressions as se

# Create an enrichment that fetches images from S3
enrichment = table.enrich(
    se.pack({
        "image": se.http.get(table["url"])
    })
)

# Apply the enrichment in a streaming fashion
enrichment.run()
```

After enrichment, our table now contains the actual image data, as well as metadata requests:

```python
table.schema().to_arrow()
```

```python
pa.schema({
    "idx": pa.int64(),
    "etag": pa.string(),
    "size": pa.int64(),
    "url": pa.string(),
    "image": pa.struct({
        "bytes": pa.binary(),
        "meta": pa.struct({
            "e_tag": pa.string(),
            "last_modified": pa.timestamp("ms"),
            "location": pa.string(),
            "size": pa.uint64(),
            "status_code": pa.uint16(),
            "version": pa.string(),
        }),
    }),
})
```

### Incremental Progress

Rows can be selectively enriched using the `where` parameter.

The `se.http.get()` expression reads data from S3 URLs. Spiral handles parallel fetching, retries, and error handling
automatically, but some requests may still fail (e.g., 404 Not Found). The errors can be inspected in using the
`status_code` metadata field.

Let's retry fetching images that previously failed with a 500 status code.

```python
enrichment = table.enrich(
    se.pack({
        "image": se.http.get(table["url"])
    }),
    where=(table["image.meta.status_code"] == pa.scalar(500, pa.uint16()))
)

enrichment.run()
```

### Other Data Sources

In addition to `se.http.get()`, Spiral supports other data sources such as `se.s3.get()` for object storage ingestion
and `se.file.get()` for local filesystem reads. Note that the client environment must have appropriate permissions
to access these resources.

## Deriving Columns with UDFs

It's also possible to enrich with User-Defined Functions (UDFs) for custom processing logic. Let's say we had a column
`path` representing a path within a specific S3 bucket, and we wanted to enrich with the full image data. To use
`se.s3.get()` we need to have URLs in the form `s3://bucket_name/path/to/object`. Let's enrich table with these.

First, we define a UDF to convert S3 paths to full S3 URLs.

```python
import pyarrow as pa
from spiral.expressions.udf import UDF

class GetPath(UDF):
    """
    turns urls like 'https://c2.staticflickr.com/6/5606/15611395595_f51465687d_o.jpg'
    into paths like '/6/5606/15611395595_f51465687d_o.jpg'
    """
    def name(self):
        return "get_path"

    def return_type(self, input_type: pa.DataType):
        return pa.string()

    def invoke(self, urls: pa.Array):
        from urllib.parse import urlparse
        paths = [
            urlparse(url).path
            for url in urls.to_pylist()
        ]

        return pa.array(paths, type=pa.string())
```

Now we can use this UDF in an enrichment to generate full S3 URLs and fetch the images.

```python
get_path = GetPath()

enrichment = table.enrich(
    se.pack({
        "path": get_path(table["url"]),
    })
)

enrichment.run()
```

## Distributing Work

Distributed enrichment is supported through distributed execution engines like Dask or Ray.

Distributed enrichment will default to table shards, but that might not be sufficient parallelism. When metadata is small,
default sharding may contain way too many rows for a single distributed task. Custom shards can be generated using `compute_shards()`
which will scan the key columns to compute a set of shards with roughly equal numbers of keys (and thus rows) per shard.

---
# Distributed
Section: Tables
Source: https://docs.spiraldb.com/tables/distributed

Both Spiral Table scans and transactions can be row-wise partitioned for use in distributed query
engines such as PySpark, Dask, and Ray. For the engines listed below we have first-class
integrations. For any other engine, see [Others](#Others) for instructions on how to integrate a
Spiral scan or transaction with an arbitrary distributed compute system with a Python API.

## Ray

A Spiral Table can be read by a Ray Dataset:

```python
import spiral
from spiral.demo import gharchive, demo_project

sp = spiral.Spiral()
table = gharchive(sp)
scan = sp.scan(table)

ds: ray.data.Dataset = scan.to_ray()
```

A Ray Dataset can be written into a Spiral Table:

```python
ds: ray.Dataset = sp.scan(table).to_ray()

gharchive_events_copy = demo_project(sp).create_table(
    "gharchive.events_copy",
    key_schema=table.key_schema
)

with gharchive_events_copy.txn() as txn:
    ds.write_datasink(txn.to_ray_datasink())
```

## Others

Spiral's distribution primitives should integrate with any Python-based distributed compute
platform. In this section, we use Python's `ThreadPoolExecutor` to demonstrate this integration.

### Reading

All scan functions (e.g. [`to_table()`](https://docs.spiraldb.com/python-api/spiral#to_table),
[`to_pandas()`](https://docs.spiraldb.com/python-api/spiral#to_pandas),
[`to_record_batches()`](https://docs.spiraldb.com/python-api/spiral#to_record_batches)) accept a `shards` argument which
are row-wise partitions of the table defined by non-overlapping key ranges. There are two ways to
generate a set of shards for a given scan:

1. [`shards()`](https://docs.spiraldb.com/python-api/spiral#shards) which is cheap to compute but shards based on the
   physical data layout.
2. [`compute_shards()`](https://docs.spiraldb.com/python-api/spiral#compute_shards) which scans the key columns to
   compute a set of shards with roughly equal numbers of keys (and thus rows) per shard.

When distributing work across processes or nodes, one must ensure a consistent table state. The
state of a table can be fixed (i.e. later writes are ignored) by constructing a
[`snapshot()`](https://docs.spiraldb.com/python-api/spiral#snapshot). A Snapshot's
[`asof`](https://docs.spiraldb.com/python-api/spiral#asof), a psuedo-timestamp, is sufficient to uniquely identify the
state of a Spiral table. Any Scan which specifies the same asof will see the same state of the
table:

```python
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime

asof = table.snapshot().asof
table_id = table.table_id

predicate = table["created_at"] > datetime(2020, 3, 1)
driver_scan = sp.scan(table, where=predicate, asof=asof)

shards = driver_scan.shards()

def worker_fn(shard):
    """Code executed on another node."""
    from spiral import Spiral

    table = Spiral().table(table_id)
    worker_scan = sp.scan(table, where=predicate, asof=asof, shard=shard)
    return worker_scan.to_table()

with ThreadPoolExecutor() as executor:
    results = executor.map(worker_fn, shards)
```

### Writing

Spiral Transactions are also horizontally scalable. The orchestration / driver node should start a
transaction, execute and collect the transaction operations of each node, compose them into one
transaction, and commit the entire transaction at once. Instead of serializing the transaction
object itself, we [`Transaction.take()`](https://docs.spiraldb.com/python-api/spiral#take) the list of operations and
serialize those with `Operation.to_json`. `Operation.from_json` deserializes an operation and
[`Transaction.include()`](https://docs.spiraldb.com/python-api/spiral#include) subsumes one or more operations into
another transaction.

```python
from concurrent.futures import ThreadPoolExecutor
from spiral.core.table.spec import TransactionOps
import pyarrow as pa

project = demo_project(sp)
project_id = project.id
table = project.create_table("squares", key_schema={"key": pa.int64()})

def worker_fn(index: int) -> TransactionOps:
    """Code executed on another node."""
    from spiral import Spiral

    sp = Spiral()
    project = sp.project(project_id)

    txn = project.table("squares").txn()
    txn.write({
        "key": [index],
        "square": [index * index],
    })

    return txn.take()

with table.txn() as txn:
    # The driver-side transaction must start before the workers!

    with ThreadPoolExecutor() as executor:
       worker_results = list(executor.map(worker_fn, [0, 1, 2, 3]))

    for worker_tx in worker_results:
        txn.include(worker_tx)
```

See also:

- [Data Loading](https://docs.spiraldb.com/usecases/data-loaders)

---
# Expressions
Section: Tables
Source: https://docs.spiraldb.com/tables/expressions

The `expressions` module provides a powerful way to build and manipulate expressions for data processing.
This guide will walk you through common use cases using [GitHub event data](https://www.gharchive.org/) as examples.

## Data Schema

Let's take a look at the public GitHub Archive dataset hosted in Spiral:

```python
import spiral as sp
from spiral.demo import gharchive
import pyarrow as pa

sp = sp.Spiral()
table = gharchive(sp)
```

## Basic Concepts

### Tables

Spiral tables are structured as a series of nested structs. Use Python dict syntax to project columns from the table.

```python
# Create a variable referring to the actor's login
actor_login = table["actor"]["login"]
```

### Scalars

Scalars are constant values that can be used in expressions.

```python
# Compare with scalar
is_octocat = actor_login == "octocat"
```

Scalars can be explicitly created using the `se.scalar` function.

```python
from spiral import expressions as se
# Create a scalar
example_user = se.scalar("octocat")
```

## Common Operations

Full list of common operations is available in API docs.

### Filtering

```python
# Find issues with number greater than 100
high_number_issues = table["payload"]["issue"]["number"] > 100
```

### Combining

```python
octocat_high_issues = is_octocat & high_number_issues
```

This is equivalent to:

```python
from spiral import expressions as se
# Find issues created by octocat that are high-numbered
octocat_high_issues = se.and_(
    table["actor"]["login"] == "octocat",
    table["payload"]["issue"]["number"] > 100
)
```

## Struct Operations

Struct operations allow you to manipulate nested data structures.

Although `se.struct` is a submodule of `se`, some functions are also available directly on `se`.

### Selecting Fields

```python
from spiral import expressions as se

# Select only actor information.
actor_info = se.select(table["actor"], names=["login", "avatar_url"])

# Exclude the avatar URL.
actor_info_no_avatar = se.select(actor_info, exclude=["avatar_url"])
```

### Merging Structs

```python
from spiral import expressions as se

# Combine actor and repo information.
actor_and_repo = se.merge(table["actor"], table["repo"])
```

On a key clash keys are taken from the right most struct.

### Creating New Structs

```python
# Create a simplified event representation
simple_event = se.pack({
    "user": table["actor"]["login"],
    "repo_name": table["repo"]["name"],
    "action": table["payload"]["action"],
})
```

## Advanced Example

```python
import spiral
import pyarrow as pa
from spiral import expressions as se
from datetime import datetime
from spiral.demo import gharchive

sp = spiral.Spiral()
table = gharchive(sp)

# First, let's create some reusable components
actor = table["actor"]
repo = table["repo"]
payload = table["payload"]
created_at = table["created_at"]

# 1. Filter conditions
is_2024 = created_at >= datetime(2024, 1, 1)
is_issue_event = payload["action"] == "opened"
has_avatar = actor["gravatar_id"] != ""

# 2. Popular repositories (arbitrary threshold for example)
popular_repo = repo["id"] > 1000000

# 3. Specific users of interest
vip_users = ["octocat", "torvalds", "gvanrossum"]
is_vip = se.or_(*[
    actor["login"] == user
    for user in vip_users
])

# 4. Combine all filters
where_filter = se.and_(
    is_2024,
    se.or_(
        se.and_(is_issue_event, popular_repo),
        is_vip
    )
)

# 5. Create a scoring system
user_score = {
    "base_score": se.divide(actor["id"], 100),
    "avatar_bonus": has_avatar.cast(pa.int64()) * 50
}

# 6. Create the final enriched event structure
projection = se.pack({
    # Original fields
    "timestamp": created_at,

    # User info with computed fields
    "user": se.pack({
        "login": actor["login"],
        "is_vip": is_vip,
        "has_avatar": has_avatar,
        **user_score
    }),

    # Repository info
    "repository": se.pack({
        "name": repo["name"],
        "is_popular": popular_repo
    }),

    # Event type info
    "event_type": se.pack({
        "is_issue": is_issue_event,
        # Null for non-issue events
        "issue_number": payload["issue"]["number"],
    }),

    # Computed total score
    "total_score": se.add(
        user_score["base_score"],
        user_score["avatar_bonus"]
    )
})

# 7. Scan the table.
scan: Scan = sp.scan(projection, where=where_filter)
scan.schema.to_arrow()
```

```python
pa.schema({
    'timestamp': pa.timestamp("ms"),
    'user': pa.struct({
        'login': pa.string(),
        'is_vip': pa.bool_(),
        'has_avatar': pa.bool_(),
        'base_score': pa.int64(),
        'avatar_bonus': pa.int64()
    }),
    'repository': pa.struct({
        'name': pa.string(),
        'is_popular': pa.bool_()
    }),
    'event_type': pa.struct({
        'is_issue': pa.bool_(),
        'issue_number': pa.int64()
    }),
    'total_score': pa.int64()
})
```

---
# Best Practices
Section: Tables
Source: https://docs.spiraldb.com/tables/best-practices

Spiral Tables are designed to be flexible, yet performant for a wide variety of use cases. However, there are some
best practices to follow when designing your table schema to ensure optimal performance.

## Key Schema

The key schema defines the sorting order of the table and is crucial for performance. In general, rows frequently
accessed together should be "close" in the key space.

- Avoid using random keys such as UUIDs, as they lead to low locality and frequent compactions.
- If possible, use keys that reflect the natural ordering of your data. If data doesn't have a natural ordering,
  consider using a time-based key (e.g., UUIDv7, timestamp column, etc.) to group recent data together.

## Column Groups

Column groups are horizontal partitions of a table that are (almost) completely independent from each other.
This design allows Spiral Tables to support very wide schemas with hundreds of thousands of columns efficiently.

- Columns that aren't used as filters should be separated from those that commonly appear in filters.
- Columns that commonly appear in filters together should be placed in the same column group.
- Columns with large cells (e.g. text), and especially binary columns, (e.g., images, files, etc.) should be
  placed in their own column groups.

When scanning a table, Spiral prunes fragments based on column groups. To do this efficiently,
**at least one column from a column group must be included in the projection or filter when scanning the table**.
If no column groups are referenced in your query, the scan would have to consider all column groups across the table,
which defeats the purpose of having independent column groups and can significantly impact performance.
Due to this design, **it is not currently possible to read or write just key columns without including at least
one column from any column group**. If this limitation is causing problems for your use case, please reach out.

## Interactive Mode

When in interactive mode, such as in a Jupyter notebook, it is recommended to enable disk caching to speed up
subsequent scans of the same table. Disk caching can be enabled and configured when creating a Spiral client.

```python
from spiral import Spiral

sp = Spiral(overrides={"cache.enabled": "1"})
```

Or with more configuration options:

```python
from spiral import Spiral

sp = Spiral(overrides={
    "keys_cache.enabled": "1",
    "keys_cache.memory_capacity_bytes": "1073741824",  # 1 GiB
    "keys_cache.disk_capacity_bytes": "10737418240",  # 10 GiB

    "metadat_cache.enabled": "1",
    "metadat_cache.memory_capacity_bytes": "1073741824",  # 1 GiB
    "metadat_cache.disk_capacity_bytes": "10737418240",  # 10 GiB
})
```

---
# spiral
Section: PySpiral
Source: https://docs.spiraldb.com/python-api/spiral

## Spiral

```python
class Spiral()
```

Main client for interacting with the Spiral data platform.

Configuration is loaded with the following priority (highest to lowest):

1. Explicit parameters.
2. Environment variables (`SPIRAL__*`)
3. Config file (`~/.spiral.toml`)
4. Default values (production URLs)

**Examples**:

```python
import spiral
# Default configuration
sp = spiral.Spiral()

# With config overrides
sp = spiral.Spiral(overrides={"limits.concurrency": "16"})
```

**Arguments**:

- `config` - Custom ClientSettings object. Defaults to global settings.
- `overrides` - Configuration overrides using dot notation,
  see the [Client Configuration](https://docs.spiraldb.com/config) page for a full list.

### config

```python
@property
def config() -> ClientSettings
```

Returns the client's configuration

### authn

```python
@property
def authn() -> Authn
```

Get the authentication handler for this client.

### list\_projects

```python
def list_projects() -> list["Project"]
```

List project IDs.

### create\_project

```python
def create_project(*,
                   description: str | None = None,
                   id_prefix: str | None = None,
                   **kwargs) -> "Project"
```

Create a project in the current, or given, organization.

### project

```python
def project(project_id: str) -> "Project"
```

Open an existing project.

### scan

```python
def scan(*projections: ExprLike,
         where: ExprLike | None = None,
         asof: datetime | int | None = None,
         shard: Shard | None = None,
         limit: int | None = None,
         hide_progress_bar: bool = False) -> Scan
```

Starts a read transaction on the Spiral.

**Arguments**:

- `projections` - a set of expressions that return struct arrays.
- `where` - a query expression to apply to the data.
- `asof` - execute the scan on the version of the table as of the given timestamp.
  Each transaction creates a new version of the table. The commit timestamp
  of the transaction can be used in asof. See `spiral txn` for transaction
  commands in CLI.
- `shard` - if provided, opens the scan only for the given shard.
  While shards can be provided when executing the scan, providing a shard here
  optimizes the scan planning phase and can significantly reduce metadata download.
- `limit` - maximum number of rows to return. When set, the scan will stop reading
  data once the limit is reached, providing efficient early termination.
- `hide_progress_bar` - if True, disables the progress bar during scan building.

### scan\_keys

```python
def scan_keys(*projections: ExprLike,
              where: ExprLike | None = None,
              asof: datetime | int | None = None,
              shard: Shard | None = None,
              limit: int | None = None,
              hide_progress_bar: bool = False) -> Scan
```

Starts a keys-only read transaction on the Spiral.

To determine which keys are present in at least one column group of the table, key scan the
table itself:

```
sp.scan_keys(table)
```

**Arguments**:

- `projections` - scan the keys of the column groups referenced by these expressions.
- `where` - a query expression to apply to the data.
- `asof` - execute the scan on the version of the table as of the given timestamp.
  Each transaction creates a new version of the table. The commit timestamp
  of the transaction can be used in asof. See `spiral txn` for transaction
  commands in CLI.
- `shard` - if provided, opens the scan only for the given shard.
  While shards can be provided when executing the scan, providing a shard here
  optimizes the scan planning phase and can significantly reduce metadata download.
- `limit` - maximum number of rows to return. When set, the scan will stop reading
  data once the limit is reached, providing efficient early termination.
- `hide_progress_bar` - if True, disables the progress bar during scan building.

### sample

```python
def sample(*projections: ExprLike,
           sampler: Callable[[pa.Array], pa.Array],
           where: ExprLike | None = None,
           asof: datetime | int | None = None,
           hide_progress_bar: bool = False) -> SampleScan
```

Creates a SampleScan that can be inspected before execution.

NOTE: This API is experimental and will likely change in the near future.

For most use cases, prefer using `sample()` directly. This method is useful
when you need to inspect the key\_scan and value\_scan plans before executing
the sample operation. Call `to_record_batches()` on the returned SampleScan to
execute and get a RecordBatchReader.

**Arguments**:

- `projections` - a set of expressions that return struct arrays.
- `sampler` - A function that takes a struct array of keys and returns a boolean array
  indicating which keys to sample.
- `where` - a query expression to apply to the data.
- `asof` - execute the scan on the version of the table as of the given timestamp.
  Each transaction creates a new version of the table. The commit timestamp
  of the transaction can be used in asof. See `spiral txn` for transaction
  commands in CLI.
- `hide_progress_bar` - if True, disables the progress bar during scan building.

**Returns**:

A SampleScan object with key\_plan(), value\_plan(), and to\_record\_batches() methods.

### search

```python
def search(top_k: int,
           *rank_by: ExprLike,
           filters: ExprLike | None = None,
           freshness_window: timedelta | None = None) -> pa.RecordBatchReader
```

Queries the index with the given rank by and filters clauses. Returns a stream of scored keys.

**Arguments**:

- `top_k` - The number of top results to return.
- `rank_by` - Rank by expressions are combined for scoring.
  See `se.text.find` and `se.text.boost` for scoring expressions.
- `filters` - The `filters` expression is used to filter the results.
  It must return a boolean value and use only conjunctions (ANDs). Expressions in filters
  statement are considered either a `must` or `must_not` clause in search terminology.
- `freshness_window` - If provided, the index will not be refreshed if its freshness does not exceed this window.

### resume\_scan

```python
def resume_scan(scan_bytes: bytes) -> Scan
```

Open a serialized scan in this instance of a client.

**Arguments**:

- `scan_bytes` - The compressed scan bytes returned by a previous scan's to\_bytes\_compressed().

### compute\_shards

```python
def compute_shards(*projections: ExprLike,
                   where: ExprLike | None = None,
                   asof: datetime | int | None = None,
                   batch_size: int | None = None) -> list[Shard]
```

Computes shards over the given projections and filter.

**Arguments**:

- `projections` - a set of expressions that return struct arrays.
- `where` - a query expression to apply to the data.
- `asof` - execute the scan on the version of the table as of the given timestamp.
  Each transaction creates a new version of the table. The commit timestamp
  of the transaction can be used in asof. See `spiral txn` for transaction
  commands in CLI.
- `batch_size` - a specific batch size, otherwise the shards will be computed based
  on the fragments in the table.

## Project

```python
class Project()
```

### table

```python
def table(identifier: str) -> Table
```

Open a table with a `dataset.table` identifier, or `table` name using the `default` dataset.

### create\_table

```python
def create_table(identifier: str,
                 *,
                 key_schema: Schema
                 | pa.Schema
                 | Iterable[pa.Field[pa.DataType]]
                 | Iterable[tuple[str, pa.DataType]]
                 | Mapping[str, pa.DataType],
                 root_uri: Uri | None = None,
                 exist_ok: bool = False) -> Table
```

Create a new table in the project.

**Arguments**:

- `identifier` - The table identifier, in the form `dataset.table` or `table`.
- `key_schema` - The schema of the table's keys.
- `root_uri` - The root URI for the table.
- `exist_ok` - If True, do not raise an error if the table already exists.

### move\_table

```python
def move_table(identifier: str, new_dataset: str)
```

Move a table to a new dataset in the project.

**Arguments**:

- `identifier` - The table identifier, in the form `dataset.table` or `table`.
- `new_dataset` - The dataset into which to move this table.

### rename\_table

```python
def rename_table(identifier: str, new_table: str)
```

Move a table to a new dataset in the project.

**Arguments**:

- `identifier` - The table identifier, in the form `dataset.table` or `table`.
- `new_dataset` - The dataset into which to move this table.

### drop\_table

```python
def drop_table(identifier: str)
```

Drop a table from the project.

**Arguments**:

- `identifier` - The table identifier, in the form `dataset.table` or `table`.

### text\_index

```python
def text_index(name: str) -> TextIndex
```

Returns the index with the given name.

### create\_text\_index

```python
def create_text_index(name: str,
                      *projections: ExprLike,
                      where: ExprLike | None = None,
                      root_uri: Uri | None = None,
                      exist_ok: bool = False) -> TextIndex
```

Creates a text index over the table projection.

**Arguments**:

- `name` - The index name. Must be unique within the project.
- `projections` - At least one projection expression is required.
  All projections must reference the same table.
- `where` - An optional filter expression to apply to the index.
- `root_uri` - The root URI for the index.
- `exist_ok` - If True, do not raise an error if the index already exists.

### compute

```python
def compute() -> Compute
```

Gets compute resources configured for this project.

NOTE: Compute is experimental and will likely change in the near future.

## Table

```python
class Table(Expr)
```

API for interacting with a SpiralDB's Table.

Spiral Table is a powerful and flexible way for storing, analyzing,
and querying massive and/or multimodal datasets. The data model will feel familiar
to users of SQL- or DataFrame-style systems, yet is designed to be more flexible, more powerful,
and more useful in the context of modern data processing.

Tables are stored and queried directly from object storage.

### identifier

```python
@property
def identifier() -> str
```

Returns the fully qualified identifier of the table.

### project

```python
@property
def project() -> str | None
```

Returns the project of the table.

### dataset

```python
@property
def dataset() -> str | None
```

Returns the dataset of the table.

### name

```python
@property
def name() -> str | None
```

Returns the name of the table.

### key\_schema

```python
@property
def key_schema() -> Schema
```

Returns the key schema of the table.

### schema

```python
def schema() -> Schema
```

Returns the FULL schema of the table.

NOTE: This can be expensive for large tables.

### write

```python
def write(table: LazyTableLike,
          push_down_nulls: bool = False,
          **kwargs) -> None
```

Write an item to the table inside a single transaction.

**Arguments**:

- `push_down_nulls`: Whether to push down nullable structs down its children. E.g. `[{"a": 1}, null]` would
  become `[{"a": 1}, {"a": null}]`. SpiralDB doesn't allow struct-level nullability, so use this option if your
  data contains nullable structs.
- `table`: The table to write.

### enrich

```python
def enrich(*projections: ExprLike,
           where: ExprLike | None = None) -> Enrichment
```

Returns an Enrichment object that, when applied, produces new columns.

Enrichment can be applied in different ways, e.g. distributed.

**Arguments**:

- `projections`: Projection expressions deriving new columns to write back.
  Expressions can be over multiple Spiral tables, but all tables including
  this one must share the same key schema.
- `where`: Optional filter expression to apply when reading the input tables.

### drop\_columns

```python
def drop_columns(column_paths: list[str]) -> None
```

Drops the specified columns from the table.

**Arguments**:

- `column_paths`: Fully qualified column names. (e.g., "column\_name" or "nested.field").
  All columns must exist, if a column doesn't exist the function will return an error.

### snapshot

```python
def snapshot(asof: datetime | int | None = None) -> Snapshot
```

Returns a snapshot of the table at the given timestamp.

Each transaction creates a new version of the table. The commit timestamp
of the transaction can be used in asof. See `spiral txn` for transaction
commands in CLI.

### txn

```python
def txn(**kwargs) -> Transaction
```

Begins a new transaction. Transaction must be committed for writes to become visible.

While transaction can be used to atomically write data to the table,
it is important that the primary key columns are unique within the transaction.
The behavior is undefined if this is not the case.

### to\_arrow\_dataset

```python
def to_arrow_dataset() -> "ds.Dataset"
```

Returns a PyArrow Dataset representing the table.

### to\_polars\_lazy\_frame

```python
def to_polars_lazy_frame() -> "pl.LazyFrame"
```

Returns a Polars LazyFrame for the Spiral table.

### to\_duckdb\_relation

```python
def to_duckdb_relation() -> "duckdb.DuckDBPyRelation"
```

Returns a DuckDB relation for the Spiral table.

### key

```python
def key(*parts) -> Key
```

Creates a Key object for the given parts according to the table's key schema.

**Arguments**:

- `parts` - Parts of the key. Must be a valid prefix of the table's key schema.

**Returns**:

Key object representing the given parts.

### set\_metadata

```python
def set_metadata(metadata: dict[str, bytes]) -> None
```

Set metadata on the table.

Entries are upserted: existing keys are overwritten, keys not mentioned are left unchanged.

**Warnings**:

The metadata is not versioned. Parallel calls to set\_metadata (on the same
machine or on different machines) race to set a value. Ordering is not guaranteed.

Each key + value pair can be at most 8 KiB, keys can be at most 255 characters, and a
table can have at most 1024 entries.

**Examples**:

```
table.set_metadata({"source": b"sensor", "version": b"2"})
```

If you need to store more complex data, use json to serialize:

```
import json

table.set_metadata({"config": json.dumps({"f1": 123, "f2": "abc"}).encode()})
```

**Arguments**:

- `metadata` - A dict mapping string keys to bytes values.

### get\_metadata

```python
def get_metadata() -> dict[str, bytes]
```

Get metadata for the table.

**Returns**:

Dict mapping string keys to bytes values.

### drop\_metadata

```python
def drop_metadata(keys: list[str]) -> None
```

Remove the given keys from the table's metadata.

Keys that do not exist are silently ignored.

**Arguments**:

- `keys` - List of metadata keys to remove.

### column\_group

```python
def column_group(*paths: str) -> ColumnGroup
```

Creates a ColumnGroup object for the given column paths.

**Arguments**:

- `paths` - Path to the column group. List of column names or dot-separated paths.

**Returns**:

ColumnGroup object representing the given columns.

## Snapshot

```python
class Snapshot()
```

Spiral table snapshot.

A snapshot represents a point-in-time view of a table.

### asof

```python
@property
def asof() -> Timestamp
```

Returns the asof timestamp of the snapshot.

### schema

```python
def schema() -> Schema
```

Returns the schema of the snapshot.

### table

```python
@property
def table() -> "Table"
```

Returns the table associated with the snapshot.

### to\_arrow\_dataset

```python
def to_arrow_dataset() -> "ds.Dataset"
```

Returns a PyArrow Dataset representing the table.

### to\_polars\_lazy\_frame

```python
def to_polars_lazy_frame() -> "pl.LazyFrame"
```

Returns a Polars LazyFrame for the Spiral table.

### to\_duckdb\_relation

```python
def to_duckdb_relation() -> "duckdb.DuckDBPyRelation"
```

Returns a DuckDB relation for the Spiral table.

### set\_column\_metadata

```python
def set_column_metadata(column_path: str, metadata: dict[str, bytes]) -> None
```

Set metadata on a column, identified by its dotted path.

Entries are upserted: existing keys are overwritten, keys not mentioned are left unchanged.

**Arguments**:

- `column_path` - Dotted path to the column (e.g. "color" or "video.frames").
- `metadata` - A dict mapping string keys to bytes values.

### get\_column\_metadata

```python
def get_column_metadata(column_path: str) -> dict[str, bytes]
```

Get metadata for a column, identified by its dotted path.

**Arguments**:

- `column_path` - Dotted path to the column (e.g. "color" or "video.frames").

**Returns**:

Dict mapping string keys to bytes values.

### drop\_column\_metadata

```python
def drop_column_metadata(column_path: str, keys: list[str]) -> None
```

Remove the given keys from a column's metadata.

Keys that do not exist are silently ignored.

**Arguments**:

- `column_path` - Dotted path to the column (e.g. "color" or "video.frames").
- `keys` - List of metadata keys to remove.

## Scan

```python
class Scan()
```

Scan object.

### limit

```python
@property
def limit() -> int | None
```

Returns the limit set on this scan, if any.

### schema

```python
@property
def schema() -> Schema
```

Returns the schema of the scan.

### key\_schema

```python
@property
def key_schema() -> Schema
```

Returns the key schema of the scan.

### table\_ids

```python
@property
def table_ids() -> list[str]
```

Returns the set of table IDs referenced by this scan.

### plan

```python
def plan() -> "Plan"
```

Builds the executable plan for this scan.

### to\_bytes\_compressed

```python
def to_bytes_compressed() -> bytes
```

Get the scan state as compressed bytes.

This state can be used to resume the scan later using Spiral.resume\_scan().

**Returns**:

Compressed bytes representing the internal scan state.

### \_\_getstate\_\_

```python
def __getstate__() -> bytes
```

Serialize scan for pickling.

Enables seamless integration with distributed systems like Ray, Dask, and
Python's multiprocessing without requiring manual serialization.

**Returns**:

Zstd-compressed bytes containing JSON-serialized config and scan state.

### \_\_setstate\_\_

```python
def __setstate__(state: bytes) -> None
```

Deserialize scan from pickled state.

**Arguments**:

- `state` - Zstd-compressed bytes from **getstate**.

## Plan

```python
class Plan()
```

Executable plan object.

### limit

```python
@property
def limit() -> int | None
```

Returns the limit set on this scan, if any.

### metrics

```python
@property
def metrics() -> dict[str, Any]
```

Returns metrics about the plan.

### schema

```python
@property
def schema() -> Schema
```

Returns the schema of the plan.

### key\_schema

```python
@property
def key_schema() -> Schema
```

Returns the key schema of the plan.

### is\_empty

```python
def is_empty() -> bool
```

Check if the Spiral is empty for the given key range.

False negatives are possible, but false positives are not,
i.e. is\_empty can return False and scan can return zero rows.

### to\_record\_batches

```python
def to_record_batches(
        *,
        shards: Iterable[Shard] | None = None,
        key_table: TableLike | None = None,
        batch_readahead: int | None = None,
        batch_aligned: bool | None = None,
        grouping_prefix: list[str] | None = None,
        explode: str | None = None,
        hide_progress_bar: bool | None = None) -> pa.RecordBatchReader
```

Read as a stream of RecordBatches.

**Arguments**:

- `shards` - Optional iterable of shards to evaluate.
  If provided, only the specified shards will be read.
  Must not be provided together with key\_table.
- `key_table` - a table of keys to "take" (including aux columns for cell-push-down).
  Key table must be either a table or a stream of table-like objects
  (e.g. Arrow's RecordBatchReader). For optimal performance, each batch should
  contain sorted and unique keys. Unsorted and duplicate keys are still supported,
  but performance is less predictable.
- `batch_readahead` - the number of batches to prefetch in the background.
- `batch_aligned` - if True, ensures that batches are aligned with key\_table batches.
  The stream will yield batches that correspond exactly to the batches in key\_table,
  but may be less efficient and use more memory (aligning batches requires buffering and maybe a copy).
  Must only be used when key\_table is provided.
- `grouping_prefix` - list of key column names to group by. Must be a prefix of the key schema.
  Non-group columns are collected into List arrays.
- `explode` - name of a `List<Struct>` column to unnest. The struct fields become
  top-level columns and other columns are repeated to match. Typically used
  together with grouping\_prefix.
- `hide_progress_bar` - DEPRECATED, the progress bar can be hidden when opening a scan

### to\_unordered\_record\_batches

```python
def to_unordered_record_batches(
        *,
        shards: Iterable[Shard] | None = None,
        key_table: TableLike | None = None,
        batch_readahead: int | None = None,
        hide_progress_bar: bool | None = None) -> pa.RecordBatchReader
```

Read as a stream of RecordBatches, NOT ordered by key.

**Arguments**:

- `shards` - Optional iterable of shards to evaluate.
  If provided, only the specified shards will be read.
  Must not be provided together with key\_table.
- `key_table` - a table of keys to "take" (including aux columns for cell-push-down).
  Key table must be either a table or a stream of table-like objects
  (e.g. Arrow's RecordBatchReader). For optimal performance, each batch should
  contain sorted and unique keys. Unsorted and duplicate keys are still supported,
  but performance is less predictable.
- `batch_readahead` - the number of batches to prefetch in the background.
- `hide_progress_bar` - DEPRECATED, the progress bar can be hidden when opening a scan

### to\_table

```python
def to_table(*,
             shards: Iterable[Shard] | None = None,
             key_table: TableLike | None = None,
             batch_readahead: int | None = None,
             hide_progress_bar: bool | None = None) -> pa.Table
```

Read into a single PyArrow Table.

**Warnings**:

This downloads the entire Spiral Table into memory on this machine.

**Arguments**:

- `shards` - Optional iterable of shards to evaluate.
  If provided, only the specified shards will be read.
  Must not be provided together with key\_table.
- `key_table` - a table of keys to "take" (including aux columns for cell-push-down).
- `batch_readahead` - the number of batches to prefetch in the background.
- `hide_progress_bar` - If True, disables the progress bar during reading.

**Returns**:

pyarrow\.Table

### to\_dask

```python
def to_dask(*,
            shards: Iterable[Shard] | None = None,
            batch_readahead: int | None = None,
            hide_progress_bar: bool | None = None) -> "dd.DataFrame"
```

Read into a Dask DataFrame.

Requires the `dask` package to be installed.

Dask execution has some limitations, e.g. UDFs are not currently supported. These limitations
usually manifest as serialization errors when Dask workers attempt to serialize the state. If you are
encountering such issues, please reach out to the support for assistance.

**Arguments**:

- `shards` - Optional iterable of shards to evaluate.
  If provided, only the specified shards will be read.
- `batch_readahead` - the number of batches to prefetch in the background.
  Applies to each shard read task.
- `hide_progress_bar` - If True, disables the progress bar during reading.

**Returns**:

dask.dataframe.DataFrame

### to\_ray

```python
def to_ray(*,
           shards: Iterable[Shard] | None = None,
           batch_readahead: int | None = None,
           hide_progress_bar: bool | None = None) -> "ray.data.Dataset"
```

Read into a Ray Dataset.

Requires the `ray` package to be installed.

**Warnings**:

The output row order is not guaranteed. Shards are read concurrently and Ray
does not guarantee inter-block ordering. Sort the resulting dataset if order matters.

If the Scan returns zero rows, the resulting Ray Dataset will have [an empty
schema](https://github.com/ray-project/ray/issues/59946).

**Arguments**:

- `shards` - Optional iterable of shards to evaluate.
  If provided, only the specified shards will be read.
- `batch_readahead` - the number of batches to prefetch in the background.
- `hide_progress_bar` - If True, disables the progress bar during reading.

**Returns**:

- `ray.data.Dataset` - A Ray Dataset distributed across shards.

### to\_pandas

```python
def to_pandas(*,
              shards: Iterable[Shard] | None = None,
              key_table: TableLike = None,
              batch_readahead: int | None = None,
              hide_progress_bar: bool | None = None) -> "pd.DataFrame"
```

Read into a Pandas DataFrame.

Requires the `pandas` package to be installed.

**Warnings**:

This downloads the entire Spiral Table into memory on this machine.

**Arguments**:

- `shards` - Optional iterable of shards to evaluate.
  If provided, only the specified shards will be read.
  Must not be provided together with key\_table.
- `key_table` - a table of keys to "take" (including aux columns for cell-push-down).
- `batch_readahead` - the number of batches to prefetch in the background.
- `hide_progress_bar` - If True, disables the progress bar during reading.

**Returns**:

pandas.DataFrame

### to\_polars

```python
def to_polars(*,
              shards: Iterable[Shard] | None = None,
              key_table: TableLike = None,
              batch_readahead: int | None = None,
              hide_progress_bar: bool | None = None) -> "pl.DataFrame"
```

Read into a Polars DataFrame.

Requires the `polars` package to be installed.

**Warnings**:

This downloads the entire Spiral Table into memory on this machine. To lazily interact
with a Spiral Table try Table.to\_polars\_lazy\_frame.

**Arguments**:

- `shards` - Optional iterable of shards to evaluate.
  If provided, only the specified shards will be read.
  Must not be provided together with key\_table.
- `key_table` - a table of keys to "take" (including aux columns for cell-push-down).
- `batch_readahead` - the number of batches to prefetch in the background.
- `hide_progress_bar` - If True, disables the progress bar during reading.

**Returns**:

polars.DataFrame

### to\_data\_loader

```python
def to_data_loader(seed: int = 42,
                   shuffle_buffer_size: int = 0,
                   batch_size: int = 32,
                   **kwargs) -> "SpiralDataLoader"
```

Read into a Torch-compatible DataLoader for single-node training.

**Arguments**:

- `seed` - Random seed for reproducibility.
- `shuffle_buffer_size` - Size of shuffle buffer.
  Zero means no shuffling.
- `batch_size` - Batch size.
- `**kwargs` - Additional arguments passed to SpiralDataLoader constructor.

**Returns**:

SpiralDataLoader with shuffled shards.

### to\_distributed\_data\_loader

```python
def to_distributed_data_loader(world: "World | None" = None,
                               shards: Iterable[Shard] | None = None,
                               seed: int = 42,
                               shuffle_buffer_size: int = 0,
                               batch_size: int = 32,
                               **kwargs) -> "SpiralDataLoader"
```

Read into a Torch-compatible DataLoader for distributed training.

**Arguments**:

- `world` - World configuration with rank and world\_size.
  If None, auto-detects from torch.distributed.
- `shards` - Optional sharding. Sharding is global, i.e. the world will be used to select
  the shards for this rank. If None, uses scan's natural sharding.
- `seed` - Random seed for reproducibility.
- `shuffle_buffer_size` - Size of shuffle buffer.
  Zero means no shuffling.
- `batch_size` - Batch size.
- `**kwargs` - Additional arguments passed to SpiralDataLoader constructor.

**Returns**:

SpiralDataLoader with shards partitioned for this rank.

Auto-detect from PyTorch distributed:

```python
import spiral
from spiral.dataloader import SpiralDataLoader, World
from spiral.demo import fineweb

sp = spiral.Spiral()
fineweb_table = fineweb(sp)

plan = sp.scan(fineweb_table[["text"]]).plan()

loader: SpiralDataLoader = plan.to_distributed_data_loader(batch_size=32)
```

Explicit world configuration:

```python
world = World(rank=0, world_size=4)
loader: SpiralDataLoader = plan.to_distributed_data_loader(world=world, batch_size=32)
```

### resume\_data\_loader

```python
def resume_data_loader(state: dict[str, Any], **kwargs) -> "SpiralDataLoader"
```

Create a DataLoader from checkpoint state, resuming from where it left off.

This is the recommended way to resume training from a checkpoint. It extracts
the seed, samples\_yielded, and shards from the state dict and creates a new
DataLoader that will skip the already-processed samples.

**Arguments**:

- `state` - Checkpoint state from state\_dict().
- `**kwargs` - Additional arguments to pass to SpiralDataLoader constructor.
  These will override values in the state dict where applicable.

**Returns**:

New SpiralDataLoader instance configured to resume from the checkpoint.

Save checkpoint during training:

```python
import spiral
from spiral.dataloader import SpiralDataLoader, World
from spiral.demo import images, fineweb

sp = spiral.Spiral()
table = images(sp)
fineweb_table = fineweb(sp)

plan = sp.scan(fineweb_table[["text"]]).plan()

loader = plan.to_distributed_data_loader(batch_size=32, seed=42)
checkpoint = loader.state_dict()
```

Resume later - uses same shards from checkpoint:

```python
resumed_loader = plan.resume_data_loader(
    checkpoint,
    batch_size=32,
    # An optional transform_fn may be provided:
    # transform_fn=my_transform,
)
```

### to\_iterable\_dataset

```python
def to_iterable_dataset(shards: Iterable[Shard] | None = None,
                        seed: int = 42,
                        shuffle_buffer_size: int = 0,
                        batch_readahead: int | None = None,
                        infinite: bool = False) -> "hf.IterableDataset"
```

Returns a Huggingface's IterableDataset.

Requires `datasets` package to be installed.

Note: For new code, consider using SpiralDataLoader instead.

**Arguments**:

- `shards` - Optional iterable of shards to read. If None, uses scan's natural sharding.
- `seed` - Base random seed for deterministic shuffling and checkpointing.
- `shuffle_buffer_size` - Size of shuffle buffer for within-shard shuffling.
  0 means no shuffling.
- `batch_readahead` - Controls how many batches to read ahead concurrently.
  If pipeline includes work after reading (e.g. decoding, transforming, ...) this can be set higher.
  Otherwise, it should be kept low to reduce next batch latency.
  Defaults to min(number of CPU cores, 64) or to shuffle.buffer\_size/16 if shuffle is not None.
- `infinite` - If True, the returned IterableDataset will loop infinitely over the data,
  re-shuffling ranges after exhausting all data.

### shards

```python
def shards() -> list[Shard]
```

Get list of shards for this plan.

The shards are based on the scan's physical data layout (file fragments).
Each shard contains a key range and cardinality (set to None when unknown).

**Returns**:

List of Shard objects with key range and cardinality (if known).

## Transaction

```python
@final
class Transaction()
```

Spiral table transaction.

While transaction can be used to atomically write data to the table,
it is important that the primary key columns are unique within the transaction.

### status

```python
@property
def status() -> str
```

The status of the transaction.

### table

```python
@property
def table() -> Table
```

The table associated with this transaction.

### is\_empty

```python
def is_empty() -> bool
```

Check if the transaction has no operations.

### write

```python
def write(table: LazyTableLike, push_down_nulls: bool = False)
```

Write an item to the table inside a single transaction.

**Arguments**:

- `push_down_nulls`: Whether to push down nullable structs down its children. E.g. `[{"a": 1}, null]` would
  become `[{"a": 1}, {"a": null}]`. SpiralDB doesn't allow struct-level nullability, so use this option if your
  data contains nullable structs.
- `table`: The table to write.

### writeback

```python
def writeback(plan: Plan, *, shards: Iterable[Shard] | None = None)
```

Write back the results of a plan to the table.

**Arguments**:

- `plan`: The plan to write back.
  The plan does NOT need to be over the same table as transaction,
  but it does need to have the same key schema.
- `shards`: The shards to read from. If not provided, all shards are read.

### to\_ray\_datasink

```python
def to_ray_datasink() -> ray.data.Datasink[tuple[Timestamp, list[str]]]
```

Returns a Ray Datasink which writes into this transaction.

### drop\_columns

```python
def drop_columns(column_paths: list[str])
```

Drops the specified columns from the table.

**Arguments**:

- `column_paths`: Fully qualified column names. (e.g., "column\_name" or "nested.field").
  All columns must exist, if a column doesn't exist the function will return an error.

### compact\_key\_space

```python
def compact_key_space()
```

Compact the key space of the table.

### compact\_column\_group

```python
def compact_column_group(column_group: ColumnGroup,
                         strategy: CompactionStrategy,
                         *,
                         key_range: KeyRange | Shard | None = None)
```

Compact the specified column group of the table.

This method is a convenience method that combines planning, execution, and progress submission,
when there is no distributed context, i.e. compaction is run on a single node.
See [https://docs.spiraldb.com/config](https://docs.spiraldb.com/config) for configuration options.

**Arguments**:

- `column_group`: The column group to compact. Can be obtained from Table.column\_groups.
- `strategy`: The compaction strategy to use.
- `key_range`: Optional key range or shard to limit the range of compaction.
  Requires that no fragment overlaps the given key range, but is not covered by it.

### column\_group\_compaction

```python
def column_group_compaction(
        column_group: ColumnGroup,
        strategy: CompactionStrategy,
        *,
        key_range: KeyRange | Shard | None = None) -> Compaction
```

Plan a compaction for the specified column group.

Called by the "driver".

**Arguments**:

- `column_group`: The column group to compact. Can be obtained from Table.column\_groups.
- `strategy`: The compaction strategy to use.
- `key_range`: Optional key range or shard to limit the range of compaction.
  Requires that no fragment overlaps the given key range, but is not covered by it.

**Returns**:

A Compaction object representing the planned compaction.

### column\_group\_compaction\_execute\_tasks

```python
def column_group_compaction_execute_tasks(tasks: CompactionTasks) -> None
```

Execute the given compaction tasks inside the transaction.

Called by "worker" inside of a worker transaction.

Operations can be collected by calling `take()` after this method.

**Arguments**:

- `tasks`: The compaction tasks to execute.

### take

```python
def take() -> TransactionOps
```

Take the operations from the transaction

Transaction can no longer be committed or aborted after calling this method.
.

### include

```python
def include(ops: TransactionOps)
```

Include the given operations in the transaction.

Checks for conflicts between the included operations and any existing operations.

IMPORTANT: The `self` transaction must be started at or before the timestamp of the included operations.

### commit

```python
def commit(*, txn_dump: str | None = None)
```

Commit the transaction.

### load\_dumps

```python
@staticmethod
def load_dumps(client: Spiral, *txn_dump: str) -> list[Operation]
```

Load a transaction from a dump file.

### abort

```python
def abort()
```

Abort the transaction.

## Compaction

```python
class Compaction()
```

Compaction is used to optimize the physical layout of data in a table's column group.

### column\_group

```python
@property
def column_group() -> ColumnGroup
```

The column group being compacted.

### run

```python
def run() -> None
```

Run the compaction to completion by repeatedly getting tasks and executing them.

### run\_dask

```python
def run_dask(client: dask.distributed.Client) -> None
```

Run the compaction using distributed Dask workers.

Tasks are partitioned and distributed across Dask workers for parallel execution.

**Arguments**:

- `client` - Dask distributed client. Required. Create with:
  `from dask.distributed import Client; client = Client()`

### run\_ray

```python
def run_ray() -> None
```

Run the compaction using distributed Ray workers.

Ray must be initialized before calling this method.
To initialize Ray run `ray.init()` for a local cluster or `ray.init(address="ray://<address>:<port>")`
to connect to an existing cluster.

Tasks are partitioned and distributed across Ray workers for parallel execution.

## Enrichment

```python
class Enrichment()
```

An enrichment is used to derive new columns from the existing ones, such as fetching data from object storage
with `se.s3.get` or compute embeddings. With column groups design supporting 100s of thousands of columns,
horizontally expanding tables are a powerful primitive. Spiral optimizes enrichments where source and destination
table are the same.

### table

```python
@property
def table() -> Table
```

The table to write back into.

### projection

```python
@property
def projection() -> Expr
```

The projection expression.

### where

```python
@property
def where() -> Expr | None
```

The filter expression.

### run

```python
def run(*, shards: Iterable[Shard] | None = None) -> None
```

Apply the enrichment onto the table in a streaming fashion.

For large tables, consider using `run_dask()` or `run_ray()` for distributed execution.

**Arguments**:

- `shards` - Optional iterable of shards to process. If not provided, processes all data.

### run\_dask

```python
def run_dask(client: dask.distributed.Client,
             *,
             shards: Iterable[Shard] | None = None,
             checkpoint: str | None = None) -> None
```

Use distributed Dask to apply the enrichment. Requires `dask[distributed]` to be installed.

How shards are determined:

- If `shards` is provided, those will be used directly.
- Else, if `checkpoint` is provided, shards will be loaded from the checkpoint file.
- Else, the scan's default sharding will be used.

**Arguments**:

- `client` - Dask distributed client. Required. Create with:
  `from dask.distributed import Client; client = Client()`
- `shards` - Optional iterable of shards to process. If not provided, uses default sharding or
  checkpoint sharding if available.
- `checkpoint` - Optional path to checkpoint file for incremental progress. If the file exists,
  processing will resume from failed shards. Failed shards are written back to this file.

### run\_ray

```python
def run_ray(*,
            shards: Iterable[Shard] | None = None,
            checkpoint: str | None = None) -> None
```

Use distributed Ray to apply the enrichment. Requires `ray` to be installed and initialized.

Ray must be initialized before calling this method.
To initialize Ray run `ray.init()` for a local cluster or `ray.init(address="ray://<address>:<port>")`
to connect to an existing cluster.

Ray execution has some limitations. These limitations usually manifest as serialization
errors when Ray workers attempt to serialize the state. If you are encountering such issues,
consider splitting the enrichment into UDF-only derivation that will be executed in a
streaming fashion, followed by a Ray enrichment for the rest of the computation.  If that is
not possible, please reach out to support for assistance.

How shards are determined:

- If `shards` is provided, those will be used directly.
- Else, if `checkpoint` is provided, shards will be loaded from the checkpoint file.
- Else, the scan's default sharding will be used.

**Arguments**:

- `shards` - Optional iterable of shards to process. If not provided, uses default sharding or
  checkpoint sharding if available.
- `checkpoint` - Optional path to checkpoint file for incremental progress. If the file exists,
  processing will resume from failed shards. Failed shards are written back to this file.

## World

```python
@dataclass(frozen=True)
class World()
```

Distributed training configuration.

**Attributes**:

- `rank` - Process rank (0 to world\_size-1).
- `world_size` - Total number of processes.

### shards

```python
def shards(shards: list[Shard],
           shuffle_seed: int | None = None) -> list[Shard]
```

Partition shards for distributed training.

**Arguments**:

- `shards` - List of Shard objects to partition.
- `shuffle_seed` - Optional seed to shuffle before partitioning.

**Returns**:

Subset of shards for this rank (round-robin partitioning).

### from\_torch

```python
@classmethod
def from_torch(cls) -> World
```

Auto-detect world configuration from PyTorch distributed.

## SpiralDataLoader

```python
class SpiralDataLoader()
```

DataLoader optimized for Spiral's multi-threaded streaming architecture.

Unlike PyTorch's DataLoader which uses multiprocessing for I/O (num\_workers),
SpiralDataLoader leverages Spiral's efficient Rust-based streaming and only
uses multiprocessing for CPU-bound post-processing transforms.

Key differences from PyTorch DataLoader:

- No num\_workers for I/O (Spiral's Rust layer is already multi-threaded)
- map\_workers for parallel post-processing (tokenization, decoding, etc.)
- Built-in checkpoint support via skip\_samples
- Explicit shard-based architecture for distributed training

Simple usage:

```python
def train_step(batch):
    pass

loader = SpiralDataLoader(plan, batch_size=32)
for batch in loader:
    train_step(batch)
```

With parallel transforms:

```python
def tokenize_batch(batch):
    # ...
    return batch

loader = SpiralDataLoader(
    plan,
    batch_size=32,
    transform_fn=tokenize_batch,
    map_workers=4,
)
```

### \_\_init\_\_

```python
def __init__(plan: Plan,
             *,
             shards: list[Shard] | None = None,
             shuffle_shards: bool = True,
             seed: int = 42,
             skip_samples: int = 0,
             shuffle_buffer_size: int = 0,
             batch_size: int = 32,
             batch_readahead: int | None = None,
             transform_fn: Callable[[pa.RecordBatch], Any] | None = None,
             map_workers: int = 0,
             infinite: bool = False)
```

Initialize SpiralDataLoader.

**Arguments**:

- `plan` - Spiral plan to load data from.
- `shards` - Optional list of Shard objects to read. If None, uses
  plan's natural sharding based on physical layout.
- `shuffle_shards` - Whether to shuffle the list of shards.
  Uses the provided seed.
- `seed` - Base random seed for deterministic shuffling and checkpointing.
- `skip_samples` - Number of samples to skip at the beginning (for resuming
  from checkpoint).
- `shuffle_buffer_size` - Size of shuffle buffer for within-shard shuffling.
  0 means no shuffling.
- `batch_size` - Number of rows per batch.
- `batch_readahead` - Number of batches to prefetch in background. If None,
  uses a sensible default based on whether transforms are applied.
- `transform_fn` - Optional function to transform each batch. Takes a PyArrow
  RecordBatch and returns any type. Users can call batch.to\_pydict()
  inside the function if they need a dict. If map\_workers > 0, this
  function must be picklable.
- `map_workers` - Number of worker processes for parallel transform\_fn
  application. 0 means single-process (no parallelism). Use this for
  CPU-bound transforms like tokenization or audio decoding.
- `infinite` - Whether to cycle through the dataset infinitely. If True,
  the dataloader will repeat the dataset indefinitely. If False,
  the dataloader will stop after going through the dataset once.

### \_\_iter\_\_

```python
def __iter__() -> Iterator[Any]
```

Iterate over batches.

### state\_dict

```python
def state_dict() -> dict[str, Any]
```

Get checkpoint state for resuming.

**Returns**:

Dictionary containing samples\_yielded, seed, and shards.

Example checkpoint:

```python
loader = SpiralDataLoader(plan, batch_size=32, seed=42)
for i, batch in enumerate(loader):
    if i == 10:
        checkpoint = loader.state_dict()
        break
```

Example resume:

```python
loader = SpiralDataLoader.from_state_dict(plan, checkpoint, batch_size=32)
```

### from\_state\_dict

```python
@classmethod
def from_state_dict(cls, plan: Plan, state: dict[str, Any],
                    **kwargs) -> SpiralDataLoader
```

Create a DataLoader from checkpoint state, resuming from where it left off.

This is the recommended way to resume training from a checkpoint. It extracts
the seed, samples\_yielded, and shards from the state dict and creates a new
DataLoader that will skip the already-processed samples.

**Arguments**:

- `plan` - Spiral plan to load data from.
- `state` - Checkpoint state from state\_dict().
- `**kwargs` - Additional arguments to pass to SpiralDataLoader constructor.
  These will override values in the state dict where applicable.

**Returns**:

New SpiralDataLoader instance configured to resume from the checkpoint.

Save checkpoint during training:

```python
loader = plan.to_distributed_data_loader(batch_size=32, seed=42)
checkpoint = loader.state_dict()
```

Resume later using the same shards from checkpoint:

```python
resumed_loader = SpiralDataLoader.from_state_dict(
    plan,
    checkpoint,
    batch_size=32,
    # An optional transform_fn may be provided:
    # transform_fn=my_transform,
)
```

---
# spiral.expressions
Section: PySpiral
Source: https://docs.spiraldb.com/python-api/expressions

#### aux

```python
def aux(name: builtins.str, dtype: pa.DataType) -> Expr
```

Create a variable expression referencing a column in the auxiliary table.

Auxiliary table is optionally given to `Scan#to_record_batches` function when reading only specific keys
or doing cell pushdown.

**Arguments**:

- `name` - variable name
- `dtype` - must match dtype of the column in the auxiliary table.

#### scalar

```python
def scalar(value: Any) -> Expr
```

Create a scalar expression.

#### cast

```python
def cast(expr: ExprLike, dtype: pa.DataType) -> Expr
```

Cast an expression into another PyArrow DataType.

#### and\_

```python
def and_(expr: ExprLike, *exprs: ExprLike) -> Expr
```

Create a conjunction of one or more expressions.

#### or\_

```python
def or_(expr: ExprLike, *exprs: ExprLike) -> Expr
```

Create a disjunction of one or more expressions.

#### eq

```python
def eq(lhs: ExprLike, rhs: ExprLike) -> Expr
```

Create an equality comparison.

#### neq

```python
def neq(lhs: ExprLike, rhs: ExprLike) -> Expr
```

Create a not-equal comparison.

#### xor

```python
def xor(lhs: ExprLike, rhs: ExprLike) -> Expr
```

Create a XOR comparison.

#### lt

```python
def lt(lhs: ExprLike, rhs: ExprLike) -> Expr
```

Create a less-than comparison.

#### lte

```python
def lte(lhs: ExprLike, rhs: ExprLike) -> Expr
```

Create a less-than-or-equal comparison.

#### gt

```python
def gt(lhs: ExprLike, rhs: ExprLike) -> Expr
```

Create a greater-than comparison.

#### gte

```python
def gte(lhs: ExprLike, rhs: ExprLike) -> Expr
```

Create a greater-than-or-equal comparison.

#### negate

```python
def negate(expr: ExprLike) -> Expr
```

Negate the given expression.

#### not\_

```python
def not_(expr: ExprLike) -> Expr
```

Negate the given expression.

#### is\_null

```python
def is_null(expr: ExprLike) -> Expr
```

Check if the given expression is null.

#### is\_not\_null

```python
def is_not_null(expr: ExprLike) -> Expr
```

Check if the given expression is not null.

#### add

```python
def add(lhs: ExprLike, rhs: ExprLike) -> Expr
```

Add two expressions.

#### subtract

```python
def subtract(lhs: ExprLike, rhs: ExprLike) -> Expr
```

Subtract two expressions.

#### multiply

```python
def multiply(lhs: ExprLike, rhs: ExprLike) -> Expr
```

Multiply two expressions.

#### divide

```python
def divide(lhs: ExprLike, rhs: ExprLike) -> Expr
```

Divide two expressions.

#### modulo

```python
def modulo(lhs: ExprLike, rhs: ExprLike) -> Expr
```

Modulo two expressions.

#### getitem

```python
def getitem(expr: ExprLike, field: str) -> Expr
```

Get field from a struct.

**Arguments**:

- `expr` - The struct expression to get the field from.
- `field` - The field to get. Dot-separated string is supported to access nested fields.

#### pack

```python
def pack(fields: dict[str, ExprLike], *, nullable: bool = False) -> Expr
```

Assemble a new struct from the given named fields.

**Arguments**:

- `fields` - A dictionary of field names to expressions. The field names will be used as the struct field names.

#### merge

```python
def merge(*structs: "ExprLike") -> Expr
```

Merge fields from the given structs into a single struct.

**Arguments**:

- `*structs` - Each expression must evaluate to a struct.

**Returns**:

A single struct containing all the fields from the input structs.
If a field is present in multiple structs, the value from the last struct is used.

#### select

```python
def select(expr: ExprLike,
           names: list[str] = None,
           exclude: list[str] = None) -> Expr
```

Select fields from a struct.

**Arguments**:

- `expr` - The struct-like expression to select fields from.
- `names` - Field names to select. If a path contains a dot, it is assumed to be a nested struct field.
- `exclude` - List of field names to exclude from result. Exactly one of `names` or `exclude` must be provided.

## Expr

```python
class Expr()
```

Base class for Spiral expressions. All expressions support comparison and basic arithmetic operations.

#### \_\_getitem\_\_

```python
def __getitem__(item: str | int | list[str]) -> "Expr"
```

Get an item from a struct or list.

**Arguments**:

- `item` - The key or index to get.
  If item is a string, it is assumed to be a field in a struct. Dot-separated string is supported
  to access nested fields.
  If item is a list of strings, it is assumed to be a list of fields in a struct.
  If item is an integer, it is assumed to be an index in a list.

#### cast

```python
def cast(dtype: pa.DataType) -> "Expr"
```

Cast the expression result to a different data type.

#### select

```python
def select(*paths: str, exclude: list[str] = None) -> "Expr"
```

Select fields from a struct-like expression.

**Arguments**:

- `*paths` - Field names to select. If a path contains a dot, it is assumed to be a nested struct field.
- `exclude` - List of field names to exclude from result.

## UDF

```python
@dataclass(frozen=True)
class UDF(abc.ABC)
```

A User-Defined Function (UDF). This class should be subclassed to define custom UDFs.

**Warning**

You must define **eq** and **hash** correctly for your sub-class! A simple way to do this is to
annotate your sub-class with `@dataclass(frozen=True)` and list its type-annotated member variables:

```python
from dataclasses import dataclass
from spiral import expressions as se

@dataclass(frozen=True)
class FMA(se.UDF):
    b: int
    c: int

    def name(self):
        ...
    def return_type(self, scope):
        ...
    def invoke(self, scope):
        ...
```

**Example**

```python
import spiral
from spiral.demo import fineweb

sp = spiral.Spiral()
fineweb_table = fineweb(sp)

from spiral import expressions as se
import pyarrow as pa

class MyAdd(se.UDF):
    def name(self):
        return "my_add"

    def return_type(self, scope: pa.DataType):
        if not isinstance(scope, pa.StructType):
            raise ValueError("Expected struct type as input")
        return scope.field(0).type

    def invoke(self, scope: pa.Array):
        if not isinstance(scope, pa.StructArray):
            raise ValueError("Expected struct array as input")
        return pa.compute.add(scope.field(0), scope.field(1))

my_add = MyAdd()

expr = my_add(fineweb_table.select("first_arg", "second_arg"))
```

#### \_\_call\_\_

```python
def __call__(scope: ExprLike) -> Expr
```

Create an expression that calls this UDF with the given arguments.

#### name

```python
@abc.abstractmethod
def name() -> str
```

A name used when displaying this UDF.

**Example**

This UDF:

```python
@dataclass(frozen=True)
class Foo(se.UDF):
    bar: int
    baz: str

    def name(self):
        return "hello_world"
    def return_type(self, s): ...
    def invoke(self, s): ...

Foo(3, "hello")(1)
```

renders as follows:

`udf_hello_world[Foo(bar=3, baz="hello")](...)`

You can change what is displayed between the square brackets by defining **str**:

```python
@dataclass(frozen=True)
class Foo(se.UDF):
    bar: int
    baz: str

    def __str__(self):
        return f'{bar}, {baz}'
    def name(self):
        return "hello_world"
    def return_type(self, s): ...
    def invoke(self, s): ...

Foo(3, "hello")(1)
```

renders as follows:

`udf_hello_world[3, "hello"](...)`

#### return\_type

```python
@abc.abstractmethod
def return_type(scope: pa.DataType) -> pa.DataType
```

Must return the return type of the UDF given the input scope type.

All expressions in Spiral must return nullable (Arrow default) types,
including nested structs, meaning that all fields in structs must also be nullable,
and if those fields are structs, their fields must also be nullable, and so on.

#### invoke

```python
@abc.abstractmethod
def invoke(scope: pa.Array) -> pa.Array
```

Must implement the UDF logic given the input scope array.

---
# spiral.expressions.s3
Section: PySpiral
Source: https://docs.spiraldb.com/python-api/expressions.s3

#### get

```python
def get(expr: ExprLike, abort_on_error: bool = False) -> Expr
```

Read data from object storage by the s3:// URL.

**Arguments**:

- `expr` - URLs of the data that needs to be read from object storage.
- `abort_on_error` - Should the expression abort on errors or just collect them.

---
# spiral.expressions.http
Section: PySpiral
Source: https://docs.spiraldb.com/python-api/expressions.http

#### get

```python
def get(expr: ExprLike, abort_on_error: bool = False) -> Expr
```

Read data from the URL.

**Arguments**:

- `expr` - URLs of the data that needs to be read.
- `abort_on_error` - Should the expression abort on errors or just collect them.

---
# spiral.expressions.file
Section: PySpiral
Source: https://docs.spiraldb.com/python-api/expressions.file

#### get

```python
def get(expr: ExprLike, abort_on_error: bool = False) -> Expr
```

Read data from the local filesystem by the file:// URL.

**Arguments**:

- `expr` - URLs of the data that needs to be read.
- `abort_on_error` - Should the expression abort on errors or just collect them.

---
# spiral.expressions.list
Section: PySpiral
Source: https://docs.spiraldb.com/python-api/expressions.list

Spiral List Expressions.

```python
from spiral import expressions as se

se.list
```

#### contains

```python
def contains(expr: ExprLike, value: ExprLike) -> Expr
```

Check if a list contains a value.

**Arguments**:

- `expr` - The list expression.
- `value` - The value to search for.

```python
se.list.contains(table["tags"], "important")
```

#### map\_

```python
def map_(expr: ExprLike, transform: Callable[[Expr], ExprLike]) -> Expr
```

Apply a transform to each element of a list, returning a new list.

**Arguments**:

- `expr` - The list expression.
- `transform` - A lambda that takes an element expression and returns a
  transformed expression.

```python
se.list.map_(table["scores"], lambda el: el > 90)
```

#### any\_

```python
def any_(expr: ExprLike,
         predicate: Callable[[Expr], ExprLike] | None = None) -> Expr
```

Check if any element of a list satisfies a predicate.

**Arguments**:

- `expr` - A list expression. If no predicate is given, must be a list of booleans.
- `predicate` - Optional lambda applied to each element. When provided, it is
  equivalent to `any_(map_(expr, predicate))`.

```python
se.list.any_(table["scores"], lambda el: el > 90)
```

#### all\_

```python
def all_(expr: ExprLike,
         predicate: Callable[[Expr], ExprLike] | None = None) -> Expr
```

Check if all elements of a list satisfy a predicate.

**Arguments**:

- `expr` - A list expression. If no predicate is given, must be a list of booleans.
- `predicate` - Optional lambda applied to each element. When provided, it is
  equivalent to `all_(map_(expr, predicate))`.

```python
se.list.all_(table["scores"], lambda el: el > 90)
```

#### slice

```python
def slice(array: ExprLike,
          start: int,
          stop: int | None,
          step: int = 1) -> Expr
```

Slice each element of a list array to start..stop by step.

---
# spiral.expressions.vector
Section: PySpiral
Source: https://docs.spiraldb.com/python-api/expressions.vector

#### cosine\_similarity

```python
def cosine_similarity(left: ExprLike, right: ExprLike) -> Expr
```

Compute the cosine similarity of `left` and `right`.

See the Wikipedia article ["Cosine similarity"](https://en.wikipedia.org/wiki/Cosine_similarity).

Only implemented for fixed size lists of the 16-, 32-, and 64-bit floating-point types.

##### Examples

```python
>>> import spiral.expressions as se
>>> import pyarrow as pa
>>> needle = pa.scalar([1, 0, 0], type=pa.list_(pa.float64(), 3))
>>> sp.scan({"result": se.vector.cosine_similarity(t["embedding"], needle)}).to_table()
pyarrow.Table
x: int64
result: double
----
x: ...
result: ...
```

**Arguments**:

- `left` - an array of fixed-size-list of length N and type T
- `right` - an array of fixed-size-list of length N and type T

**Returns**:

an array of T

#### dot

```python
def dot(left: ExprLike, right: ExprLike) -> Expr
```

Compute the dot product of `left` and `right`.

Only implemented for fixed size lists of the 16-, 32-, and 64-bit floating-point types.

##### Examples

```python
>>> import spiral.expressions as se
>>> sp.scan({"5_34_sim": se.vector.dot(t["layer5"], t["layer34"])}).to_table()
pyarrow.Table
x: int64
5_34_sim: double
----
x: ...
4_34_sim: ...
```

**Arguments**:

- `left` - an array of fixed-size-list of length N and type T
- `right` - an array of fixed-size-list of length N and type T

**Returns**:

an array of T

#### norm

```python
def norm(array: ExprLike) -> Expr
```

Compute the Euclidean (L2) norm of `array`.

Only implemented for fixed size lists of the 16-, 32-, and 64-bit floating-point types.

##### Examples

```python
>>> import spiral.expressions as se
>>> sp.scan({"norm": se.vector.norm(t["embedding"])}).to_table()
pyarrow.Table
x: int64
norm: double
----
x: ...
norm: ...
```

**Arguments**:

- `array` - an array of fixed-size-list of type T

**Returns**:

an array of T

---
# spiral.viz
Section: PySpiral
Source: https://docs.spiraldb.com/python-api/viz

## send\_to\_rerun

```python
def send_to_rerun(data: Table | Scan | pa.RecordBatchReader | pa.Table
                  | pa.RecordBatch | pa.ChunkedArray | pa.StructArray,
                  spec: dict[str, Archetype],
                  *,
                  default_timeline: str | None = None)
```

Visualize a table, scan, record batch, record batch reader, array, or table.

**Arguments**:

- `data` - positional-only; the data to visualize

- `spec` - positional-only; a named set of visualizations referencing columns of data.

- `default_timeline` - keyword-only; a dot-separated path to a timeline column (an integer or timestamp).

## Archetype

```python
@final
class Archetype()
```

An ergonomic shim around a Rerun Archetype.

An archetype defines visualization semantics for a value. For example, a pair of floating-point
values could be interpreted as:

- a point in 2-d Euclidean space,
- a vector in 2-d Euclidean space (perhaps rooted at the origin, perhaps not), or
- a point in 2-d Hyberbolic space

among many other possible interpretations. Rerun defines archetypes for the first two:
Points2D and Arrows2D.

`spiral.viz.Archetype` is a shim which flattens or broadcasts components of an Archetype so that
they are all aligned to one another and to their timeline.

**Example**

Consider a 2-d point cloud which rotates around the origin. We store it as an Arrow table with
two columns: `timeline` of type `timestamp[ns]` and `cloud` of type `list(list(f64, 2))`. The
outer lists represent the point cloud at each timestamp and the inner fixed-size-list of length
2 represents the pairs.

```python
import math
import pyarrow as pa
from datetime import datetime
import spiral
from spiral import viz as sv

timeline = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

def point_on_circle(index, time):
    x = 2 * math.pi * (index / 8 + time)
    return (math.cos(x), math.sin(x))

cloud = [
    [point_on_circle(index, time) for index in range(8)]
    for time in timeline
]

timeline = [datetime.fromtimestamp(t) for t in timeline]

table = pa.table({
    "timeline": pa.array(timeline, type=pa.timestamp("ns")),
    "cloud": pa.array(cloud),
})
```

Now we can use `sv.Points2D`, a spiral Archetype, to convert this data into a Rerun
visualization. `sv.Points2D` will automatically broadcast the timeline across the values of the
outer list.

```python
import rerun

rerun.init("spiral 1")
sv.send_to_rerun(
    table,
    {"point_cloud": sv.Points2D(positions="cloud", timeline="timeline")}
)
rerun.notebook_show()
```

## ArrowComponentBatchLike

```python
@final
class ArrowComponentBatchLike(rr.ComponentBatchLike)
```

A Rerun shim around an Arrow arrray.

## ArrowTimeColumn

```python
@final
class ArrowTimeColumn(rr._send_columns.TimeColumnLike)
```

A timeline represented as a series of points on that timeline.

A timeline may either be temporal or ordinal. The `ArrowTimeColumn` for a temporal timeline must
have type `timestamp[ns]`. An ordinal timeline must have type `int64`.

**Example**

A series of values may simply use their row index as a timeline:

```
import pyarrow as pa
import spiral

data = ['a', 'b', 'c', 'd']
ordinal_timeline = pa.array(list(range(len(data))))
spiral.viz.ArrowTimeColumn.from_name_and_data("my_timeline", ordinal_timeline)
```

They may also use any nanosecond-resolution timestamp column. For example, a 10Hz sensor with
data from 0s to 1 might use a timeline such as this:

```
from datetime import datetime

timeline = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
timeline = [datetime.fromtimestamp(t) for t in timeline]

spiral.viz.ArrowTimeColumn.from_name_and_data("my_timeline", timeline)
```

## Arrows2D

```python
def Arrows2D(**kwargs)
```

[rerun.archetypes.Arrows2D](https://rerun.io/docs/reference/types/archetypes/arrows2d)

## Arrows3D

```python
def Arrows3D(**kwargs)
```

[rerun.archetypes.Arrows3D](https://rerun.io/docs/reference/types/archetypes/arrows3d)

## Points2D

```python
def Points2D(**kwargs)
```

[rerun.archetypes.Points2D](https://rerun.io/docs/reference/types/archetypes/points2d)

## Points3D

```python
def Points3D(**kwargs)
```

[rerun.archetypes.Points3D](https://rerun.io/docs/reference/types/archetypes/points3d)

## TextLog

```python
def TextLog(**kwargs)
```

[rerun.archetypes.TextLog](https://rerun.io/docs/reference/types/archetypes/text_log)

## Scalars

```python
def Scalars(**kwargs)
```

[rerun.archetypes.Scalars](https://rerun.io/docs/reference/types/archetypes/scalars)

---
# Command Line
Section: PySpiral
Source: https://docs.spiraldb.com/cli

**Usage**:

```console
$ spiral [OPTIONS] COMMAND [ARGS]...
```

**Options**:

- `--version`: Display the version of the Spiral CLI.
- `-v, --verbose`: Increase log verbosity. Use -v for info, -vv for debug.  \[default: 0]
- `--install-completion`: Install completion for the current shell.
- `--show-completion`: Show completion for the current shell, to copy it or customize the installation.
- `--help`: Show this message and exit.

**Commands**:

- `login`
- `whoami`: Display the current user's information.
- `orgs`: Org admin.
- `projects`: Projects and grants.
- `tables`: Spiral Tables.
- `fs`: File Systems.
- `txn`: Table Transactions.
- `workloads`

## `spiral login`

**Usage**:

```console
$ spiral login [OPTIONS]
```

**Options**:

- `--org-id TEXT`
- `--force / --no-force`: \[default: no-force]
- `--show-token / --no-show-token`: \[default: no-show-token]
- `--help`: Show this message and exit.

## `spiral whoami`

Display the current user's information.

**Usage**:

```console
$ spiral whoami [OPTIONS]
```

**Options**:

- `--help`: Show this message and exit.

## `spiral orgs`

**Usage**:

```console
$ spiral orgs [OPTIONS] COMMAND [ARGS]...
```

**Options**:

- `--help`: Show this message and exit.

**Commands**:

- `switch`: Switch the active organization.
- `ls`: List organizations.
- `invite`: Invite a user to the organization.
- `sso`: Configure single sign-on for your...
- `directory`: Configure directory services for your...
- `audit-logs`: Configure audit logs for your organization.
- `log-streams`: Configure log streams for your organization.
- `domains`: Configure domains for your organization.
- `keys`: Configure bring-your-own-key for your...

### `spiral orgs switch`

Switch the active organization.

**Usage**:

```console
$ spiral orgs switch [OPTIONS] [ORG_ID]
```

**Arguments**:

- `[ORG_ID]`: Organization ID

**Options**:

- `--help`: Show this message and exit.

### `spiral orgs ls`

List organizations.

**Usage**:

```console
$ spiral orgs ls [OPTIONS]
```

**Options**:

- `--help`: Show this message and exit.

### `spiral orgs invite`

Invite a user to the organization.

**Usage**:

```console
$ spiral orgs invite [OPTIONS] EMAIL
```

**Arguments**:

- `EMAIL`: \[required]

**Options**:

- `--role [owner|member|guest]`: \[default: member]
- `--expires-in-days INTEGER`: \[default: 7]
- `--help`: Show this message and exit.

### `spiral orgs sso`

Configure single sign-on for your organization.

**Usage**:

```console
$ spiral orgs sso [OPTIONS]
```

**Options**:

- `--help`: Show this message and exit.

### `spiral orgs directory`

Configure directory services for your organization.

**Usage**:

```console
$ spiral orgs directory [OPTIONS]
```

**Options**:

- `--help`: Show this message and exit.

### `spiral orgs audit-logs`

Configure audit logs for your organization.

**Usage**:

```console
$ spiral orgs audit-logs [OPTIONS]
```

**Options**:

- `--help`: Show this message and exit.

### `spiral orgs log-streams`

Configure log streams for your organization.

**Usage**:

```console
$ spiral orgs log-streams [OPTIONS]
```

**Options**:

- `--help`: Show this message and exit.

### `spiral orgs domains`

Configure domains for your organization.

**Usage**:

```console
$ spiral orgs domains [OPTIONS]
```

**Options**:

- `--help`: Show this message and exit.

### `spiral orgs keys`

Configure bring-your-own-key for your organization.

**Usage**:

```console
$ spiral orgs keys [OPTIONS]
```

**Options**:

- `--help`: Show this message and exit.

## `spiral projects`

**Usage**:

```console
$ spiral projects [OPTIONS] COMMAND [ARGS]...
```

**Options**:

- `--help`: Show this message and exit.

**Commands**:

- `ls`: List projects.
- `create`: Create a new project.
- `drop`: Drop a project.
- `grant`: Grant a role on a project to a principal.
- `grants`: List project grants.

### `spiral projects ls`

List projects.

**Usage**:

```console
$ spiral projects ls [OPTIONS]
```

**Options**:

- `--help`: Show this message and exit.

### `spiral projects create`

Create a new project.

**Usage**:

```console
$ spiral projects create [OPTIONS]
```

**Options**:

- `--description TEXT`: Description for the project.  \[required]
- `--id-prefix TEXT`: An optional ID prefix to which a random number will be appended.
- `--help`: Show this message and exit.

### `spiral projects drop`

Drop a project.

**Usage**:

```console
$ spiral projects drop [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--help`: Show this message and exit.

### `spiral projects grant`

Grant a role on a project to a principal.

**Usage**:

```console
$ spiral projects grant [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--role [viewer|editor|admin]`: Project role to grant.  \[required]
- `--org-id TEXT`: Pass an organization ID to grant a role to an organization user(s).
- `--org-user TEXT`: Pass a user ID when using --org-id to grant a role to a user.
- `--org-role [owner|member|guest]`: Pass an org role when using --org-id to grant a role to all users with that role.
- `--workload-id TEXT`: Pass a workload ID to grant a role to a workload.
- `--github TEXT`: Pass an `{org}/{repo}` string to grant a role to a job running in GitHub Actions.
- `--modal TEXT`: Pass a `{workspace_id}/{env_name}` string to grant a role to a job running in Modal environment.
- `--gcp-service-account TEXT`: Pass a `{service_account_email}/{unique_id}` to grant a role to a GCP service account.
- `--aws-iam-role TEXT`: Pass a `{account_id}/{role_name}` to grant a Spiral role to an AWS IAM Role.
- `--help`: Show this message and exit.

### `spiral projects grants`

List project grants.

**Usage**:

```console
$ spiral projects grants [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--help`: Show this message and exit.

## `spiral tables`

**Usage**:

```console
$ spiral tables [OPTIONS] COMMAND [ARGS]...
```

**Options**:

- `--help`: Show this message and exit.

**Commands**:

- `ls`: List tables.
- `head`: Show the leading rows of the table.
- `move`: Move table to a different dataset.
- `rename`: Rename table.
- `drop`: Drop table.
- `key-schema`: Show the table key schema.
- `schema`: Compute the full table schema.
- `wal`: Fetch Write-Ahead-Log.
- `flush`: Flush Write-Ahead-Log.
- `truncate`: Truncate column group metadata.
- `manifests`: Display all fragments from metadata. (DEPRECATED: Use 'fragments' instead.)
- `fragments`: Display all fragments from metadata.
- `fragments-scan`: Display the fragments used in a scan of a...
- `debug-scan`: Visualize the scan of a given column group.
- `stats`: Display table statistics and compaction...

### `spiral tables ls`

List tables.

**Usage**:

```console
$ spiral tables ls [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--help`: Show this message and exit.

### `spiral tables head`

Show the leading rows of the table.

**Usage**:

```console
$ spiral tables head [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `-n INTEGER`: Maximum number of rows to show. Defaults to 10.  \[default: 10]
- `--asof TEXT`: Transaction timestamp from `spiral txn ls` (microseconds or UTC datetime).
- `--help`: Show this message and exit.

### `spiral tables move`

Move table to a different dataset.

**Usage**:

```console
$ spiral tables move [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `--new-dataset TEXT`: New dataset name.
- `--help`: Show this message and exit.

### `spiral tables rename`

Rename table.

**Usage**:

```console
$ spiral tables rename [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `--new-table TEXT`: New table name.
- `--help`: Show this message and exit.

### `spiral tables drop`

Drop table.

**Usage**:

```console
$ spiral tables drop [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `--help`: Show this message and exit.

### `spiral tables key-schema`

Show the table key schema.

**Usage**:

```console
$ spiral tables key-schema [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `--help`: Show this message and exit.

### `spiral tables schema`

Compute the full table schema.

**Usage**:

```console
$ spiral tables schema [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `--help`: Show this message and exit.

### `spiral tables wal`

Fetch Write-Ahead-Log.

**Usage**:

```console
$ spiral tables wal [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `--help`: Show this message and exit.

### `spiral tables flush`

Flush Write-Ahead-Log.

**Usage**:

```console
$ spiral tables flush [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `--help`: Show this message and exit.

### `spiral tables truncate`

Truncate column group metadata.

**Usage**:

```console
$ spiral tables truncate [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `--help`: Show this message and exit.

### `spiral tables manifests`

Display all fragments from metadata.

**Usage**:

```console
$ spiral tables manifests [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `--max-rows INTEGER`: Maximum number of rows to show per manifest.
- `--asof TEXT`: Transaction timestamp from `spiral txn ls` (microseconds or UTC datetime).
- `--help`: Show this message and exit.

### `spiral tables fragments`

Display all fragments from metadata.

**Usage**:

```console
$ spiral tables fragments [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `--max-rows INTEGER`: Maximum number of fragments to show per manifest.
- `--asof TEXT`: Transaction timestamp from `spiral txn ls` (microseconds or UTC datetime).
- `--help`: Show this message and exit.

### `spiral tables fragments-scan`

Display the fragments used in a scan of a given column group.

**Usage**:

```console
$ spiral tables fragments-scan [OPTIONS] [PROJECT] [COLUMN_GROUP]
```

**Arguments**:

- `[PROJECT]`: Project ID
- `[COLUMN_GROUP]`: Dot-separated column group path.  \[default: .]

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `--help`: Show this message and exit.

### `spiral tables debug-scan`

Visualize the scan of a given column group.

**Usage**:

```console
$ spiral tables debug-scan [OPTIONS] [PROJECT] [COLUMN_GROUP]
```

**Arguments**:

- `[PROJECT]`: Project ID
- `[COLUMN_GROUP]`: Dot-separated column group path.  \[default: .]

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `--help`: Show this message and exit.

### `spiral tables stats`

Display table statistics and compaction indicators.

**Usage**:

```console
$ spiral tables stats [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `--help`: Show this message and exit.

## `spiral fs`

**Usage**:

```console
$ spiral fs [OPTIONS] COMMAND [ARGS]...
```

**Options**:

- `--help`: Show this message and exit.

**Commands**:

- `show`: Show the file system configured for project.
- `update`: Update a project's default file system.
- `list-providers`: Lists the available built-in file system...

### `spiral fs show`

Show the file system configured for project.

**Usage**:

```console
$ spiral fs show [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--help`: Show this message and exit.

### `spiral fs update`

Update a project's default file system.

**Usage**:

```console
$ spiral fs update [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--type [builtin|s3|s3like|gcs|upstream]`: Type of the file system.
- `--provider TEXT`: Provider, when using `builtin` type.
- `--endpoint TEXT`: Endpoint, when using `s3` or `s3like` type.
- `--region TEXT`: Region, when using `s3`, `s3like` or `gcs` type (defaults to `auto` for `s3` when `endpoint` is set).
- `--bucket TEXT`: Bucket, when using `s3` or `gcs` type.
- `--role-arn TEXT`: Role ARN to assume, when using `s3` type.
- `-y, --yes`: Skip confirmation prompt.
- `--dangerously-move-data`: Allow updating a file system which has mounted tables.
- `--help`: Show this message and exit.

### `spiral fs list-providers`

Lists the available built-in file system providers.

**Usage**:

```console
$ spiral fs list-providers [OPTIONS]
```

**Options**:

- `--help`: Show this message and exit.

## `spiral txn`

**Usage**:

```console
$ spiral txn [OPTIONS] COMMAND [ARGS]...
```

**Options**:

- `--help`: Show this message and exit.

**Commands**:

- `ls`: List transactions for a table.
- `revert`: Revert a transaction by table ID and...

### `spiral txn ls`

List transactions for a table.

**Usage**:

```console
$ spiral txn ls [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--table TEXT`: Table name.
- `--dataset TEXT`: Dataset name.
- `--since INTEGER`: List transactions committed after this timestamp (microseconds).
- `--help`: Show this message and exit.

### `spiral txn revert`

Revert a transaction by table ID and transaction index.

**Usage**:

```console
$ spiral txn revert [OPTIONS] TABLE_ID TXN_IDX
```

**Arguments**:

- `TABLE_ID`: \[required]
- `TXN_IDX`: \[required]

**Options**:

- `--help`: Show this message and exit.

## `spiral workloads`

**Usage**:

```console
$ spiral workloads [OPTIONS] COMMAND [ARGS]...
```

**Options**:

- `--help`: Show this message and exit.

**Commands**:

- `ls`: List workloads.
- `create`: Create a new workload.
- `deactivate`: Deactivate a workload.
- `issue-creds`: Issue new workflow credentials.
- `revoke-creds`: Revoke workflow credentials.
- `policies`: List workload policies.
- `delete-policy`: Delete a workload policy.
- `create-policy`: Create a workload policy.

### `spiral workloads ls`

List workloads.

**Usage**:

```console
$ spiral workloads ls [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--help`: Show this message and exit.

### `spiral workloads create`

Create a new workload.

**Usage**:

```console
$ spiral workloads create [OPTIONS] [PROJECT]
```

**Arguments**:

- `[PROJECT]`: Project ID

**Options**:

- `--name TEXT`: Friendly name for the workload.
- `--help`: Show this message and exit.

### `spiral workloads deactivate`

Deactivate a workload. Removes all associated credentials and policies.

**Usage**:

```console
$ spiral workloads deactivate [OPTIONS] WORKLOAD_ID
```

**Arguments**:

- `WORKLOAD_ID`: Workload ID.  \[required]

**Options**:

- `--help`: Show this message and exit.

### `spiral workloads issue-creds`

Issue new workflow credentials.

**Usage**:

```console
$ spiral workloads issue-creds [OPTIONS] WORKLOAD_ID
```

**Arguments**:

- `WORKLOAD_ID`: Workload ID.  \[required]

**Options**:

- `--skip-prompt / --no-skip-prompt`: Skip prompt and print secret to console.  \[default: no-skip-prompt]
- `--help`: Show this message and exit.

### `spiral workloads revoke-creds`

Revoke workflow credentials.

**Usage**:

```console
$ spiral workloads revoke-creds [OPTIONS] CLIENT_ID
```

**Arguments**:

- `CLIENT_ID`: Client ID to revoke.  \[required]

**Options**:

- `--help`: Show this message and exit.

### `spiral workloads policies`

List workload policies.

**Usage**:

```console
$ spiral workloads policies [OPTIONS] [WORKLOAD_ID]
```

**Arguments**:

- `[WORKLOAD_ID]`: Workload ID.

**Options**:

- `--help`: Show this message and exit.

### `spiral workloads delete-policy`

Delete a workload policy.

**Usage**:

```console
$ spiral workloads delete-policy [OPTIONS] POLICY_ID
```

**Arguments**:

- `POLICY_ID`: Policy ID.  \[required]

**Options**:

- `--help`: Show this message and exit.

### `spiral workloads create-policy`

Create a workload policy.

The audience (aud) claim of OIDC tokens must be [https://iss.spiraldb.com](https://iss.spiraldb.com) (except for Modal, where the audience is oidc.modal.com). See [https://docs.spiraldb.com/oidc](https://docs.spiraldb.com/oidc) for more info.

**Usage**:

```console
$ spiral workloads create-policy [OPTIONS] COMMAND [ARGS]...
```

**Options**:

- `--help`: Show this message and exit.

**Commands**:

- `github`: Create a policy for a GitHub Actions...
- `modal`: Create a policy for a Modal environment.
- `gcp`: Create a policy for a GCP environment.
- `aws`: Create a policy for an AWS environment.
- `oidc`: Create a policy with a custom OIDC issuer.

#### `spiral workloads create-policy github`

Create a policy for a GitHub Actions environment.

**Usage**:

```console
$ spiral workloads create-policy github [OPTIONS] WORKLOAD_ID
```

**Arguments**:

- `WORKLOAD_ID`: Workload ID.  \[required]

**Options**:

- `--repository TEXT`: GitHub repository (org/repo).  \[required]
- `--environment TEXT`: GitHub deployment environment name.
- `--ref TEXT`: Git ref (e.g. refs/heads/main).
- `--sha TEXT`: Commit SHA that triggered the workflow.
- `--repository-owner TEXT`: Repository owner (org or user).
- `--actor-id TEXT`: GitHub user ID of the actor.
- `--repository-visibility TEXT`: Repository visibility (public, private, internal).
- `--repository-id TEXT`: Repository ID.
- `--repository-owner-id TEXT`: Repository owner ID.
- `--runner-environment TEXT`: Runner environment (github-hosted or self-hosted).
- `--actor TEXT`: GitHub username of the actor.
- `--workflow TEXT`: Workflow name.
- `--head-ref TEXT`: Head ref (source branch for PRs).
- `--base-ref TEXT`: Base ref (target branch for PRs).
- `--event-name TEXT`: Event that triggered the workflow (e.g. push, pull\_request).
- `--ref-type TEXT`: Ref type (branch or tag).
- `--job-workflow-ref TEXT`: Reusable workflow ref.
- `--help`: Show this message and exit.

#### `spiral workloads create-policy modal`

Create a policy for a Modal environment.

**Usage**:

```console
$ spiral workloads create-policy modal [OPTIONS] WORKLOAD_ID
```

**Arguments**:

- `WORKLOAD_ID`: Workload ID.  \[required]

**Options**:

- `--workspace-id TEXT`: Modal workspace ID.  \[required]
- `--environment-name TEXT`: Modal environment name.
- `--environment-id TEXT`: Modal environment ID.
- `--app-id TEXT`: Modal app ID.
- `--app-name TEXT`: Modal app name.
- `--function-id TEXT`: Modal function ID.
- `--function-name TEXT`: Modal function name.
- `--container-id TEXT`: Modal container ID.
- `--help`: Show this message and exit.

#### `spiral workloads create-policy gcp`

Create a policy for a GCP environment.

**Usage**:

```console
$ spiral workloads create-policy gcp [OPTIONS] WORKLOAD_ID
```

**Arguments**:

- `WORKLOAD_ID`: Workload ID.  \[required]

**Options**:

- `--email TEXT`: GCP service account email.  \[required]
- `--unique-id TEXT`: GCP unique ID (sub claim).  \[required]
- `--help`: Show this message and exit.

#### `spiral workloads create-policy aws`

Create a policy for an AWS environment.

**Usage**:

```console
$ spiral workloads create-policy aws [OPTIONS] WORKLOAD_ID
```

**Arguments**:

- `WORKLOAD_ID`: Workload ID.  \[required]

**Options**:

- `--account TEXT`: AWS account ID.  \[required]
- `--role TEXT`: AWS IAM role name.  \[required]
- `--help`: Show this message and exit.

#### `spiral workloads create-policy oidc`

Create a policy with a custom OIDC issuer.

**Usage**:

```console
$ spiral workloads create-policy oidc [OPTIONS] WORKLOAD_ID
```

**Arguments**:

- `WORKLOAD_ID`: Workload ID.  \[required]

**Options**:

- `--iss TEXT`: OIDC issuer URL.  \[required]
- `--conditions TEXT`: Conditions in KEY=VALUE format. Can be provided multiple times.
- `--help`: Show this message and exit.

---
# Client Configuration
Section: PySpiral
Source: https://docs.spiraldb.com/config

This document describes the configuration options for the Spiral client.

Configuration can be set via:

- **File**: `~/.spiral.toml`
- **Environment variables**: Prefixed with `SPIRAL__` (double underscore)
- **Runtime overrides**: Override values when initializing the client, using the dot notation

## Caching

Spiral supports three independent caches for different file types:

- **`keys_cache`**: Caches key space files. **Enabled memory-only by default**. Recommended for workloads with repeated key lookups.
- **`manifests_cache`**: Caches manifest files. Recommended to enable for better query planning performance.
- **`fragments_cache`**: Caches fragment (data) files. **Not recommended** to enable in most cases as fragments are typically very large and benefit less from caching.

Each cache can be independently configured with separate memory limits, disk capacity, and storage paths. Set `disk_capacity_bytes` to `0` for memory-only caching.

## Top level

Main configuration for the Spiral client, that can affect many different parts of the client.

| Field         | Default  | Environment Variable  | Description                   |
| ------------- | -------- | --------------------- | ----------------------------- |
| `dev`         | `false`  | `SPIRAL__DEV`         | Development mode flag         |
| `file_format` | `vortex` | `SPIRAL__FILE_FORMAT` | File format for table storage |

## Authn

Authentication related settings

| Field               | Default | Environment Variable         | Description                                                                                                                |
| ------------------- | ------- | ---------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `authn.device_code` | `false` | `SPIRAL__AUTHN__DEVICE_CODE` | Whether to force device code authentication flow. This exist as an override when e.g. ssh-ing into an environment machine. |
| `authn.token`       | `-`     | `SPIRAL__AUTHN__TOKEN`       | Optional authentication token.                                                                                             |

## Fragments Cache

Client cache limits and constraints

| Field                                         | Default          | Environment Variable                                   | Description                                                                                                                               |
| --------------------------------------------- | ---------------- | ------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- |
| `fragments_cache.blob_index_size`             | `4 * (1 << 10)`  | `SPIRAL__FRAGMENTS_CACHE__BLOB_INDEX_SIZE`             | Size of the blob index for each blob on disk. A larger index can hold more entries per blob but increases per-blob I/O.                   |
| `fragments_cache.block_size`                  | `16 * (1 << 20)` | `SPIRAL__FRAGMENTS_CACHE__BLOCK_SIZE`                  | Block size for the disk cache engine. This is the minimum eviction unit and also limits the maximum cacheable entry size.                 |
| `fragments_cache.buffer_pool_size`            | `16 * (1 << 20)` | `SPIRAL__FRAGMENTS_CACHE__BUFFER_POOL_SIZE`            | Total flush buffer pool size in bytes (RAM). Each flusher gets `buffer_pool_size / flushers` bytes. Entries larger than this are dropped. |
| `fragments_cache.disk_capacity_bytes`         | `20 * (1 << 30)` | `SPIRAL__FRAGMENTS_CACHE__DISK_CAPACITY_BYTES`         | Disk cache size in bytes.                                                                                                                 |
| `fragments_cache.disk_path`                   | `spiral`         | `SPIRAL__FRAGMENTS_CACHE__DISK_PATH`                   | Disk cache path override. If not set, defaults to $XDG\_CACHE\_HOME/spiral or \~/.cache/spiral                                            |
| `fragments_cache.enabled`                     | `false`          | `SPIRAL__FRAGMENTS_CACHE__ENABLED`                     | Whether this cache is enabled. Default varies per cache type (see ClientSettings).                                                        |
| `fragments_cache.memory_capacity_bytes`       | `2 * (1 << 30)`  | `SPIRAL__FRAGMENTS_CACHE__MEMORY_CAPACITY_BYTES`       | Memory cache size in bytes.                                                                                                               |
| `fragments_cache.submit_queue_size_threshold` | `16 * (1 << 20)` | `SPIRAL__FRAGMENTS_CACHE__SUBMIT_QUEUE_SIZE_THRESHOLD` | If the total estimated size in the submit queue exceeds this threshold, further entries are silently dropped.                             |
| `fragments_cache.write_on_insertion`          | `-`              | `SPIRAL__FRAGMENTS_CACHE__WRITE_ON_INSERTION`          | When true, entries are written to disk immediately on insertion. When false, entries are only written to disk when evicted from memory.   |

## Keys Cache

Client cache limits and constraints

| Field                                    | Default          | Environment Variable                              | Description                                                                                                                               |
| ---------------------------------------- | ---------------- | ------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| `keys_cache.blob_index_size`             | `4 * (1 << 10)`  | `SPIRAL__KEYS_CACHE__BLOB_INDEX_SIZE`             | Size of the blob index for each blob on disk. A larger index can hold more entries per blob but increases per-blob I/O.                   |
| `keys_cache.block_size`                  | `16 * (1 << 20)` | `SPIRAL__KEYS_CACHE__BLOCK_SIZE`                  | Block size for the disk cache engine. This is the minimum eviction unit and also limits the maximum cacheable entry size.                 |
| `keys_cache.buffer_pool_size`            | `16 * (1 << 20)` | `SPIRAL__KEYS_CACHE__BUFFER_POOL_SIZE`            | Total flush buffer pool size in bytes (RAM). Each flusher gets `buffer_pool_size / flushers` bytes. Entries larger than this are dropped. |
| `keys_cache.disk_capacity_bytes`         | `0`              | `SPIRAL__KEYS_CACHE__DISK_CAPACITY_BYTES`         | Disk cache size in bytes.                                                                                                                 |
| `keys_cache.disk_path`                   | `spiral`         | `SPIRAL__KEYS_CACHE__DISK_PATH`                   | Disk cache path override. If not set, defaults to $XDG\_CACHE\_HOME/spiral or \~/.cache/spiral                                            |
| `keys_cache.enabled`                     | `true`           | `SPIRAL__KEYS_CACHE__ENABLED`                     | Whether this cache is enabled. Default varies per cache type (see ClientSettings).                                                        |
| `keys_cache.memory_capacity_bytes`       | `1 << 30`        | `SPIRAL__KEYS_CACHE__MEMORY_CAPACITY_BYTES`       | Memory cache size in bytes.                                                                                                               |
| `keys_cache.submit_queue_size_threshold` | `16 * (1 << 20)` | `SPIRAL__KEYS_CACHE__SUBMIT_QUEUE_SIZE_THRESHOLD` | If the total estimated size in the submit queue exceeds this threshold, further entries are silently dropped.                             |
| `keys_cache.write_on_insertion`          | `-`              | `SPIRAL__KEYS_CACHE__WRITE_ON_INSERTION`          | When true, entries are written to disk immediately on insertion. When false, entries are only written to disk when evicted from memory.   |

## Limits

Client resource limits and constraints

| Field                                                 | Default                                        | Environment Variable                                           | Description                                                                                                                                                                                                                                                                                                                                                                          |
| ----------------------------------------------------- | ---------------------------------------------- | -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `limits.batch_readahead`                              | `NonZeroUsize :: new (min (num_cpus :: get ()` | `SPIRAL__LIMITS__BATCH_READAHEAD`                              | Maximum number of concurrent shards to evaluate, "read ahead", when scanning. Defaults to min(number of CPU cores, 32).                                                                                                                                                                                                                                                              |
| `limits.compaction_size_based_threshold_bytes`        | `8 * 1024 * 1024`                              | `SPIRAL__LIMITS__COMPACTION_SIZE_BASED_THRESHOLD_BYTES`        | Fragments smaller than this threshold will be considered for compaction when size-based compaction strategy is used.                                                                                                                                                                                                                                                                 |
| `limits.compaction_split_compressed_max_bytes`        | `1024 * 1024 * 1024`                           | `SPIRAL__LIMITS__COMPACTION_SPLIT_COMPRESSED_MAX_BYTES`        | Maximum total compressed size of fragments accumulated in a single compaction split.                                                                                                                                                                                                                                                                                                 |
| `limits.compaction_task_concurrency`                  | `1`                                            | `SPIRAL__LIMITS__COMPACTION_TASK_CONCURRENCY`                  | Maximum number of concurrent tasks in compaction.                                                                                                                                                                                                                                                                                                                                    |
| `limits.http_max_retries`                             | `10`                                           | `SPIRAL__LIMITS__HTTP_MAX_RETRIES`                             | Maximum number of HTTP request retries on transient errors.                                                                                                                                                                                                                                                                                                                          |
| `limits.http_max_retry_interval_ms`                   | `5 * 60 * 1000`                                | `SPIRAL__LIMITS__HTTP_MAX_RETRY_INTERVAL_MS`                   | Maximum number of milliseconds to wait between HTTP request retries.                                                                                                                                                                                                                                                                                                                 |
| `limits.http_min_retry_interval_ms`                   | `100`                                          | `SPIRAL__LIMITS__HTTP_MIN_RETRY_INTERVAL_MS`                   | Minimum number of milliseconds to wait between HTTP request retries.                                                                                                                                                                                                                                                                                                                 |
| `limits.io_merge_threshold`                           | `1024 * 1024`                                  | `SPIRAL__LIMITS__IO_MERGE_THRESHOLD`                           | Byte range merge distance threshold (in bytes) for IO coalescing. Nearby byte ranges within this distance are merged into single HTTP requests.                                                                                                                                                                                                                                      |
| `limits.join_set_concurrency`                         | `num_cpus :: get ()`                           | `SPIRAL__LIMITS__JOIN_SET_CONCURRENCY`                         | Maximum number of concurrent tasks in a JoinSet. Defaults to the number of CPU cores.                                                                                                                                                                                                                                                                                                |
| `limits.key_space_max_rows`                           | `100_000_000`                                  | `SPIRAL__LIMITS__KEY_SPACE_MAX_ROWS`                           | Maximum number of rows in a key space.  Compaction keeps fragments across column groups aligned to L1 key spaces. When this limit is larger, the L1 key spaces might change more frequently, causing a re-alignment to happen. When this limit is smaller, more fragments may have to be split due to key space boundaries.                                                          |
| `limits.manifest_read_concurrency`                    | `2 * num_cpus :: get ()`                       | `SPIRAL__LIMITS__MANIFEST_READ_CONCURRENCY`                    | Maximum number of concurrent manifest file reads.                                                                                                                                                                                                                                                                                                                                    |
| `limits.manifest_write_concurrency`                   | `2 * num_cpus :: get ()`                       | `SPIRAL__LIMITS__MANIFEST_WRITE_CONCURRENCY`                   | Maximum number of concurrent manifest file writes.                                                                                                                                                                                                                                                                                                                                   |
| `limits.object_storage_client_pool_idle_timeout_s`    | `None`                                         | `SPIRAL__LIMITS__OBJECT_STORAGE_CLIENT_POOL_IDLE_TIMEOUT_S`    | `pool_idle_timeout` for object storage clients.                                                                                                                                                                                                                                                                                                                                      |
| `limits.object_storage_client_pool_max_idle_per_host` | `None`                                         | `SPIRAL__LIMITS__OBJECT_STORAGE_CLIENT_POOL_MAX_IDLE_PER_HOST` | `pool_max_idle_per_host` for object storage clients.                                                                                                                                                                                                                                                                                                                                 |
| `limits.prefetch_buffer_size`                         | `20`                                           | `SPIRAL__LIMITS__PREFETCH_BUFFER_SIZE`                         | Per-stream channel buffer size (number of KeyTables buffered ahead)                                                                                                                                                                                                                                                                                                                  |
| `limits.read_max_iops_per_file`                       | `-`                                            | `SPIRAL__LIMITS__READ_MAX_IOPS_PER_FILE`                       | Maximum number of concurrent HTTP requests when reading a single file.  This is an upper bound on the number of concurrent I/O requests to a single VortexReadAt instance. Usually only one of these exists per Vortex file, so this is a rough upper limit on the number of concurrent requsts to each fragment or manifest file.  See also: \[read\_thread\_per\_core\_per\_file]. |
| `limits.read_threads_per_core_per_file`               | `1`                                            | `SPIRAL__LIMITS__READ_THREADS_PER_CORE_PER_FILE`               | The number of read threads per core per file.  See also: \[read\_max\_iops\_per\_file], which is a limit shared across all cores and threads reading from the same file object.                                                                                                                                                                                                      |
| `limits.scan_eval_concurrency`                        | `16`                                           | `SPIRAL__LIMITS__SCAN_EVAL_CONCURRENCY`                        | Maximum number of concurrent scan node evaluations (streams preparation) during scans.                                                                                                                                                                                                                                                                                               |
| `limits.scan_min_rows_per_chunk`                      | `1000`                                         | `SPIRAL__LIMITS__SCAN_MIN_ROWS_PER_CHUNK`                      | Minimum number of rows to buffer when executing the scan Small chunks are concatenated together until this threshold is reached.                                                                                                                                                                                                                                                     |
| `limits.spfs_max_retries`                             | `8`                                            | `SPIRAL__LIMITS__SPFS_MAX_RETRIES`                             | Maximum number of SPFS operation retries on transient errors.  A single SPFS read operation may comprise tens or hundreds of distinct HTTP requests. SPFS write operations may comprise multiple HTTP requests but are typically just one (long) HTTP request.                                                                                                                       |
| `limits.transaction_compact_threshold`                | `100`                                          | `SPIRAL__LIMITS__TRANSACTION_COMPACT_THRESHOLD`                | Automatically compact transaction operations before commit if operation count exceeds this threshold. Set to 0 to disable auto-compaction.                                                                                                                                                                                                                                           |
| `limits.transaction_manifest_max_rows`                | `-`                                            | `SPIRAL__LIMITS__TRANSACTION_MANIFEST_MAX_ROWS`                | Maximum number of rows in manifest when compacting transaction.                                                                                                                                                                                                                                                                                                                      |
| `limits.transaction_retries`                          | `3`                                            | `SPIRAL__LIMITS__TRANSACTION_RETRIES`                          | Maximum number of transaction retries on conflict before giving up.                                                                                                                                                                                                                                                                                                                  |
| `limits.write_buffer_max_bytes`                       | `1024 * 1024 * 1024`                           | `SPIRAL__LIMITS__WRITE_BUFFER_MAX_BYTES`                       | Maximum number of bytes to buffer in memory when writing key table batches.                                                                                                                                                                                                                                                                                                          |
| `limits.write_partition_max_bytes`                    | `128 * 1024 * 1024`                            | `SPIRAL__LIMITS__WRITE_PARTITION_MAX_BYTES`                    | Maximum number of bytes in an uncompressed fragment file.                                                                                                                                                                                                                                                                                                                            |

## Manifests Cache

Client cache limits and constraints

| Field                                         | Default          | Environment Variable                                   | Description                                                                                                                               |
| --------------------------------------------- | ---------------- | ------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- |
| `manifests_cache.blob_index_size`             | `4 * (1 << 10)`  | `SPIRAL__MANIFESTS_CACHE__BLOB_INDEX_SIZE`             | Size of the blob index for each blob on disk. A larger index can hold more entries per blob but increases per-blob I/O.                   |
| `manifests_cache.block_size`                  | `16 * (1 << 20)` | `SPIRAL__MANIFESTS_CACHE__BLOCK_SIZE`                  | Block size for the disk cache engine. This is the minimum eviction unit and also limits the maximum cacheable entry size.                 |
| `manifests_cache.buffer_pool_size`            | `16 * (1 << 20)` | `SPIRAL__MANIFESTS_CACHE__BUFFER_POOL_SIZE`            | Total flush buffer pool size in bytes (RAM). Each flusher gets `buffer_pool_size / flushers` bytes. Entries larger than this are dropped. |
| `manifests_cache.disk_capacity_bytes`         | `20 * (1 << 30)` | `SPIRAL__MANIFESTS_CACHE__DISK_CAPACITY_BYTES`         | Disk cache size in bytes.                                                                                                                 |
| `manifests_cache.disk_path`                   | `spiral`         | `SPIRAL__MANIFESTS_CACHE__DISK_PATH`                   | Disk cache path override. If not set, defaults to $XDG\_CACHE\_HOME/spiral or \~/.cache/spiral                                            |
| `manifests_cache.enabled`                     | `false`          | `SPIRAL__MANIFESTS_CACHE__ENABLED`                     | Whether this cache is enabled. Default varies per cache type (see ClientSettings).                                                        |
| `manifests_cache.memory_capacity_bytes`       | `2 * (1 << 30)`  | `SPIRAL__MANIFESTS_CACHE__MEMORY_CAPACITY_BYTES`       | Memory cache size in bytes.                                                                                                               |
| `manifests_cache.submit_queue_size_threshold` | `16 * (1 << 20)` | `SPIRAL__MANIFESTS_CACHE__SUBMIT_QUEUE_SIZE_THRESHOLD` | If the total estimated size in the submit queue exceeds this threshold, further entries are silently dropped.                             |
| `manifests_cache.write_on_insertion`          | `-`              | `SPIRAL__MANIFESTS_CACHE__WRITE_ON_INSERTION`          | When true, entries are written to disk immediately on insertion. When false, entries are only written to disk when evicted from memory.   |

## Server

API endpoint configuration for the Spiral control plane

| Field        | Default                    | Environment Variable  | Description                   |
| ------------ | -------------------------- | --------------------- | ----------------------------- |
| `server.url` | `https://api.spiraldb.dev` | `SPIRAL__SERVER__URL` | The SpiralDB API endpoint URL |

## Spfs

SpFS (Spiral File System) configuration

| Field      | Default                     | Environment Variable | Description                                |
| ---------- | --------------------------- | -------------------- | ------------------------------------------ |
| `spfs.url` | `https://spfs.spiraldb.dev` | `SPIRAL__SPFS__URL`  | The SpFS (Spiral File System) endpoint URL |

## Telemetry

Telemetry configuration for the Spiral client

| Field               | Default | Environment Variable         | Description                                      |
| ------------------- | ------- | ---------------------------- | ------------------------------------------------ |
| `telemetry.enabled` | `false` | `SPIRAL__TELEMETRY__ENABLED` | Whether client-side telemetry export is enabled. |

---
# OIDC Integrations
Section: Reference
Source: https://docs.spiraldb.com/oidc

[OIDC](https://auth0.com/docs/authenticate/protocols/openid-connect-protocol) is a protocol for authenticating
entities between systems. Spiral integrates with several OIDC providers to allow [workloads](https://docs.spiraldb.com/authorization#workloads) to authenticate
without managing static credentials. For general workload setup steps (creating a workload, policy, and grant), see
[Workloads](https://docs.spiraldb.com/authorization#workloads).

## GitHub Actions

[GitHub Actions](https://github.com/features/actions) workflows can authenticate to Spiral using
[GitHub's OIDC tokens](https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/about-security-hardening-with-openid-connect).

Create a workload policy matching tokens from a specific repository:

```bash copy
spiral workloads create-policy github work_p9cqam --repository vortex-data/vortex
```

In your GitHub Actions workflow, request an OIDC token and set the workload ID:

```yaml
permissions:
  id-token: write

env:
  SPIRAL_WORKLOAD_ID: work_p9cqam
```

For more on GitHub's OIDC claims, see
[Understanding the OIDC token](https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/about-security-hardening-with-openid-connect#understanding-the-oidc-token).

## Modal

[Modal](https://modal.com/) functions can authenticate to Spiral using
[Modal's OIDC tokens](https://modal.com/docs/guide/oidc-integration).

Create a workload policy matching tokens from a specific workspace and environment:

```bash copy
spiral workloads create-policy modal work_p9cqam --workspace-id ws-12345abcd --environment-name main
```

In your Modal function, set the workload ID:

```bash
export SPIRAL_WORKLOAD_ID=work_p9cqam
```

For more on Modal's OIDC claims, see
[Understanding your OIDC claims](https://modal.com/docs/guide/oidc-integration#step-0-understand-your-oidc-claims).

## GCP

[GCP](https://cloud.google.com/) service accounts can authenticate to Spiral using
[GCP metadata identity tokens](https://cloud.google.com/docs/authentication/get-id-token).

Create a workload policy matching tokens from a specific service account:

```bash copy
spiral workloads create-policy gcp work_p9cqam \
    --email my-sa@my-project.iam.gserviceaccount.com \
    --unique-id 107691503500061507150
```

Your code must be running on a Google Cloud service that provides metadata identity tokens:

- Compute Engine
- App Engine (standard and flexible)
- Cloud Run / Cloud Run functions
- Google Kubernetes Engine (see [GKE](#gke) below)
- Cloud Build

### GKE

Google Kubernetes Engine pods authenticate through GCP IAM service account impersonation.
The GKE Kubernetes service account impersonates a GCP IAM service account, and that GCP IAM service account
is the identity matched by the workload policy.

1. [Enable Workload Identity Federation](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#enable_on_clusters_and_node_pools) for your GKE cluster.
2. [Grant your GKE service account impersonation rights](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#kubernetes-sa-to-iam)
   on a GCP IAM service account. We recommend creating a dedicated GCP IAM service account for this.
   You may use `roles/iam.serviceAccountOpenIdTokenCreator` instead of `roles/iam.workloadIdentityUser`.
3. Create a workload and policy for the GCP IAM service account (the one being impersonated), and grant it access to your project.
   The policy should match the GCP IAM service account, not the GKE Kubernetes service account.
4. Set the workload ID in your pod's environment:

   ```bash
   export SPIRAL_WORKLOAD_ID=work_p9cqam
   ```

## AWS

[AWS](https://aws.amazon.com/) IAM roles can authenticate to Spiral using STS (Security Token Service).

AWS authentication uses STS rather than OIDC, but the setup follows the same workload and policy pattern.

Create a workload policy matching a specific IAM role:

```bash copy
spiral workloads create-policy aws work_p9cqam --account 123456789012 --role MyDeployRole
```

Your code must be running with the specified IAM role assumed (e.g. via EC2 instance role, ECS task role,
or Lambda execution role).

In your AWS environment, set the workload ID:

```bash
export SPIRAL_WORKLOAD_ID=work_p9cqam
```

## OIDC Provider

If your identity provider is not listed above, you can create a policy with a custom OIDC provider.

Create a workload policy with a custom issuer and claim conditions:

```bash copy
spiral workloads create-policy oidc work_p9cqam \
    --iss https://login.example.com \
    --conditions sub=service-account-1 \
    --conditions email=bot@example.com
```

In your environment, set the workload ID and the OIDC token issued by your provider:

```bash
export SPIRAL_WORKLOAD_ID=work_p9cqam
export SPIRAL_OIDC_PROVIDER_TOKEN=<token from your OIDC provider>
```

The pyspiral client detects `SPIRAL_OIDC_PROVIDER_TOKEN` and automatically exchanges it for a Spiral JWT.

Your OIDC provider must issue tokens with the audience (`aud`) claim set to `https://iss.spiraldb.com`.

---
# Architecture
Section: Reference
Source: https://docs.spiraldb.com/architecture

![SpiralDB Architecture](/img/architecture.svg)

---
# Table Format
Section: Reference
Source: https://docs.spiraldb.com/format

Spiral Tables are designed as a table format, similar to Apache Iceberg.
The client is configurable with a pluggable metastore, and the SpiralDB server provides a reference implementation over HTTP.

## Data Model

The data model is tabular, requiring a mandatory primary key and an optional partitioning key.

A primary key is a column or a group of columns used to uniquely identify a row in a table.
A partitioning key is simply a prefix of the primary key that forces a physical partitioning of data files.
From here on, we will refer to the primary (and partitioning) key as the "key".

### Schema Tree & Column Groups

The table's schema forms a tree, with a column at each leaf. Sibling leaves are grouped together into a *column group*.
At a basic level, column groups can be thought of as independent tables that are efficiently zipped together at read time.

![Column Tree](/img/column-tree.png)

## Storage Model

The storage model consists of key and column group tables, each partitioned and sorted by the key.
At a basic level, a read operation performs an outer join on these tables.

![Wide Table](/img/wide-table.png)

Keys (and column groups) are stored as a log-structured merge (LSM) tree.

This is a data structure that consists of sorted runs of data. Within each sorted run, all keys are sorted and unique.
But between runs, keys may overlap. Periodic compaction of an LSM tree attempts to read overlapping sorted runs, merge
them together, and write them back as a single non-overlapping run. This improves read performance since the overlapping
keys must be merged at read time.

![LSM Tree](/img/lsm-tree.png)

In the context of keys, each sorted run represents a *key space*.

### Key Spaces

Key spaces store the actual key data, separately from the column group data. This separation allows us to reuse key
spaces across multiple column groups and reduce the number of outer joins required at read time.

Key spaces have an upper bound on the number of keys they can store, which depends on the level of the LSM tree.

Furthermore, each key space is content-addressed, allowing us to deduplicate identical runs of keys across multiple
column groups. *(not yet implemented)*

Key spaces are stored as a partitioned Merkle hash trie, enabling us to quickly join dense runs of identical keys.
*(not yet implemented; currently, key spaces are stored as a collection of key files)*

### Column Groups

Column groups are stored as partitions, called fragments, which are sorted by key and aligned to exactly one key space.

Fragments have an upper bound on the number of bytes they can store. The partitioning key is used to split data.

Alignment of the fragment to the key space, called key mask, can be dense or sparse \[1], and is used to join the
column group data with the key space data at read time. From here on, we will refer to fragment with dense alignment
as dense fragment. *(sparse alignment is not yet implemented)*

![Fragments](/img/fragments.png)

Each write to a column group creates new fragments and a manifest — a listing of metadata for each new fragment.
Metadata includes a file handle, the fragment's alignment to the key space, and additional fields such as column names
and statistics used for pruning during read time.

Column group configuration, such as the latest schema and whether the schema is mutable, is stored in column group
metadata. To support column rename and deletion, fragment columns are replaced with column IDs and metadata stores
the mapping.

Additionally, metadata holds a listing of all active fragments, called a snapshot.

Metadata must be atomically updated, but to support atomic updates across many column groups, we need a table-wide write-ahead log.

### Write-Ahead Log

The write-ahead log (WAL) is a table-wide log of atomic write operations.
It contains a sequence of operations detailing the addition/removal of key spaces and fragments,
as well as any configuration changes to the column groups.

Similar to the column group snapshot, a listing of all active key spaces, called the key space snapshot, is stored in the WAL.

Periodically, the WAL is flushed. Key space addition/removal operations are flushed into the key space snapshot,
and column group fragments and configuration are flushed into column group metadata.

### Compaction

Key space compaction is an LSM tree compaction. L0 runs are all potentially overlapping; all other levels have
non-overlapping sorted runs. *(there are only two levels currently)*

L1 initially starts with a key range of `b''` to `b'0xff'`, so compaction selects all L0 runs and merges them together.
While writing, if the number of keys grows too large, a new key boundary is introduced and the sorted run is split
in two. Subsequent compactions respect the new set of key ranges and include any existing L1 data alongside overlapping
L0 data.

When a key space is compacted, a displacement map is produced. The displacement map describes how to map a fragment
aligned to an origin key space to one or more fragments aligned to a destination key space. *(not yet implemented)*

L1 ranges determine the key boundaries that form the basis for column group compaction.

The goal of minor compaction is to reduce outer joins, but it does not rewrite fragments' data to achieve that.
It uses displacement maps to re-align fragments to new, higher-level key spaces, purely as a metadata operation.
Minor compaction might produce sparse or fragments that reuse a file handle but mask the file rows, called masked
fragments. *(minor compaction is not yet implemented)*

The goal of major compaction is to maintain reasonably large and not too sparse fragments (in the context of alignment),
and it will rewrite fragments to achieve that. It is similar to an LSM tree compaction, in that sorted runs are merged,
split, and moved across levels, except it uses the key space LSM tree structure to do so.
This ensures that fragments are optimal both in the context of a single column group and the entire table.

***

\[1] A fragment is densely aligned to a key space if and only if there are no key gaps in the fragment.
For example, consider a key space with the keys 1, 3, 4, 5, 10 and two fragments:
fragment one has data for keys 1, 3, 4, 5 and fragment two has data for keys 3, 5.
Fragment one is densely aligned to the key space because, within the key range of its rows, every key has a row.
Fragment two is sparsely aligned because within the key range of its rows one key lacks a corresponding row, key 4.

---
# Anatomy of a Scan
Section: Reference
Source: https://docs.spiraldb.com/anatomy-of-scan

Before we dive into the details of table scans, let's remind ourselves of the few design decisions in a [table format](https://docs.spiraldb.com/format):

- [Column Groups](https://docs.spiraldb.com/format#column-groups) are co-partitioned sets of columns. A partition is called fragment.
- Fragments are stored as [Vortex](https://vortex.dev/) files, sorted by primary key, but keys are stored separately.
- [Key Spaces](https://docs.spiraldb.com/format#key-spaces) store keys and are optimized for sort-merge joins (especially for dense runs of identical keys).
- Manifests store fragments metadata including an alignment between fragment and a key space.

![Fragments](/img/fragments.png)

These design decisions have following implications for table scans:

- Metadata that is frequently used for filtering is partitioned separately from data columns that are usually
  just projected (for example, it's very rare to filter on audio bytes).
- Fragments are partitioned by size (optimized for object storage). In practice, fragments that store metadata
  have a lot more rows compared to fragments that store projected data.

## A look at the table

Our table contains audio data with following schema:

```bash
Schema({
    audio_length=f64?,
    silence_ratio=f64?
    audio={
        bytes=binary?,
        meta={
            size=u64?,
            e_tag=utf8?,
        }?
    }?
})
```

Table contains audio bytes in a column group called `audio`, and two "metadata" column groups, a root one and
`audio.meta` with some additional source-specific metadata. A look at the manifests shows following (truncated):

```bash
Key Space manifest
131 fragments, total: 60.6MB, avg: 473.5KB, metadata: 431.7KB
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ ID         ┃ Size (Metadata) ┃ Format ┃  Key Span  ┃ Level ┃           Committed At           ┃ Compacted At ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ m86rvtu9y6 │ 260.5KB (3.2KB) │ vortex │  0..10000  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ 0qaq87gz87 │ 30.4MB (19.6KB) │ vortex │ 0..1294000 │  L0   │ 2025-11-06 18:20:00.224537+00:00 │     N/A      │
│ dve28ce6z8 │ 247.1KB (3.1KB) │ vortex │  0..10000  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ hcjbwo31q5 │ 237.4KB (3.1KB) │ vortex │  0..10000  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ tzs9ssfgd7 │ 238.0KB (3.2KB) │ vortex │  0..10000  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │


Column Group manifest for table_sl6o0u
6 fragments, total: 113.9MB, avg: 19.0MB, metadata: 111.5KB
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ ID         ┃ Size (Metadata) ┃ Format ┃     Key Span     ┃ Level ┃           Committed At           ┃ Compacted At ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ pqjb420f8r │ 20.2MB (19.1KB) │ vortex │    0..228261     │  L0   │ 2025-11-06 18:20:00.224537+00:00 │     N/A      │
│ vrv19qwg1i │ 20.2MB (19.1KB) │ vortex │  228261..456522  │  L0   │ 2025-11-06 18:20:00.224537+00:00 │     N/A      │
│ 9lhiuacvoi │ 20.0MB (19.1KB) │ vortex │  456522..684783  │  L0   │ 2025-11-06 18:20:00.224537+00:00 │     N/A      │
│ yq7dqeed9r │ 20.1MB (19.1KB) │ vortex │  684783..913044  │  L0   │ 2025-11-06 18:20:00.224537+00:00 │     N/A      │
│ 2tbvh0v6td │ 20.1MB (19.1KB) │ vortex │ 913044..1141305  │  L0   │ 2025-11-06 18:20:00.224537+00:00 │     N/A      │


Column Group manifest for table_sl6o0u.audio
1165 fragments, total: 137.6GB, avg: 120.9MB, metadata: 2.7MB
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ ID         ┃ Size (Metadata) ┃ Format ┃  Key Span   ┃ Level ┃           Committed At           ┃ Compacted At ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ xn6rq15db4 │ 129.5MB (2.4KB) │ vortex │   0..1177   │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ vn4d04pp2r │ 128.7MB (2.4KB) │ vortex │ 1177..2354  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ yqgn69c1rh │ 126.2MB (2.4KB) │ vortex │ 2354..3531  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ 9mmqc78utg │ 129.3MB (2.4KB) │ vortex │ 3531..4708  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ ffsgw7yf5m │ 126.6MB (2.4KB) │ vortex │ 4708..5885  │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │


Column Group manifest for table_sl6o0u.audio.meta
130 fragments, total: 63.8MB, avg: 502.5KB, metadata: 891.4KB
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ ID         ┃ Size (Metadata) ┃ Format ┃ Key Span ┃ Level ┃           Committed At           ┃ Compacted At ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ lkbhpyhmhq │ 507.0KB (6.9KB) │ vortex │ 0..10000 │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ y052mol59q │ 514.9KB (6.9KB) │ vortex │ 0..10000 │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ w128y2udmu │ 504.6KB (6.9KB) │ vortex │ 0..10000 │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ v36ealhiys │ 502.9KB (6.9KB) │ vortex │ 0..10000 │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
│ a03enoo1j5 │ 502.7KB (6.9KB) │ vortex │ 0..10000 │  L0   │ 2025-11-06 20:10:41.359348+00:00 │     N/A      │
```

## A typical scan

A typical scan over this table might look like this:

```python
sp.scan(table["audio.bytes"], where=table["silence_ratio"] < 0.1)
```

When executing this scan, Spiral client performs following steps:

### Load Fragment Manifest(s)

Client identifies that the scan involves two column groups:

- root, for filtering on `silence_ratio`
- `audio`, for projecting `audio.bytes`

Client scans the table's manifests to identify relevant fragments for both column groups.

Client determines that `silence_ratio` filter can be pushed-down into the root column group scan, and
prunes manifests using statistics metadata about fragments.

![Fragments](/img/scan-fragments.png)

### Scan Filtered Column Group(s)

Client scans the fragments of the column group involved in filtering (root).

Client applies the filter `silence_ratio < 0.1` and produces a row mask (and the projected columns but in this
case no columns are being projected from this column group). In practice, this is a columnar file scan over lots
of rows (very efficient!).

Row mask indicates which rows satisfy the filter condition, and when combined with alignment metadata from manifests,
client can determine which keys correspond to the filtered rows.

![Filtered](/img/scan-filtered.png)

### Join Needed Key Spaces

Client identifies the key spaces needed for join between the filtered column group and the projected column group.

Client loads key spaces, applies row mask that is the result of the filtering, and joins them together to produce
a new row mask that indicates which value rows are needed for the projected column group.

![Keys](/img/scan-keys.png)

### Take Projected Column Group(s)

Client scans the fragments of the column group involved in projection, `audio` in this case.

Client applies the row mask obtained from the previous step to read only the needed rows from the projected column group.

![Take](/img/scan-take.png)

This is only possible because of the random access performance of Vortex files.

## About Performance

Scans are optimized for high-throughput.

- Filters are pushed down into Vortex file scans over large number of rows (10-20x faster that Parquet!).
- Keys are joined efficiently using partitioned Merkle hash tries.
- Projections are evaluated as random access Vortex file reads (100x faster than Parquet!).

This last point means that each batch of rows out of a scan is expressed only as masked Vortex array, enabling
**zero-copy & zero-decompression** transfer of data from storage all the way to the end user. And since Vortex
arrays can decompress on the GPU, this means that **data can be transferred directly into GPU memory!**

---
# Troubleshooting
Section: Reference
Source: https://docs.spiraldb.com/troubleshooting

A guide to troubleshooting common Spiral client-side issues.

## Exporting telemetry

Client-side [OTEL](https://opentelemetry.io/) telemetry data is exported automatically when telemetry is enabled.

To export to your own collector instead of Spiral Cloud, set the `SPIRAL_OTEL_ENDPOINT` environment variable to your OTEL collector endpoint.
Standard OTEL environment variables can be used to configure the OTEL exporter, such as `OTEL_EXPORTER_OTLP_HEADERS`, etc.

To enable telemetry export, see [Telemetry](https://docs.spiraldb.com/config#telemetry).

### Enabling OpenTelemetry internal logs

By default, the OpenTelemetry SDK's internal diagnostic logs are suppressed. To enable them (e.g. to debug export failures), set the `RUST_LOG` environment variable to include the `opentelemetry` target:

```bash
RUST_LOG=warn,opentelemetry=debug
```

This will show detailed logs from the OpenTelemetry SDK, including export attempts, retries, and errors.

## Temporary workarounds

Following issues may occur and currently have temporary workarounds. We are actively working on permanent fixes.
If the workaround does not work for you, please reach out to us.

- *"ArrowInvalid: offset overflow while concatenating arrays"* The error occurs on write when working with
  variable-length data (e.g., bytes, strings, lists) and the total size of the data exceeds the maximum allowed size for
  a single Arrow array. Spiral currently does not support writing Arrow chunked arrays so all chunks are combined into a
  single array before writing. To resolve this, we recommend using "large" Arrow types (e.g. `pa.large_binary()`,
  `pa.large_string()`, `pa.large_list()`) when passing in Arrow arrays. Spiral uses [Vortex](https://vortex.dev/) as
  storage and memory format. Unlike Arrow, Vortex has logical types and does not differentiate between "large" types.

- *"Too many open files"* This error occurs when the operating system limit for open file descriptors is reached.
  Default limits are usually too low for data-intensive applications. Use `ulimit -n` to check the current soft limit and
  `ulimit -Hn` to check the hard limit. Use `ulimit -n <new_limit>` to increase the limit for the session or reach out to
  your system administrator to increase the hard limit. We recommend setting the limit to at least 65536 or higher,
  but do not exceed the hard limit.