Skip to Content
TablesWrite Tables

Write Tables

We’re going to explore the Spiral Tables write API using the ever classic GitHub Archive dataset .

Create a Table

Currently, tables can only be created using the Spiral Python API.

import pyarrow as pa from spiral import Spiral from spiral.demo import demo_project sp = Spiral() project = demo_project(sp) # Define the key schema for the table. # Random keys such as UUIDs should be avoided, so we're using `created_at` as part of primary key. key_schema = pa.schema([('created_at', pa.timestamp('ms')), ('id', pa.string())]) # Create the 'events-v1' table in the 'gharchive' dataset events = project.create_table('gharchive.events-v1', key_schema=key_schema, exist_ok=True)

To open the table in the future:

events = project.table("gharchive.events-v1")

We’ll define a function that downloads data from the GitHub Archive and download an hour of data.

import pyarrow as pa import pandas as pd import duckdb def get_events(period: pd.Period, limit=100) -> pa.Table: json_gz_url = f"https://data.gharchive.org/{period.strftime('%Y-%m-%d')}-{str(period.hour)}.json.gz" arrow_table = ( duckdb.read_json(json_gz_url, union_by_name=True) .limit(limit) # remove this line to load more rows .select(""" * REPLACE ( cast(created_at AS TIMESTAMP_MS) AS created_at, to_json(payload) AS payload ) """) .to_arrow_table() ) events = duckdb.from_arrow(arrow_table).order("created_at, id").distinct().to_arrow_table() events = ( events.drop_columns("id") .add_column(0, "id", events["id"].cast(pa.large_string())) .drop_columns("created_at") .add_column(0, "created_at", events["created_at"].cast(pa.timestamp("ms"))) .drop_columns("org") ) return events data: pa.Table = get_events(pd.Period("2023-01-01T00:00:00Z", freq="h"))

Write to a Table

All write operations to a Spiral table must include the key columns.

Writing nested data such as struct arrays or dictionaries into a table creates column groups. See Best Practices for how to design your schema for optimal performance.

The write API accepts a variety of data formats, including PyArrow tables and dictionaries. Since our function returns a PyArrow table, we can write it directly to the Spiral table.

events.write(data)

We could also have written multiple times atomically within a transaction.

with events.txn() as tx: for period in pd.period_range("2023-01-01T00:00:00Z", "2023-01-01T01:00:00Z", freq="h"): data = get_events(period) tx.write(data)

Or in parallel, useful when ingesting large amounts of data:

from concurrent.futures import ThreadPoolExecutor def write_period(tx, period: pd.Period): data = get_events(period) tx.write(data) with events.txn() as tx: with ThreadPoolExecutor() as executor: for period in pd.period_range("2023-01-01T00:00:00Z", "2023-01-01T01:00:00Z", freq="h"): executor.submit(write_period, tx, period)

It is important that the primary key columns are unique within the transaction because all writes in a transaction will have a same timestamp.

If a different value is written for the same column and the same key in the same transaction, the result when reading the table is undefined.

Our table schema now looks like this:

events.schema().to_arrow()
pa.schema({ "created_at": pa.timestamp("ms"), "id": pa.string(), "payload": pa.string(), "public": pa.bool_(), "type": pa.string(), "actor": pa.struct({ "avatar_url": pa.string(), "display_login": pa.string(), "gravatar_id": pa.string(), "id": pa.int64(), "login": pa.string(), "url": pa.string(), }), "repo": pa.struct({ "id": pa.int64(), "name": pa.string(), "url": pa.string(), }), })

Recall that column groups are defined as sibling columns in the same node of the schema tree. Our schema, in addition to the root column group, contains actor, org, and repo column groups.

While columnar storage formats are quite efficient at pruning unused columns, the distance in the file between the data for each column can introduce seek operations that can significantly slow down queries over object storage. In many cases, if you know upfront that you will always query a set of columns together, you can improve query performance by grouping these columns together in a column group.

Append Columns

Spiral Tables are merged across writes making it very easy to append columns to an existing table.

Suppose we wanted to derive an action column. We could easily append these columns to the existing table without rewriting the entire dataset.

This example will use some syntax from the spiral.expressions module. For more detailed examples, please refer to the Expressions.

actions = sp.scan({ 'created_at': events["created_at"], 'id': events["id"], 'action': { "repo_id": events["repo.id"], "actor_id": events["actor.id"], "type": events["type"], "payload": events["payload"] } }).to_table() events.write(actions)

Our table schema now contains action column group, like this:

events.schema().to_arrow()
pa.schema({ "created_at": pa.timestamp("ms"), "id": pa.string(), "payload": pa.string(), "public": pa.bool_(), "type": pa.string(), "action": pa.struct({ "actor_id": pa.int64(), "payload": pa.string(), "repo_id": pa.int64(), "type": pa.string(), }), "actor": pa.struct({ "avatar_url": pa.string(), "display_login": pa.string(), "gravatar_id": pa.string(), "id": pa.int64(), "login": pa.string(), "url": pa.string(), }), "repo": pa.struct({ "id": pa.int64(), "name": pa.string(), "url": pa.string(), }), })

Write Blobs

Tables supports blob columns for storing binary payloads — images, audio, video, or arbitrary binary data — alongside structured columns in the same table. A blob column is created automatically the first time you write a blob value into it.

Use helpers from spiral.expressions.blob to write blob data:

from spiral import expressions as se blobs_tbl = project.create_table('blobs', key_schema={"id": pa.int64()}, exist_ok=True) png_bytes = b"\x89PNG\r\n\x1a\n..." # truncated PNG file bytes jpeg_bytes = b"\xff\xd8\xff..." # truncated JPEG file bytes tiny_bytes = b"\x00\x01\x02" # small payload for inline storage # Always upload to external object storage blobs_tbl.write({"id": [1, 2], "image": se.blob.write([png_bytes, jpeg_bytes])}) # Always store inline in the column data (good for small payloads) blobs_tbl.write({"id": [3], "thumbnail": se.blob.inline([tiny_bytes])})

For fine-grained control, use se.Blob() directly on a per-row basis. By default payloads below 256 KiB are stored inline and larger ones are uploaded externally. Use threshold= to override this:

blobs_tbl.write({ "id": [1, 2, 3], "data": [ se.Blob(tiny_bytes), # auto: inline if < 256 KiB se.Blob(jpeg_bytes, threshold=1024), # external if >= 1 KiB se.Blob(png_bytes, mime_type="image/png"), # explicit MIME type ] })

To write a blob at a predictable, content-independent storage path (useful for deduplication or human-readable URIs), pass path=:

blobs_tbl.write({"id": [4], "logo": [se.Blob(png_bytes, path="assets/logo.png")]})

Use read expressions to project blob fields in a scan:

sp.scan({ "id": blobs_tbl["id"], "payload": se.blob.bytes(blobs_tbl["image"]), # raw bytes (fetches external blobs) "uri": se.blob.uri(blobs_tbl["image"]), # spfsm:// URI, or NULL if inline "mime": se.blob.mime_type(blobs_tbl["image"]), # auto-detected or explicit MIME type "external": se.blob.is_external(blobs_tbl["image"]), # True / False "inline": se.blob.is_inline(blobs_tbl["image"]), # True / False }).to_table()

MIME types are detected automatically from magic bytes (PNG, JPEG, etc.) unless overridden with mime_type= at write time.

Referencing blobs

If your blobs already live in an S3 bucket (or another supported object store), you can reference them zero-copy — no re-upload required.

1. Point a project’s file system at your bucket

Use spiral fs update to configure the project’s file system:

$ spiral fs update --type s3 --bucket my-bucket --region us-east-1 <project-id>

2. Mount a path from that file system into your table

Use spiral tables fs-mount to attach a directory prefix of that file system to the table. The command prints the mount identifier:

$ spiral tables fs-mount \ --table my-table \ --file-system <project-id> \ --directory /images \ <project-id> mount_id: mount_123

To look up mount identifiers later, use spiral tables fs-show:

$ spiral tables fs-show --table my-table <project-id>

3. Write external blob references

Construct spfsm:// URIs using the mount identifier and write them with se.blob.external() — no upload happens, the table stores a read-only reference:

mount_id = "mount_123" # from fs-mount output above blobs_tbl.write({ "id": [1, 2], "image": se.blob.external([ f"spfsm://{mount_id}/photo-a.png", f"spfsm://{mount_id}/photo-b.png", ]), })

se.blob.bytes() will then fetch the file contents directly from the bucket at query time. External references are never deleted during transaction rollback.

Last updated on