Write Tables
We’re going to explore the Spiral Tables write API using the ever classic GitHub Archive dataset .
Create a Table
Currently, tables can only be created using the Spiral Python API.
import pyarrow as pa
from spiral import Spiral
from spiral.demo import demo_project
sp = Spiral()
project = demo_project(sp)
# Define the key schema for the table.
# Random keys such as UUIDs should be avoided, so we're using `created_at` as part of primary key.
key_schema = pa.schema([('created_at', pa.timestamp('ms')), ('id', pa.string())])
# Create the 'events-v1' table in the 'gharchive' dataset
events = project.create_table('gharchive.events-v1', key_schema=key_schema, exist_ok=True)To open the table in the future:
events = project.table("gharchive.events-v1")We’ll define a function that downloads data from the GitHub Archive and download an hour of data.
import pyarrow as pa
import pandas as pd
import duckdb
def get_events(period: pd.Period, limit=100) -> pa.Table:
json_gz_url = f"https://data.gharchive.org/{period.strftime('%Y-%m-%d')}-{str(period.hour)}.json.gz"
arrow_table = (
duckdb.read_json(json_gz_url, union_by_name=True)
.limit(limit) # remove this line to load more rows
.select("""
* REPLACE (
cast(created_at AS TIMESTAMP_MS) AS created_at,
to_json(payload) AS payload
)
""")
.to_arrow_table()
)
events = duckdb.from_arrow(arrow_table).order("created_at, id").distinct().to_arrow_table()
events = (
events.drop_columns("id")
.add_column(0, "id", events["id"].cast(pa.large_string()))
.drop_columns("created_at")
.add_column(0, "created_at", events["created_at"].cast(pa.timestamp("ms")))
.drop_columns("org")
)
return events
data: pa.Table = get_events(pd.Period("2023-01-01T00:00:00Z", freq="h"))Write to a Table
All write operations to a Spiral table must include the key columns.
Writing nested data such as struct arrays or dictionaries into a table creates column groups. See Best Practices for how to design your schema for optimal performance.
The write API accepts a variety of data formats, including PyArrow tables and dictionaries. Since our function returns a PyArrow table, we can write it directly to the Spiral table.
events.write(data)We could also have written multiple times atomically within a transaction.
with events.txn() as tx:
for period in pd.period_range("2023-01-01T00:00:00Z", "2023-01-01T01:00:00Z", freq="h"):
data = get_events(period)
tx.write(data)Or in parallel, useful when ingesting large amounts of data:
from concurrent.futures import ThreadPoolExecutor
def write_period(tx, period: pd.Period):
data = get_events(period)
tx.write(data)
with events.txn() as tx:
with ThreadPoolExecutor() as executor:
for period in pd.period_range("2023-01-01T00:00:00Z", "2023-01-01T01:00:00Z", freq="h"):
executor.submit(write_period, tx, period)It is important that the primary key columns are unique within the transaction because all writes in a transaction will have a same timestamp.
If a different value is written for the same column and the same key in the same transaction, the result when reading the table is undefined.
Our table schema now looks like this:
events.schema().to_arrow()pa.schema({
"created_at": pa.timestamp("ms"),
"id": pa.string(),
"payload": pa.string(),
"public": pa.bool_(),
"type": pa.string(),
"actor": pa.struct({
"avatar_url": pa.string(),
"display_login": pa.string(),
"gravatar_id": pa.string(),
"id": pa.int64(),
"login": pa.string(),
"url": pa.string(),
}),
"repo": pa.struct({
"id": pa.int64(),
"name": pa.string(),
"url": pa.string(),
}),
})Recall that column groups are defined as sibling columns in the same node of the schema tree.
Our schema, in addition to the root column group, contains actor, org, and repo column groups.
While columnar storage formats are quite efficient at pruning unused columns, the distance in the file between the data for each column can introduce seek operations that can significantly slow down queries over object storage. In many cases, if you know upfront that you will always query a set of columns together, you can improve query performance by grouping these columns together in a column group.
Append Columns
Spiral Tables are merged across writes making it very easy to append columns to an existing table.
Suppose we wanted to derive an action column. We could easily append these columns to the existing table without
rewriting the entire dataset.
This example will use some syntax from the spiral.expressions module. For more detailed examples, please refer to
the Expressions.
actions = sp.scan({
'created_at': events["created_at"],
'id': events["id"],
'action': {
"repo_id": events["repo.id"],
"actor_id": events["actor.id"],
"type": events["type"],
"payload": events["payload"]
}
}).to_table()
events.write(actions)Our table schema now contains action column group, like this:
events.schema().to_arrow()pa.schema({
"created_at": pa.timestamp("ms"),
"id": pa.string(),
"payload": pa.string(),
"public": pa.bool_(),
"type": pa.string(),
"action": pa.struct({
"actor_id": pa.int64(),
"payload": pa.string(),
"repo_id": pa.int64(),
"type": pa.string(),
}),
"actor": pa.struct({
"avatar_url": pa.string(),
"display_login": pa.string(),
"gravatar_id": pa.string(),
"id": pa.int64(),
"login": pa.string(),
"url": pa.string(),
}),
"repo": pa.struct({
"id": pa.int64(),
"name": pa.string(),
"url": pa.string(),
}),
})Write Blobs
Tables supports blob columns for storing binary payloads — images, audio, video, or arbitrary binary data — alongside structured columns in the same table. A blob column is created automatically the first time you write a blob value into it.
Use helpers from spiral.expressions.blob to write blob data:
from spiral import expressions as se
blobs_tbl = project.create_table('blobs', key_schema={"id": pa.int64()}, exist_ok=True)
png_bytes = b"\x89PNG\r\n\x1a\n..." # truncated PNG file bytes
jpeg_bytes = b"\xff\xd8\xff..." # truncated JPEG file bytes
tiny_bytes = b"\x00\x01\x02" # small payload for inline storage
# Always upload to external object storage
blobs_tbl.write({"id": [1, 2], "image": se.blob.write([png_bytes, jpeg_bytes])})
# Always store inline in the column data (good for small payloads)
blobs_tbl.write({"id": [3], "thumbnail": se.blob.inline([tiny_bytes])})For fine-grained control, use se.Blob() directly on a per-row basis. By default payloads below 256 KiB are
stored inline and larger ones are uploaded externally. Use threshold= to override this:
blobs_tbl.write({
"id": [1, 2, 3],
"data": [
se.Blob(tiny_bytes), # auto: inline if < 256 KiB
se.Blob(jpeg_bytes, threshold=1024), # external if >= 1 KiB
se.Blob(png_bytes, mime_type="image/png"), # explicit MIME type
]
})To write a blob at a predictable, content-independent storage path (useful for deduplication or
human-readable URIs), pass path=:
blobs_tbl.write({"id": [4], "logo": [se.Blob(png_bytes, path="assets/logo.png")]})Use read expressions to project blob fields in a scan:
sp.scan({
"id": blobs_tbl["id"],
"payload": se.blob.bytes(blobs_tbl["image"]), # raw bytes (fetches external blobs)
"uri": se.blob.uri(blobs_tbl["image"]), # spfsm:// URI, or NULL if inline
"mime": se.blob.mime_type(blobs_tbl["image"]), # auto-detected or explicit MIME type
"external": se.blob.is_external(blobs_tbl["image"]), # True / False
"inline": se.blob.is_inline(blobs_tbl["image"]), # True / False
}).to_table()MIME types are detected automatically from magic bytes (PNG, JPEG, etc.) unless overridden with
mime_type= at write time.
Referencing blobs
If your blobs already live in an S3 bucket (or another supported object store), you can reference them zero-copy — no re-upload required.
1. Point a project’s file system at your bucket
Use spiral fs update to configure the project’s file system:
$ spiral fs update --type s3 --bucket my-bucket --region us-east-1 <project-id>2. Mount a path from that file system into your table
Use spiral tables fs-mount to attach a directory prefix of that
file system to the table. The command prints the mount identifier:
$ spiral tables fs-mount \
--table my-table \
--file-system <project-id> \
--directory /images \
<project-id>
mount_id: mount_123To look up mount identifiers later, use spiral tables fs-show:
$ spiral tables fs-show --table my-table <project-id>3. Write external blob references
Construct spfsm:// URIs using the mount identifier and write them with se.blob.external() — no
upload happens, the table stores a read-only reference:
mount_id = "mount_123" # from fs-mount output above
blobs_tbl.write({
"id": [1, 2],
"image": se.blob.external([
f"spfsm://{mount_id}/photo-a.png",
f"spfsm://{mount_id}/photo-b.png",
]),
})se.blob.bytes() will then fetch the file contents directly from the bucket at query time.
External references are never deleted during transaction rollback.