Scan & Query

Scanning tables is the process of reading data row-by-row performing row-based scalar transformations, or row-based filtering operations.

Scan Object

The scan method returns a Scan object that encapsulates a specific query. This object can then return rows of data, or be used to perform further operations.


import spiral
import pyarrow as pa
from spiral.demo import gharchive
 
 
sp = spiral.Spiral()
events_table = gharchive(sp)
 
scan = sp.scan(events_table[["id", "type", "public", "actor", "payload.*", "repo"]])
 
# The result schema of the scan
schema: pa.Schema = scan.schema
 
# Whether the scan is definitely empty
is_empty: bool = scan.is_empty()
 
# Read as a stream of RecordBatches
record_batches: pa.RecordBatchReader = scan.to_record_batches()
 
# Read into a single PyArrow Table
arrow_table: pa.Table = scan.to_table()
 
# Read into a Dask DataFrame
dask_df: dd.DataFrame = scan.to_dask()
 
# Read into a Pandas DataFrame
pandas_df: pd.DataFrame = scan.to_pandas()
 
# Read into a Polars DataFrame
polars_df: pl.DataFrame = scan.to_polars()
 
# Read into a Torch-compatible DataLoader for model training
data_loader: torch.utils.data.DataLoader = scan.to_data_loader()
 
# Read into a Torch-compatible DataLoader for distributed model training
distributed_data_loader: torch.utils.data.DataLoader = scan.to_distributed_data_loader()
 
# Read into a Torch-compatible IterableDataset for model training
iterable_dataset: torch.utils.data.IterableDataset = scan.to_iterable_dataset()

Scan’s to_record_batches method returns rows sorted by the table’s key. If ordering is not important, consider using to_unordered_record_batches which can be faster.

Filtering

Filtering is the process of selecting rows that meet a certain condition. For example, to find events with a specific event type:


insertion_events = sp.scan(
    events_table,
    where=events_table['type'] == 'PullRequestEvent'
)

Any expression that resolves to a boolean value can be used as a filter. See the Expressions documentation for more information.

Projection

Projection is the process of applying a transformation function to a single row of a table. This can be as simple as selecting a subset of columns, through to much more complex functions such as passing a string column through an LLM API.

A projection expression must resolve to a struct value. See Expressions documentation for more information.

Nested Data

Scanning a table selects all columns including nested column groups:


sp.scan(events_table.select(exclude=['payload'])).schema.to_arrow()


pa.schema({
    "created_at": pa.timestamp("ms"),
    "id": pa.string_view(),
    "public": pa.bool_(),
    "type": pa.string_view(),
    "actor": pa.struct({
        "avatar_url": pa.string_view(),
        "display_login": pa.string_view(),
        "gravatar_id": pa.string_view(),
        "id": pa.int64(),
        "login": pa.string_view(),
        "url": pa.string_view(),
    }),
    "repo": pa.struct({
        "id": pa.int64(),
        "name": pa.string_view(),
        "url": pa.string_view(),
    })
})

You can select an entire column group using bracket notation. Remember, the result must always be a struct but column group is a struct so this is valid:


sp.scan(events_table["repo"]).schema.to_arrow()


pa.schema({
    "id": pa.int64(),
    "name": pa.string_view(),
    "url": pa.string_view(),
})

Selecting a column like this is not valid. When selecting column, use double brackets:


sp.scan(events_table[["public"]]).schema.to_arrow()


pa.schema({
    "public": pa.bool_(),
})

Or multiple columns:


sp.scan(events_table[["public", "repo"]]).schema.to_arrow()


pa.schema({
    "public": pa.bool_(),
    "repo": pa.struct({
        "id": pa.int64(),
        "name": pa.string_view(),
        "url": pa.string_view(),
    }),
})

You can pack columns with custom names using a dictionary:


sp.scan({
    "is_public": events_table["public"],
    "repo": events_table["repo"],
}).schema.to_arrow()


pa.schema({
    "is_public": pa.bool_(),
    "repo": pa.struct({
        "id": pa.int64(),
        "name": pa.string_view(),
        "url": pa.string_view(),
    }),
})

Double-brackets is a “syntax sugar” for the .select() method:


sp.scan(events_table.select("public")).schema.to_arrow()


pa.schema({
    "public": pa.bool_(),
})

Selection applies over columns and nested column groups:


sp.scan(events_table.select("public", "repo")).schema.to_arrow()


pa.schema({
    "public": pa.bool_(),
    "repo": pa.struct({
        "id": pa.int64(),
        "name": pa.string_view(),
        "url": pa.string_view(),
    }),
})

It is possible to “select into” a column group to get specific fields within nested structures:


sp.scan(events_table.select("public", "repo.id", "repo.url")).schema.to_arrow()


pa.schema({
    "public": pa.bool_(),
    "repo": pa.struct({
        "id": pa.int64(),
        "url": pa.string_view(),
    }),
})

You can select from multiple nested structures:


sp.scan(events_table.select("public", "repo", "actor.login")).schema.to_arrow()


pa.schema({
    "public": pa.bool_(),
    "repo": pa.struct({
        "id": pa.int64(),
        "name": pa.string_view(),
        "url": pa.string_view(),
    }),
    "actor": pa.struct({
        "login": pa.string_view(),
    }),
})

Note that selecting “into” a column group returns a nested structure. To flatten, “step into” the column group first:


sp.scan(events_table[["public"]], events_table["repo"][["id", "url"]]).schema.to_arrow()


pa.schema({
    "public": pa.bool_(),
    "id": pa.int64(),
    "url": pa.string_view(),
})

Column groups support the same selection operations. This is equivalent to the flattened example above:


sp.scan(events_table["repo"].select("id", "url")).schema.to_arrow()


pa.schema({
    "id": pa.int64(),
    "url": pa.string_view(),
})

Use exclude to remove specific columns, column groups, or keys:


sp.scan(events_table.select(exclude=["payload", "repo", "actor"])).schema.to_arrow()


pa.schema({
    "created_at": pa.timestamp("ms"),
    "id": pa.string_view(),
    "public": pa.bool_(),
    "type": pa.string_view(),
})

Wildcard "*" allows you to select or exclude columns from a specific column group, without including or excluding nested column groups:


sp.scan(events_table.select("*")).schema.to_arrow()


pa.schema({
    "created_at": pa.timestamp("ms"),
    "id": pa.string_view(),
    "public": pa.bool_(),
    "type": pa.string_view(),
})

Exclude all columns to keep only nested groups except payload column group:


sp.scan(events_table.select(exclude=["*", "payload"])).schema.to_arrow()


pa.schema({
    "actor": pa.struct({
        "avatar_url": pa.string_view(),
        "display_login": pa.string_view(),
        "gravatar_id": pa.string_view(),
        "id": pa.int64(),
        "login": pa.string_view(),
        "url": pa.string_view(),
    }),
    "repo": pa.struct({
        "id": pa.int64(),
        "name": pa.string_view(),
        "url": pa.string_view(),
    }),
})

Wildcards can be mixed with other selections:


sp.scan(events_table.select("*", "actor.login")).schema.to_arrow()


pa.schema({
    "created_at": pa.timestamp("ms"),
    "id": pa.string_view(),
    "public": pa.bool_(),
    "type": pa.string_view(),
    "actor": pa.struct({
        "login": pa.string_view(),
    }),
})

Querying

Recent years have seen several excellent query engines including:

Polars - with Python and Rust DataFrame APIs
DuckDB - with a SQL oriented API
DataFusion - SQL as well as a Rust DataFrame API

Polars

Tables can also be queried with Polars LazyFrames :


tbl = gharchive(sp)
df = tbl.to_polars()
result = df.collect()

DuckDB

Tables can be turned into PyArrow Datasets with to_dataset(), which in turn enables the DuckDB Python API .


import duckdb
 
tbl = gharchive(sp)
ds = tbl.to_dataset()
 
result = duckdb.execute("SELECT type, COUNT(*) FROM ds GROUP BY type")

Or using the DuckDB Relational API :


result = tbl.to_duckdb().filter("public is not null").to_arrow_table()

Key Table

Key Table scan is an equivalent of a primary key lookup. Any scan can be evaluated against a (sorted) table of keys to return only the rows that match the keys, with columns defined by the scan’s projection (and optionally filtered by the scan’s filter).


import pyarrow as pa
 
scan = sp.scan(events_table)
 
key_table = pa.table({
    "created_at": ['2025-11-20 13:00:00'],
    "id": ['4807206745'],
})
results = scan.to_record_batches(key_table=key_table)

Cross-table

It is possible to jointly scan any tables that have a common key schema. This has the same behaviour as an outer join but is more efficient since we know the tables are both sorted by the key columns.


sp.scan(
    {
        "events": events_table,
        "users": users_table,
    },
    where=events_table['actor_id'] == users_table['id'],
)