Skip to Content
TablesScan & Query

Scan & Query

Scanning tables is the process of reading data row-by-row performing row-based scalar transformations, or row-based filtering operations.

Scan Object

The scan method returns a Scan object that encapsulates a specific query. This object can then return rows of data, or be used to perform further operations.

import spiral import pyarrow as pa from spiral.demo import gharchive sp = spiral.Spiral() events_table = gharchive(sp) scan = sp.scan(events_table[["id", "type", "public", "actor", "payload.*", "repo"]]) # The result schema of the scan schema: pa.Schema = scan.schema # Whether the scan is definitely empty is_empty: bool = scan.is_empty() # Read as a stream of RecordBatches record_batches: pa.RecordBatchReader = scan.to_record_batches() # Read into a single PyArrow Table arrow_table: pa.Table = scan.to_table() # Read into a Dask DataFrame dask_df: dd.DataFrame = scan.to_dask() # Read into a Pandas DataFrame pandas_df: pd.DataFrame = scan.to_pandas() # Read into a Polars DataFrame polars_df: pl.DataFrame = scan.to_polars() # Read into a Torch-compatible DataLoader for model training data_loader: torch.utils.data.DataLoader = scan.to_data_loader() # Read into a Torch-compatible DataLoader for distributed model training distributed_data_loader: torch.utils.data.DataLoader = scan.to_distributed_data_loader() # Read into a Torch-compatible IterableDataset for model training iterable_dataset: torch.utils.data.IterableDataset = scan.to_iterable_dataset()

Scan’s to_record_batches method returns rows sorted by the table’s key. If ordering is not important, consider using to_unordered_record_batches which can be faster.

Filtering

Filtering is the process of selecting rows that meet a certain condition. For example, to find events with a specific event type:

insertion_events = sp.scan( events_table, where=events_table['type'] == 'PullRequestEvent' )

Any expression that resolves to a boolean value can be used as a filter. See the Expressions documentation for more information.

Projection

Projection is the process of applying a transformation function to a single row of a table. This can be as simple as selecting a subset of columns, through to much more complex functions such as passing a string column through an LLM API.

A projection expression must resolve to a struct value. See Expressions documentation for more information.

Nested Data

Scanning a table selects all columns including nested column groups:

sp.scan(events_table.select(exclude=['payload'])).schema.to_arrow()
pa.schema({ "created_at": pa.timestamp("ms"), "id": pa.string_view(), "public": pa.bool_(), "type": pa.string_view(), "actor": pa.struct({ "avatar_url": pa.string_view(), "display_login": pa.string_view(), "gravatar_id": pa.string_view(), "id": pa.int64(), "login": pa.string_view(), "url": pa.string_view(), }), "repo": pa.struct({ "id": pa.int64(), "name": pa.string_view(), "url": pa.string_view(), }) })

You can select an entire column group using bracket notation. Remember, the result must always be a struct but column group is a struct so this is valid:

sp.scan(events_table["repo"]).schema.to_arrow()
pa.schema({ "id": pa.int64(), "name": pa.string_view(), "url": pa.string_view(), })

Selecting a column like this is not valid. When selecting column, use double brackets:

sp.scan(events_table[["public"]]).schema.to_arrow()
pa.schema({ "public": pa.bool_(), })

Or multiple columns:

sp.scan(events_table[["public", "repo"]]).schema.to_arrow()
pa.schema({ "public": pa.bool_(), "repo": pa.struct({ "id": pa.int64(), "name": pa.string_view(), "url": pa.string_view(), }), })

You can pack columns with custom names using a dictionary:

sp.scan({ "is_public": events_table["public"], "repo": events_table["repo"], }).schema.to_arrow()
pa.schema({ "is_public": pa.bool_(), "repo": pa.struct({ "id": pa.int64(), "name": pa.string_view(), "url": pa.string_view(), }), })

Double-brackets is a “syntax sugar” for the .select() method:

sp.scan(events_table.select("public")).schema.to_arrow()
pa.schema({ "public": pa.bool_(), })

Selection applies over columns and nested column groups:

sp.scan(events_table.select("public", "repo")).schema.to_arrow()
pa.schema({ "public": pa.bool_(), "repo": pa.struct({ "id": pa.int64(), "name": pa.string_view(), "url": pa.string_view(), }), })

It is possible to “select into” a column group to get specific fields within nested structures:

sp.scan(events_table.select("public", "repo.id", "repo.url")).schema.to_arrow()
pa.schema({ "public": pa.bool_(), "repo": pa.struct({ "id": pa.int64(), "url": pa.string_view(), }), })

You can select from multiple nested structures:

sp.scan(events_table.select("public", "repo", "actor.login")).schema.to_arrow()
pa.schema({ "public": pa.bool_(), "repo": pa.struct({ "id": pa.int64(), "name": pa.string_view(), "url": pa.string_view(), }), "actor": pa.struct({ "login": pa.string_view(), }), })

Note that selecting “into” a column group returns a nested structure. To flatten, “step into” the column group first:

sp.scan(events_table[["public"]], events_table["repo"][["id", "url"]]).schema.to_arrow()
pa.schema({ "public": pa.bool_(), "id": pa.int64(), "url": pa.string_view(), })

Column groups support the same selection operations. This is equivalent to the flattened example above:

sp.scan(events_table["repo"].select("id", "url")).schema.to_arrow()
pa.schema({ "id": pa.int64(), "url": pa.string_view(), })

Use exclude to remove specific columns, column groups, or keys:

sp.scan(events_table.select(exclude=["payload", "repo", "actor"])).schema.to_arrow()
pa.schema({ "created_at": pa.timestamp("ms"), "id": pa.string_view(), "public": pa.bool_(), "type": pa.string_view(), })

Wildcard "*" allows you to select or exclude columns from a specific column group, without including or excluding nested column groups:

sp.scan(events_table.select("*")).schema.to_arrow()
pa.schema({ "created_at": pa.timestamp("ms"), "id": pa.string_view(), "public": pa.bool_(), "type": pa.string_view(), })

Exclude all columns to keep only nested groups except payload column group:

sp.scan(events_table.select(exclude=["*", "payload"])).schema.to_arrow()
pa.schema({ "actor": pa.struct({ "avatar_url": pa.string_view(), "display_login": pa.string_view(), "gravatar_id": pa.string_view(), "id": pa.int64(), "login": pa.string_view(), "url": pa.string_view(), }), "repo": pa.struct({ "id": pa.int64(), "name": pa.string_view(), "url": pa.string_view(), }), })

Wildcards can be mixed with other selections:

sp.scan(events_table.select("*", "actor.login")).schema.to_arrow()
pa.schema({ "created_at": pa.timestamp("ms"), "id": pa.string_view(), "public": pa.bool_(), "type": pa.string_view(), "actor": pa.struct({ "login": pa.string_view(), }), })

Querying

Recent years have seen several excellent query engines including:

Polars

Tables can also be queried with Polars LazyFrames :

tbl = gharchive(sp) df = tbl.to_polars() result = df.collect()

DuckDB

Tables can be turned into PyArrow Datasets with to_dataset(), which in turn enables the DuckDB Python API .

import duckdb tbl = gharchive(sp) ds = tbl.to_dataset() result = duckdb.execute("SELECT type, COUNT(*) FROM ds GROUP BY type")

Or using the DuckDB Relational API :

result = tbl.to_duckdb().filter("public is not null").to_arrow_table()

Key Table

Key Table scan is an equivalent of a primary key lookup. Any scan can be evaluated against a (sorted) table of keys to return only the rows that match the keys, with columns defined by the scan’s projection (and optionally filtered by the scan’s filter).

import pyarrow as pa scan = sp.scan(events_table) key_table = pa.table({ "created_at": ['2025-11-20 13:00:00'], "id": ['4807206745'], }) results = scan.to_record_batches(key_table=key_table)

Cross-table

It is possible to jointly scan any tables that have a common key schema. This has the same behaviour as an outer join but is more efficient since we know the tables are both sorted by the key columns.

sp.scan( { "events": events_table, "users": users_table, }, where=events_table['actor_id'] == users_table['id'], )
Last updated on