Skip to Content
TablesExpressions

Spiral Expressions

The expressions module provides a powerful way to build and manipulate expressions for data processing. This guide will walk you through common use cases using GitHub event data  as examples.

Data Schema

Let’s take a look at the public GitHub Archive dataset hosted in Spiral:

import spiral as sp from spiral.demo import gharchive import pyarrow as pa sp = sp.Spiral() table = gharchive(sp)

Basic Concepts

Tables

Spiral tables are structured as a series of nested structs. Use Python dict syntax to project columns from the table.

# Create a variable referring to the actor's login actor_login = table["actor"]["login"]

Scalars

Scalars are constant values that can be used in expressions.

# Compare with scalar is_octocat = actor_login == "octocat"

Scalars can be explicitly created using the se.scalar function.

from spiral import expressions as se # Create a scalar example_user = se.scalar("octocat")

Common Operations

Full list of common operations is available in API docs.

Filtering

# Find issues with number greater than 100 high_number_issues = table["payload"]["issue"]["number"] > 100

Combining

octocat_high_issues = is_octocat & high_number_issues

This is equivalent to:

from spiral import expressions as se # Find issues created by octocat that are high-numbered octocat_high_issues = se.and_( table["actor"]["login"] == "octocat", table["payload"]["issue"]["number"] > 100 )

Struct Operations

Struct operations allow you to manipulate nested data structures.

Although se.struct is a submodule of se, some functions are also available directly on se.

Selecting Fields

from spiral import expressions as se # Select only actor information. actor_info = se.select(table["actor"], names=["login", "avatar_url"]) # Exclude the avatar URL. actor_info_no_avatar = se.select(actor_info, exclude=["avatar_url"])

Merging Structs

from spiral import expressions as se # Combine actor and repo information. actor_and_repo = se.merge(table["actor"], table["repo"])

On a key clash keys are taken from the right most struct.

Creating New Structs

# Create a simplified event representation simple_event = se.pack({ "user": table["actor"]["login"], "repo_name": table["repo"]["name"], "action": table["payload"]["action"], })

Advanced Example

import spiral import pyarrow as pa from spiral import expressions as se from datetime import datetime from spiral.demo import gharchive sp = spiral.Spiral() table = gharchive(sp) # First, let's create some reusable components actor = table["actor"] repo = table["repo"] payload = table["payload"] created_at = table["created_at"] # 1. Filter conditions is_2024 = created_at >= datetime(2024, 1, 1) is_issue_event = payload["action"] == "opened" has_avatar = actor["gravatar_id"] != "" # 2. Popular repositories (arbitrary threshold for example) popular_repo = repo["id"] > 1000000 # 3. Specific users of interest vip_users = ["octocat", "torvalds", "gvanrossum"] is_vip = se.or_(*[ actor["login"] == user for user in vip_users ]) # 4. Combine all filters where_filter = se.and_( is_2024, se.or_( se.and_(is_issue_event, popular_repo), is_vip ) ) # 5. Create a scoring system user_score = { "base_score": se.divide(actor["id"], 100), "avatar_bonus": has_avatar.cast(pa.int64()) * 50 } # 6. Create the final enriched event structure projection = se.pack({ # Original fields "timestamp": created_at, # User info with computed fields "user": se.pack({ "login": actor["login"], "is_vip": is_vip, "has_avatar": has_avatar, **user_score }), # Repository info "repository": se.pack({ "name": repo["name"], "is_popular": popular_repo }), # Event type info "event_type": se.pack({ "is_issue": is_issue_event, # Null for non-issue events "issue_number": payload["issue"]["number"], }), # Computed total score "total_score": se.add( user_score["base_score"], user_score["avatar_bonus"] ) }) # 7. Scan the table. scan: Scan = sp.scan(projection, where=where_filter) scan.schema.to_arrow()
pa.schema({ 'timestamp': pa.timestamp("ms"), 'user': pa.struct({ 'login': pa.string(), 'is_vip': pa.bool_(), 'has_avatar': pa.bool_(), 'base_score': pa.int64(), 'avatar_bonus': pa.int64() }), 'repository': pa.struct({ 'name': pa.string(), 'is_popular': pa.bool_() }), 'event_type': pa.struct({ 'is_issue': pa.bool_(), 'issue_number': pa.int64() }), 'total_score': pa.int64() })
Last updated on