Skip to Content
TablesExpressions

Spiral Expressions

The expressions module provides a powerful way to build and manipulate expressions for data processing. This guide will walk you through common use cases using GitHub event data  as examples.

Data Schema

Let’s take a look at the public GitHub Archive dataset hosted in Spiral:

import spiral as sp sp = sp.Spiral() table = sp.project("public").table("gharchive.events")

All examples in this documentation work with the following GitHub events data schema:

table.schema().to_pyarrow()
pa.schema({ 'id': pa.int64(), 'created_at': pa.timestamp(), 'actor': pa.struct({ 'id': pa.int64(), 'login': pa.string(), 'display_login': pa.string(), 'gravatar_id': pa.string(), 'url': pa.string(), 'avatar_url': pa.string() }), 'repo': pa.struct({ 'id': pa.int64(), 'name': pa.string(), 'url': pa.string() }), 'payload': pa.struct({ 'action': pa.string(), 'issue': pa.struct({ 'url': pa.string(), 'number': pa.int64() }) }) })

Basic Concepts

To access the expression module, import it as follows:

from spiral import expressions as se

Tables

Spiral tables are structured as a series of nested structs. Use Python dict syntax to project columns from the table.

# Create a variable referring to the actor's login actor_login = table["actor"]["login"]

Scalars

Scalars are constant values that can be used in expressions.

# Compare with scalar is_octocat = actor_login == "octocat"

Scalars can be explicitly created using the se.scalar function.

# Create a scalar example_user = se.scalar("octocat")

Common Operations

Full list of common operations is available in API docs.

Filtering

# Find issues with number greater than 100 high_number_issues = table["payload"]["issue"]["number"] > 100

Combining

octocat_high_issues = is_octocat & high_number_issues

This is equivalent to:

# Find issues created by octocat that are high-numbered octocat_high_issues = se.and_( table["actor"]["login"] == "octocat", table["payload"]["issue"]["number"] > 100 )

Struct Operations

Struct operations allow you to manipulate nested data structures.

Although se.struct is a submodule of se, some functions are also available directly on se.

Selecting Fields

# Select only actor information. actor_info = se.select(table["actor"], names=["login", "avatar_url"]) # Exclude the avatar URL. actor_info_no_avatar = se.select(actor_info, exclude=["avatar_url"])

Merging Structs

# Combine actor and repo information. actor_and_repo = se.merge(table["actor"], table["repo"])

On a key clash keys are taken from the right most struct.

Creating New Structs

# Create a simplified event representation simple_event = se.pack({ "user": table["actor"]["login"], "repo_name": table["repo"]["name"], "action": table["payload"]["action"], })

Advanced Example

import spiral as sp import pyarrow as pa from spiral import expressions as se from spiral import tables from datetime import datetime db = sp.Spiral() table = sp.project("public").table("gharchive.events") # First, let's create some reusable components actor = table["actor"] repo = table["repo"] payload = table["payload"] created_at = table["created_at"] # 1. Filter conditions is_2024 = created_at >= datetime(2024, 1, 1) is_issue_event = payload["action"] == "opened" has_avatar = actor["gravatar_id"] != "" # 2. Popular repositories (arbitrary threshold for example) popular_repo = repo["id"] > 1000000 # 3. Specific users of interest vip_users = ["octocat", "torvalds", "gvanrossum"] is_vip = se.or_(*[ actor["login"] == user for user in vip_users ]) # 4. Combine all filters where_filter = se.and_( is_2024, se.or_( se.and_(is_issue_event, popular_repo), is_vip ) ) # 5. Create a scoring system user_score = se.pack({ "base_score": se.modulo(actor["id"], 100) * 10, "avatar_bonus": has_avatar.cast(pa.int64()) * 50 }) # 6. Create the final enriched event structure projection = se.pack({ # Original fields "timestamp": created_at, # User info with computed fields "user": se.pack({ "login": actor["login"], "is_vip": is_vip, "has_avatar": has_avatar, **user_score }), # Repository info "repository": se.pack({ "name": repo["name"], "is_popular": popular_repo }), # Event type info "event_type": se.pack({ "is_issue": is_issue_event, # Null for non-issue events "issue_number": payload["issue"]["number"], }), # Computed total score "total_score": se.add( user_score["base_score"], user_score["avatar_bonus"] ) }) # 7. Scan the table. scan: Scan = sp.scan(projection, where=where_filter) print(scan.schema)
{ 'timestamp': pa.timestamp(), 'user': { 'login': pa.string(), 'is_vip': pa.bool_(), 'has_avatar': pa.bool_(), 'base_score': pa.int64(), 'avatar_bonus': pa.int64() }, 'repository': { 'name': pa.string(), 'is_popular': pa.bool_() }, 'event_type': { 'is_issue': pa.bool_(), 'issue_number': pa.int64() }, 'total_score': pa.int64() }
Last updated on