Skip to content

Data Model

pgflow’s data model separates flow definitions from runtime execution state. Flow orchestration uses PostgreSQL tables and functions in the pgflow schema for transparent, queryable behavior.

Flows and steps are identified by slugs - simple text identifiers like 'analyze_website' or 'fetch_data'. Slugs must be valid identifiers (alphanumeric plus underscores, max 128 characters) and serve as natural, readable keys throughout the system.

The schema uses composite primary and foreign keys that include denormalized slug values:

-- Definition tables
flows: (flow_slug)
steps: (flow_slug, step_slug)
deps: (flow_slug, dep_slug, step_slug)
-- Runtime tables
runs: (run_id) + denormalized flow_slug
step_states: (run_id, step_slug) + denormalized flow_slug
step_tasks: (run_id, step_slug, task_index) + denormalized flow_slug

Denormalized slugs simplify queries - no joins needed:

-- Get all tasks for a specific step across all runs
SELECT * FROM pgflow.step_tasks
WHERE flow_slug = 'my_flow' AND step_slug = 'my_step';

The schema organizes tables into two distinct categories that serve different purposes in the flow lifecycle. All tables live in the pgflow schema.

These tables store the flow structure - the “blueprint” of what flows exist and how they connect:

flows - Flow identity and global configuration

  • Contains opt_max_attempts, opt_base_delay, opt_timeout
  • Each flow has a unique slug

steps - DAG nodes with step type (single or map)

  • single: Regular steps created with .step() or .array() DSL methods
  • map: Map steps created with .map() method for parallel array processing
  • Can override flow-level options
  • Must be added in topological order

🔗 deps - DAG edges defining step dependencies

  • Determines execution order and data flow

Populated during deployment through migrations. Change only when flow definitions change.

These tables track the execution state of flow instances:

runs - Execution instances of flows

  • Tracks run status (started, completed, failed)
  • Stores input and aggregated outputs
  • Maintains remaining_steps counter for completion detection

step_states - State of individual steps within a run

  • Tracks step status (created, started, completed, failed)
  • For map steps, tracks initial_tasks and remaining_tasks counts
  • Coordinates step-level completion and aggregation

step_tasks - Units of work for steps

  • Single steps create 1 task, map steps create N tasks
  • Each task has retry counter and attempts tracking
  • Contains task_index for map task array elements
  • Stores individual outputs for aggregation

Created and modified during execution. Each run creates new records tracking progress from start to completion.

The relationships between these tables form the complete pgflow data model:

Schema ERD Diagram

The separation between definition and runtime tables enables several key capabilities:

🔒 Separation of Concerns - Definition tables read-only during execution, runtime tables query without affecting definitions, multiple runs execute concurrently

👁️ Observability - All execution state in queryable SQL tables (step_states, step_tasks), no hidden state or external coordination

ACID Guarantees - All state transitions in transactions, dependencies and task spawning atomic, no race conditions

The relationship between steps and tasks depends on the step type:

Step TypeTasks CreatedTask IndexInput Received
singleExactly 1N/AFull input with dependencies
mapN (one per array element)0 to N-1Individual array element

This design handles both single-task steps and parallel fanout through the same execution model.

🏃 Run-level tracking

  • remaining_steps starts at total step count, decrements on completion
  • When zero, run marked complete and outputs aggregated

🔸 Step-level tracking (map steps only)

  • initial_tasks set when array size known
  • remaining_tasks decrements as tasks complete
  • When zero, outputs aggregated into array