Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Clinker

Clinker is a pure-Rust, bounded-memory batch DAG executor for CSV, JSON, XML, and fixed-width data. It reads finite inputs, drives them through a directed acyclic graph of transformation nodes one record at a time, and exits when the inputs are drained. It ships as a single static binary with no interpreter, no runtime, and no install dependencies.

Pipelines are declared in YAML. Data transformation logic is written in CXL, a custom expression language purpose-built for ETL. Together they replace legacy tools like Informatica, SSIS, Talend, and NiFi with something deterministic, lightweight, and easy to reason about.

What Clinker is, plainly

A finite batch executor with per-record streaming evaluation, not a long-running stream processor. A pipeline run is a job: Sources read until EOF, the DAG drains, the process exits. Within a run, stateless operators (Transform, Route, most Combine probe-side work, Output) evaluate records one at a time without accumulating per-record state. Every stage is charged against the configured RSS budget. Fused Source → Transform → Output paths run streaming with no per-stage materialization; non-fused boundaries (Route fan-out, Merge fan-in, Composition bodies, diamond DAGs) materialize records into per-stage buffers that charge against the same envelope. The engine spills buffers to disk at 80% of the limit and fails fast with E310 MemoryBudgetExceeded at the hard limit, naming the offending producer. Blocking operators (Aggregate, sort, grace-hash Combine) accumulate state inside that same budget and spill to disk when soft and hard memory thresholds trip, rather than OOM-killing the process.

If you have used Flink, Kafka Streams, or Beam in unbounded mode: Clinker is not that. There are no watermarks against wall-clock time, no infinite-source semantics, no exactly-once delivery across restarts. The closest prior art is Pentaho Kettle / Apache Hop, Embulk, Singer, Benthos in batch mode, and Vector running file-to-file – finite ETL jobs with per-record evaluation and a hard memory ceiling.

Three pillars of what Clinker is:

  1. Finite inputs. Files (CSV, JSON, XML, fixed-width) are the canonical shape. Finite-cursor network sources (paginated REST APIs, SQL SELECT cursors) fit the same model – they exhaust their cursor and EOF. Unbounded sources (Kafka topics, Kinesis streams, Server-Sent Events, webhooks, tail -f-style file followers) are out of scope and will remain so.
  2. Finite jobs. A pipeline run begins when you invoke clinker run, drains the DAG, and exits with a status code. No long-running daemon, no service surface, no infinite event loop.
  3. Single process. One clinker binary invocation is one operating- system process. Parallelism happens inside the process via threads (std::thread, Rayon). Clinker does not spawn worker processes, does not coordinate a cluster, and does not shuffle data between machines. Scale by giving the host more cores, more RAM, and more disk – the DuckDB / Polars / Kettle model. If a single host genuinely can’t fit the work, partition the input by file or by key and run multiple clinker invocations from a shell script; that’s a five-line bash script, not an architectural addition.

Why Clinker?

Single binary, zero dependencies. Download it, run it. No JVM, no Python, no package manager. Works on any Linux server out of the box.

Good neighbor on busy servers. Clinker enforces a strict memory ceiling (default 512 MB) so it can run alongside JVM applications, databases, and other services without competing for RAM. Aggregation spills to disk when memory pressure rises.

Reproducible output. Given the same input and pipeline, Clinker produces byte-identical output across runs. No nondeterminism from thread scheduling, hash randomization, or floating-point reordering.

Operability-first design. Per-stage metrics, dead-letter queues for error records, explain plans for understanding execution, and structured exit codes for scripting. Built for production from day one.

Two binaries:

BinaryPurpose
clinkerRun pipelines against real data
cxlCheck, evaluate, and format CXL expressions interactively

A taste of Clinker

Here is a complete pipeline that reads a customer CSV, filters to active customers, classifies them into tiers, and writes the result:

pipeline:
  name: customer_etl

nodes:
  - type: source
    name: customers
    config:
      name: customers
      type: csv
      path: "./data/customers.csv"
      schema:
        - { name: customer_id, type: int }
        - { name: first_name, type: string }
        - { name: last_name, type: string }
        - { name: status, type: string }
        - { name: lifetime_value, type: float }

  - type: transform
    name: enrich
    input: customers
    config:
      cxl: |
        filter status == "active"
        emit customer_id = customer_id
        emit full_name = first_name + " " + last_name
        emit tier = if lifetime_value >= 10000 then "gold" else "standard"

  - type: output
    name: result
    input: enrich
    config:
      name: enriched
      type: csv
      path: "./output/enriched_customers.csv"

Run it:

clinker run customer_etl.yaml

That is the entire workflow. No project scaffolding, no configuration files, no compile step. One YAML file, one command.

Next steps

Non-Goals

This page lists what Clinker is deliberately not. These are architectural commitments — design surfaces Clinker will not grow into, not just features that haven’t been built yet.

If you arrived here because you were considering Clinker for one of the scenarios below, the answer is “a different tool is the right fit.” Each non-goal is paired with the kind of tool that is the right fit.

Not an unbounded stream processor

Clinker reads sources that have an end. A pipeline run is a finite job: Sources read until EOF, the DAG drains, the process exits.

Out of scope:

  • Kafka topics, Kinesis streams, Pub/Sub subscriptions (long-running consumers without a natural end).
  • Server-Sent Events, WebSocket subscriptions, webhooks-as-input.
  • tail -f-style file followers.
  • Watermarking against wall-clock time.
  • Exactly-once delivery across process restarts.
  • Stateful infinite-stream windowing (tumbling / sliding / session windows over event time without a finite boundary).

Right fit instead: Apache Flink, Kafka Streams, Apache Beam in unbounded mode, Vector with streaming sources, Benthos with streaming inputs, Apache NiFi.

Not a multi-process or distributed engine

One clinker run invocation is one operating-system process. Clinker does not spawn worker processes, does not coordinate a cluster, and does not shuffle data between machines.

Out of scope:

  • Worker-process pools on a single machine.
  • Multi-machine sharded execution.
  • Network shuffle between executors.
  • Cluster managers (Kubernetes operators, YARN, Mesos integrations).
  • Distributed memory accounting.
  • Partial-failure recovery across worker boundaries.

Right fit instead: Apache Spark, Trino / Presto, Apache Flink in cluster mode, Apache Beam on Dataflow, Hadoop MapReduce.

Scaling Clinker: give the host more cores, more RAM, more disk — the DuckDB / Polars / Kettle / Hop model. If a single host genuinely can’t fit the work, partition the input by file or by key and run multiple clinker invocations from a shell script. That’s a five-line script, not an architectural addition.

Not a long-running service

Clinker is a CLI binary, not a server. There is no daemon mode, no HTTP control plane, no JDBC/ODBC listener, no UI server, no scheduled job runner inside Clinker itself.

Out of scope:

  • HTTP API exposing pipeline execution.
  • Built-in cron / scheduler / orchestrator.
  • Persistent connection pool living across pipeline runs.
  • A long-lived process accepting new pipeline submissions over a socket.

Right fit instead:

  • For scheduling: cron, systemd timers, Airflow, Dagster, Prefect, Temporal.
  • For HTTP-fronted ETL: any of the above orchestrators wrapping clinker run invocations.
  • For interactive queries against finite data: DuckDB, Polars, or any embedded query engine.

Not an OLAP / SQL query engine

Clinker is a per-record expression engine with explicit nodes: in a DAG. It does not parse SQL, does not optimize joins via cost-based optimization across the whole pipeline, and does not present a relational table model.

Out of scope:

  • SQL parsing (the CXL language is the surface; no SELECT ... FROM is accepted).
  • Cost-based join reordering across more than the local Combine node.
  • Materialized views or query caching.
  • Interactive query latencies under a second.
  • ANSI-SQL semantics for NULL, type coercion, or aggregate behavior.

Right fit instead: DuckDB, ClickHouse, DataFusion, Trino, Postgres, or any RDBMS. If you want SQL-driven transformation over files, DuckDB is the closest single-binary alternative to Clinker for the cases where SQL is the right surface.

Not a connector marketplace

Clinker ships with a deliberately small set of source and sink types: CSV, JSON, XML, fixed-width files in the current release; finite-cursor REST and SQL sources on the roadmap. There is no plugin registry, no third-party connector store, no SaaS-API catalog.

Out of scope:

  • Hundreds of pre-built SaaS integrations (Salesforce, HubSpot, Stripe, etc.).
  • A central registry of community-maintained connectors.
  • Schema discovery against arbitrary external APIs.
  • Change-data-capture (CDC) sources.

Right fit instead: Airbyte, Fivetran, Stitch, Singer with its tap ecosystem, dlt (data load tool).

Not a streaming-CDC engine

Clinker treats each pipeline run as a fresh, finite pass over the input. It does not maintain a persistent log of source changes, does not replicate row-level changes from a database, and does not produce an append-only stream of inserts / updates / deletes.

Out of scope:

  • Postgres logical replication subscriptions.
  • MySQL binlog tailing.
  • Debezium-style CDC stream production.
  • Maintaining a target database in continuous sync with a source.

Right fit instead: Debezium, Maxwell, AWS DMS, Striim, Estuary Flow, or vendor-native CDC like Snowflake Streams.

What Clinker is

For the positive framing, see the Introduction and Key Concepts. The short version:

  • A pure-Rust, single-binary, bounded-memory batch DAG executor for finite file and finite-cursor inputs.
  • Per-record evaluation through a directed acyclic graph of Source, Transform, Aggregate, Route, Merge, Combine, Output, and Composition nodes.
  • Pipelines declared in YAML, transformation logic written in CXL (a custom per-record expression language).
  • One process, finite job, EOF-then-exit. Disk spill under memory pressure rather than OOM.

Installation

Clinker is a single static binary with no runtime dependencies. Download it, put it on your PATH, and you are ready to go.

Binaries

Clinker ships two binaries:

  • clinker – the pipeline executor. This is the main tool you use to validate and run pipelines against data.
  • cxl – the CXL expression checker, evaluator, and formatter. Use it during development to test expressions interactively, check types, and format CXL blocks.

Verify installation

After placing the binaries on your PATH, confirm they work:

clinker --version
clinker 0.1.0
cxl --version
cxl 0.1.0

Both commands should print a version string and exit. If you see command not found, check that the directory containing the binaries is in your PATH.

Building from source

Clinker requires Rust 1.91+ (edition 2024). If you have a Rust toolchain installed, build and install both binaries directly from the repository:

# Clone the repository
git clone https://github.com/rustpunk/clinker.git
cd clinker

# Install the pipeline executor
cargo install --path crates/clinker

# Install the CXL expression tool
cargo install --path crates/cxl-cli

This compiles release-optimized binaries and places them in ~/.cargo/bin/, which is typically already on your PATH.

To verify the build:

cargo test --workspace

This runs the full test suite (approximately 1100 tests) and confirms everything is working correctly on your system.

Rust toolchain

The repository includes a rust-toolchain.toml that pins the exact Rust version. If you use rustup, it will automatically download the correct toolchain when you build.

RequirementValue
Rust edition2024
Minimum version1.91
C dependenciesNone

Your First Pipeline

This walkthrough builds a pipeline from scratch, runs it, and explores the tools Clinker provides for validating and understanding pipelines before they touch real data.

1. Create sample data

Save the following as employees.csv:

id,name,department,salary
1,Alice Chen,Engineering,95000
2,Bob Martinez,Marketing,62000
3,Carol Johnson,Engineering,88000
4,Dave Williams,Sales,71000

2. Write the pipeline

Save the following as my_first_pipeline.yaml:

pipeline:
  name: salary_report

nodes:
  - type: source
    name: employees
    config:
      name: employees
      type: csv
      path: "./employees.csv"
      schema:
        - { name: id, type: int }
        - { name: name, type: string }
        - { name: department, type: string }
        - { name: salary, type: int }

  - type: transform
    name: classify
    input: employees
    config:
      cxl: |
        emit id = id
        emit name = name
        emit department = department
        emit salary = salary
        emit level = if salary >= 90000 then "senior" else "junior"

  - type: output
    name: report
    input: classify
    config:
      name: salary_report
      type: csv
      path: "./salary_report.csv"

This pipeline has three nodes:

  1. employees (source) – reads the CSV file and declares the schema.
  2. classify (transform) – passes all fields through and adds a level field based on salary.
  3. report (output) – writes the result to a new CSV file.

The input: field on each consumer node wires the DAG together. Data flows from employees through classify to report.

3. Validate before running

Before processing any data, check that the pipeline is well-formed:

clinker run my_first_pipeline.yaml --dry-run

Dry-run parses the YAML, resolves the DAG, and type-checks all CXL expressions against the declared schemas. If there are errors – a typo in a field name, a type mismatch, a missing input: reference – Clinker reports them with source-location diagnostics and stops. No data is read.

4. Preview records

To see what the output will look like without writing files, preview a few records:

clinker run my_first_pipeline.yaml --dry-run -n 2

This reads the first 2 records from the source, runs them through the pipeline, and prints the results to the terminal. Useful for sanity-checking transformations before committing to a full run.

5. Understand the execution plan

To see how Clinker will execute the pipeline:

clinker run my_first_pipeline.yaml --explain

The explain plan shows the DAG topology, the order nodes will execute, per-node parallelism strategy, and schema propagation through the pipeline. This is valuable for understanding complex pipelines with routes, merges, and aggregations.

6. Run it

clinker run my_first_pipeline.yaml

Clinker reads employees.csv, applies the transform, and writes salary_report.csv. The output:

id,name,department,salary,level
1,Alice Chen,Engineering,95000,senior
2,Bob Martinez,Marketing,62000,junior
3,Carol Johnson,Engineering,88000,junior
4,Dave Williams,Sales,71000,junior

Alice’s salary of 95,000 meets the threshold, so she is classified as senior. Everyone else is junior.

What just happened

The pipeline executed as a streaming process:

  1. The source node read employees.csv one record at a time.
  2. Each record flowed through the classify transform, which evaluated the CXL block to produce the output fields.
  3. The output node wrote each transformed record to salary_report.csv.

At no point was the entire dataset loaded into memory. This is how Clinker processes files of any size under its memory ceiling.

Next steps

Key Concepts

This page covers the mental model behind Clinker pipelines. If you have experience with other ETL tools, most of this will feel familiar – but pay attention to where Clinker diverges, especially around CXL, per-record evaluation, and the memory budget.

Batch jobs, not unbounded streams

A Clinker run is a finite batch job. Source nodes read their files until EOF, the DAG drains, and the process exits. There are no watermarks against wall-clock time, no infinite-source semantics, no exactly-once delivery across restarts. If you have used Flink, Kafka Streams, or Beam in unbounded mode: Clinker is not that.

The word “streaming” in Clinker’s documentation always refers to per-record evaluation within a single batch run – records flow through the graph one at a time rather than being materialized as a whole table – not to long-running stream-processor semantics. Internal identifiers in the codebase (function names like streaming_output_task, config fields like strategy: streaming, error messages, log lines) use the word in the same row-by-row sense; if you see it in a stack trace, it is not Flink leaking through.

Finite inputs only

Clinker reads sources that have an end. Files are the canonical shape, and finite-cursor network sources (paginated REST APIs, SQL SELECT cursors) fit the same model – they exhaust their cursor and EOF. Unbounded sources (Kafka, Kinesis, Server-Sent Events, webhooks, tail -f-style file followers) are explicitly out of scope and will remain so.

Single process, ever

One clinker run invocation is one OS process. Parallelism happens inside that process via threads. Clinker does not spawn worker processes, does not coordinate a cluster, and does not shuffle data between machines. Scale by giving the host more cores, more RAM, more disk – the DuckDB / Polars / Kettle model. If a single host genuinely can’t fit the work, partition the input by file or by key and run multiple clinker invocations from a shell script; that’s a five-line script, not an architectural addition to Clinker.

For the full list of what Clinker deliberately does not do, see Non-Goals.

Pipelines are DAGs

A pipeline is a directed acyclic graph of nodes. Data flows from sources, through processing nodes, to outputs. There are no cycles – a node cannot consume its own output, directly or indirectly.

You define the graph by setting input: on each consumer node, naming the upstream node it reads from. Clinker resolves these references, validates that the graph is acyclic, and determines execution order automatically.

The nodes: list

Every pipeline has a single flat list of nodes. Each node has a type: discriminator that determines its behavior. The eight node types are:

TypePurpose
sourceRead data from a file (CSV, JSON, XML, fixed-width)
transformApply CXL logic to reshape, filter, or enrich records
aggregateGroup records and compute summary values (sum, count, etc.)
routeSplit a stream into named ports based on conditions
mergeConcatenate multiple streams that share a schema
combineJoin records across N inputs with cross-input predicates
outputWrite data to a file
compositionEmbed a reusable sub-pipeline

You can have as many nodes of each type as your pipeline requires. The only constraint is that the resulting graph must be a valid DAG.

CXL is not SQL

CXL is a per-record expression language. Each record flows through a CXL block independently – there is no table-level context, no SELECT, no FROM, no JOIN. Think of it as a programmable row mapper.

The core statements:

  • emit name = expr – produce a field in the output record. Only emitted fields appear downstream. If you want to pass a field through unchanged, you must emit it explicitly: emit id = id.
  • let name = expr – bind a local variable for use in later expressions. Local variables do not appear in the output.
  • filter condition – discard the record if the condition is false. A filtered record produces no output and is not counted as an error.
  • distinct / distinct by field – deduplicate records. distinct deduplicates on all output fields; distinct by field deduplicates on a specific field.

CXL uses and, or, and not for boolean logic – not && or ||. String concatenation uses +. Conditional expressions use if ... then ... else ... syntax.

System namespaces use a $ prefix: $pipeline.*, $window.*, $meta.*. These provide access to pipeline metadata, window function state, and record metadata respectively.

Per-record evaluation and the memory budget

Within a run, records flow through the pipeline one at a time. Clinker does not load an entire file into memory before processing it. A source reads one record, pushes it through the downstream nodes, and then reads the next. This is what “streaming” means in Clinker – row-by-row evaluation inside a finite batch job, not Flink-style unbounded stream processing.

Per-record evaluation keeps per-row memory usage bounded for the stateless parts of the graph (Transform, Route, Merge, most Combine probe-side work, Output). Every stage is charged against the configured RSS budget. Fused Source → Transform → Output paths run streaming, with no per-stage materialization, so a 100 GB CSV passes through with the same footprint as a 100 KB CSV. A stage that hands its output to a single downstream sink Output also avoids a charged inter-stage buffer – single-branch Route, non-fused Merge, streaming Aggregate, and the Combine probe-side stream their result straight to the writer (see Streaming vs. Blocking Stages). The remaining boundaries – multi-branch Route fan-out, output that forks to several consumers, Composition bodies, diamond DAGs – materialize records into per-stage buffers that charge against the same budget envelope. When a buffer would push cumulative usage past the soft threshold (80% of the limit), the engine spills the buffer to disk; when it would exceed the hard limit, the engine fails fast with a structured E310 MemoryBudgetExceeded diagnostic that names the offending producer.

Use clinker run --explain to see which nodes will materialize (buffer: materialized) versus which will stream (buffer: streaming) before runtime – that label is the canonical “which stages charge the budget” signal. See the --explain reference and the memory-tuning page.

Stateful operators must accumulate. Aggregate, sort, and grace-hash Combine cannot emit until they have seen enough input – sums need every addend, a full sort needs the last row, a hash join needs the build side complete. These operators run inside a configured RSS budget (default 512 MB) and degrade gracefully under pressure rather than OOM:

  • Aggregate uses hash aggregation by default and spills partitions to disk when soft/hard memory thresholds trip. When the input is already sorted by the group key, the planner picks streaming aggregation, which requires only constant memory.
  • Sort spills runs to disk and merges them.
  • Combine picks among in-memory hash join, grace hash join (spilled), and IEJoin / sort-merge depending on predicates and memory pressure.

The memory ceiling is a first-class promise. Clinker is designed to share a server with JVM applications, databases, and other services without competing for RAM.

Input wiring

Consumer nodes reference their upstream via the input: field:

- type: transform
  name: enrich
  input: customers    # reads from the node named "customers"

Route nodes produce named output ports. Downstream nodes reference a specific port using dot notation:

- type: route
  name: split_by_region
  input: customers
  config:
    routes:
      us: region == "US"
      eu: region == "EU"
    default: other

- type: output
  name: us_output
  input: split_by_region.us    # reads from the "us" port

Merge nodes accept multiple inputs using inputs: (plural):

- type: merge
  name: combined
  inputs:
    - us_transform
    - eu_transform

Schema declaration

Source nodes require an explicit schema: that declares every column’s name and type:

config:
  schema:
    - { name: customer_id, type: int }
    - { name: email, type: string }
    - { name: balance, type: float }
    - { name: created_at, type: date }

Clinker uses these declarations to type-check CXL expressions at compile time, before any data is read. If a CXL block references a field that does not exist in the upstream schema, or applies an operation to an incompatible type, the error is caught during validation – not at row 5 million of a production run.

Supported types include int, float, string, bool, date, and datetime.

Error handling

Each node can specify an error handling strategy:

StrategyBehavior
fail_fastStop the pipeline on the first error (default)
continueRoute error records to a dead-letter queue file and continue
best_effortLog errors and continue without writing error records

When using continue, Clinker writes rejected records to a DLQ file alongside the output. Each DLQ entry includes the original record, the error category, the error message, and the node that rejected it. This makes diagnosing production issues straightforward: check the DLQ, fix the data or the pipeline, and rerun.

Pipeline YAML Structure

A Clinker pipeline is a single YAML file with three top-level sections: pipeline (metadata), nodes (the processing graph), and optionally error_handling.

Top-level shape

pipeline:
  name: my_pipeline            # Required — pipeline identifier
  memory:                      # Optional — see ops/memory.md
    limit: "256M"              # Optional (K/M/G suffixes), default 512M
    backpressure: pause        # Optional, default `pause`
  vars:                        # Optional key-value pairs
    threshold: 500
    label: "Monthly Report"
  date_formats: ["%Y-%m-%d"]   # Optional — custom date parsing formats
  rules_path: "./rules/"       # Optional — CXL module search path
  concurrency:                 # Optional
    threads: 4
    chunk_size: 1000
  metrics:                     # Optional
    spool_dir: "./metrics/"

nodes:                         # Required — flat list of pipeline nodes
  - type: source
    name: raw_data
    config:
      name: raw_data
      type: csv
      path: "./data/input.csv"
      schema:
        - { name: id, type: int }
        - { name: value, type: string }

  - type: transform
    name: clean
    input: raw_data
    config:
      cxl: |
        emit id = id
        emit value = value.trim()

  - type: output
    name: result
    input: clean
    config:
      name: result
      type: csv
      path: "./output/result.csv"

error_handling:                # Optional
  strategy: fail_fast

Pipeline metadata

The pipeline: block carries global settings that apply to the entire run.

FieldRequiredDescription
nameYesPipeline identifier. Used in logs and metrics.
memoryNoMemory-arbitrator tuning. Nested fields: limit (RSS budget, K/M/G suffixes, default 512M) and backpressure (spill/pause/both, default pause). See Memory Tuning.
varsNoScalar constants accessible in CXL via $vars.*.
date_formatsNoList of strftime-style patterns for date parsing.
rules_pathNoDirectory for CXL use module resolution.
concurrencyNothreads and chunk_size for parallel chunk processing.
metricsNospool_dir for per-run JSON metric files.
date_localeNoLocale for date formatting.
include_provenanceNoAttach provenance metadata to records.

The nodes list

Every pipeline has a flat nodes: list. Each entry is a node with a type: discriminator that determines its kind:

TypeRole
sourceReads data from a file
transformApplies CXL expressions to each record
aggregateGroups and summarizes records
routeSplits records into named branches by condition
mergeConcatenates multiple upstream branches that share a schema
combineJoins records across N inputs with where: predicates
outputWrites records to a file
compositionImports a reusable transform fragment

Node naming

Every node must have a name: field. Names must be unique within the pipeline and must not contain dots – the dot character is reserved for port syntax (see below). Names are used for wiring, logging, and diagnostics.

Wiring: input and inputs

Nodes connect to each other through input: (singular) and inputs: (plural) fields that live at the node’s top level, alongside name: and type:.

Single upstream – used by transform, aggregate, route, and output nodes:

- type: transform
  name: clean
  input: raw_data       # References the source node named "raw_data"
  config: ...

Port syntax – for consuming a specific branch from a route node, use node.port:

- type: output
  name: high_value_out
  input: split.high     # Consumes the "high" branch of route node "split"
  config: ...

Multiple upstreams – merge nodes use inputs: (plural) instead of input::

- type: merge
  name: combined
  inputs:
    - east_processed
    - west_processed
  config: {}

Source nodes have no input field. They are entry points – adding an input: field to a source is a parse error.

Using inputs: on a non-merge node (or input: on a merge node) is caught at parse time by deny_unknown_fields.

Optional fields on all nodes

Every node type supports these optional fields:

  • description: – human-readable text for documentation. Ignored by the engine.
  • _notes: – arbitrary metadata (JSON object). Ignored by the engine, used by the Kiln IDE for visual annotations and inspector panels.
- type: transform
  name: enrich
  description: "Add customer tier based on lifetime value"
  _notes:
    color: "#4a9eff"
    position: { x: 300, y: 200 }
  input: customers
  config:
    cxl: |
      emit tier = if lifetime_value >= 10000 then "gold" else "standard"

Strict parsing

All config structs use deny_unknown_fields. If you misspell a field name – for example, writing inputt: instead of input: or stratgy: instead of strategy: – the YAML parser rejects it immediately with a diagnostic pointing to the typo. This catches configuration errors before any data processing begins.

Environment variable: CLINKER_ENV

The CLINKER_ENV environment variable can be used for conditional logic outside of pipelines (e.g., selecting channel directories or controlling CLI behavior). It is not directly referenced within pipeline YAML but is available to the channel and workspace systems.

Source Nodes

Source nodes read data from files and are the entry points of every pipeline. They have no input: field – they produce records, they do not consume them.

Basic structure

- type: source
  name: customers
  config:
    name: customers
    type: csv
    path: "./data/customers.csv"
    schema:
      - { name: customer_id, type: int }
      - { name: name, type: string }
      - { name: email, type: string }
      - { name: status, type: string }
      - { name: amount, type: float }

Schema declaration

The schema: field is required on every source node. Clinker does not infer types from data – you must declare each column’s name and CXL type explicitly. This schema drives compile-time type checking across the entire pipeline.

Each entry is a { name, type } pair:

schema:
  - { name: employee_id, type: string }
  - { name: salary, type: int }
  - { name: hired_at, type: date_time }
  - { name: is_active, type: bool }
  - { name: notes, type: nullable(string) }

Available types

TypeDescription
stringUTF-8 text
int64-bit signed integer
float64-bit IEEE 754 floating point
boolBoolean (true / false)
dateCalendar date
date_timeDate with time component
arrayOrdered sequence of values
numericUnion of int and float – resolved during type unification
anyUnknown type – field used in type-agnostic contexts
nullable(T)Nullable wrapper around any inner type (e.g. nullable(int))

long_unique — storage hint for high-cardinality text

A string column may carry an optional long_unique: true flag. It is an advisory, opt-in storage hint, not a type change: it tells the engine the column’s values are long and effectively unique — never repeated across records — so it stores them in a leaner, header-free representation that drops the per-value bookkeeping the default representation keeps for values that might be shared. Typical candidates are UUIDs rendered as text, street addresses, and free-text comment or note fields.

schema:
  - { name: ticket_id,  type: string, long_unique: true }   # 36-char UUID
  - { name: notes,      type: string, long_unique: true }   # free text
  - { name: department, type: string }                      # low-cardinality, default

The flag changes only the in-memory footprint of the annotated column. The value’s content, its comparison/grouping/join/sort behavior, and the on-disk encoding when a run spills to disk are all unchanged — a long_unique value compares and groups identically to the same text in any other column. Omitting the flag (the common case) leaves the default behavior untouched. Set it only when you know a column is genuinely high-cardinality free text; on a column whose values repeat, the default representation is the better choice because it shares repeated values instead of storing each copy independently.

Transport vs format

A source declaration has two independent layers:

  • Transport (transport:) selects where the records come from. The only transport today is file — read bytes from the filesystem, resolved through one of the file matchers (path / glob / regex / paths). transport: is optional and defaults to file, so a source that omits it reads from disk exactly as before.
  • Format (type:) selects how the bytes decode into records: csv, json, xml, fixed_width, edifact, x12.
- type: source
  name: orders
  config:
    name: orders
    transport: file        # optional; this is the default
    type: csv              # the on-disk format
    path: "./data/orders.csv"
    schema:
      - { name: order_id, type: int }

A file transport requires exactly one file matcher (path, glob, regex, or paths). Declaring none fails validation with E211; declaring more than one fails with E210. Both are reported at config-load time, before any file is opened.

Format types

The type: field inside config: selects the on-disk format. Supported values: csv, json, xml, fixed_width, edifact, x12.

The edifact format reads UN/EDIFACT interchanges; it has its own reference page covering the segment-record schema, delimiter discovery, envelope sections, and control-count validation. See EDIFACT Format. Its one input option is max_elements (default 32) — the number of positional eNN element columns on the record schema; a segment with more data elements than that is rejected rather than truncated.

The x12 format reads ANSI ASC X12 interchanges with their three-tier ISA..IEAGS..GEST..SE envelope, surfacing all three tiers as nested $doc sections. It shares the same max_elements input option and has its own reference page. See X12 Format.

CSV

- type: source
  name: orders
  config:
    name: orders
    type: csv
    path: "./data/orders.csv"
    schema:
      - { name: order_id, type: int }
      - { name: customer_id, type: int }
      - { name: amount, type: float }
      - { name: order_date, type: date }
    options:
      delimiter: ","         # Default: ","
      quote_char: "\""       # Default: "\""
      has_header: true        # Default: true
      encoding: "utf-8"      # Default: "utf-8"

All CSV options are optional. With no options: block, Clinker uses standard RFC 4180 defaults.

JSON

- type: source
  name: events
  config:
    name: events
    type: json
    path: "./data/events.json"
    schema:
      - { name: event_id, type: string }
      - { name: timestamp, type: date_time }
      - { name: payload, type: string }
    options:
      format: ndjson          # array | ndjson | object (auto-detect if omitted)
      record_path: "$.data"   # JSONPath to records array
  • array – the file is a single JSON array of objects.
  • ndjson – one JSON object per line (newline-delimited JSON).
  • object – single top-level object; use record_path to locate the records array within it.

If format is omitted, Clinker auto-detects based on file content.

XML

- type: source
  name: catalog
  config:
    name: catalog
    type: xml
    path: "./data/catalog.xml"
    schema:
      - { name: product_id, type: int }
      - { name: name, type: string }
      - { name: price, type: float }
    options:
      record_path: "//product"          # XPath to record elements
      attribute_prefix: "@"             # Prefix for XML attribute fields
      namespace_handling: strip         # strip | qualify
  • strip (default) – removes namespace prefixes from element and attribute names.
  • qualify – preserves namespace-qualified names.

Fixed-width

- type: source
  name: legacy_data
  config:
    name: legacy_data
    type: fixed_width
    path: "./data/mainframe.dat"
    schema:
      - { name: account_id, type: string }
      - { name: balance, type: float }
      - { name: status_code, type: string }
    options:
      line_separator: crlf    # Line ending style

Fixed-width sources require a separate format schema (.schema.yaml file) that defines field positions, widths, and padding. The schema: on the source body declares CXL types for compile-time checking; the format schema defines the physical layout.

on_unmapped — undeclared input fields

The per-source on_unmapped policy decides what to do with input fields the source’s schema: block does not name. Three modes — auto_widen (default), drop, reject:

- type: source
  name: orders
  config:
    name: orders
    type: csv
    path: "./data/orders.csv"
    on_unmapped:
      mode: auto_widen     # default; other values: drop, reject
    schema:
      - { name: order_id, type: string }
      - { name: amount, type: numeric }

See Auto-Widen & Schema Drift for the full specification: the $widened sidecar absorber design, propagation rules per downstream node type, the include_unmapped Output flag, E315 merge-policy mismatch, and fixed-width inertness.

Sort order

If your source data is pre-sorted, declare the sort order so the optimizer can use streaming aggregation instead of hash aggregation:

- type: source
  name: sorted_transactions
  config:
    name: sorted_transactions
    type: csv
    path: "./data/transactions_sorted.csv"
    schema:
      - { name: account_id, type: string }
      - { name: txn_date, type: date }
      - { name: amount, type: float }
    sort_order:
      - { field: "account_id", order: asc }
      - { field: "txn_date", order: asc }

Sort order declarations are trusted – Clinker does not verify that the data is actually sorted. If the data violates the declared order, downstream streaming aggregation may produce incorrect results.

The shorthand form is also accepted – a bare string defaults to ascending:

    sort_order:
      - "account_id"
      - { field: "txn_date", order: desc }

Watermarks

An event-time watermark declares which column on the source carries each record’s event time — the wall-clock instant the event happened, distinct from when Clinker read the row. When set, the engine reads the column on every record, subtracts the source’s delay, and folds the result into a per-source monotonic watermark. The delay-corrected value is also stamped on every record as $source.event_time, the column a downstream time-windowed aggregate uses to assign records to windows.

- type: source
  name: clicks
  config:
    name: clicks
    type: csv
    path: "./data/clicks.csv"
    options:
      has_header: true
    watermark:
      column: event_ts       # must be date_time or date
      delay: 5s              # bounded out-of-order tolerance
      idle_timeout: 30s      # flip partitions to idle if quiet
    schema:
      - { name: user_id, type: string }
      - { name: event_ts, type: date_time }
      - { name: amount, type: int }

Fields:

  • column (required) — the schema column whose value is each record’s event time. The column’s declared type must be date_time or date. A column: that names a field absent from schema: raises E154; a column: whose declared type is neither raises E155.

  • delay (optional duration, default unset) — bounded out-of-order tolerance. Each record’s event time is shifted earlier by delay before being folded into the watermark, so the source’s effective watermark trails its observed max event time by this amount. Mirrors Flink’s BoundedOutOfOrdernessWatermarks. Without delay, the watermark advances strictly to the observed max — a single late record routes to the DLQ.

  • idle_timeout (optional duration, default unset) — if a live source’s receiver stays quiet longer than this, its partitions flip to idle and stop holding back min_across_sources. Lets downstream windows keep closing when one source pauses. None means never go idle, preserving the prior behaviour for pipelines without a window-close consumer.

Durations use the suffixes ms, s, m, h, d. ms is matched before the single-character s, so 500ms reads as 500 milliseconds, not 500 seconds with a stray m.

A pipeline whose aggregate declares time_window: must have a watermark.column on every upstream-reachable source. Without it, min_across_sources over the source set stays at None and the window can never close — the planner rejects this with E156.

Array paths

For nested data (JSON/XML sources with embedded arrays), array_paths controls how nested arrays are handled:

- type: source
  name: invoices
  config:
    name: invoices
    type: json
    path: "./data/invoices.json"
    schema:
      - { name: invoice_id, type: int }
      - { name: customer, type: string }
      - { name: line_item, type: string }
      - { name: line_amount, type: float }
    array_paths:
      - path: "$.line_items"
        mode: explode         # One output record per array element
      - path: "$.tags"
        mode: join            # Concatenate array elements into a string
        separator: ","
  • explode (default) – produces one output record per array element, with parent fields repeated.
  • join – concatenates array elements into a single string using the specified separator.

EDIFACT Format

Clinker reads and writes UN/EDIFACT interchanges alongside CSV, JSON, XML, and fixed-width. An interchange is a finite file: it opens with an optional UNA service-string advice and a mandatory UNB header, wraps one or more UNH..UNT messages, and closes with a UNZ trailer. The reader streams one segment at a time and the writer reconstructs the envelope around emitted records. The reader decodes release-escape sequences into clean data values and the writer re-escapes them on output, so a reader → writer → reader round-trip preserves the data values and the envelope control references.

Delimiters and the UNA service string

Each segment is terminated by the segment terminator; within a segment, data elements split on the element separator and components on the component separator. A release character escapes a delimiter that occurs as literal data.

When the file begins with a 9-byte UNA prefix, its six service characters override the defaults in this fixed order: component, element, decimal, release, repetition, terminator. When UNA is absent, the syntax Level-A defaults apply:

RoleLevel-A default
Component separator:
Element separator+
Decimal notation.
Release / escape?
Repetitionspace (inactive)
Segment terminator'

UNA is optional — a parser that requires it would fail on the common no-UNA interchange, so Clinker assumes Level-A when it is absent.

Release character

The release character (default ?) marks the following byte as literal data rather than a delimiter: ?+ is a literal + inside an element, ?' is a literal apostrophe (not a terminator), and ?? is a literal ?. The reader decodes these sequences into clean data values, so a downstream CSV/JSON sink, a CXL string comparison, or a $doc field sees O'BRIEN, never the wire form O?'BRIEN. The writer re-escapes on output: any element value that carries the element separator, the segment terminator, or the release character is release-escaped automatically, so a value computed by a Transform or sourced from CSV — never EDIFACT-escaped to begin with — does not corrupt the interchange. A reader → writer → reader round-trip therefore preserves the data values exactly.

The component separator inside an element (e.g. the : in the composite UNOA:1) is kept as part of the element’s text and is not escaped — the positional element model works above component resolution, so a composite element round-trips unchanged. A literal colon in free-text data is the one ambiguity this introduces: because components are not split into separate fields, a : in a value re-reads as a component boundary. Repeating elements ride inside one element string intact and are likewise never truncated to their first repetition.

Newlines between segments

Some producers insert CR/LF after each segment terminator for readability. Those bytes are insignificant and are stripped between segments; CR/LF that appears inside an element is preserved.

Record shape

Each non-service segment becomes one record under a fixed positional schema:

ColumnMeaning
seg_idThe segment tag (BGM, NAD, …)
msg_refThe enclosing message reference (the UNH element 1)
msg_typeThe message type (the UNH element 2, full composite)
e01, e02, …The segment’s positional data elements (release sequences decoded)

Service segments (UNB, UNZ, UNH, UNT) are consumed by the reader to drive envelope state and validation — they are never emitted as body records. The UNH segment that opens a message is emitted as a body record (its seg_id is UNH), carrying the message reference and type.

The number of eNN columns is controlled by the source max_elements option (default 32). A segment carrying more data elements than that is rejected with guidance rather than silently truncated. Absent trailing elements read as null.

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: edifact
      glob: ./inbox/*.edi
      options:
        max_elements: 48      # widen the positional schema for exotic segments
      schema:
        - { name: seg_id, type: string }
        - { name: msg_ref, type: string }
        - { name: e01, type: string }

Envelope sections over UNB

The interchange header UNB is extractable as a document envelope section, exposing its positional elements to CXL as $doc.<section>.<field>. Use the segment extract rule with the section field names matching the positional keys e01, e02, …:

envelope:
  sections:
    interchange:
      extract: { segment: "UNB" }
      fields:
        e05: string          # interchange control reference (UNB element 5)

A Transform can then read $doc.interchange.e05 on every body record.

Only the UNB header is extractable as an envelope section. Trailer segments (UNT, UNZ) arrive after the body and cannot become $doc fields without buffering the whole interchange — their control counts are instead validated inline by the reader (see below). A segment extract naming any tag other than UNB, or an xml_path / json_pointer extract against an EDIFACT source, is rejected at startup.

Control-count validation

The reader validates the structural integrity claims carried in the trailers as they arrive, failing the run on a mismatch (a truncation or corruption signal):

  • UNT segment count — must equal the actual number of segments in the message, counting the UNH and UNT themselves.
  • UNT message reference — must echo the opening UNH reference.
  • UNZ message count — must equal the actual number of UNH messages in the interchange.
  • UNZ control reference — must echo the UNB control reference.

The UNB control reference (data element 0020) is located by its structural position — the first data element after the four mandatory leading composites (syntax identifier, sender, recipient, date/time) — rather than at a fixed element index. An interchange that carries an empty optional element ahead of the control reference (shifting it past the fifth position) therefore validates and round-trips correctly: the reader reads the real reference and the writer echoes the same one into UNZ, so the trailer never contradicts its own header.

A missing UNZ at end of input is a truncation error; content after the UNZ trailer is rejected.

Writing EDIFACT

An EDIFACT Output node reconstructs the envelope around emitted records. Records map by the same positional columns (seg_id, msg_ref, msg_type, eNN); trailing null/empty elements are trimmed so no fabricated delimiters appear, and a column the writer does not recognize is an error (project the record to the EDIFACT columns first). Engine-internal $-namespaced columns are excluded automatically.

nodes:
  - type: output
    name: out
    input: messages
    config:
      name: out
      type: edifact
      path: ./out/result.edi
      options:
        interchange: ["UNOA:1", "SENDER", "RECEIVER", "240101:1200", "REF1"]
        message_type: "ORDERS:D:96A:UN"
        write_una: false
        segment_newline: true

Output options:

OptionMeaning
interchangeLiteral UNB data elements (release-escaped as needed on write).
interchange_from_docName of a $doc section to echo the UNB elements from (round-trip).
message_typeFallback UNH message type when a record carries no msg_type value.
write_unaEmit a leading UNA segment (default false).
segment_newlineWrite a newline after each segment terminator (default true).

Consecutive records are grouped into UNH..UNT messages on msg_ref transitions. The writer recomputes the UNT segment count and UNZ message count, and echoes the message and interchange control references, so the output passes its own count validation on re-read.

interchange_from_doc echoes the header from a record’s document context. That context is populated by a source’s UNB envelope section (declare a segment: "UNB" envelope section on the source) and travels with every body record through the pipeline — including to a sink that sits directly downstream of the source with no intervening Transform. The reader stashes the complete, ordered UNB element list (empty middle elements included), so the reconstructed header is faithful even when a middle element is empty and the user declares only the fields they care about. Supply interchange literal elements instead when the records have no source UNB section to echo.

Limitations

  • Charset. Element text is decoded as UTF-8. Non-UTF-8 interchanges (UNOA/UNOB/Latin-1 high bytes) are rejected explicitly rather than silently corrupted.
  • Functional groups. A single UNB..UNZ interchange is supported; UNG/UNE functional-group segments are rejected with a precise error.
  • UNH composite fidelity. The reader stamps the UNH reference (element 1) and the full message-type composite (element 2). A UNH carrying additional elements (e.g. a common access reference) is reconstructed as a two-element UNH on round-trip.
  • Output splitting. An interchange is a single UNB..UNZ envelope and cannot be divided across files. An edifact output combined with a split: block is rejected at config-validation time (diagnostic E323) rather than emitting a structurally corrupt interchange.

X12 Format

Clinker reads and writes ANSI ASC X12 interchanges alongside CSV, JSON, XML, fixed-width, and EDIFACT. An X12 interchange is a finite file with a three-tier envelope: an ISA..IEA interchange wraps one or more GS..GE functional groups, and each functional group wraps one or more ST..SE transaction sets. The reader streams one segment at a time and the writer reconstructs the three envelope tiers around emitted records.

The three tiers surface as nested document-context levels: the ISA interchange becomes the file-level $doc document, and each GS group and ST set opens a nested level whose $doc sections layer over the enclosing tiers. A body record therefore sees every enclosing tier’s fields through one $doc.<section>.<field> lookup.

Delimiters and the ISA header

Unlike EDIFACT’s optional UNA service-string advice, X12 declares its delimiters in a fixed-length 106-byte ISA header. Three delimiter bytes live at structural positions within it:

RoleSource in the ISA
Element (data) separatorThe byte immediately after the ISA tag
Sub-element (component) sep.ISA16, the last single-byte ISA element
Segment terminatorThe byte immediately after ISA16

The reader reads these three bytes from the header rather than assuming a fixed delimiter set, so an interchange that uses */:/~, |/^/newline, or any other producer-chosen delimiters parses correctly. The ISA13 interchange control number is located as the 13th element of the header split on the discovered element separator — structurally, not by an absolute byte offset — so producer padding quirks do not misalign it.

No escape character

X12 has no release/escape character (EDIFACT’s ? has no X12 equivalent). A data value that contains a delimiter byte is therefore unrepresentable. On output the writer rejects any element value carrying the element separator or the segment terminator with a precise error rather than silently corrupting the interchange; re-encode the value or choose delimiters the data does not contain.

The sub-element (component) separator inside an element (e.g. the : in a composite A:B:C) is kept as part of the element’s text and is not split — the positional element model works above component resolution, so a composite element round-trips unchanged.

Newlines between segments

Some producers insert CR/LF after each segment terminator for readability. Those bytes are insignificant and are stripped between segments; CR/LF that appears inside an element is preserved.

Record shape

Each non-service segment becomes one record under a fixed positional schema:

ColumnMeaning
seg_idThe segment tag (BEG, PO1, …)
set_refThe enclosing transaction set control number (ST02)
set_typeThe transaction set identifier code (ST01, e.g. 850)
e01, e02, …The segment’s positional data elements

Service segments (ISA, IEA, GS, GE, SE) are consumed by the reader to drive the envelope and validation — they are never emitted as body records. The ST segment that opens a transaction set is emitted as a body record (its seg_id is ST), carrying the set reference and type.

The number of eNN columns is controlled by the source max_elements option (default 32). A segment carrying more data elements than that is rejected with guidance rather than silently truncated. Absent trailing elements read as null.

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: x12
      glob: ./inbox/*.x12
      options:
        max_elements: 48      # widen the positional schema for exotic segments
      schema:
        - { name: seg_id, type: string }
        - { name: set_ref, type: string }
        - { name: e01, type: string }

Envelope sections over the three tiers

The interchange header ISA is extractable as a file-level document envelope section, exposing its positional elements to CXL as $doc.<section>.<field>. Use the segment extract rule with the section field names matching the positional keys e01, e02, …:

envelope:
  sections:
    interchange:
      extract: { segment: "ISA" }
      fields:
        e13: string          # interchange control number (ISA13)

The GS functional group and the ST transaction set surface automatically as the nested $doc sections functional_group and transaction_set, each keyed by positional eNN elements — no envelope declaration is needed for them. A Transform on any body record can read all three tiers at once:

emit isa13 = $doc.interchange.e13       # interchange control number
emit gs06  = $doc.functional_group.e06  # group control number (GS06)
emit st02  = $doc.transaction_set.e02   # set control number (ST02)

Only the ISA header is extractable as a declared envelope section. Trailer segments (SE, GE, IEA) arrive after the body they close and cannot become $doc fields without buffering the whole interchange — their control counts are instead validated inline by the reader (see below). A segment extract naming any tag other than ISA, or an xml_path / json_pointer extract against an X12 source, is rejected at startup.

Control-count validation

The reader validates the structural integrity claims carried in the trailers as they arrive, failing the run on a mismatch (a truncation or corruption signal):

  • SE segment count (SE01) — must equal the number of segments in the transaction set, counting the ST and SE themselves.
  • SE set control number (SE02) — must echo the opening ST02.
  • GE transaction-set count (GE01) — must equal the number of ST sets in the functional group.
  • GE group control number (GE02) — must echo the GS06.
  • IEA functional-group count (IEA01) — must equal the number of GS groups in the interchange.
  • IEA control number (IEA02) — must echo the ISA13.

A missing IEA at end of input is a truncation error; content after the IEA trailer is rejected.

Writing X12

An X12 Output node reconstructs the three-tier envelope around emitted records. Records map by the same positional columns (seg_id, set_ref, set_type, eNN); trailing null/empty elements are trimmed so no fabricated delimiters appear, and a column the writer does not recognize is an error (project the record to the X12 columns first). Engine-internal $-namespaced columns are excluded automatically.

nodes:
  - type: output
    name: out
    input: messages
    config:
      name: out
      type: x12
      path: ./out/result.x12
      options:
        interchange:
          ["00", "          ", "00", "          ", "ZZ", "SENDER         ",
           "ZZ", "RECEIVER       ", "240101", "1200", "U", "00401",
           "000000001", "0", "P", ":"]
        group_header: ["PO", "SENDER", "RECEIVER", "20240101", "1200", "1", "X", "004010"]
        set_type: "850"
        segment_newline: true

Output options:

OptionMeaning
interchangeLiteral ISA data elements (the 16 fixed-width ISA fields).
interchange_from_docName of a $doc section to echo the ISA elements from (round-trip).
group_headerLiteral GS01..GS08 elements (GS06 control number recomputed).
set_typeFallback ST01 set type when a record carries no set_type value.
segment_newlineWrite a newline after each segment terminator (default true).

Consecutive records are grouped into ST..SE transaction sets on set_ref transitions, and all sets are wrapped in a single GS..GE functional group. The writer recomputes the SE segment count, the GE transaction-set count, and the IEA functional-group count, and echoes the set, group, and interchange control numbers, so the output passes its own count validation on re-read.

interchange_from_doc echoes the header from a record’s document context. That context is populated by a source’s ISA envelope section (declare a segment: "ISA" envelope section on the source) and travels with every body record through the pipeline — including to a sink that sits directly downstream of the source with no intervening Transform. The reader stashes the complete, ordered ISA element list, so the reconstructed header is faithful. Supply interchange literal elements instead when the records have no source ISA section to echo.

Limitations

  • Charset. Element text is decoded as UTF-8. Non-UTF-8 interchanges are rejected explicitly rather than silently corrupted.
  • No escape character. X12 has no release mechanism, so a data value that contains a delimiter byte is rejected on output rather than corrupting the interchange.
  • One functional group on output. The writer wraps all transaction sets in a single GS..GE functional group; the reader handles any number of groups on input. A multi-group output shape requires multiple runs.
  • Output splitting. An interchange is a single ISA..IEA envelope and cannot be divided across files. An x12 output combined with a split: block is rejected at config-validation time (diagnostic E338) rather than emitting a structurally corrupt interchange.

Network Sources (REST)

A Source reads from the filesystem by default. To pull records from a network endpoint instead, declare a transport: block on the Source. The transport selects where records come from; it sits above the on-disk type: (the format), which for a REST source still selects how the response bodies decode.

A network transport is a finite-pull source: it runs on its own thread, drives a synchronous client to cursor exhaustion, then exits. There is no daemon, no event loop, and no async runtime — the same single- process, run-to-drain model as a file pipeline. Finiteness is a hard property of the reader: a REST source caps its pull with an explicit page/record limit, so an unbounded endpoint cannot keep it running forever.

A network source still requires a schema: block. That authored schema is the row-to-record target: the reader maps each decoded object onto it, coercing values leniently. A per-row value that cannot coerce is left unchanged at the reader and routed to the dead-letter queue at the Transform stage — identical to file-source semantics. A network source declares no file matcher (path / glob / regex / paths); declaring one is a configuration error (E219).

Because a network source has no file path, its $source.file provenance column and the {source_file} output template both resolve to a stable synthetic identifier, <source:NAME>, where NAME is the Source node’s name.

REST sources

A rest source issues paginated HTTP GETs against a base URL, decoding each response body through the declared json or xml format. (Other formats are rejected with E220 — a REST body is a multi-record document, not a flat CSV/fixed-width stream.)

nodes:
  - type: source
    name: orders_api
    config:
      name: orders_api
      type: json
      options:
        format: array        # each page body is a JSON array of objects
      transport:
        kind: rest
        url: https://api.example.com/v1/orders
        max_pages: 50         # HARD page cap — required
        pagination:
          strategy: link_header
        auth:
          scheme: bearer
          token: "${ORDERS_TOKEN}"
      schema:
        - { name: order_id, type: int }
        - { name: total,    type: float }
        - { name: placed_at, type: date_time }

Pagination strategies

The pagination.strategy selects how the reader advances pages and detects the last one. Whatever the strategy, the pull always stops at the max_pages / max_records cap, even when the server keeps offering more.

  • none (default) — a single GET; the body is the whole result.

  • offset?offset=N&limit=L, advancing the offset by the page size each request. The last page is the one that returns fewer rows than limit.

    pagination:
      strategy: offset
      limit: 200
      offset_param: offset     # optional, defaults shown
      limit_param: limit
    
  • cursor_token — the reader reads a continuation token from a JSON pointer in each response and sends it back on the next request. Paging stops when the token field is absent or null.

    pagination:
      strategy: cursor_token
      cursor_param: page_token
      next_token_pointer: /meta/next_page   # RFC 6901 JSON pointer
    
  • link_header — the reader follows the URL in the response’s RFC 5988 Link: <…>; rel="next" header until no such link is present.

    pagination:
      strategy: link_header
    

Authentication

auth.scheme selects the credential sent on every request:

  • none (default) — no auth header.

  • bearer — sends Authorization: Bearer <token>.

  • header — sends an arbitrary static header, e.g. an API key.

    auth:
      scheme: header
      name: X-API-Key
      value: "${API_KEY}"
    

Reliability and finiteness knobs

KeyDefaultMeaning
max_pagesRequired. Hard ceiling on pages fetched, regardless of the server.
max_recordsnoneOptional hard ceiling on records emitted.
retries3Bounded retries on a transient failure (5xx, connect/timeout error). A 4xx is fatal — retrying cannot help.
timeout_secs30Per-request timeout. Bounds in-flight time so an interrupt lands within the shutdown window.

A partial-page decode failure routes that page’s offending rows to the DLQ per-row, exactly like a file source; it does not abort the pull.

Shutdown

On SIGINT/SIGTERM the reader polls its cancellation handle at each page boundary and stops cleanly with a normal end-of-input — the same graceful drain a file source performs. The timeout_secs per-request bound caps how long a single in-flight request can delay that stop.

Auto-Widen & Schema Drift

When an input file carries columns the source’s declared schema: block does not name, Clinker decides what to do with them via the per-source on_unmapped policy. The engine-wide default is auto_widen — schema drift is preserved end-to-end without user-visible breakage. This chapter is the single source of truth for the absorber design, propagation rules, output controls, and diagnostics.

The three modes

- type: source
  name: orders
  config:
    name: orders
    type: csv
    path: "./data/orders.csv"
    on_unmapped:
      mode: auto_widen     # default; other values: drop, reject
    schema:
      - { name: order_id, type: string }
      - { name: amount, type: numeric }
  • auto_widen (default) — per-record undeclared fields are absorbed into a Value::Map payload carried by an engine-stamped $widened sidecar column appended to the source’s schema. The sidecar’s payload propagates through downstream nodes and the sink expands it back to top-level columns whenever include_unmapped: true is set on the Output node (the default). Pattern precedent: Databricks Auto Loader’s _rescued_data sidecar and ClickHouse’s JSON column type.

  • drop — undeclared input fields are silently stripped at read time. No sidecar; the source’s plan-time schema equals the declared schema:. Matches Snowflake’s MATCH_BY_COLUMN_NAME='CASE_INSENSITIVE' with ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE and dbt’s on_schema_change=ignore.

  • reject — any input record carrying a key not in the declared schema fails the source with a FormatError::UndeclaredField diagnostic naming the offending field. Strict; matches dlt’s freeze mode.

The $widened sidecar absorber

auto_widen is implemented as an on-schema sidecar: the engine appends a single column named $widened to the source’s schema, marked with FieldMetadata::WidenedSidecar. Each record’s undeclared input fields are stored as the sidecar’s Value::Map payload — keyed by input field name, valued by the read scalar.

The on-schema design is deliberate. An off-schema sidecar (a parallel data structure outside Schema) is a silent-loss bug class: any code path that reconstructs a Record from schema.columns() and a value vector silently drops the side-channel. The on-schema slot inherits the same serialization, span propagation, sort/spill, and projection machinery as user-declared columns — there is no “remember to copy the sidecar” obligation on every consumer. CXL expressions cannot read or write the sidecar (the typechecker is blind to its contents); see System variables → $widened for the parser-level rejection.

Propagation through the DAG

The $widened sidecar follows these rules through downstream nodes:

Node typeSidecar behavior
TransformInherits unchanged from input (transforms are row-preserving).
AggregateOutput’s $widened slot is Value::Null — per-row payloads have no canonical aggregation. Users who need an unmapped field at aggregate output must add it to group_by or emit it explicitly via an aggregate function.
CombineDriver’s sidecar rides through; build-side sidecars are dropped (mirrors propagate_ck: Driver). Build-side iter_user_fields() filters every engine-stamped column from match: collect array payloads, so build $widened cannot leak into the collect array. Users can lift a build-side unmapped field via <build_qualifier>.<field> in the combine body’s CXL.
Route / MergeRow-preserving — sidecar passes through. Merge requires every input source to share the same on_unmapped policy; mixing fails compile with E315 (see below).
CompositionBody inherits the parent’s sidecar via the synthetic input port; whatever the body’s terminal node carries flows back to the parent. The body’s terminal-node propagation rule applies (e.g. an Aggregate terminal yields Value::Null at the parent boundary, a match: first Combine terminal carries the driver’s payload).
OutputSidecar expands to top-level columns when include_unmapped: true (the default). Set include_unmapped: false to strip the sidecar (and every other unmapped input field) so only explicitly-emitted columns reach the writer.

Output controls

- type: output
  name: out
  input: src
  config:
    name: out
    type: json
    path: out.json
    include_unmapped: true    # default: true

When true (the default), fields the source absorbed into $widened are expanded back to top-level columns at the sink. Useful for pass-through pipelines where every original input field should reach the output regardless of whether it was declared in schema:. Set include_unmapped: false to strip the sidecar (and every other input field not explicitly emitted upstream) so the writer sees only user-declared columns.

include_unmapped composes independently with include_correlation_keys: true — each, both, or neither can be set. include_correlation_keys does not surface $widened; the two flags are orthogonal.

Cross-format flow

The expansion happens at the projection layer, before the writer sees the record. So a CSV source with auto_widen plus a JSON output with include_unmapped: true produces JSON objects whose top-level keys include both declared columns and absorbed input columns:

input.csv:    id,extra,city
              1,foo,Paris

output.json:  {"id": "1", "extra": "foo", "city": "Paris"}

The literal $widened slot is stripped during expansion; the writer never sees a Value::Map.

Writer rejection of Value::Map payloads

CSV, XML, and fixed-width writers refuse records carrying a Value::Map payload at any column slot, raising FormatError::UnserializableMapValue { format, column }. The rejection lives in each writer’s value-to-string helper — single point of truth, no defensive prechecks. JSON serializes Value::Map natively as a nested object and does not raise.

The most common cause: the $widened sidecar reaches the writer because the Output node set include_unmapped: false. Remediation is either to leave include_unmapped at its default of true (so the projection layer expands the map to top-level columns before write) or to coerce the map to a scalar in CXL before the emit. The error message lists both routes.

The DLQ writer applies the same filter at its own layer: dlq::dlq_user_columns strips any column tagged FieldMetadata::WidenedSidecar, so the DLQ CSV header never contains $widened even when the DLQ entry’s original_record retains the auto_widen schema shape. Correlation-lattice columns ($ck.*) are retained in the DLQ output for collateral debugging.

E315 — Merge inputs must agree on policy

Merge concatenates streams positionally against the merge node’s output_schema (taken from the first input). Every input must agree on column shape — same column names, same on_unmapped policy, same correlation_key set.

If two upstream sources disagree on whether they carry the $widened sidecar (one source uses auto_widen, another uses drop / reject), compile fails:

E315: merge "merged": input schemas disagree on the `$widened` auto_widen sidecar column.

Remediation: set every merge upstream source to the same on_unmapped policy. The engine-wide default is auto_widen; for sources that should explicitly omit the sidecar, declare on_unmapped: { mode: drop } (or reject) on each.

Fixed-width sources are structurally inert

Fixed-width sources are positional — the schema is constructed from width / start..end byte ranges, and bytes outside the declared ranges are invisible to the reader. auto_widen therefore can never populate the sidecar for fixed-width sources; the slot stays Value::Null for every record.

The executor emits a tracing::info diagnostic at source-reader construction time when auto_widen is the policy on a fixed-width source, naming the source. The diagnostic fires once per reader instance; a source used as a combine build-side input across multiple combines may produce one log per combine. To avoid the noise, switch to on_unmapped: drop (or reject) for explicit scalar semantics, or accept the empty sidecar.

Transform Nodes

Transform nodes apply CXL expressions to each record, producing new fields, filtering records, or both. They process one record at a time in streaming fashion with constant memory overhead.

Basic structure

- type: transform
  name: enrich
  input: customers
  config:
    cxl: |
      emit full_name = first_name + " " + last_name
      emit tier = if lifetime_value >= 10000 then "gold" else "standard"
      filter status == "active"

The cxl: field is required and contains a CXL program. The three core CXL statements for transforms are:

  • emit – produces an output field. Only emitted fields appear in downstream nodes.
  • filter – drops records that do not match the boolean condition.
  • let – binds a local variable for use in subsequent expressions (not emitted).
    cxl: |
      let margin = revenue - cost
      emit product_id = product_id
      emit margin = margin
      emit margin_pct = if revenue > 0 then margin / revenue * 100 else 0
      filter margin > 0

Analytic window

The analytic_window field enables cross-source lookups by joining a secondary dataset into the transform. The secondary source is loaded into memory and indexed by the join key.

- type: transform
  name: enrich_orders
  input: orders
  config:
    analytic_window:
      source: products
      on: product_id
      group_by: [product_id]
    cxl: |
      emit order_id = order_id
      emit product_name = $window.first()
      emit quantity = quantity
      emit line_total = quantity * price

The $window.* namespace provides access to the windowed data. Functions like $window.first(), $window.last(), and $window.count() operate over the matched group.

Validations

Declarative validation checks can be attached to a transform. They run against each record and either route failures to the DLQ (severity error) or log a warning and continue (severity warn).

- type: transform
  name: validate_orders
  input: raw_orders
  config:
    cxl: |
      emit order_id = order_id
      emit amount = amount
      emit email = email
    validations:
      - field: email
        check: "not_empty"
        severity: error
        message: "Email is required"
      - check: "amount > 0"
        severity: warn
        message: "Non-positive amount"
      - field: order_id
        check: "not_empty"
        severity: error

Validation fields

FieldRequiredDescription
fieldNoRestrict the check to a single field
checkYesValidation name (e.g. "not_empty") or CXL boolean expression
severityNoerror (default) routes to DLQ; warn logs and continues
messageNoCustom error message for DLQ entries
nameNoValidation name for DLQ reporting. Auto-derived from field + check if omitted
argsNoAdditional arguments as key-value pairs

Expansion cap (max_expansion)

When a transform body contains an emit each statement, every input record can fan out into multiple output records. The max_expansion field caps how many output records a single input record may produce – a safety bound against unexpectedly large arrays.

- type: transform
  name: explode_items
  input: orders
  config:
    max_expansion: 5000      # default: 10000
    cxl: |
      emit each it in items {
        emit order_id = order_id
        emit sku = it["sku"]
        emit price = it["price"]
      }
FieldTypeDefaultDescription
max_expansionu6410000Maximum cumulative output records per input record.

If a single input record’s emit each block produces more than max_expansion output records, the originating record routes to the DLQ with category expansion_limit_exceeded instead of producing a truncated or unbounded result. No partial output is emitted for that record – the cap is enforced eagerly so the writer never sees records from a runaway expansion.

When to tune

  • Lower (e.g. 100, 1000) when input arrays are bounded by a known business rule and you want hostile or malformed input to surface as a DLQ entry rather than as a flood of downstream records.
  • Higher (e.g. 100000, 1000000) when legitimate input carries large arrays – for example, an order with a long line-item list or an event carrying a per-second pricing curve.

The DLQ category expansion_limit_exceeded is distinct from generic CXL evaluation failures, so DLQ-side filters and metrics can target expansion runaway specifically. See Error Handling & DLQ for the wider DLQ contract.

Batch size (batch_size)

A streaming-eligible transform hands its output downstream in bounded batches rather than accumulating the whole stage before the next stage runs. batch_size sets how many events (records plus document-boundary punctuations) a batch holds. A per-transform batch_size overrides the pipeline-level pipeline.batch_size for this one stage; omit it to inherit the pipeline value (or the built-in default of 2048).

- type: transform
  name: enrich
  input: orders
  config:
    batch_size: 512         # override pipeline.batch_size for this stage
    cxl: |
      emit order_id = order_id
      emit total = quantity * unit_price
FieldTypeDefaultDescription
batch_sizeusizeinherits pipeline.batch_size (else 2048)Events per streaming batch for this transform. Must be >= 1.

A batch_size of 0 is rejected at config load (a zero-event batch never flushes). Smaller batches lower the in-flight memory of a streaming stage at the cost of more per-batch bookkeeping; larger batches amortize the bookkeeping at the cost of a larger live working set. The default suits typical record widths — tune it only when a profiling run shows a streaming stage’s per-batch footprint matters. See Streaming vs. Blocking Stages for which stages stream and which fully materialize.

Log directives

Log directives control diagnostic output during transform execution:

- type: transform
  name: process
  input: validated
  config:
    cxl: |
      emit id = id
      emit result = compute(value)
    log:
      - level: info
        when: per_record
        every: 1000
        message: "Processed record"
      - level: warn
        when: on_error
        message: "Record failed processing"
      - level: debug
        when: before_transform
        message: "Starting transform"

Log directive fields

FieldRequiredDescription
levelYestrace, debug, info, warn, or error
whenYesbefore_transform, after_transform, per_record, or on_error
messageYesLog message text
everyNoOnly log every N records (for per_record timing)
conditionNoCXL boolean expression – only log when true
fieldsNoList of field names to include in the log output
log_ruleNoReference to an external log rule definition

Complete example

- type: source
  name: employees
  config:
    name: employees
    type: csv
    path: "./data/employees.csv"
    schema:
      - { name: employee_id, type: string }
      - { name: first_name, type: string }
      - { name: last_name, type: string }
      - { name: department, type: string }
      - { name: salary, type: int }
      - { name: hire_date, type: date }

- type: transform
  name: enrich_employees
  description: "Compute display name and tenure"
  input: employees
  config:
    cxl: |
      emit employee_id = employee_id
      emit display_name = last_name + ", " + first_name
      emit department = department.upper()
      emit salary = salary
      emit annual_bonus = if salary >= 80000 then salary * 0.15
        else salary * 0.10
    validations:
      - field: employee_id
        check: "not_empty"
        severity: error
        message: "Employee ID is required"
      - check: "salary > 0"
        severity: warn
        message: "Salary should be positive"
    log:
      - level: info
        when: per_record
        every: 5000
        message: "Processing employees"

Aggregate Nodes

Aggregate nodes group records by one or more fields and compute summary values using CXL aggregate functions. They consume all input records in a group before emitting a single summary record per group.

Basic structure

- type: aggregate
  name: dept_totals
  input: employees
  config:
    group_by: [department]
    cxl: |
      emit total_salary = sum(salary)
      emit headcount = count(*)
      emit avg_salary = avg(salary)

Group-by fields pass through automatically – you do not need to emit them. In this example, the output records contain department, total_salary, headcount, and avg_salary.

Group-by fields

The group_by: field is a list of field names from the input schema. Records sharing the same values for all group-by fields are placed in the same group.

    group_by: [region, department]
    cxl: |
      emit total_salary = sum(salary)
      emit max_salary = max(salary)

This produces one output record per unique (region, department) combination.

Global aggregation

An empty group_by list treats the entire input as a single group, producing exactly one output record:

- type: aggregate
  name: grand_totals
  input: orders
  config:
    group_by: []
    cxl: |
      emit grand_total = sum(amount)
      emit record_count = count(*)
      emit avg_order = avg(amount)

Aggregate functions

The following aggregate functions are available in CXL:

FunctionDescription
sum(field)Sum of all values in the group
count(*)Number of records in the group
avg(field)Arithmetic mean
min(field)Minimum value
max(field)Maximum value
collect(field)Collect all values into an array
weighted_avg(value, weight)Weighted average

Strategy hint

The strategy: field controls how aggregation is executed:

- type: aggregate
  name: totals
  input: sorted_data
  config:
    group_by: [account_id]
    strategy: streaming
    cxl: |
      emit total = sum(amount)
StrategyBehavior
autoDefault. The optimizer chooses based on whether the input is provably sorted for the group-by keys.
hashForce hash aggregation. Works on any input ordering. Holds all groups in memory (with disk spill if memory budget is exceeded).
streamingRequire streaming aggregation. Processes one group at a time with O(1) memory per group. Compile-time error if the input is not provably sorted for the group-by keys.

When to use streaming

If your source declares a sort_order: that covers the group-by fields, the optimizer will automatically choose streaming aggregation. Use strategy: streaming as an explicit assertion – it turns a silent fallback to hash aggregation into a compile error, which is useful for catching sort-order regressions.

When to use hash

Hash aggregation works on unsorted input and is the safe default. It uses more memory but handles any data ordering. Memory-aware disk spill kicks in when RSS approaches the pipeline’s memory.limit.

Correlation-key interaction

In a pipeline whose sources declare correlation_key: fields, the engine inspects each aggregate’s group_by against the upstream CK lattice (the union of $ck.* shadow columns visible at the aggregate’s input):

  • group_by covers every upstream CK field — strict-collateral path. The aggregate emits one row per group, the row inherits the correlation identity of its inputs, and a DLQ trigger anywhere in the group rolls back the whole group including the aggregate output. Zero retraction overhead.
  • group_by omits any upstream CK field — retraction protocol path. A single correlation group may span multiple aggregate groups; CK fields omitted from group_by stop being visible to downstream consumers of this aggregate’s output as user-named columns. The engine retracts only the failing records and refinalizes affected groups.

Authors do not configure this — the engine selects the path automatically based on group_by content. A retraction-mode aggregate is incompatible with strategy: streaming (rejected with E15Y, because streaming aggregates emit at group-boundary close before the terminal correlation commit and that defeats the rollback window). See Correlation Keys for the full lattice rules.

A retraction-mode aggregate emits one engine-managed $ck.aggregate.<name> shadow column on its output schema, alongside [group_by_columns] ++ [emitted_binding_columns]. The column carries the aggregator’s per-group index at finalize and costs ~16 bytes per emitted row (the Value::Integer payload plus its slot overhead); it is hidden from default writer output. The synthetic column is the lineage hook that lifts the post-aggregate retract path: a Transform or Output that fails on an aggregate output row carries the column on the failing record, the orchestrator’s detect phase decodes the index back to the contributing source row ids, and the recompute phase retracts those source rows so the failing aggregate row’s contributors are removed from the writer payload — matching the upstream-failure DLQ-fan-out semantic. See Correlation Keys → Where retraction triggers are sourced and the runnable demo at examples/pipelines/retract-demo/.

The retraction protocol carries a per-aggregate cost — Reversible accumulators use a per-row lineage map, BufferRequired accumulators hold raw contributions until commit. Both paths additionally pay ~16 bytes per output row for the synthetic-CK shadow column. The operator-by-operator retraction cost reference has the per-operator breakdown; clinker run --explain reports the live per-aggregate detail including the synthetic-CK line.

Time-windowed aggregates

When time_window: is set on the aggregate body, the operator groups records not just by group_by but also by event-time window. Each record is assigned to one or more windows by the engine-stamped $source.event_time column; state accumulates per (group_by, window); a window closes once min_across_sources >= window_end + allowed_lateness and emits one row per group it saw. The shape parallels Flink SQL Window TVFs, Spark Structured Streaming window / session_window, and Beam windowing.

Every upstream-reachable source must declare a watermark:. Otherwise min_across_sources stays at None, no window ever closes, and the planner rejects the pipeline with E156.

The engine emits user-declared columns only — window bounds do not appear in the output unless you compute and emit them yourself. The emit order is ascending window_start (deterministic), so output rows naturally group by window.

Tumbling windows

Non-overlapping fixed-size buckets. Each record lands in exactly one window [floor(t / size) * size, floor(t / size) * size + size).

time_window:
  tumbling: { size: 1h }

Input (tumbling_demo.csv):

user_id,event_ts,kind
u1,2026-05-14T10:05:00,click
u2,2026-05-14T10:30:00,click
u1,2026-05-14T10:42:00,click
u1,2026-05-14T11:03:00,click
u2,2026-05-14T11:15:00,click
u2,2026-05-14T11:50:00,click

Output with tumbling: { size: 1h }, group_by: [user_id], emit n = count(*):

user_id,n
u1,2
u2,1
u1,1
u2,2

Reading top-to-bottom: the first two rows are the [10:00, 11:00) bucket (u1’s 10:05 and 10:42, then u2’s 10:30); the next two are the [11:00, 12:00) bucket (u1’s 11:03, then u2’s 11:15 and 11:50). Each input record contributes to exactly one window.

Hopping windows

Overlapping fixed-size buckets advanced by slide. Each record lands in ceil(size / slide) windows: slide < size produces overlap, slide == size degenerates to tumbling, slide > size produces gaps where some records fall in zero windows.

time_window:
  hopping: { size: 1h, slide: 30m }

Input (hopping_demo.csv):

user_id,event_ts,amount
u1,2026-05-14T10:05:00,10
u1,2026-05-14T10:42:00,20
u1,2026-05-14T11:10:00,15

Output with group_by: [user_id], emit total = sum(amount), emit n = count(*):

user_id,total,n
u1,10,1
u1,30,2
u1,35,2
u1,15,1

Three input records, four output rows — each record fans into two overlapping size: 1h, slide: 30m windows:

  • [09:30, 10:30) — just 10:05 → total=10, n=1
  • [10:00, 11:00) — 10:05 + 10:42 → total=30, n=2
  • [10:30, 11:30) — 10:42 + 11:10 → total=35, n=2
  • [11:00, 12:00) — just 11:10 → total=15, n=1

Session windows

Per-key gap-bounded sessions. A new record extends its key’s current session if its event time is within gap of the session’s last event time; otherwise it starts a new session. The boundary is data-driven, not clock-aligned.

time_window:
  session: { gap: 10m }

Input (session_demo.csv):

user_id,event_ts,action
u1,2026-05-14T10:00:00,login
u1,2026-05-14T10:07:00,click
u1,2026-05-14T10:13:00,click
u1,2026-05-14T10:50:00,login
u1,2026-05-14T10:55:00,click

Output with group_by: [user_id], emit n = count(*):

user_id,n
u1,3
u1,2

u1’s first three rows form one session (10:00 → 10:07 → 10:13, consecutive gaps ≤ 10m). The 37-minute idle stretch exceeds gap, so 10:50 starts a fresh session that runs through 10:55. Two sessions, two output rows.

Allowed lateness

allowed_lateness is an operator-side knob, distinct from the source-side watermark.delay. A window with time_window: closes when min_across_sources >= window_end + allowed_lateness. Records arriving after a window’s end + allowed_lateness route to the DLQ as LateRecord with stage label time_window:<aggregate-name>. See DLQ category: LateRecord for the DLQ row layout.

- type: aggregate
  name: hourly
  input: clicks
  config:
    group_by: [user_id]
    time_window:
      tumbling: { size: 1h }
    allowed_lateness: 30s
    cxl: |
      emit n = count(*)

Default (unset) means no grace beyond the watermark — windows close the instant min_across_sources crosses window_end. Set allowed_lateness when the source’s watermark.delay alone is too small to absorb the observed out-of-order tail.

Worked example: multi-source session window

This pipeline merges two independent login feeds and groups per-user events into gap-bounded sessions. Issue #61 tracks the multi-source synchronisation contract: a window cannot close until every upstream source has advanced its watermark past window_end + allowed_lateness.

pipeline:
  name: multi_source_session

nodes:
  - type: source
    name: src_web
    description: Web login events.
    config:
      name: src_web
      type: csv
      path: ./data/session_logins.csv
      options:
        has_header: true
      watermark:
        column: event_ts
      schema:
        - { name: user_id, type: string }
        - { name: event_ts, type: date_time }
        - { name: source, type: string }

  - type: source
    name: src_mobile
    description: Mobile login events.
    config:
      name: src_mobile
      type: csv
      path: ./data/session_mobile.csv
      options:
        has_header: true
      watermark:
        column: event_ts
      schema:
        - { name: user_id, type: string }
        - { name: event_ts, type: date_time }
        - { name: source, type: string }

  - type: merge
    name: all_logins
    inputs: [src_web, src_mobile]

  - type: aggregate
    name: user_sessions
    input: all_logins
    config:
      group_by: [user_id]
      time_window:
        session: { gap: 5m }
      allowed_lateness: 30s
      cxl: |
        emit user_id = user_id
        emit logins = count(*)

  - type: output
    name: results
    input: user_sessions
    config:
      name: results
      type: csv
      path: ./output/multi_source_session.csv

Both sources declare their own watermark.column independently. At ingest, each record gets the engine-stamped $source.event_time column, so the aggregate is column-name-agnostic about which source delivered any given record. The aggregate’s close decision reads min_across_sources across both sources’ partitions: a session cannot emit until both src_web and src_mobile have advanced past the session’s end + allowed_lateness. Drop the watermark: block on either source and the planner rejects the pipeline with E156.

Run it from the repo:

cargo run -p clinker -- run examples/pipelines/multi_source_session.yaml

Complete example

- type: source
  name: transactions
  config:
    name: transactions
    type: csv
    path: "./data/transactions.csv"
    schema:
      - { name: account_id, type: string }
      - { name: txn_date, type: date }
      - { name: amount, type: float }
      - { name: category, type: string }
    sort_order:
      - { field: "account_id", order: asc }

- type: aggregate
  name: account_summary
  input: transactions
  config:
    group_by: [account_id]
    strategy: streaming
    cxl: |
      emit total_amount = sum(amount)
      emit txn_count = count(*)
      emit avg_amount = avg(amount)
      emit max_amount = max(amount)
      emit categories = collect(category)

- type: output
  name: summary_output
  input: account_summary
  config:
    name: summary_output
    type: csv
    path: "./output/account_summary.csv"

Route Nodes

Route nodes split a stream of records into named branches based on CXL boolean conditions. Each branch becomes an independent output port that downstream nodes can wire to using port syntax.

Basic structure

- type: route
  name: split_by_value
  input: orders
  config:
    mode: exclusive
    conditions:
      high: "amount.to_int() > 1000"
      medium: "amount.to_int() > 100"
    default: low

This creates three output ports: split_by_value.high, split_by_value.medium, and split_by_value.low.

Conditions

The conditions: field is an ordered map of branch names to CXL boolean expressions. Each expression is evaluated against the incoming record.

    conditions:
      priority: "urgency == \"high\" and amount > 500"
      standard: "urgency == \"medium\""
      bulk: "quantity > 100"
    default: other

Condition keys become the port names used in downstream input: wiring.

Default branch

The default: field is required. Records that match no condition are routed to the default branch. The default branch name must not collide with any condition key.

Routing modes

Exclusive (default)

In exclusive mode, conditions are evaluated in declaration order and the first matching condition wins. A record appears in exactly one branch. Order matters – put more specific conditions first.

    mode: exclusive
    conditions:
      vip: "lifetime_value > 100000"
      high: "lifetime_value > 10000"
      medium: "lifetime_value > 1000"
    default: standard

A customer with lifetime_value = 50000 matches both vip and high, but because exclusive stops at first match, they go to high only if vip was checked first – and they do, because vip comes first. Actually, 50000 is not > 100000, so they match high.

Inclusive

In inclusive mode, all matching conditions route the record. A single record can appear in multiple branches simultaneously.

    mode: inclusive
    conditions:
      needs_review: "amount > 10000"
      flagged: "status == \"flagged\""
      international: "country != \"US\""
    default: standard

A flagged international order over 10000 would appear in needs_review, flagged, and international – three copies routed to three branches.

Downstream wiring

Downstream nodes reference route branches using port syntax: route_name.branch_name.

- type: route
  name: classify
  input: transactions
  config:
    mode: exclusive
    conditions:
      high: "amount > 1000"
      medium: "amount > 100"
    default: low

- type: transform
  name: high_value_processing
  input: classify.high
  config:
    cxl: |
      emit txn_id = txn_id
      emit amount = amount
      emit review_flag = true

- type: transform
  name: standard_processing
  input: classify.medium
  config:
    cxl: |
      emit txn_id = txn_id
      emit amount = amount

- type: output
  name: low_value_out
  input: classify.low
  config:
    name: low_value_out
    type: csv
    path: "./output/low_value.csv"

Constraints

  • At least 1 condition is required.
  • Maximum 256 branches (conditions + default).
  • Branch names must be unique.
  • The default name must not collide with any condition key.

Complete example

pipeline:
  name: order_routing

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: csv
      path: "./data/orders.csv"
      schema:
        - { name: order_id, type: int }
        - { name: region, type: string }
        - { name: amount, type: float }
        - { name: priority, type: string }

  - type: route
    name: by_region
    input: orders
    config:
      mode: exclusive
      conditions:
        domestic: "region == \"US\" or region == \"CA\""
        emea: "region == \"UK\" or region == \"DE\" or region == \"FR\""
        apac: "region == \"JP\" or region == \"AU\" or region == \"SG\""
      default: other

  - type: output
    name: domestic_orders
    input: by_region.domestic
    config:
      name: domestic_orders
      type: csv
      path: "./output/domestic.csv"

  - type: output
    name: emea_orders
    input: by_region.emea
    config:
      name: emea_orders
      type: csv
      path: "./output/emea.csv"

  - type: output
    name: apac_orders
    input: by_region.apac
    config:
      name: apac_orders
      type: csv
      path: "./output/apac.csv"

  - type: output
    name: other_orders
    input: by_region.other
    config:
      name: other_orders
      type: csv
      path: "./output/other_regions.csv"

Merge Nodes

Merge nodes concatenate multiple upstream branches into a single stream. They are the counterpart to route nodes – where a route splits one stream into many, a merge joins many streams back into one.

Merge is for streamwise concatenation of inputs that share a schema. For record-level joining across inputs that have different schemas, see Combine Nodes.

Basic structure

- type: merge
  name: combined
  inputs:
    - east_data
    - west_data
  config: {}

Note the key differences from other node types:

  • Uses inputs: (plural), not input: (singular).
  • The config: block is empty – all wiring is on the node header.
  • Using input: (singular) on a merge node is a parse error.

Wiring

The inputs: field is a list of upstream node references. These can be bare node names or port references from route nodes:

- type: merge
  name: rejoin
  inputs:
    - process_high
    - process_medium
    - classify.low           # Port syntax for a route branch
  config: {}

Downstream nodes wire to the merge as a normal single-input reference:

- type: output
  name: final_output
  input: rejoin
  config:
    name: final_output
    type: csv
    path: "./output/combined.csv"

Modes

Merge’s cross-input ordering discipline is selected by config.mode. Two modes exist; concat is the default.

concat (default)

Predecessor records drain in declaration order: inputs[0] flows to output first, then inputs[1], then inputs[2], and so on. Within a single predecessor, per-source FIFO order is preserved. Output is reproducible run-to-run.

- type: merge
  name: combined
  inputs: [east, west]
  config:
    mode: concat

interleave

Records flow to output as they become available from any predecessor. Per-source FIFO is preserved within each input; cross-input order follows wall-clock arrival and is non-deterministic.

- type: merge
  name: combined
  inputs: [east, west]
  config:
    mode: interleave

When every direct predecessor of an unseeded interleave merge is a Source node, the executor fuses the Merge into the source ingest loop — predecessor channels are polled directly and Merge consumption proceeds at live ingest rate without any intermediate buffering tier.

Seeded interleave — interleave_seed:

Snapshot tests and benchmarks that need reproducible cross-input ordering can opt into a deterministic schedule:

- type: merge
  name: combined
  inputs: [east, west]
  config:
    mode: interleave
    interleave_seed: 42

A seeded interleave bypasses the fused live-channel path. The Merge instead pre-buffers each predecessor’s output into a Vec, then emits records in fastrand-driven order seeded by interleave_seed. Output is reproducible regardless of upstream timing — at the cost of opting out of live back-pressure across this Merge (see below).

Back-pressure semantics

How a slow consumer or slow upstream reader propagates back through the DAG depends on the merge mode.

concat

Each Source ingest task pushes into its own bounded mpsc channel (capacity 1024 records per Source). Peer sources produce concurrently up to that capacity — the dispatch arm just consumes from inputs[0]’s channel before turning to inputs[1]’s.

Consequences:

  • Memory: a non-leading input can hold up to one channel’s worth of buffered records before its producer blocks. Multi-input concat over N Sources may carry up to (N - 1) × 1024 records in flight even while only one input is being drained.
  • Latency: a record produced by inputs[1] while inputs[0] is still draining will not reach output until inputs[0] finishes, regardless of how fast it was produced.
  • Producer-side back-pressure: when a non-leading input’s channel fills, its reader blocks at blocking_send, propagating pressure back to the upstream file/network reader. The upstream is throttled even though it is not the currently-consumed input.

concat is the right choice when downstream consumers depend on declaration-ordered records (e.g. snapshot tests asserting on byte-identical output) or when the inputs represent ordered time partitions that must remain contiguous.

interleave (unseeded)

Fused with Source predecessors, the Merge arm polls every predecessor’s channel concurrently. Live back-pressure flows end-to-end:

  • A slow downstream operator delays Merge consumption, which fills the predecessor channels, which blocks the Source reader tasks.
  • A fast input does not wait on a slow peer — the Merge schedules whichever channel has a ready record.

When predecessors are not all Sources (e.g. Transform → Merge), fusion does not apply and the Merge consumes pre-buffered predecessor outputs in round-robin order; live back-pressure across the Merge boundary itself is unavailable in that shape, though the upstream operator’s own bounded buffer still throttles its predecessors.

Unseeded interleave is the right choice when end-to-end latency matters and the downstream consumer is order-insensitive (e.g. an aggregator grouping on a key, or a writer that does not assert on row sequencing).

interleave (seeded)

The seeded path does not preserve live back-pressure across the Merge: it pre-buffers each predecessor’s full output into a Vec before emitting in fastrand-driven order. A slow consumer downstream of a seeded Merge will not throttle the Source readers while the buffers are still filling.

If you need both run-to-run determinism and live back-pressure, prefer asserting on the multiset of records rather than their sequence and use unseeded interleave, or fall back to concat over deterministically-declared inputs.

Record ordering

Records arrive in the order described by the mode in use — see Modes and Back-pressure semantics above. If you need sorted output regardless of merge mode, apply a sort_order on the downstream output node.

Use cases

Reuniting route branches

The most common pattern is routing records through different processing paths and then merging them back together:

- type: route
  name: classify
  input: orders
  config:
    mode: exclusive
    conditions:
      high: "amount > 1000"
    default: standard

- type: transform
  name: process_high
  input: classify.high
  config:
    cxl: |
      emit order_id = order_id
      emit amount = amount
      emit surcharge = amount * 0.02
      emit tier = "premium"

- type: transform
  name: process_standard
  input: classify.standard
  config:
    cxl: |
      emit order_id = order_id
      emit amount = amount
      emit surcharge = 0
      emit tier = "standard"

- type: merge
  name: all_orders
  inputs:
    - process_high
    - process_standard
  config: {}

- type: output
  name: result
  input: all_orders
  config:
    name: result
    type: csv
    path: "./output/all_orders.csv"

Unioning multiple sources

Merge nodes can combine records from multiple source files that share the same schema:

- type: source
  name: jan_sales
  config:
    name: jan_sales
    type: csv
    path: "./data/sales_jan.csv"
    schema:
      - { name: sale_id, type: int }
      - { name: amount, type: float }
      - { name: region, type: string }

- type: source
  name: feb_sales
  config:
    name: feb_sales
    type: csv
    path: "./data/sales_feb.csv"
    schema:
      - { name: sale_id, type: int }
      - { name: amount, type: float }
      - { name: region, type: string }

- type: merge
  name: all_sales
  inputs:
    - jan_sales
    - feb_sales
  config: {}

- type: aggregate
  name: totals
  input: all_sales
  config:
    group_by: [region]
    cxl: |
      emit total = sum(amount)
      emit count = count(*)

Schema constraints across inputs

Merge concatenates streams positionally against the merge node’s output_schema (taken from the first input). Every input must therefore agree on column shape — same column names, same on_unmapped policy, same correlation_key set.

Disagreement on the $widened auto_widen sidecar (one source uses auto_widen, another uses drop / reject) fails compile with E315. See Auto-Widen & Schema Drift → E315 for the full diagnostic shape and remediation.

Combine Nodes

Combine nodes are the N-ary record-combining operator. Every input is declared up front and bound to a qualifier; the where: expression matches records across inputs using qualified field references (e.g. orders.product_id == products.product_id); the cxl: body shapes the output row.

Combine is distinct from merge: merge concatenates upstream branches that share a schema, while combine joins records across inputs that have different schemas.

Basic structure

- type: combine
  name: enrich
  input:
    orders: orders         # qualifier: upstream node name
    products: products
  config:
    where: "orders.product_id == products.product_id"
    match: first
    on_miss: null_fields
    cxl: |
      emit order_id = orders.order_id
      emit product_name = products.product_name
      emit amount = orders.amount
    propagate_ck: driver

Note the differences from other node types:

  • Uses input: as a map, binding qualifier names to upstream node references. Other nodes use input: as a single string or inputs: as a list of strings.
  • Every field reference inside where: and cxl: must be qualified (<qualifier>.<field>). Bare field names are a compile error.
  • Using inputs: (plural list) on a combine node is a parse error.

Wiring

Each entry in the input: map binds a qualifier to an upstream node:

  input:
    orders: orders                  # qualifier "orders" -> source node "orders"
    products: products
    high_priority: classify.high    # qualifier "high_priority" -> route port

Qualifiers are local names used inside where: and cxl:; they do not need to match the upstream node name. Upstream references can be bare node names or port references from a route node.

Iteration order in the input: map is preserved and used as the default driver-selection order (see Choosing the driving input below).

Configuration fields

FieldRequiredDefaultDescription
whereYesCXL boolean expression matching records across inputs. Must contain at least one cross-input equality.
matchNofirstMatch cardinality: first, all, or collect.
on_missNonull_fieldsDriver-record handling on zero matches: null_fields, skip, or error.
cxlYes (except under match: collect)Emit statements defining the output row. Empty under match: collect.
driveNofirst inputExplicit driver-input qualifier. Overrides the iteration-order default.
strategyNoautoExecution strategy hint: auto or grace_hash.
propagate_ckYesSelects which correlation-key columns ride onto the output. driver keeps the driver’s CK only; all unions every input’s CK columns; { named: [<field>, ...] } carries an explicit subset. See Correlation-key propagation below.

The where: predicate

The where: expression is a CXL boolean expression evaluated for every candidate record pair across inputs. It must contain at least one cross-input equality – an equality with field references from two different inputs:

  where: "orders.product_id == products.product_id"

Compound predicates combine multiple conjuncts with and. Each conjunct is classified by the planner:

  • Equi conjunct – a cross-input equality (a.x == b.y). Drives the hash lookup or sort-merge join.
  • Range conjunct – a cross-input ordered comparison (a.start <= b.ts and b.ts <= a.end). Handled by the IEJoin algorithm when no equi conjunct constrains the same input pair.
  • Residual conjunct – any other CXL predicate (intra-input filter, function call, etc.). Applied as a post-filter after the equi/range match.
  where: |
    orders.product_id == products.product_id
    and orders.amount >= 100
    and products.region == "us-east"

Above: the equi conjunct drives the join; orders.amount >= 100 and products.region == "us-east" are applied as residuals.

At least one cross-input equality is required for every combine. Pure-range predicates without an equi conjunct are also supported via IEJoin.

Match modes

match: first

Emit one output row per driver record, using the first matching build-side record. Standard 1:1 enrichment. Default.

  config:
    where: "orders.product_id == products.product_id"
    match: first
    cxl: |
      emit order_id = orders.order_id
      emit product_name = products.product_name

match: all

Emit one output row for every matching build-side record. 1:N fan-out – if a driver record matches three build records, three rows are emitted.

  config:
    where: "employees.department == benefits.department"
    match: all
    cxl: |
      emit employee_id = employees.employee_id
      emit benefit = benefits.benefit_name

match: collect

Gather every matching build-side record into a single Array-typed field on the output row. The driver record appears once; the build matches are aggregated into an array. The cxl: body must be empty under collect – the combine node synthesizes the output as { driver fields..., <build_qualifier>: Array }.

  config:
    where: "orders.product_id == products.product_id"
    match: collect
    cxl: ""

A per-group entry limit of 10,000 prevents unbounded growth.

Use collect when you need the set of matches as a single structured value; use all when you need a flat row per match.

Unmatched records (on_miss)

on_miss controls what happens to driver records with zero matches:

ValueSemantics
null_fields (default)Build-side fields resolve to null. Driver record is still emitted. Equivalent to left-join.
skipDriver record is dropped. Equivalent to inner-join.
errorPipeline fails on the first unmatched driver record.
  config:
    where: "orders.product_id == products.product_id"
    on_miss: skip

on_miss: error is useful for strict referential integrity where any miss should halt processing. on_miss: skip is the inner-join shape. on_miss: null_fields is the left-join shape and the default.

Composite keys

Chain multiple cross-input equalities with and:

  config:
    where: |
      sales.department == targets.department
      and sales.region == targets.region
    cxl: |
      emit department = sales.department
      emit region = sales.region
      emit actual = sales.amount
      emit goal = targets.goal

All conjuncts must hold for a record pair to match.

Multi-input combine (three or more)

Combine accepts any number of inputs. Each pair of inputs that should be related needs an explicit cross-input equality:

- type: combine
  name: fully_enriched
  input:
    orders: orders
    products: products
    categories: categories
  config:
    where: |
      orders.product_id == products.product_id
      and products.category_id == categories.category_id
    match: first
    on_miss: null_fields
    cxl: |
      emit order_id = orders.order_id
      emit product_name = products.product_name
      emit category_name = categories.name
      emit amount = orders.amount
    propagate_ck: driver

The planner builds a join tree by walking equalities pairwise and ordering the joins by selectivity.

Choosing the driving input

The driver is the input whose records flow through one at a time during execution; the other inputs are materialized as build-side hash tables (or IEJoin index structures). By default the first input in the input: map is the driver.

Use drive: to override:

  config:
    where: "orders.product_id == products.product_id"
    drive: products
    cxl: |
      emit product_id = products.product_id
      emit product_name = products.product_name
      emit sample_order_id = orders.order_id

With drive: products, the pipeline emits one row per product enriched with a matching order, instead of one row per order enriched with its product. Pick the driver based on which side you want to iterate over (typically the larger stream, or the one whose ordering you want to preserve).

Strategy hint

ValueBehavior
auto (default)Planner picks a strategy from the predicate shape. Hash join for equi predicates; IEJoin for pure-range predicates.
grace_hashForce grace hash join (disk-spilling partitioned hash). Applies only to pure-equi predicates; ignored on predicates with range conjuncts.

grace_hash is the right hint when build-side inputs are larger than the memory budget but fit on disk after partitioning. The planner falls back automatically to grace-hash spill when an in-memory hash table approaches the RSS soft limit, so strategy: grace_hash is mostly an explicit assertion for performance reasoning.

Correlation-key propagation

Combine declares which correlation-key columns its output rows carry via the required propagate_ck field. The choice shapes both the combine’s compile-time output schema and the runtime record builder.

- type: combine
  name: enriched
  input:
    orders: orders
    products: products
  config:
    where: "orders.product_id == products.product_id"
    cxl: |
      emit order_id = orders.order_id
      emit product_name = products.name
    propagate_ck: driver        # driver-only (today's behavior)
    propagate_ck: all           # union of every input's $ck.* columns
    propagate_ck:
      named: [order_id]         # explicit subset (intersected with upstream)
  • driver – output schema carries only the driver input’s $ck.<field> columns. Build-side records contribute body fields; their CK identity is consumed by the match.
  • all – output schema carries every input’s $ck.<field> columns; the runtime copies build-side values onto each output row alongside the body’s emit columns. Use when the build side carries CK fields downstream operators need to read.
  • named: [<field>, ...] – explicit subset, intersected with what’s actually present upstream. Use to project a multi-field CK down to a single field after a join.

Driver wins on a name collision: if both the driver and a build input declare $ck.<field>, the column appears once on the output schema and the runtime keeps the driver’s value. See the Correlation-key combine interaction reference for match-mode interaction details (especially match: collect, where the propagated slot is single-valued and the array column preserves full lineage).

propagate_ck is required on every combine; pipelines without an explicit value fail to compile. Existing pipelines migrate by adding propagate_ck: driver, which is bit-for-bit equivalent to today’s behavior.

Memory considerations

Build-side inputs are materialized in memory as hash tables keyed by the equi columns. For each non-driving input, plan for roughly 1.5-2x the raw CSV size in heap. A 50 MB product catalog typically uses 75-100 MB of hash-table memory. Tune with pipeline.memory.limit at the pipeline level; see Memory Tuning for spill thresholds, the backpressure knob, and strategy overrides.

Complete example

pipeline:
  name: order_enrichment

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: csv
      path: "./data/orders.csv"
      schema:
        - { name: order_id, type: string }
        - { name: product_id, type: string }
        - { name: amount, type: float }

  - type: source
    name: products
    config:
      name: products
      type: csv
      path: "./data/products.csv"
      schema:
        - { name: product_id, type: string }
        - { name: product_name, type: string }
        - { name: category, type: string }

  - type: combine
    name: enrich
    input:
      orders: orders
      products: products
    config:
      where: "orders.product_id == products.product_id"
      match: first
      on_miss: null_fields
      cxl: |
        emit order_id = orders.order_id
        emit product_id = orders.product_id
        emit product_name = products.product_name
        emit category = products.category
        emit amount = orders.amount
      propagate_ck: driver

  - type: output
    name: result
    input: enrich
    config:
      name: result
      type: csv
      path: "./output/enriched_orders.csv"

See also

  • Multi-Input Combine – recipe-style walkthrough with input data and expected output.
  • Merge Nodes – streamwise concatenation; the right operator when inputs share a schema and no per-record matching is needed.
  • Memory Tuning – memory budget, spill thresholds, and strategy overrides.

Output Nodes

Output nodes write processed records to files. They are the terminal nodes of a pipeline – every pipeline path must end at an output (or records are silently dropped).

Basic structure

- type: output
  name: result
  input: transform_node
  config:
    name: output_stage
    type: csv
    path: "./output/result.csv"

The type: field selects the output format: csv, json, xml, fixed_width, edifact, or x12. The edifact and x12 writers reconstruct their EDI interchange envelopes around emitted records; see EDIFACT Format and X12 Format.

Field control

Output nodes can either pass every upstream field through to the writer or restrict output to the fields the upstream transform explicitly emitted. Several options control which fields appear and how they are named.

Unmapped input field passthrough

    include_unmapped: false    # Default: true

When true (the default), every field on an input record that the upstream transform did not explicitly emit still passes through to the output unchanged. This includes fields the source’s on_unmapped: auto_widen policy absorbed into the per-record $widened sidecar map – their contents expand back to top-level columns at the sink.

When false, only fields named by an emit statement in the upstream transform appear in the output. The $widened sidecar slot is stripped and undeclared input fields are dropped.

Migration notice

The default flipped from false to true in a recent release (see issue #90). Pipelines that relied on the previous behavior – where output records contained only the fields explicitly emitted upstream – must now set include_unmapped: false explicitly to restore that shape.

The flag composes independently with include_correlation_keys: true – see below. See Auto-Widen & Schema Drift -> Output controls for the full specification, cross-format flow examples, and the writer-rejection contract for Value::Map payloads.

Worked example

Suppose the upstream source emits records with order_id, customer_id, amount, and region, and a transform that emits only one derived field:

- type: transform
  name: classify
  input: orders
  config:
    cxl: |
      emit amount_bucket = if amount >= 1000 then "high" else "low"

With include_unmapped: true (the default), each output record carries order_id, customer_id, amount, region, and amount_bucket. With include_unmapped: false, each output record carries only amount_bucket. The transform’s CXL is unchanged in both cases – the Output node decides the field set.

Include correlation-key shadow columns

    include_correlation_keys: true    # Default: false

When the pipeline declares error_handling.correlation_key: <field>, the engine adds shadow columns named $ck.<field> to the schema. These shadow columns preserve correlation-group identity through transforms that may rewrite the user-declared field. They are an internal engine namespace and are stripped from output by default.

Set include_correlation_keys: true to surface the shadow columns in the writer output – typically for debugging correlation-group routing or auditing DLQ behavior. See Correlation Keys for the full lifecycle.

include_correlation_keys does not surface the $widened sidecar – include_unmapped is the separate flag for that. The two are independent: each, both, or neither can be set.

Writer rejection of Value::Map payloads

CSV, XML, fixed-width, EDIFACT, and X12 writers refuse records carrying a Value::Map payload at any column slot, raising FormatError::UnserializableMapValue { format, column }. JSON serializes Value::Map natively as a nested object.

The typical cause is a $widened sidecar reaching a non-JSON writer because the Output node set include_unmapped: false. See Auto-Widen & Schema Drift -> Writer rejection for the rejection contract and remediation routes.

Field mapping

Rename fields at output time without changing upstream CXL:

    mapping:
      "Customer Name": "full_name"
      "Order Total": "amount"

Keys are output column names; values are the source field names from upstream.

Excluding fields

Remove specific fields from output:

    exclude: [internal_id, _debug_flag, temp_calc]

Header control (CSV)

    include_header: true      # Default: true

Set to false to omit the CSV header row.

Null handling

    preserve_nulls: false     # Default: false

When false, null values are written as empty strings. When true, nulls are preserved in the output format’s native null representation (e.g., null in JSON).

Metadata inclusion

Control whether per-record $meta.* metadata fields appear in output:

    include_metadata: all       # Include all metadata fields
    include_metadata: none      # Default -- strip all metadata
    include_metadata:
      - source_file             # Include only listed metadata keys
      - source_row

Metadata fields are prefixed with meta. in the output.

Output format options

CSV

- type: output
  name: csv_out
  input: processed
  config:
    name: csv_out
    type: csv
    path: "./output/result.csv"
    options:
      delimiter: "|"

JSON

- type: output
  name: json_out
  input: processed
  config:
    name: json_out
    type: json
    path: "./output/result.json"
    options:
      format: ndjson           # array | ndjson
      pretty: true             # Pretty-print JSON
  • array (default) – writes a single JSON array containing all records.
  • ndjson – writes one JSON object per line.

XML

- type: output
  name: xml_out
  input: processed
  config:
    name: xml_out
    type: xml
    path: "./output/result.xml"
    options:
      root_element: "data"
      record_element: "row"

Fixed-width

- type: output
  name: fw_out
  input: processed
  config:
    name: fw_out
    type: fixed_width
    path: "./output/result.dat"
    schema: "./schemas/output.schema.yaml"
    options:
      line_separator: crlf

Fixed-width output requires a format schema defining field positions and widths.

EDIFACT

- type: output
  name: edi_out
  input: messages
  config:
    name: edi_out
    type: edifact
    path: "./out/result.edi"
    options:
      interchange: ["UNOA:1", "SENDER", "RECEIVER", "240101:1200", "REF1"]
      message_type: "ORDERS:D:96A:UN"
      write_una: false
      segment_newline: true

The EDIFACT writer reconstructs the interchange envelope around emitted records, recomputing the UNT/UNZ control counts and echoing the control references, and release-escapes any element data that carries a service character. The UNB header comes from interchange (literal elements) or interchange_from_doc (echoed from a $doc section). An interchange is a single envelope, so an edifact output cannot be combined with a split: block — the combination is rejected at config-validation time (E323). See EDIFACT Format for the full option reference, the record schema, and the round-trip semantics.

Sort order

Sort records before writing:

    sort_order:
      - { field: "name", order: asc }
      - { field: "amount", order: desc, null_order: last }
Sort optionValuesDefault
orderasc, descasc
null_orderfirst, last, droplast
  • first – nulls sort before all non-null values.
  • last – nulls sort after all non-null values.
  • drop – records with null sort keys are excluded from output.

Shorthand: a bare string defaults to ascending with nulls last:

    sort_order:
      - "name"
      - { field: "amount", order: desc }

File splitting

Split output into multiple files based on record count, byte size, or group boundaries:

- type: output
  name: split_output
  input: processed
  config:
    name: split_output
    type: csv
    path: "./output/result.csv"
    split:
      max_records: 10000
      max_bytes: 10485760           # 10 MB
      group_key: "department"       # Never split mid-group
      naming: "{stem}_{seq:04}.{ext}"
      repeat_header: true           # Repeat CSV header in each file
      oversize_group: warn          # warn | error | allow

Split configuration fields

FieldRequiredDefaultDescription
max_recordsNoSoft record count limit per file
max_bytesNoSoft byte size limit per file
group_keyNoField name – never split within a group sharing this key value
namingNo"{stem}_{seq:04}.{ext}"File naming pattern. {stem} is the base name, {seq:04} is a zero-padded sequence number, {ext} is the file extension
repeat_headerNotrueRepeat CSV header row in each split file
oversize_groupNowarnWhat to do when a single key group exceeds file limits

At least one of max_records or max_bytes should be specified for splitting to have any effect.

Oversize group policies

  • warn (default) – log a warning and allow the oversized file.
  • error – stop the pipeline.
  • allow – silently allow the oversized file.

When group_key is set, the split point is the first group boundary after the threshold is reached (greedy). Without group_key, files are split at the exact limit.

Streaming writes under fused Merge.interleave

When a single Output sits directly downstream of a Merge whose mode is interleave and whose every direct predecessor is a Source, the executor takes a streaming path: a bounded tokio::sync::mpsc::channel connects the Merge arm to the writer task, and Writer::write_record fires per record as Merge emits, concurrent with Merge production.

The buffered alternative — which still runs for every other Output topology — waits until the Merge arm has accumulated every record before invoking the writer. With a slow upstream Source that defeats the live back-pressure the Merge.interleave fusion provides at the Source-channel layer: each record sits in node_buffers[merge] until the slow Source finishes.

Topology

- type: source
  name: src_a
  config: { type: csv, path: a.csv, schema: ... }
- type: source
  name: src_b
  config: { type: csv, path: b.csv, schema: ... }
- type: merge
  name: merged
  inputs: [src_a, src_b]
  config:
    mode: interleave        # required
- type: output
  name: out
  input: merged
  config:
    name: out
    type: csv
    path: out.csv

The streaming path is selected automatically — there is no opt-in setting. Pipelines that don’t match the topology keep the buffered path.

Eligibility

Every condition must hold for the streaming path to engage; if any fails, the buffered path runs:

  • The Output has exactly one incoming edge, and that predecessor is a Merge with mode: interleave.
  • Every direct predecessor of that Merge is a Source (same predicate the fused Merge.interleave arm uses for its live tokio::select!).
  • The Merge has no other downstream consumer besides this one Output (no fan-out).
  • The Output is not in the init-phase ancestor closure.
  • The OutputConfig has no split: block — splitting writers manage their own file rotation lifecycle.
  • The writer is registered in the single-file writer registry (not fan_out_per_source_file).
  • No Source in the pipeline declares a correlation key — the correlation-buffered output path defers writes to CorrelationCommit and is incompatible with per-record write.

Back-pressure flow

Under the streaming path, back-pressure flows end-to-end:

writer slow → mpsc::Sender::send().await yields
             → Merge arm yields
             → Source mpsc::Receiver fills
             → Source ingest task blocks on send

The bounded handoff channel between Merge and Output (256 slots) and the existing per-Source ingest channels (issue #67) form a single pace-bound chain from the underlying Write sink back to the source reader. A slow file system, a saturated network sink, or a deliberately-paced writer no longer accumulates records in pipeline-internal Vecs; the upstream readers slow down to match.

Counter semantics

Counter behavior under the streaming path matches the buffered Output arm exactly: records_written increments once per Writer::write_record call, ok_count counts distinct source row_nums reaching the Output, and dlq_count is unaffected (DLQ entries originate upstream). Stage metrics (SchemaScan, Write, Projection) accumulate into the same fields the buffered path uses; the dispatcher folds the streaming task’s per-task accounting back into the run-wide totals at end of DAG.

Complete example

- type: output
  name: department_reports
  input: enriched_employees
  config:
    name: department_reports
    type: csv
    path: "./output/employees.csv"
    mapping:
      "Employee ID": "employee_id"
      "Full Name": "display_name"
      "Department": "department"
      "Annual Salary": "salary"
    exclude: [internal_flags]
    include_header: true
    sort_order:
      - { field: "department", order: asc }
      - { field: "display_name", order: asc }
    split:
      max_records: 5000
      group_key: "department"
      naming: "employees_{seq:03}.csv"
      repeat_header: true

Error Handling & DLQ

Clinker provides structured error handling with a dead-letter queue (DLQ) for records that fail processing. The error_handling: block at the top level of the pipeline YAML controls the behavior.

Configuration

error_handling:
  strategy: continue
  dlq:
    path: "./output/errors.csv"
    include_reason: true
    include_source_row: true

Strategies

The strategy: field controls what happens when a record fails:

StrategyBehavior
fail_fastDefault. Stop the pipeline on the first error.
continueRoute bad records to the DLQ and keep processing good records.
best_effortContinue processing with partial results, even if some stages produce incomplete output.

fail_fast

The safest strategy. Any record-level error (type coercion failure, validation error, missing required field) halts the pipeline immediately. Use this when data quality is critical and you prefer to fix issues before reprocessing.

continue

The production workhorse. Bad records are written to the DLQ file with diagnostic metadata, and the pipeline continues processing remaining records. After the run completes, inspect the DLQ to understand and correct failures.

A pipeline that completes with DLQ entries exits with code 2 – this signals “pipeline completed successfully but some records were rejected.” It is not a crash or internal error.

best_effort

The most lenient strategy. Processing continues even with partial results. Use this for exploratory data analysis where completeness is less important than progress.

DLQ configuration

The DLQ is always written as CSV, regardless of the pipeline’s input/output formats.

  dlq:
    path: "./output/errors.csv"
    include_reason: true
    include_source_row: true
FieldRequiredDefaultDescription
pathNoFile path for DLQ output. If omitted, DLQ records are logged but not written to file.
include_reasonNoInclude _cxl_dlq_error_category and _cxl_dlq_error_detail columns.
include_source_rowNoInclude original source fields alongside DLQ metadata.

DLQ columns

Every DLQ record includes these metadata columns:

ColumnDescription
_cxl_dlq_idUUID v7 (time-ordered unique identifier)
_cxl_dlq_timestampRFC 3339 timestamp of when the error occurred
_cxl_dlq_source_fileInput filename that produced the failing record
_cxl_dlq_source_row1-based row number in the source file
_cxl_dlq_stageName of the transform or aggregate node where the error occurred
_cxl_dlq_routeRoute branch name (if the error occurred after routing)
_cxl_dlq_triggerValidation rule name that triggered the rejection

When include_reason: true is set, two additional columns appear:

ColumnDescription
_cxl_dlq_error_categoryMachine-readable error classification
_cxl_dlq_error_detailHuman-readable error description

Error categories

The _cxl_dlq_error_category column contains one of these values:

CategoryDescription
missing_required_fieldA required field is absent from the record
type_coercion_failureA value could not be converted to the expected type
required_field_conversion_failureA required field exists but its value cannot be converted
nan_in_output_fieldA computation produced NaN
aggregate_type_errorAn aggregate function received an incompatible type
validation_failureA declarative validation check failed
aggregate_finalizeAn aggregate function failed during finalization
correlatedA non-failing record was DLQ’d as collateral because another record in its correlation group failed
group_size_exceededA correlation-key group exceeded the configured max_group_buffer limit
late_recordA record arrived at a time-windowed aggregate after its event-time window had already closed
expansion_limit_exceededA transform’s emit each fan-out produced more output records than its max_expansion ceiling allows
combine_output_rowA Combine output-stage eval failed for one driver row (probe-key, residual, or matched / on_miss: null_fields body); the entry carries the contributing-build lineage and rewinds both the driver and matched build source’s rollback cursor. Routed to the DLQ under continue / best_effort across every Combine join mode; fail_fast propagates the eval error

Advanced options

Type error threshold

Abort the pipeline if the fraction of failing records exceeds a threshold:

  type_error_threshold: 0.05    # Abort if >5% of records fail

This acts as a circuit breaker – if your input data is unexpectedly corrupt, the pipeline stops early rather than filling the DLQ with millions of entries.

Correlation key

Group DLQ rejections by a key field. When any record in a correlation group fails, records from the failing source’s contribution to that group are routed to the DLQ:

  correlation_key: order_id

For compound keys:

  correlation_key: [order_id, customer_id]

This is useful for transactional data where partial processing of a group is worse than rejecting the entire group. For example, if one line item in an order fails validation, you may want to reject the entire order.

Under multi-source ingest, the collateral fan-out narrows to the failing source: a src_b trigger does NOT DLQ records from src_a that share the same correlation key. Single-source pipelines see bit-identical behavior to today’s pipeline-wide collateral DLQ. See Per-source rollback narrowing for the full semantic and the two documented exceptions (max_group_buffer overflow and Combine output failures).

For the full lifecycle and per-operator semantics (route, merge, aggregate, combine), see Correlation Keys.

Max group buffer

Limit the number of records buffered per correlation group:

  max_group_buffer: 100000     # Default: 100,000

Groups exceeding this limit are DLQ’d entirely with a group_size_exceeded summary entry.

Exit codes

CodeMeaning
0Pipeline completed successfully, no errors
1Pipeline failed (internal error, config error, or fail_fast triggered)
2Pipeline completed, but DLQ entries were produced

Exit code 2 is not a failure – it means the pipeline ran to completion and handled errors according to the configured strategy. Check the DLQ file for details.

Complete example

pipeline:
  name: order_processing
  memory: { limit: "512M" }

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: csv
      path: "./data/orders.csv"
      schema:
        - { name: order_id, type: int }
        - { name: customer_id, type: int }
        - { name: amount, type: float }
        - { name: email, type: string }

  - type: transform
    name: validate_orders
    input: orders
    config:
      cxl: |
        emit order_id = order_id
        emit customer_id = customer_id
        emit amount = amount
        emit email = email
      validations:
        - field: email
          check: "not_empty"
          severity: error
          message: "Customer email is required"
        - check: "amount > 0"
          severity: error
          message: "Order amount must be positive"

  - type: output
    name: valid_orders
    input: validate_orders
    config:
      name: valid_orders
      type: csv
      path: "./output/valid_orders.csv"

error_handling:
  strategy: continue
  dlq:
    path: "./output/rejected_orders.csv"
    include_reason: true
    include_source_row: true
  type_error_threshold: 0.10
  correlation_key: order_id

Correlation Keys

A correlation key declares a set of records from a single source as an atomic group: if any record in the group fails validation or processing, the whole group is sent to the DLQ. This is the right shape for transactional data where partial processing is worse than total rejection – the canonical example is an order with multiple line items where one bad line should reject the entire order.

This page describes the full lifecycle of a correlation key and how it interacts with each operator that can fan out, fan in, group, or join records.

Declaration

Correlation keys are declared per source. Each source’s config: block carries an optional correlation_key: field naming the column (or list of columns) whose value identifies a record’s correlation group within that source. The engine widens each declaring source’s schema with one $ck.<field> shadow column per field and stamps the user-declared value into it at ingest.

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: csv
      path: ./data/orders.csv
      correlation_key: order_id
      schema:
        - { name: order_id, type: string }
        - { name: amount, type: int }

  - type: source
    name: customers
    config:
      name: customers
      type: csv
      path: ./data/customers.csv
      correlation_key: [customer_id, region]
      schema:
        - { name: customer_id, type: string }
        - { name: region, type: string }
        - { name: name, type: string }

  - type: source
    name: sensor_readings
    config:
      name: sensor_readings
      type: csv
      path: ./data/sensors.csv
      # No correlation_key field declared. This source carries no
      # $ck.* widening; record-level errors land in the DLQ as
      # standalone entries with no group atomicity.
      schema:
        - { name: ts, type: date_time }
        - { name: value, type: float }

A record’s correlation group is identified by the tuple of values for that source’s listed fields. Records sharing the same tuple within the same source belong to the same group. There is no pipeline-level correlation key — the previous error_handling.correlation_key: field has been removed; pipelines that previously declared it move the field down to each contributing source.

A source whose declared correlation_key: field names a column not present in its own schema: block is rejected at compile time with diagnostic E153.

Lifecycle

The engine adds a shadow column named $ck.<field> (one per correlation-key field) to every declaring source’s schema and copies the field’s value into it at ingest. From that point on, the shadow column is the authoritative group identity – if a downstream transform rewrites the user-declared correlation field, the shadow column is untouched and the group identity is preserved.

Shadow columns are an internal engine namespace. You never write $ck.<field> in YAML or CXL – the engine manages them. They are stripped from default writer output. To surface them for debugging, set include_correlation_keys: true on an output node:

- type: output
  name: debug_out
  input: validate
  config:
    name: debug_out
    type: csv
    path: "./debug.csv"
    include_correlation_keys: true

Multi-source pipelines

Different sources can declare different correlation-key fields. The engine treats each source’s CK identity as locally consistent: a record from customers is a member of the customer-id group named in its row, and a record from orders is a member of the order-id group named in its row, regardless of whether customer_id appears in orders or vice versa. Combine and Merge nodes that join across sources negotiate which CK columns survive into the joined output via the Combine node’s propagate_ck: field (see Combine interaction below).

A source that declares no correlation_key: carries no $ck.* widening. Records from such a source flow through the pipeline without group identity; per-record errors DLQ on a per-record basis with no group fan-out. The orchestrator’s relaxed-aggregate retraction protocol still activates if any other source on the same DAG carries a CK field that an aggregate’s group_by omits — the retraction protocol scope is the DAG’s lattice of $ck.* columns, not any single source’s declaration.

DLQ semantics

When a record fails inside a correlation group:

  • The failing record produces a trigger DLQ entry. Its category reflects the actual failure (e.g. type_error, validation_failed).
  • Every other record from a source that contributed a trigger to the same group produces a collateral DLQ entry. Collaterals carry the category correlated.
  • Records belonging to other (clean) groups proceed normally.

A record with a null value for the correlation-key field is treated as its own per-record group: it has no peers and DLQ atomicity does not span multiple records.

A Combine output-row eval failure that the engine recovers from (under continue / best_effort, in the hash build-probe inline arm) produces entries under the combine_output_row category — distinct from the upstream-Transform type_coercion_failure because the entry carries the contributing-build lineage and rewinds both the driver and the matched build source’s rollback cursor. See Per-source rollback narrowing below for the cursor-rewind detail.

The dlq_count counter sums triggers and collaterals.

Per-source rollback narrowing

When two sources contribute records to the same correlation group, a failure originating from one source does NOT collaterally DLQ records from the OTHER source. The collateral fan-out is scoped to the failing source’s records only.

Concretely, consider [src_a, src_b] → merge → tfm → out with both sources declaring correlation_key: id. A mid-stream Transform error fires on every src_b record but leaves src_a records untouched:

- type: transform
  name: tfm
  input: m
  config:
    cxl: |
      emit id = id
      emit ratio = if($source.name == "src_b") then (1 / 0) else amt

Under per-source rollback, the dirty correlation group for each id value contains:

  • One trigger DLQ entry — the src_b row that hit 1 / 0.
  • The src_a row sharing the same id is spared and reaches the output.

The engine identifies origin per record via the engine-stamped $source.name column. Within the failing source’s records, the existing CorrelationFanoutPolicy (Any / All / Primary) determines which records DLQ — the policy semantics are unchanged. Single-source pipelines see bit-identical behavior to the pre-narrowing engine because every co-grouped record shares the failing source by construction.

Records that carry no single-source attribution — synthetic aggregate emits and Combine output rows — are NOT spared by per-source narrowing. They flow through the existing collateral path because their stamp falls back to the merged-source identity which is ambiguous about origin.

The engine also surfaces a per_source_rollback_cursors map on the ExecutionReport, keyed by source name and carrying the highest source row number that cleanly exited a forward operator. The map advances per record at the clean exit of Transform / Route / Aggregate, and rewinds per contributing source on max_group_buffer overflow to the lowest row_num any group member of that source contributed. Sources whose records all DLQ never land in the map. The map is the replay anchor for per-source resume: a downstream rerun reads each source’s cursor as the floor for what must be reprocessed.

On max_group_buffer overflow, every record in the overflowing group still lands in DLQ (one GroupSizeExceeded trigger plus per-row collaterals), but the per-source rollback cursor rewinds independently per contributing source. Attributing the overflow failure itself to one source would be a fiction — every contributing source shared blame proportionally — so the DLQ shape stays group-wide while the rewind narrows per source.

The relaxed-CK aggregator’s per-row lineage carries (row_id, source_name) pairs so a finalize-time retract scoped to one source rewinds only that source’s contributions to each affected group. The source half of the pair is load-bearing under multi-source ingest: each source numbers its rows from its own monotonic counter, so two sources that both feed the same aggregate group can contribute records at identical row_id values. Pairing the row id with its source keeps src_a’s row 1 distinct from src_b’s row 1 when both land in one group, so a retract that must remove both reaches each one instead of collapsing the colliding ids and stranding the second source’s contribution. Combine input snapshots are captured at fold start and cleared at every Combine arm’s exit (inline, IEJoin, GraceHash, SortMerge). When a Combine output-row eval fails recoverably under continue / best_effort in the hash build-probe (inline) arm — a probe-key, residual-filter, or matched / on_miss: null_fields body failure on one driver row — the snapshot restores each contributing source’s rollback cursor to the value it held at the start of the fold (its pre-fold floor), lowering the cursor only if it had since advanced, then routes the row to the DLQ under the combine_output_row category. Only the sources that fed the failing row rewind; co-folded sources that did not contribute keep their forward progress. The IEJoin, grace-hash, and sort-merge arms propagate an output-eval failure as fail-fast regardless of strategy.

Group buffering

The engine buffers records per correlation group until either the group completes (all source records observed) or a failure triggers a flush. The max_group_buffer: field on the pipeline-level error_handling: block caps per-group buffering across every source’s groups:

error_handling:
  max_group_buffer: 100000     # Default: 100,000

Groups that exceed the cap are DLQ’d entirely with a group_size_exceeded trigger plus a collateral entry per buffered record. This is a backpressure boundary, not a hard error.

Compile-time constraints

Two compile-time invariants are enforced:

  • CK field must exist in source schema (E153). A source that declares correlation_key: <field> must list <field> in its own schema: block; otherwise the engine emits E153 pointing at the offending source declaration. The remediation is to either add the field to the source’s schema: block or remove the field from correlation_key:.

  • Arena execution incompatible. The arena-evaluated execution path is incompatible with correlation grouping. Combinations are rejected at compile time.

Aggregates whose group_by covers the upstream CK lattice stay on the strict-collateral path; aggregates that omit any CK field visible upstream activate the retraction protocol automatically. Authors do not configure this — the engine inspects the configuration and picks the correct path. See Aggregate interaction below.

Per-operator interactions

Transform interaction

A transform that rewrites the user-declared correlation-key field does not change a record’s group identity. The shadow column captured at ingest is what the buffer-key extractor reads, not the live field value.

- type: source
  name: orders
  config:
    schema:
      - { name: order_id, type: string }
      - { name: amount, type: float }

# At ingest each record gets $ck.order_id = order_id

- type: transform
  name: anonymize
  input: orders
  config:
    cxl: |
      emit order_id = "REDACTED"      # writes the live field
      emit amount = amount

# Group identity is still the original order_id from $ck.order_id;
# anonymize does not collapse records into a single null-keyed group.

This makes the correlation-key declaration robust against routine field-rewrite logic in transforms.

Route interaction (fan-out)

A correlation group can span multiple route branches. Group atomicity is preserved across branches: if any record in the group fails (in any branch’s transform, or in the route predicate itself), the entire group is rejected from every branch.

- type: route
  name: split
  input: validate
  config:
    mode: inclusive
    conditions:
      a: 'priority == "high"'
      b: 'priority == "low"'
    default: a

- type: output
  name: out_a
  input: split.a
  config: { ... }

- type: output
  name: out_b
  input: split.b
  config: { ... }

For an inclusive route where one record reaches both branches, a single failure in the source still DLQ’s that source row exactly once – not once per (row, output) pair. The group identity dedupes the DLQ entries at the source-row level.

A route predicate that itself fails to evaluate (e.g. type error inside the condition expression) is treated like any other failure: it triggers DLQ atomicity for the whole correlation group.

Merge interaction (fan-in)

Merge concatenates upstream branches that share a schema. Each record carries its $ck.<field> shadow column unchanged through the merge. Groups originating from different upstream sources but sharing the same correlation-key value are treated as a single correlation domain downstream:

- type: source
  name: east_orders
  config: { ... }

- type: source
  name: west_orders
  config: { ... }

- type: merge
  name: all_orders
  inputs: [east_orders, west_orders]
  config: {}

If east_orders and west_orders both contain rows for order_id = ORD-42, all of those rows are members of the same correlation group post-merge. A failure on any one of them DLQ’s the whole group across both upstream sources.

Aggregate interaction

When an aggregate’s group_by covers every CK field visible upstream, the aggregate stays on the strict-collateral path: each emitted row inherits the correlation identity of its inputs and any DLQ trigger in the group rolls back every record in the group, including the aggregate output row. This is the zero-overhead default.

- type: source
  name: orders
  config:
    name: orders
    type: csv
    path: ./data/orders.csv
    correlation_key: order_id
    schema:
      - { name: order_id, type: string }
      - { name: amount, type: int }

- type: aggregate
  name: order_totals
  input: orders
  config:
    group_by: [order_id]               # strict -- covers the upstream CK
    cxl: |
      emit total = sum(amount)

When an aggregate’s group_by omits any CK field visible upstream, the engine routes the aggregate through the retraction protocol automatically. A single correlation group may span multiple aggregate groups; CK fields omitted from group_by stop being visible to downstream consumers of this aggregate’s output as user-named columns. Authors do not configure this — the engine inspects the configuration and picks the correct path.

- type: source
  name: orders
  config:
    name: orders
    type: csv
    path: ./data/orders.csv
    correlation_key: order_id
    schema:
      - { name: order_id, type: string }
      - { name: department, type: string }
      - { name: amount, type: int }

- type: aggregate
  name: dept_totals
  input: orders
  config:
    group_by: [department]             # retraction protocol is active
    cxl: |
      emit total = sum(amount)

Aggregate output rows on the strict path inherit the correlation meta of the records that fed them. If any input record in a correlation group fails, the surviving records in that group still flow through the aggregator and produce one aggregate row – but that aggregate row is itself DLQ’d as a collateral and never reaches the writer.

On the retraction path, the engine retracts only the failing records and refinalizes affected groups, so the aggregate output row reflects the surviving contributions. The retraction protocol’s runtime constraint (E15Y for strategy: streaming on a retraction-mode aggregate) is enforced automatically once the engine has classified the aggregate. Operators downstream of a retraction-mode aggregate run only at commit time on the post-recompute aggregate emits, so non-deterministic CXL builtins (e.g. now) evaluate exactly once per output row and need no special-casing.

Synthetic correlation column

A retraction-mode aggregate emits one engine-managed $ck.aggregate.<name> column on its output schema, alongside the user-emitted bindings. The column carries the aggregator’s per-group index at finalize and is the lineage hook that lifts the post-aggregate retract path: a Transform or Output that fails on an aggregate output row carries the synthetic column on the failing record, the orchestrator’s detect phase decodes the index back to the contributing source row ids via the retained aggregator’s input_rows table, and the recompute phase retracts those source rows just as it would retract a directly-failing source record. Authors never write or read $ck.aggregate.<name> — the column is hidden from default writer output (mirroring the source-CK shadow column posture) and lives outside any user-visible CXL surface.

Where retraction triggers are sourced

Retraction is fine-grained for failures upstream of a retraction-mode aggregate (Source ingest, Transform evaluation, Combine probe, Validation): the failing record carries $ck.<field> shadow columns, the engine identifies its correlation group from those columns, and retract_row removes that record’s specific contribution from every affected aggregate group while leaving every other contributing record intact.

Failures downstream of a retraction-mode aggregate (a Transform that fails on an aggregate output row, an Output writer that rejects an aggregate row) carry the synthetic $ck.aggregate.<name> lineage column described above. The detect phase resolves that column to the contributing source row ids and feeds them into the same recompute pipeline as upstream failures. The end-to-end demo at examples/pipelines/retract-demo/ runs both surfaces in one pipeline.

Combine interaction

Every combine declares propagate_ck: to select which correlation-key fields its output rows carry:

  • propagate_ck: driver – output inherits only the driver input’s correlation identity. Build-side records contribute fields to the output but their group identity is consumed by the match. Default-equivalent behavior; today’s strict-correlation pipelines stay on this setting.
  • propagate_ck: all – output carries the union of correlation-key fields across every input. Use when the build side carries CK fields that downstream operators need to read (for example, a build-side stream is also subject to correlation-driven DLQ on its own keys).
  • propagate_ck: { named: [<field>, ...] } – output carries exactly the named subset, intersected with what is actually present upstream. Use to project a multi-field correlation key down to a single field after a join.
- type: source
  name: orders
  config:
    name: orders
    type: csv
    path: ./data/orders.csv
    correlation_key: employee_id
    schema:
      - { name: employee_id, type: string }
      - { name: amount, type: float }

- type: source
  name: departments
  config:
    name: departments
    type: csv
    path: ./data/departments.csv
    correlation_key: employee_id
    schema:
      - { name: employee_id, type: string }
      - { name: dept, type: string }

- type: combine
  name: enriched
  input:
    o: orders                          # driver
    d: departments                     # build side
  config:
    where: "o.employee_id == d.employee_id"
    match: first
    on_miss: skip
    cxl: |
      emit employee_id = o.employee_id
      emit amount = o.amount
      emit dept = d.dept
    propagate_ck: driver
- type: combine
  name: enriched_all
  input:
    o: orders                          # both sources declare correlation_key
    d: departments
  config:
    where: "o.employee_id == d.employee_id"
    cxl: |
      emit employee_id = o.employee_id
      emit dept = d.dept
    propagate_ck: all                  # union of every input's CK columns

Under propagate_ck: driver, output rows from enriched carry the $ck.employee_id value from the driver record, regardless of which department record matched. A trigger error on a driver record DLQ’s that driver’s whole correlation group, including any combine output rows that were already produced for that group.

Under propagate_ck: all (or { named: [...] }), the combine widens its output schema with the build-side $ck.<field> columns it propagates, and the runtime copies the matched build record’s values into those columns. Driver wins on a name collision: if both the driver and a build input declare $ck.<field>, the column appears once on the output schema and the runtime keeps the driver’s value – the build’s value would only land if the driver’s slot was null, which never happens for a same-named CK field that the driver itself observes.

Match-mode interaction:

  • match: first – one matched build per driver row; that build’s $ck.<field> fills the propagated slot.
  • match: all – one output row per matched build; each row carries its own matched build’s $ck.<field>.
  • match: collect – one synthesized output row per driver. The propagated $ck.<field> slot is single-valued: the first matched build’s CK fills it. Every matched build’s full payload still rides inside the array column via Value::Map, so per-build lineage is preserved at the cost of single-valued addressing on the propagated slot.

This rule holds across all combine execution paths: the hash-join path, the IEJoin range-predicate path, the grace-hash spill path, the sort-merge path, and chained combines (combine consuming the output of another combine).

The drive: field on a combine selects which input is the driver. Choose the side that carries the authoritative group identity for downstream DLQ routing – typically the larger or more transactional stream.

propagate_ck is a required field with no default value – every combine must spell out which propagation mode it uses. Existing pipelines migrate by adding propagate_ck: driver to keep today’s behavior.

Composition interaction

A composition’s body operates on records flowing in from the parent pipeline. The correlation-key shadow columns flow into composition inputs and back out the named ports unchanged. Compositions cannot declare their own correlation key — CK is a property of a source’s identity, not of the composition body that consumes records from one.

Operator-by-operator retraction cost reference

An aggregate whose group_by omits any upstream CK field activates the retraction protocol automatically. Each operator on the post-source DAG carries a different cost profile under retraction; the table below summarizes the per-operator footprint so you can size memory and pick propagate_ck settings before pipelines hit production.

OperatorRetraction cost
SourceNone at retraction time. The CK shadow columns are stamped at ingest; replay never re-reads the source file.
TransformRuns only at commit time on post-recompute aggregate emits when sitting inside a deferred region. Cost = O(rows_emitted_post_recompute) per region member, no extra state held. Non-deterministic CXL builtins (e.g. now) evaluate exactly once per output row, same as on a non-retraction pipeline.
Aggregate (strict, group_by covers upstream CK lattice)None. Strict aggregates short-circuit to today’s two-phase commit body and pay zero retraction overhead.
Aggregate (retraction-mode, Reversible bindings)Per-row lineage map (input_row_id → group_index) carried alongside accumulator state — ~8 bytes/row plus the per-group input_rows Vec inline cost — plus one synthetic $ck.aggregate.<name> shadow column on every output row at ~16 bytes/row. Retract is O(retracted_rows) reverse-op calls plus one finalize_in_place. Reversible accumulators: sum, count, collect, any.
Aggregate (retraction-mode, BufferRequired bindings)Per-group raw contributions held until commit, plus one synthetic $ck.aggregate.<name> shadow column on every output row at ~16 bytes/row. Memory cost = O(input_rows × Σ binding_value_size) plus the synthetic-column tail. Retract recomputes affected groups from contributions − retracted_rows. BufferRequired accumulators: min, max, avg, weighted_avg.
Combine (driver propagation)One propagated $ck.<field> slot from the driver record. No retraction state held by the combine itself; replay carries upstream deltas through.
Combine (propagate_ck: all / named: [...])Same per-row cost as driver propagation, plus the widened output schema’s $ck.<field> columns must be re-populated on replay. Cost scales with the output schema width, not retraction frequency.
Window (streaming)None — streaming windows are incompatible with a retraction-mode aggregate whose dropped CK fields overlap partition_by. The plan-time derivation switches such windows into buffer mode.
Window (buffer-mode)Per-partition raw row buffers held until commit. Memory cost = O(largest partition × per-row-size). Retract reruns the configured $window.* evaluation over partition − retracted_rows. Covers all 13 $window.* builtins uniformly via wholesale recompute.
OutputHolds retracted rows in correlation_buffers until commit. Replay substitutes the post-retract row in place; clean records flush to the writer, dirty records DLQ per the resolved correlation_fanout_policy.

The --explain output’s === Retraction === section reports the live per-aggregate / per-window detail derived from the current pipeline, including the per-aggregate synthetic-CK column and its 16-byte/output-row cost. The clinker metrics collect spool reports the runtime counterpart: correlation.retract.groups_recomputed, .partitions_recomputed, .subdag_replay_rows, .output_rows_retracted_total, .degrade_fallback_count, .synthetic_ck_columns_emitted_total, .synthetic_ck_fanout_lookups_total, .synthetic_ck_fanout_rows_expanded_total. Use the explain block for plan-time capacity sizing, the metrics spool for post-run confirmation.

When retraction’s preconditions break at runtime (an aggregate spilled before retract reached it, or a window partition exceeded the memory budget), the orchestrator degrades to “DLQ entire affected group/partition” — the same strict-collateral DLQ shape every aggregate uses on the strict path. Each degrade increments correlation.retract.degrade_fallback_count; persistent non-zero values point at a tighter memory budget or a smaller correlation key cardinality.

Debugging

To see correlation-key shadow columns in writer output:

- type: output
  name: debug
  input: any_node
  config:
    type: csv
    path: "./debug.csv"
    include_correlation_keys: true

The output will contain extra columns named $ck.<field> (literal $ck. prefix in the CSV header) for each correlation-key field declared on the source whose records reach this output. The synthetic $ck.aggregate.<name> shadow column emitted by retraction-mode aggregates is also surfaced when this flag is enabled.

To investigate DLQ collaterals: every collateral entry’s category is correlated. The trigger entry in the same group carries the actual failure category and message.

See also

Scoped Variables

Clinker’s scoped-variable system lets a pipeline read and write named values at three lifetimes: the pipeline run, the source, and the record. Variables are declared statically at pipeline top with their type and scope, read inline from CXL via the $pipeline.*, $source.*, and $record.* namespaces, and written exclusively by a dedicated state node.

The three scopes

ScopeLifetimeResetReader namespace
pipelineEntire pipeline runNever (per run)$pipeline.<key>
sourceOne per source file (Arc<str>-keyed)Per source-file$source.<key>
recordA single record as it flows through nodesPer record$record.<key>

$record.<key> is a separate namespace from $meta.<key>. Metadata is written via emit $meta.x = ... from a transform and survives only to the immediate downstream operator. Record-scope vars survive the whole row pipeline (every transform along the row’s path can read them) but never serialize as output columns unless explicitly emitted as a regular column.

Declaring variables

Every scoped variable must be declared in the pipeline’s top-level vars: block, named, scoped, typed, and optionally given a default:

pipeline:
  name: order_processing
  vars:
    pipeline:
      cutoff_date:
        type: date
        default: "2024-01-01"
      fuzzy_threshold:
        type: float
        default: 0.85
    source:
      batch_id:
        type: string
      ingestion_label:
        type: string
    record:
      fuzzy_score:
        type: float

Allowed types: int, float, string, bool, date, date_time.

Built-in members of each scope ($source.file, $source.row, $source.path, $source.count, $source.batch, $source.ingestion_timestamp; $pipeline.start_time, $pipeline.name, $pipeline.execution_id, $pipeline.batch_id, $pipeline.total_count, $pipeline.ok_count, $pipeline.dlq_count, $pipeline.filtered_count, $pipeline.distinct_count) are reserved — declaring a user variable with one of those names is rejected at parse time.

$source.count semantics

$source.count is the finalized per-source record total for the Source that produced the current record. It is observable only after that Source’s input stream closes:

  • Mid-stream reads (records emitted before the Source’s input closes — typical of Transform / Route / Window / Merge per-record evaluation) resolve to Null. The final count cannot be known before every record has been observed; the engine does not speculate or block.
  • Post-close reads (terminal aggregate emits, commit-time deferred dispatch, post-recompute paths, any record emitted after the originating Source’s mpsc::Receiver returned None) resolve to the per-source total.

Pipelines that previously used $source.count as a streaming denominator (e.g. value / $source.count) will now see Null from that division on mid-stream records. If you need a streaming row counter, declare a scope: source variable and increment it from a state writer — that gives you a running count instead of waiting for the final.

Reading variables

CXL access is identical for declared and built-in keys:

- type: transform
  name: filter_recent
  input: orders
  config:
    cxl: |
      emit id = id
      filter received_at > $pipeline.cutoff_date
      emit batch = $source.batch_id
      emit confidence = $record.fuzzy_score

Reads of undeclared keys are rejected with E200 (CXL name resolution failed) at compile time, with a “did you mean” suggestion that scans the declared registry.

Writing variables: the state node

The only way to mutate a scoped variable is a dedicated state node. The node is a pass-through for records — its input record forwards unchanged on the output edge — but evaluates its set: assignments and writes the results into the appropriate scope-keyed runtime registry.

- type: state
  name: capture_header
  input: salesforce_in
  config:
    scope: source
    set:
      - var: batch_id
        cxl: "first(this.batch)"
      - var: ingestion_label
        cxl: "$source.file.file_stem()"

- type: state
  name: row_score
  input: enrich
  config:
    scope: record
    set:
      - var: fuzzy_score
        cxl: "fuzzy_match(this.name, $pipeline.canonical_name)"

Inline mutation from a regular transform (emit $pipeline.x = ...) is a parse error. The dedicated-node design keeps the dependency between writers and readers visible at plan time.

Init phase: pre-runtime population

A state node may declare phase: init to run to completion before any runtime-phase node sees a record:

- type: source
  name: config_src
  config:
    name: config_src
    type: csv
    path: config.csv
    schema:
      - { name: cutoff, type: int }

- type: aggregate
  name: max_agg
  input: config_src
  config:
    group_by: []
    cxl: |
      emit cap = max(cutoff)

- type: state
  name: precompute_cutoff
  input: max_agg
  config:
    scope: pipeline
    phase: init
    set:
      - var: cutoff_date
        cxl: "cap"

Init-phase nodes must be terminal — no runtime-phase node may consume from an init-phase state node. (Init-phase state nodes can chain through init-only descendants for compositions.) Use disjoint Sources for init vs runtime when you need both, since a Source shared between an init and a runtime branch only feeds the init pass.

Compile-time validation

Scoped variables earn their architectural payoff at plan time. Every reference and every writer is checked against a static registry, and every cross-DAG flow is verified against the topology.

CodeWhat it catches
E107Channel var override declares a different type than the pipeline.
E109Channel targets a composition but carries vars: overrides.
E110Channel var name shadows a reserved system field for that scope.
E111Channel vars.source.<src> references an unknown source-node name.
E164An init-phase state node has a runtime descendant.
E171A reader is not a transitive DAG descendant of its writer.
E172Bare $source.<custom> read downstream of a Merge or Combine.
E173Composition body reads a parent scoped var without opting in.
E174Composition _compose.scoped_vars declares a different type than the parent.
E175An init-phase node reads a runtime-only writer’s variable.
E200A reference to an undeclared scoped variable (resolver-level failure).

Cross-Transform duplicate declares: (the same (scope, name) declared on two Transforms) is rejected at config-validation time, ahead of compilation. $pipeline, $source, and $record are flat shared namespaces; declare each name once and reference it from every consumer.

Each diagnostic carries the offending span plus secondary spans pointing at the conflicting writer or the parent declaration, so the report shows up where the user is reading or writing — not in some unrelated configuration block.

Post-merge access: qualified $source.<input>.<key>

After a Merge or Combine, the bare $source.<custom> form is ambiguous: each record carries its own source’s value, but the reader’s intent is usually to compare across inputs. E172 rejects the unqualified form and the qualified form is the legal alternative:

- type: transform
  name: read_after_merge
  input: merged
  config:
    cxl: |
      emit id = id
      emit lt = $source.left_input.left_label
      emit rt = $source.right_input.right_label

The <input_name> segment matches the named input on the Combine (its IndexMap key) or the upstream node name on the Merge.

Composition opt-in

A composition body cannot see parent scoped variables by default — the seal is enforced by E173. To pass values across the boundary, the composition declares the schema of parent vars it consumes in its _compose.scoped_vars block:

# read_pipeline_var.comp.yaml
_compose:
  name: read_pipeline_var
  inputs:
    inp:
      schema:
        - { name: id, type: int }
  outputs:
    out: tap
  scoped_vars:
    pipeline:
      cutoff:
        type: int

nodes:
  - type: transform
    name: tap
    input: inp
    config:
      cxl: |
        emit id = id
        emit cutoff_seen = $pipeline.cutoff

The parent must declare cutoff with the matching type; mismatches raise E174.

What scoped variables are not

These are intentional non-features:

  • No persistence across runs. State is in-memory only. A pipeline run starts with declaration defaults; the writes don’t survive the process.
  • No inline emit $pipeline.x writes. Convenience-style mutation from a transform body is forbidden — empirical evidence from comparable engines shows it leads to race conditions and hidden DAG dependencies.
  • No dynamic var creation. The set of variables is closed at plan time, by design. This bounds memory and makes the validation matrix above tractable.

Channel overrides

A channel can both override a pipeline’s declaration defaults and add new entries across all four registries ($vars.*, $pipeline.*, $source.*, $record.*). Each registry has its own sub-block under vars: on a .channel.yaml, and each entry uses the same { type, default } shape that pipeline-side declarations use:

# Pipeline declarations
pipeline:
  name: orders
  vars:
    fuzzy_threshold: { type: float, default: 0.85 }   # $vars.*
nodes:
  - type: source
    name: orders_src
    config: { name: orders_src, type: csv, path: in.csv,
              schema: [{ name: id, type: int }] }
  - type: transform
    name: enrich
    input: orders_src
    config:
      declares:
        - { name: cutoff_date,  scope: pipeline, type: date,   default: "2024-01-01" }
        - { name: ingest_label, scope: source,   type: string, default: "prod" }
        - { name: tier,         scope: record,   type: string, default: "bronze" }
      cxl: |
        emit id = id

# channels/acme-prod.channel.yaml
channel:
  name: acme-prod
  target: ./pipelines/orders.yaml
vars:
  static:
    fuzzy_threshold: { type: float, default: 0.95 }
  pipeline:
    cutoff_date: { type: date, default: "2026-01-01" }
  source:
    orders_src:
      ingest_label: { type: string, default: "acme-prod" }
  record:
    tier: { type: string, default: "platinum" }

Override semantics (entry name already declared) require the channel’s type to match the declared type — mismatches produce E107. Add semantics (entry name not yet declared) extend the registry with a new declaration. $source overrides are keyed by source-node name; an unknown source name produces E111. The reserved-name guard (E110) blocks channels from shadowing system fields like $pipeline.execution_id or $source.path. Channels that target a .comp.yaml may not carry vars: (E109 if they do).

See Channels for the full overlay rules and the channel manifest reference.

Document Envelope Context ($doc.*)

Many enterprise file formats wrap their record body in an envelope: named sections that surround the records and carry document-level metadata — a batch header with a run date and batch id, a trailer with a record count and checksum, or arbitrary sibling sections. Clinker exposes these sections to CXL through the $doc.<section>.<field> namespace.

sources:
  - name: payments
    path: data/payments.xml
    format: xml
    envelope:
      sections:
        BatchInfo:
          extract: { xml_path: "/payments/BatchInfo" }
          fields:
            batch_id: string
            run_date: date
        Summary:
          extract: { xml_path: "/payments/Summary" }
          fields:
            record_count: int
            checksum: string

A downstream transform reads any declared section field on every body record:

nodes:
  - transform: tag
    inputs: { in: payments }
    project:
      - batch: $doc.BatchInfo.batch_id
      - expected_total: $doc.Summary.record_count
      - amount: amount

Section names are yours

The engine reserves no section names. BatchInfo and Summary above are arbitrary identifiers chosen by the pipeline author — Head / Foot, preamble / trailer, batch_metadata / eob_summary are all equally valid. A section name is whatever string you put in the sections: map; CXL exposes it verbatim as $doc.<that_name>.<field>.

All sections are available everywhere in the body stream

Before the first body record streams from a file, the reader runs a one-time envelope pre-scan that extracts every declared section — no matter where it physically sits in the file. A header at the top and a trailer at the bottom are both pulled out up front. The result: every body record sees every $doc.<section>.<field> value, from the first record to the last.

This means a trailer field is available during body streaming, not just at end-of-file. A pipeline can compute, on every row, a ratio against the trailer’s total:

project:
  - running_fraction: row_index / $doc.Summary.record_count

The pre-scan reads the envelope-bearing segments of the file before body streaming begins. Envelope payloads are small (a few hundred bytes per document is typical), but reaching a trailing section requires the reader to have buffered the file — so envelope-aware sources hold the source file’s bytes in memory for the lifetime of the read. Body records still stream one at a time; only the envelope sections (not the body) live in the document context.

$doc.* is not the file in memory. It holds the parsed envelope sections only — body records flow through the pipeline one at a time, and the only stages that buffer multiple records are the usual blocking operators (Aggregate, Sort, grace-hash Combine) under the standard RSS budget.

Extract rules per format

Each section declares how the reader locates its payload:

Formatextract: keyValue
XMLxml_pathSlash-path to the section element, e.g. /doc/Head
JSONjson_pointerRFC 6901 pointer, e.g. /Head
EDIFACTsegmentA service-segment tag — only UNB
X12segmentA service-segment tag — only ISA (GS/ST surface as nested levels)

Declaring an xml_path section against a JSON source (or vice versa), or a segment extract against XML/JSON, is a configuration error and fails fast when the source opens, rather than silently producing empty sections. CSV and fixed-width sources do not yet support envelope extraction; declaring envelope sections on those formats is a no-op today.

EDIFACT segment extract

An EDIFACT source exposes its interchange header UNB as an envelope section. The section’s field names are the positional element keys e01, e02, … :

envelope:
  sections:
    interchange:
      extract: { segment: "UNB" }
      fields:
        e05: string          # interchange control reference

Only the UNB header is extractable. EDIFACT is scanned as a flat byte stream with only the header pre-read, so trailer segments (UNT, UNZ) that arrive after the body are not envelope sections — their control counts are validated inline by the reader instead. A segment extract naming any tag other than UNB is rejected at startup. See EDIFACT Format for the full reference.

A JSON example:

sources:
  - name: payments
    path: data/payments.json
    format: json
    record_path: records
    envelope:
      sections:
        Head:
          extract: { json_pointer: "/Head" }
          fields:
            batch_id: string
        Foot:
          extract: { json_pointer: "/Foot" }
          fields:
            count: int

against:

{
  "Head": { "batch_id": "RUN-001" },
  "records": [ { "amount": 10 }, { "amount": 20 } ],
  "Foot": { "count": 2 }
}

Typed fields

Each section’s fields: map declares the field name and its type, drawn from the same small vocabulary as source schemas: string, int, float, bool, date, date_time. The extracted raw value is coerced to the declared type at pre-scan time; a value that cannot coerce (e.g. a non-numeric string declared int) fails the source with a diagnostic naming the section, field, and offending value.

A field that the document does not carry resolves to null$doc.* follows the same missing-value convention as $source.* and $pipeline.*. A section that the document does not carry at all is simply absent from the context; any $doc.<missing_section>.<field> resolves to null.

One document per file

Each source file is its own document with its own envelope context. When a source matches multiple files (via glob: / paths:), each file gets a fresh document context with its own section values. Records from different files never share a context — a record’s $doc.* always reflects the file that record came from.

Document boundaries flow through the pipeline as inline punctuation signals (one when a document opens, one when it closes). These signals let document-scoped operators — for example a future per-document aggregate flush or trailer-count validation — fire at exactly the right point. Today the signals propagate through Source, Transform, Route, Sort, and Combine, and are reconciled at Merge (a document that fans in through several branches closes downstream exactly once).

Nested (multi-level) envelopes

Some formats wrap their records in several envelope levels, one inside another. EDI X12 is the canonical example and the first format that implements this: an interchange (ISA/IEA) contains one or more functional groups (GS/GE), each containing one or more transaction sets (ST/SE), each containing the records. A single file can carry multiple interchanges back to back. See X12 Format for the full reference.

A reader for such a format opens and closes each nested level as it crosses the corresponding envelope boundary mid-file. Each level contributes its own sections to $doc. There is no new $doc syntax for nesting — every level’s sections are read through the same two-level $doc.<section>.<field> lookup. A record inside the innermost level sees every enclosing level’s sections at once. For X12 the interchange header is a declared segment: "ISA" envelope section (you choose its name), while the GS group and ST set surface automatically as the reader-supplied sections functional_group and transaction_set, each keyed by positional eNN elements:

project:
  - interchange_control: $doc.interchange.e13        # ISA13, declared section
  - functional_id:       $doc.functional_group.e01   # GS01 (reader-supplied)
  - transaction_type:    $doc.transaction_set.e01     # ST01 (reader-supplied)
  - claim_amount:        amount                       # body field

A record streamed inside the ST level resolves the ST section, the enclosing GS section, and the outermost ISA section, all at once: each inner level inherits every enclosing level’s sections as siblings in one flat namespace. If two levels declare a section with the same name, the innermost wins for records inside it — the same shadowing rule a nested scope follows in any language. Picking distinct per-level names (as above) keeps every level independently visible.

Boundaries nest correctly through the pipeline: each level opens before the records inside it and closes after them, in strict innermost-first order. A level that fans in through several branches is still reconciled once at Merge, exactly like a single-level document.

Header-only interchanges

A multi-level envelope file can legitimately carry an interchange whose body is empty — envelope structure (an interchange header, and possibly inner group headers) with zero records inside. Such an interchange still opens a document and emits its open/close boundaries, so downstream operators and trailer-count validation observe it just like any other document. The interchange’s $doc.* sections are extracted and the boundaries flow even though no body record ever streams from it.

The same holds for an empty inner envelope — an open/close pair with no records between — and for an inner envelope that opens or closes after the file’s last body record. Every envelope boundary a reader signals is applied, whether or not a record follows it, so the document frame stays balanced end to end.

Channels

Channels enable multi-tenant pipeline customization. A single pipeline definition can be run with different configurations per client, environment, or business unit – without duplicating or modifying the base YAML.

A .channel.yaml file declares a target pipeline (or composition), composition-level config knobs, and overrides/adds for the four scoped-variable registries the pipeline reads.

Channel manifest

# channels/staging.channel.yaml
channel:
  name: staging
  target: ./pipelines/my_pipeline.yaml

# Composition-level config knobs (DottedPath keys: alias.param)
config:
  default:
    enrich1.fuzzy_threshold: 0.85
  fixed:
    enrich1.lookup_table: "s3://acme/lookups/staging.csv"

# Variable overrides / adds (issue #45)
vars:
  static:                          # overrides + adds for $vars.*
    fuzzy_threshold:
      type: float
      default: 0.92
  pipeline:                        # overrides + adds for $pipeline.*
    cutoff_date:
      type: date
      default: "2026-01-01"
  source:                          # per-source-name overrides + adds for $source.*
    orders:
      ingest_label:
        type: string
        default: "staging"
  record:                          # overrides + adds for $record.*
    tier:
      type: string
      default: "bronze"

Top-level fields

FieldRequiredDescription
channel.nameYesChannel identifier; used in --channel, path templates, and the channel-identity stamp on the compiled plan.
channel.targetYesPath to the target pipeline (*.yaml) or composition (*.comp.yaml).
config.default / config.fixedNoComposition-config overlays. default can be overridden by a higher layer; fixed cannot.
vars.*NoSee Variable overrides below.

Running with a channel

clinker run pipeline.yaml --channel ./channels/staging.channel.yaml

--channel loads the binding once, validates it against the compiled plan, applies the overlay, and seeds the executor’s eval context before any record-stream-phase node runs. The channel name is also available as the {channel} token in output path templates.

If channel.target does not match the loaded <config> path, clinker emits W104 and proceeds — the operator may have a legitimate reason to run a sibling pipeline against the same channel.

Variable overrides

A pipeline exposes four scoped-variable registries:

Read syntaxLifetimePipeline declaration site
$vars.<key>Frozen at pipeline startTop-level vars: { key: { type, default } }
$pipeline.<key>Pipeline-wide, mutableTransform declares: [{ name, scope: pipeline, type, default? }]
$source.<key>Per-source-file, mutableTransform declares: [{ name, scope: source, type, default? }]
$record.<key>Per-record, mutableTransform declares: [{ name, scope: record, type, default? }]

Each registry has a corresponding sub-block under vars: on a channel YAML. Every entry uses the same { type, default } shape the pipeline declarations use:

  • Override — entry name already exists in the registry. The channel-supplied type MUST equal the declared type (mismatch → E107). The channel default replaces the declared default after passing the same typecheck pipeline declarations use.
  • Add — entry name not yet declared. The full { type, default } becomes the new declaration in that registry.

Source overrides are keyed by source-node name (vars.source.<src>.<var>). Adds and overrides on $source apply to every file the named source ingests; an unknown source name produces E111.

Reserved-system fields

Each scope has a small set of reserved field names that the engine populates (e.g. $pipeline.execution_id, $source.path, $source.row, $pipeline.start_time). Channels cannot shadow these — attempting it produces E110, naming the offending scope and field. The full lists live in crates/clinker-core/src/config/mod.rs (RESERVED_PIPELINE_NAMES, RESERVED_SOURCE_NAMES, RESERVED_RECORD_NAMES); $vars.* has no reserved subset.

Composition-target channels

Channels that target a .comp.yaml may not carry a vars: block (composition var overlay is out of scope today) — the binding emits E109 if vars: is non-empty. Channel-config knobs (config: block) on composition targets continue to work as before.

Diagnostic codes

CodeMeaning
E107Var override type mismatch (declared T, override declared U).
E109Var overrides not supported on composition channels.
E110Channel var shadows reserved system field for that scope.
E111vars.source.<src> references a source-node name not declared in the pipeline.
W103Channel config.* key did not match any composition parameter in the compiled plan.
W104channel.target does not match the <config> argument passed to clinker run.

Cross-Transform declaration uniqueness

$pipeline, $source, and $record are flat shared namespaces. The same name declared on more than one Transform’s declares: is a config-validation error — clinker mirrors the fail-fast posture of Beam, Flink, Kafka Streams, Dagster, and post-fix dbt for shared-namespace key collisions. Authors who want shared state declare it once and reference everywhere.

Workspace discovery

Channels are part of the broader workspace system. Clinker discovers workspaces via clinker.toml files, which can define the channel directory layout and other workspace-level settings.

Compositions

Compositions are reusable pipeline fragments that can be imported into multiple pipelines. They encapsulate common transform patterns – date derivations, address normalization, currency conversion – into self-contained, testable units.

Using a composition

A composition node in your pipeline references an external .comp.yaml file:

- type: composition
  name: fiscal_dates
  input: invoices
  use: "./compositions/fiscal_date.comp.yaml"
  config:
    start_month: 4

The use: field points to the composition definition file. The config: block passes parameters that customize the composition’s behavior for this specific invocation.

Composition definition file

A .comp.yaml file declares the composition’s interface – what fields it requires from upstream and what fields it produces:

# compositions/fiscal_date.comp.yaml
composition:
  name: fiscal_date
  description: "Derive fiscal year, quarter, and period from a date field"

  requires:
    - { name: invoice_date, type: date }

  produces:
    - { name: fiscal_year, type: int }
    - { name: fiscal_quarter, type: string }
    - { name: fiscal_period, type: int }

  params:
    - name: start_month
      type: int
      default: 1
      description: "First month of the fiscal year (1-12)"

Composition fields

FieldRequiredDescription
nameYesComposition identifier
descriptionNoHuman-readable purpose
requiresYesInput fields the composition needs from upstream (name + type)
producesYesOutput fields the composition adds to the record (name + type)
paramsNoConfigurable parameters with optional defaults

Advanced wiring

For compositions with multiple input or output ports, the node supports explicit port bindings:

- type: composition
  name: enrich_address
  input: customers
  use: "./compositions/address_normalize.comp.yaml"
  inputs:
    primary: customers
    reference: zip_lookup
  outputs:
    normalized: next_stage
  config:
    country_code: "US"
  resources:
    zip_database: "./data/zipcodes.csv"

Port and resource fields

FieldRequiredDescription
inputsNoMap of composition input ports to upstream node references
outputsNoMap of composition output ports to downstream node references
configNoParameter overrides (key-value pairs)
resourcesNoExternal resource bindings (file paths, connection strings)
aliasNoNamespace prefix for expanded node names (avoids collisions)

Complete example

pipeline:
  name: invoice_pipeline

nodes:
  - type: source
    name: invoices
    config:
      name: invoices
      type: csv
      path: "./data/invoices.csv"
      schema:
        - { name: invoice_id, type: int }
        - { name: customer_id, type: int }
        - { name: invoice_date, type: date }
        - { name: amount, type: float }

  - type: composition
    name: fiscal_dates
    input: invoices
    use: "./compositions/fiscal_date.comp.yaml"
    config:
      start_month: 4

  - type: transform
    name: final_enrich
    input: fiscal_dates
    config:
      cxl: |
        emit invoice_id = invoice_id
        emit customer_id = customer_id
        emit amount = amount
        emit fiscal_year = fiscal_year
        emit fiscal_quarter = fiscal_quarter

  - type: output
    name: result
    input: final_enrich
    config:
      name: result
      type: csv
      path: "./output/invoices_enriched.csv"

Current status

Note: Composition support is being built in Phase 16c. The YAML shape parses and validates, but compilation currently returns a diagnostic (E100) per composition node. The documentation above reflects the intended design. Full compilation and expansion will land when Phase 16c is complete.

CXL Overview

CXL (Clinker Expression Language) is a per-record expression language designed for ETL transformations. Every CXL program operates on one record at a time, producing output fields, filtering records, or computing derived values.

CXL is not SQL. There are no SELECT, FROM, or WHERE keywords. CXL programs are sequences of statements – emit, let, filter, distinct – that execute top to bottom against the current record.

Key differences from SQL

SQLCXL
SELECT col AS aliasemit alias = col
WHERE conditionfilter condition
AND / OR / NOTand / or / not (keywords)
&& / || / !Not supported – use keywords
COALESCE(a, b)a ?? b
CASE WHEN ... THEN ... ENDif ... then ... else ... or match { }

Boolean operators are keywords

CXL uses English keywords for boolean logic, not symbols:

$ cxl eval -e 'emit result = true and false' --field dummy=1
{
  "result": false
}

The operators &&, ||, and ! are syntax errors in CXL. Always use and, or, and not.

System namespaces use $ prefix

CXL provides built-in namespaces for accessing pipeline state, metadata, and window functions. All system namespaces are prefixed with $:

  • $pipeline.* – pipeline execution context (name, counters, provenance)
  • $meta.* – per-record metadata
  • $window.* – window function calls
  • $vars.* – user-defined pipeline variables
$ cxl eval -e 'emit name = $pipeline.name'
{
  "name": "cxl-eval"
}

Compile-time type checking

CXL catches type errors before data processing begins. The compilation pipeline runs four phases:

  1. Parse – tokenize and build an AST from CXL source text
  2. Resolve – bind field references, validate method names, check arity
  3. Typecheck – infer types, validate operator compatibility, check method receiver types
  4. Eval – execute the typed program against each record

Errors at any phase produce rich diagnostics with source locations and fix suggestions via miette.

$ cxl check transform.cxl
ok: transform.cxl is valid

If there are type errors, the checker reports them with spans:

error[typecheck]: cannot apply '+' to String and Int (at transform.cxl:12)
  help: convert one operand — use .to_int() or .to_string()

A minimal CXL program

emit greeting = "hello"
emit doubled = amount * 2
filter amount > 0

This program:

  1. Emits a constant string field greeting
  2. Emits doubled as twice the input amount
  3. Filters out records where amount is not positive

Try it:

$ cxl eval -e 'emit greeting = "hello"' -e 'emit doubled = amount * 2' \
    --field amount=5
{
  "greeting": "hello",
  "doubled": 10
}

Statement order matters

CXL statements execute sequentially. Later statements can reference fields produced by earlier emit or let statements:

$ cxl eval -e 'let tax_rate = 0.21' -e 'emit tax = price * tax_rate' \
    --field price=100
{
  "tax": 21.0
}

A filter statement short-circuits execution – if the condition is false, remaining statements do not run and the record is excluded from output.

Types & Literals

CXL has 9 value types. Every field value, literal, and expression result is one of these types.

Value types

TypeRust backingDescription
NullValue::NullMissing or absent value
Boolbooltrue or false
Integeri6464-bit signed integer
Floatf6464-bit double-precision float
StringBox<str>UTF-8 text
DateNaiveDateCalendar date without timezone
DateTimeNaiveDateTimeDate and time without timezone
ArrayVec<Value>Ordered collection of values
MapIndexMap<Box<str>, Value>Key-value pairs

Literal syntax

Integers

Standard decimal notation. Negative values use the unary minus operator.

$ cxl eval -e 'emit a = 42' -e 'emit b = -5' -e 'emit c = 0'
{
  "a": 42,
  "b": -5,
  "c": 0
}

Floats

Decimal notation with a dot. Must have digits on both sides of the decimal point.

$ cxl eval -e 'emit a = 3.14' -e 'emit b = -0.5'
{
  "a": 3.14,
  "b": -0.5
}

Strings

Double-quoted or single-quoted. Supports escape sequences: \\, \", \', \n, \t, \r.

$ cxl eval -e 'emit greeting = "hello world"'
{
  "greeting": "hello world"
}

Booleans

The keywords true and false.

$ cxl eval -e 'emit flag = true' -e 'emit neg = not flag'
{
  "flag": true,
  "neg": false
}

Dates

Hash-delimited ISO 8601 format: #YYYY-MM-DD#.

$ cxl eval -e 'emit d = #2024-01-15#'
{
  "d": "2024-01-15"
}

Null

The keyword null.

$ cxl eval -e 'emit nothing = null'
{
  "nothing": null
}

Schema types

When declaring column types in YAML pipeline schemas, use these type names:

Schema typeCXL typeDescription
stringStringText values
intInteger64-bit integers
floatFloat64-bit floats
boolBoolBoolean values
dateDateCalendar dates
date_timeDateTimeDate and time
arrayArrayOrdered collections
numericInt or FloatUnion type – accepts either
anyAnyUnknown type – no type constraints
nullable(T)Nullable(T)Wrapper – value may be null

Example YAML schema declaration:

schema:
  employee_id: int
  name: string
  salary: nullable(float)
  start_date: date

Type promotion

CXL automatically promotes types in mixed expressions:

Int + Float promotes to Float:

$ cxl eval -e 'emit result = 2 + 3.5'
{
  "result": 5.5
}

Null + T produces Nullable(T): Any operation involving null produces a nullable result.

$ cxl eval -e 'emit result = null + 5'
{
  "result": null
}

Nullable(A) + B unifies to Nullable(unified): When a nullable value meets a non-nullable value, the result type wraps the unified inner type in Nullable.

Type unification rules

The type checker follows these rules when two types meet in an expression:

  1. Same types unify to themselves: Int + Int produces Int
  2. Any unifies with anything: Any + T produces T
  3. Numeric resolves to the concrete type: Numeric + Int produces Int, Numeric + Float produces Float
  4. Int promotes to Float: Int + Float produces Float
  5. Null wraps: Null + T produces Nullable(T)
  6. Nullable propagates: Nullable(A) + B produces Nullable(unified(A, B))
  7. Incompatible types fail: String + Int is a type error

Operators & Expressions

CXL provides arithmetic, comparison, boolean, null coalescing, and string operators. Boolean logic uses keywords (and, or, not), not symbols.

Arithmetic operators

OperatorDescriptionExample
+Addition (or string concatenation)2 + 3
-Subtraction10 - 4
*Multiplication3 * 5
/Division10 / 3
%Modulo (remainder)10 % 3
$ cxl eval -e 'emit result = 2 + 3 * 4'
{
  "result": 14
}

Multiplication binds tighter than addition, so 2 + 3 * 4 is 2 + (3 * 4) = 14, not (2 + 3) * 4 = 20.

$ cxl eval -e 'emit result = 10 % 3'
{
  "result": 1
}

Comparison operators

OperatorDescriptionExample
==Equalx == 0
!=Not equalx != 0
>Greater thanx > 10
<Less thanx < 10
>=Greater than or equalx >= 10
<=Less than or equalx <= 10
$ cxl eval -e 'emit result = 5 > 3' --field dummy=1
{
  "result": true
}

Boolean operators

CXL uses keywords for boolean logic. The symbols &&, ||, and ! are not valid CXL syntax.

OperatorDescriptionExample
andLogical ANDa and b
orLogical ORa or b
notLogical NOT (unary)not a
$ cxl eval -e 'emit result = true and not false'
{
  "result": true
}
$ cxl eval -e 'emit result = 5 > 3 or 10 < 2'
{
  "result": true
}

Null coalesce operator

The ?? operator returns its left operand if non-null, otherwise its right operand.

$ cxl eval -e 'emit result = null ?? "default"'
{
  "result": "default"
}
$ cxl eval -e 'emit result = "present" ?? "default"'
{
  "result": "present"
}

String concatenation

The + operator concatenates strings when both operands are strings.

$ cxl eval -e 'emit result = "hello" + " " + "world"'
{
  "result": "hello world"
}

Unary operators

OperatorDescriptionExample
-Numeric negation-x
notBoolean negationnot done
$ cxl eval -e 'emit result = -42'
{
  "result": -42
}

Method calls

Methods are called on a receiver using dot notation:

$ cxl eval -e 'emit result = "hello".upper()'
{
  "result": "HELLO"
}

Methods can be chained:

$ cxl eval -e 'emit result = "  hello  ".trim().upper()'
{
  "result": "HELLO"
}

Field references

Bare identifiers reference fields from the input record:

$ cxl eval -e 'emit result = price * qty' \
    --field price=10 \
    --field qty=3
{
  "result": 30
}

Qualified field references use dot notation for multi-source pipelines: source.field.

Operator precedence

From highest (binds tightest) to lowest:

PrecedenceOperatorsAssociativity
1 (highest). (method calls, field access)Left
2- (unary), notPrefix
3* / %Left
4+ -Left
5== != > < >= <=Left
6andLeft
7orLeft
8 (lowest)??Right

Use parentheses to override precedence:

$ cxl eval -e 'emit result = (2 + 3) * 4'
{
  "result": 20
}

Comments

Line comments start with # (when not followed by a digit – digit-prefixed # starts a date literal):

# This is a comment
emit total = price * qty  # inline comment
emit deadline = #2024-12-31#  # this is a date literal, not a comment

Statements

CXL programs are sequences of statements that execute top-to-bottom against each input record. Statement order matters – later statements can reference values produced by earlier ones.

emit

The emit statement produces an output field. Each emit becomes a column in the output record.

emit name = expression
$ cxl eval -e 'emit greeting = "hello"' -e 'emit doubled = 21 * 2'
{
  "greeting": "hello",
  "doubled": 42
}

Multiple emit statements build up the output record field by field:

$ cxl eval -e 'emit first = "Alice"' -e 'emit last = "Smith"' \
    -e 'emit full = first + " " + last'
{
  "first": "Alice",
  "last": "Smith",
  "full": "Alice Smith"
}

let

The let statement creates a local variable binding. The variable is available to subsequent statements but is NOT included in the output record.

let name = expression
$ cxl eval -e 'let tax_rate = 0.21' -e 'emit tax = 100 * tax_rate'
{
  "tax": 21.0
}

Note that tax_rate does not appear in the output – only emit statements produce output fields.

filter

The filter statement excludes records where the condition evaluates to false. When a filter excludes a record, remaining statements do not execute (short-circuit).

filter condition
$ cxl eval -e 'filter amount > 0' -e 'emit result = amount * 2' \
    --field amount=5
{
  "result": 10
}

When the filter condition is false, the entire record is dropped and no output is produced.

Filters can appear anywhere in the statement sequence. Place them early to skip unnecessary computation:

filter status == "active"
let discount = if tier == "gold" then 0.2 else 0.1
emit final_price = price * (1 - discount)

distinct

The distinct statement deduplicates records. The bare form deduplicates on all emitted fields. The by form deduplicates on a specific field.

distinct
distinct by field_name

In a pipeline, distinct tracks values seen so far and drops records that have already been emitted with the same key.

emit meta

The emit meta statement writes a value to the $meta.* namespace – per-record metadata that is not part of the output columns. Metadata can be read by downstream nodes via $meta.field.

emit meta quality_flag = if amount < 0 then "suspect" else "ok"

Access metadata downstream:

filter $meta.quality_flag == "ok"

trace

The trace statement emits debug logging. It has no effect on the output record. Trace messages are only visible when tracing is enabled at the appropriate level.

trace "processing record"
trace warn "unusual value detected"
trace info if amount > 10000 then "high value transaction"

Trace levels: trace (default), debug, info, warn, error. An optional guard condition (via if) limits when the trace fires.

Statement ordering

Statements execute sequentially. A statement can reference any field or variable defined by a preceding emit or let:

$ cxl eval -e 'let base = 100' -e 'let rate = 0.15' \
    -e 'emit subtotal = base * rate' \
    -e 'emit total = base + subtotal'
{
  "subtotal": 15.0,
  "total": 115.0
}

Referencing a name before it is defined is a resolve-time error:

emit total = base + tax    # error: 'base' is not defined yet
let base = 100
let tax = base * 0.21

use

The use statement imports a CXL module for reuse. See Modules & use for details.

use shared.dates as d
emit fy = d::fiscal_year(invoice_date)

Conditionals

CXL provides two conditional expression forms: if/then/else and match. Both are expressions – they return values and can be used anywhere an expression is expected.

If / then / else

The basic conditional expression:

if condition then value else alternative
$ cxl eval -e 'emit label = if amount > 100 then "high" else "low"' \
    --field amount=250
{
  "label": "high"
}

The else branch is optional. When omitted, records where the condition is false produce null:

$ cxl eval -e 'emit bonus = if score > 90 then score * 0.1' \
    --field score=80
{
  "bonus": null
}

Chained conditionals

Chain multiple conditions with else if:

$ cxl eval -e 'emit tier = if amount > 1000 then "platinum"
    else if amount > 500 then "gold"
    else if amount > 100 then "silver"
    else "bronze"' \
    --field amount=750
{
  "tier": "gold"
}

Nested usage

Since if/then/else is an expression, it can be used inside other expressions:

$ cxl eval -e 'emit price = base * (if member then 0.8 else 1.0)' \
    --field base=100 \
    --field member=true
{
  "price": 80.0
}

Match

The match expression provides pattern matching. It comes in two forms: value matching (with a subject) and condition matching (without a subject).

Value form (with subject)

Match a subject expression against literal patterns:

match subject {
  pattern1 => result1,
  pattern2 => result2,
  _ => default
}
$ cxl eval -e 'emit label = match status {
    "A" => "Active",
    "I" => "Inactive",
    "P" => "Pending",
    _ => "Unknown"
  }' \
    --field status=A
{
  "label": "Active"
}

The wildcard _ is the catch-all arm. It matches any value not covered by preceding arms.

Condition form (without subject)

When no subject is provided, each arm’s pattern is evaluated as a boolean condition. This is CXL’s equivalent of SQL’s CASE WHEN:

match {
  condition1 => result1,
  condition2 => result2,
  _ => default
}
$ cxl eval -e 'emit tier = match {
    amount > 1000 => "high",
    amount > 100 => "medium",
    _ => "low"
  }' \
    --field amount=500
{
  "tier": "medium"
}

Practical examples

Tiered pricing:

emit discount = match {
  qty >= 1000 => 0.25,
  qty >= 100  => 0.15,
  qty >= 10   => 0.05,
  _           => 0.0
}

Status code mapping:

emit status_text = match http_code {
  200 => "OK",
  201 => "Created",
  400 => "Bad Request",
  404 => "Not Found",
  500 => "Internal Server Error",
  _   => "HTTP " + http_code.to_string()
}

Region classification:

emit region = match country {
  "US" => "North America",
  "CA" => "North America",
  "MX" => "North America",
  "GB" => "Europe",
  "DE" => "Europe",
  "FR" => "Europe",
  _    => "Other"
}

Match arms are evaluated in order

The first matching arm wins. Place more specific conditions before general ones:

# Correct: specific before general
emit category = match {
  amount > 10000 => "enterprise",
  amount > 1000  => "business",
  _              => "personal"
}

# Wrong: first arm always matches
emit category = match {
  amount > 0     => "personal",    # catches everything positive
  amount > 1000  => "business",    # never reached
  amount > 10000 => "enterprise",  # never reached
  _              => "unknown"
}

Built-in Methods

CXL provides built-in scalar methods organized into categories. Methods are called on a receiver value using dot notation: receiver.method(args).

Null propagation

Most methods return null when the receiver is null. This means null values flow through method chains without causing errors. The exceptions are documented in Introspection & Debug.

Method categories

String Methods (24 methods)

Text manipulation: case conversion, trimming, padding, searching, splitting, regex matching.

MethodDescription
upper, lowerCase conversion
trim, trim_start, trim_endWhitespace removal
starts_with, ends_with, containsSubstring testing
replaceFind and replace
substring, left, rightExtraction
pad_left, pad_rightPadding
repeat, reverseRepetition and reversal
lengthCharacter count
split, joinSplitting and joining
matches, find, captureRegex operations
format, concatFormatting and concatenation

Numeric Methods (8 methods)

Rounding, clamping, and comparison for integers and floats.

MethodDescription
absAbsolute value
ceil, floorCeiling and floor
round, round_toRounding to decimal places
clampConstrain to range
min, maxPairwise minimum/maximum

Date & Time Methods (13 methods)

Date component extraction, arithmetic, and formatting.

MethodDescription
year, month, dayDate component extraction
hour, minute, secondTime component extraction (DateTime only)
add_days, add_months, add_yearsDate arithmetic
diff_days, diff_months, diff_yearsDate difference
format_dateCustom date formatting

Conversion Methods (11 methods)

Type conversion in strict (error on failure) and lenient (null on failure) variants.

MethodDescription
to_int, to_float, to_string, to_boolStrict conversion
to_date, to_datetimeStrict date parsing
try_int, try_float, try_boolLenient conversion
try_date, try_datetimeLenient date parsing

Introspection & Debug (5 methods)

Type inspection, null checking, and debugging. These are the only methods that accept null receivers without propagating null.

MethodDescription
type_ofReturns the type name as a string
is_nullTests for null
is_emptyTests for empty string, empty array, or null
catchNull fallback (equivalent to ??)
debugPassthrough with tracing side effect

Path Methods (5 methods)

File path component extraction.

MethodDescription
file_nameFull filename with extension
file_stemFilename without extension
extensionFile extension
parentParent directory path
parent_nameParent directory name

Array Methods

Traversal and transformation over nested arrays. Closure-bearing methods take an arrow-syntax closure and evaluate it per element.

MethodDescription
filter, map, find, any, flat_mapClosure-bearing traversal
removeDrop the element at a given index
length, joinCross-listed on arrays (also defined on strings)

Map Methods

Builders and accessors for Value::Map payloads. All map methods return new maps – they never mutate the receiver.

MethodDescription
keys, valuesList map keys / values as arrays
mergeUnion of two maps (right wins on conflict)
setInsert / replace an entry, by single key or by a nested a.b[0].c path
remove_fieldDrop a single entry by key

String Methods

CXL provides 24 built-in methods for string manipulation. All string methods return null when the receiver is null (null propagation).

Case conversion

upper()

Converts all characters to uppercase.

$ cxl eval -e 'emit result = "hello world".upper()'
{
  "result": "HELLO WORLD"
}

lower()

Converts all characters to lowercase.

$ cxl eval -e 'emit result = "Hello World".lower()'
{
  "result": "hello world"
}

Whitespace trimming

trim()

Removes leading and trailing whitespace.

$ cxl eval -e 'emit result = "  hello  ".trim()'
{
  "result": "hello"
}

trim_start()

Removes leading whitespace only.

$ cxl eval -e 'emit result = "  hello  ".trim_start()'
{
  "result": "hello  "
}

trim_end()

Removes trailing whitespace only.

$ cxl eval -e 'emit result = "  hello  ".trim_end()'
{
  "result": "  hello"
}

Substring testing

starts_with(prefix: String) -> Bool

Tests whether the string starts with the given prefix.

$ cxl eval -e 'emit result = "hello world".starts_with("hello")'
{
  "result": true
}

ends_with(suffix: String) -> Bool

Tests whether the string ends with the given suffix.

$ cxl eval -e 'emit result = "report.csv".ends_with(".csv")'
{
  "result": true
}

contains(substring: String) -> Bool

Tests whether the string contains the given substring.

$ cxl eval -e 'emit result = "hello world".contains("lo wo")'
{
  "result": true
}

Find and replace

replace(find: String, replacement: String) -> String

Replaces all occurrences of find with replacement.

$ cxl eval -e 'emit result = "foo-bar-baz".replace("-", "_")'
{
  "result": "foo_bar_baz"
}

Extraction

substring(start: Int [, length: Int]) -> String

Extracts a substring starting at start (0-based character index). If length is provided, takes at most that many characters. If omitted, takes all remaining characters.

$ cxl eval -e 'emit result = "hello world".substring(6)'
{
  "result": "world"
}
$ cxl eval -e 'emit result = "hello world".substring(0, 5)'
{
  "result": "hello"
}

left(n: Int) -> String

Returns the first n characters.

$ cxl eval -e 'emit result = "hello world".left(5)'
{
  "result": "hello"
}

right(n: Int) -> String

Returns the last n characters.

$ cxl eval -e 'emit result = "hello world".right(5)'
{
  "result": "world"
}

Padding

pad_left(width: Int [, char: String]) -> String

Left-pads the string to the given width. Default pad character is a space.

$ cxl eval -e 'emit result = "42".pad_left(5, "0")'
{
  "result": "00042"
}
$ cxl eval -e 'emit result = "hi".pad_left(6)'
{
  "result": "    hi"
}

pad_right(width: Int [, char: String]) -> String

Right-pads the string to the given width. Default pad character is a space.

$ cxl eval -e 'emit result = "hi".pad_right(6, ".")'
{
  "result": "hi...."
}

Repetition and reversal

repeat(n: Int) -> String

Repeats the string n times.

$ cxl eval -e 'emit result = "ab".repeat(3)'
{
  "result": "ababab"
}

reverse() -> String

Reverses the characters in the string.

$ cxl eval -e 'emit result = "hello".reverse()'
{
  "result": "olleh"
}

Length

length() -> Int

Returns the number of characters in the string. Also works on arrays, returning the number of elements.

$ cxl eval -e 'emit result = "hello".length()'
{
  "result": 5
}

Splitting and joining

split(delimiter: String) -> Array

Splits the string by the delimiter, returning an array of strings.

$ cxl eval -e 'emit result = "a,b,c".split(",")'
{
  "result": ["a", "b", "c"]
}

join(delimiter: String) -> String

Joins an array of values into a string with the given delimiter. The receiver must be an array.

$ cxl eval -e 'emit result = "a,b,c".split(",").join(" - ")'
{
  "result": "a - b - c"
}

Regex operations

matches(pattern: String) -> Bool

Tests whether the string fully matches the given regex pattern.

$ cxl eval -e 'emit result = "abc123".matches("^[a-z]+[0-9]+$")'
{
  "result": true
}

find(pattern: String) -> Bool

Tests whether the string contains a substring matching the given regex pattern (partial match).

$ cxl eval -e 'emit result = "hello world 42".find("[0-9]+")'
{
  "result": true
}

capture(pattern: String [, group: Int]) -> String

Extracts a capture group from the first regex match. Default group is 0 (the full match).

$ cxl eval -e 'emit result = "order-12345".capture("order-([0-9]+)", 1)'
{
  "result": "12345"
}

Formatting and concatenation

format(fmt: String) -> String

Formats the receiver value as a string.

$ cxl eval -e 'emit result = 42.format("")'
{
  "result": "42"
}

concat(args: String…) -> String

Concatenates the receiver with one or more string arguments. Null arguments are treated as empty strings.

$ cxl eval -e 'emit result = "hello".concat(" ", "world")'
{
  "result": "hello world"
}

This is variadic – it accepts any number of string arguments:

$ cxl eval -e 'emit result = "a".concat("b", "c", "d")'
{
  "result": "abcd"
}

Numeric Methods

CXL provides 8 built-in methods for numeric operations. These methods work on both Integer and Float values (the Numeric receiver type). All return null when the receiver is null.

abs() -> Numeric

Returns the absolute value. Preserves the original type (Int stays Int, Float stays Float).

$ cxl eval -e 'emit result = (-42).abs()'
{
  "result": 42
}
$ cxl eval -e 'emit result = (-3.14).abs()'
{
  "result": 3.14
}

ceil() -> Int

Rounds up to the nearest integer. Returns the value unchanged for integers.

$ cxl eval -e 'emit result = 3.2.ceil()'
{
  "result": 4
}
$ cxl eval -e 'emit result = (-3.2).ceil()'
{
  "result": -3
}

floor() -> Int

Rounds down to the nearest integer. Returns the value unchanged for integers.

$ cxl eval -e 'emit result = 3.8.floor()'
{
  "result": 3
}
$ cxl eval -e 'emit result = (-3.2).floor()'
{
  "result": -4
}

round([decimals: Int]) -> Float

Rounds to the specified number of decimal places. Default is 0 decimal places.

$ cxl eval -e 'emit result = 3.456.round()'
{
  "result": 3.0
}
$ cxl eval -e 'emit result = 3.456.round(2)'
{
  "result": 3.46
}

round_to(decimals: Int) -> Float

Rounds to the specified number of decimal places. Unlike round(), the decimals argument is required.

$ cxl eval -e 'emit result = 3.14159.round_to(3)'
{
  "result": 3.142
}

Use round_to when you want to be explicit about precision in financial or scientific calculations:

$ cxl eval -e 'emit price = 19.995.round_to(2)'
{
  "price": 20.0
}

clamp(min: Numeric, max: Numeric) -> Numeric

Constrains the value to the given range. Returns min if the value is below it, max if above, or the value itself if within range.

$ cxl eval -e 'emit result = 150.clamp(0, 100)'
{
  "result": 100
}
$ cxl eval -e 'emit result = (-5).clamp(0, 100)'
{
  "result": 0
}
$ cxl eval -e 'emit result = 50.clamp(0, 100)'
{
  "result": 50
}

min(other: Numeric) -> Numeric

Returns the smaller of the receiver and the argument.

$ cxl eval -e 'emit result = 10.min(20)'
{
  "result": 10
}
$ cxl eval -e 'emit result = 10.min(5)'
{
  "result": 5
}

max(other: Numeric) -> Numeric

Returns the larger of the receiver and the argument.

$ cxl eval -e 'emit result = 10.max(20)'
{
  "result": 20
}
$ cxl eval -e 'emit result = 10.max(5)'
{
  "result": 10
}

Practical examples

Clamp a percentage:

emit pct = (completed / total * 100).clamp(0, 100).round_to(1)

Absolute difference:

emit diff = (actual - expected).abs()

Floor division for batch numbering:

emit batch = (row_number / 1000).floor()

Date & Time Methods

CXL provides 13 built-in methods for date and time manipulation. These methods work on Date and DateTime values. All return null when the receiver is null.

Component extraction

year() -> Int

Returns the year component.

$ cxl eval -e 'emit result = #2024-03-15#.year()'
{
  "result": 2024
}

month() -> Int

Returns the month component (1-12).

$ cxl eval -e 'emit result = #2024-03-15#.month()'
{
  "result": 3
}

day() -> Int

Returns the day-of-month component (1-31).

$ cxl eval -e 'emit result = #2024-03-15#.day()'
{
  "result": 15
}

hour() -> Int

Returns the hour component (0-23). DateTime only – returns null for Date values.

$ cxl eval -e 'emit result = "2024-03-15T14:30:00".to_datetime().hour()'
{
  "result": 14
}

minute() -> Int

Returns the minute component (0-59). DateTime only – returns null for Date values.

$ cxl eval -e 'emit result = "2024-03-15T14:30:00".to_datetime().minute()'
{
  "result": 30
}

second() -> Int

Returns the second component (0-59). DateTime only – returns null for Date values.

$ cxl eval -e 'emit result = "2024-03-15T14:30:45".to_datetime().second()'
{
  "result": 45
}

Date arithmetic

add_days(n: Int) -> Date

Adds n days to the date. Use negative values to subtract. Works on both Date and DateTime.

$ cxl eval -e 'emit result = #2024-01-15#.add_days(10)'
{
  "result": "2024-01-25"
}
$ cxl eval -e 'emit result = #2024-01-15#.add_days(-5)'
{
  "result": "2024-01-10"
}

add_months(n: Int) -> Date

Adds n months to the date. Day is clamped to the last day of the target month if necessary.

$ cxl eval -e 'emit result = #2024-01-31#.add_months(1)'
{
  "result": "2024-02-29"
}
$ cxl eval -e 'emit result = #2024-03-15#.add_months(-2)'
{
  "result": "2024-01-15"
}

add_years(n: Int) -> Date

Adds n years to the date. Leap day (Feb 29) is clamped to Feb 28 in non-leap years.

$ cxl eval -e 'emit result = #2024-02-29#.add_years(1)'
{
  "result": "2025-02-28"
}

Date difference

diff_days(other: Date) -> Int

Returns the number of days between the receiver and the argument (receiver - other). Positive when the receiver is later.

$ cxl eval -e 'emit result = #2024-03-15#.diff_days(#2024-03-01#)'
{
  "result": 14
}
$ cxl eval -e 'emit result = #2024-01-01#.diff_days(#2024-03-15#)'
{
  "result": -74
}

diff_months(other: Date) -> Int

Returns the difference in months between two dates.

Note: This method currently returns null (unimplemented). Use diff_days and divide by 30 as an approximation.

diff_years(other: Date) -> Int

Returns the difference in years between two dates.

Note: This method currently returns null (unimplemented). Use diff_days and divide by 365 as an approximation.

Formatting

format_date(format: String) -> String

Formats the date/datetime using a chrono format string. See chrono format syntax.

Common format specifiers:

SpecifierDescriptionExample
%Y4-digit year2024
%m2-digit month03
%d2-digit day15
%HHour (24h)14
%MMinute30
%SSecond00
%BFull month nameMarch
%bAbbreviated monthMar
%AFull weekdayFriday
$ cxl eval -e 'emit result = #2024-03-15#.format_date("%B %d, %Y")'
{
  "result": "March 15, 2024"
}
$ cxl eval -e 'emit result = #2024-03-15#.format_date("%Y/%m/%d")'
{
  "result": "2024/03/15"
}

Practical examples

Fiscal year calculation (April start):

let d = invoice_date
emit fiscal_year = if d.month() < 4 then d.year() - 1 else d.year()

Age in days:

emit days_since = now.diff_days(created_date)

Quarter:

emit quarter = match {
  invoice_date.month() <= 3  => "Q1",
  invoice_date.month() <= 6  => "Q2",
  invoice_date.month() <= 9  => "Q3",
  _                          => "Q4"
}

ISO week format:

emit formatted = order_date.format_date("%Y-W%V")

Conversion Methods

CXL provides two families of conversion methods: strict (6 methods) and lenient (5 methods). Strict conversions raise an error on failure, halting pipeline execution. Lenient conversions return null on failure, allowing graceful handling of dirty data.

All conversion methods accept any receiver type (Any).

Strict conversions

Use strict conversions for required fields where invalid data should halt processing.

to_int() -> Int

Converts the receiver to an integer. Errors on failure.

  • Float: truncates toward zero
  • String: parses as integer
  • Bool: true becomes 1, false becomes 0
$ cxl eval -e 'emit result = "42".to_int()'
{
  "result": 42
}
$ cxl eval -e 'emit result = 3.9.to_int()'
{
  "result": 3
}

to_float() -> Float

Converts the receiver to a float. Errors on failure.

  • Integer: promotes to float
  • String: parses as float
$ cxl eval -e 'emit result = "3.14".to_float()'
{
  "result": 3.14
}
$ cxl eval -e 'emit result = 42.to_float()'
{
  "result": 42.0
}

to_string() -> String

Converts any value to its string representation. Never fails.

$ cxl eval -e 'emit result = 42.to_string()'
{
  "result": "42"
}
$ cxl eval -e 'emit result = true.to_string()'
{
  "result": "true"
}

to_bool() -> Bool

Converts the receiver to a boolean. Errors on failure.

  • String: "true", "1", "yes" become true; "false", "0", "no" become false (case-insensitive)
  • Integer: 0 is false, everything else is true
$ cxl eval -e 'emit result = "yes".to_bool()'
{
  "result": true
}
$ cxl eval -e 'emit result = 0.to_bool()'
{
  "result": false
}

to_date([format: String]) -> Date

Parses a string to a Date. Without a format argument, expects ISO 8601 (YYYY-MM-DD). With a format, uses chrono strftime syntax.

$ cxl eval -e 'emit result = "2024-03-15".to_date()'
{
  "result": "2024-03-15"
}
$ cxl eval -e 'emit result = "15/03/2024".to_date("%d/%m/%Y")'
{
  "result": "2024-03-15"
}

to_datetime([format: String]) -> DateTime

Parses a string to a DateTime. Without a format argument, expects ISO 8601 (YYYY-MM-DDTHH:MM:SS). With a format, uses chrono strftime syntax.

$ cxl eval -e 'emit result = "2024-03-15T14:30:00".to_datetime()'
{
  "result": "2024-03-15T14:30:00"
}

Lenient conversions

Use lenient conversions for optional or dirty data fields. They return null instead of raising errors, making them safe to combine with ?? for fallback values.

try_int() -> Int

Attempts to convert to integer. Returns null on failure.

$ cxl eval -e 'emit a = "42".try_int()' -e 'emit b = "abc".try_int()'
{
  "a": 42,
  "b": null
}

try_float() -> Float

Attempts to convert to float. Returns null on failure.

$ cxl eval -e 'emit a = "3.14".try_float()' -e 'emit b = "N/A".try_float()'
{
  "a": 3.14,
  "b": null
}

try_bool() -> Bool

Attempts to convert to boolean. Returns null on failure.

$ cxl eval -e 'emit a = "yes".try_bool()' -e 'emit b = "maybe".try_bool()'
{
  "a": true,
  "b": null
}

try_date([format: String]) -> Date

Attempts to parse a string as a Date. Returns null on failure.

$ cxl eval -e 'emit a = "2024-03-15".try_date()' \
    -e 'emit b = "not a date".try_date()'
{
  "a": "2024-03-15",
  "b": null
}

try_datetime([format: String]) -> DateTime

Attempts to parse a string as a DateTime. Returns null on failure.

$ cxl eval -e 'emit a = "2024-03-15T14:30:00".try_datetime()' \
    -e 'emit b = "invalid".try_datetime()'
{
  "a": "2024-03-15T14:30:00",
  "b": null
}

When to use each

Strict conversions (to_*) for:

  • Required fields that must be valid
  • Schema-enforced data where bad input should halt the pipeline
  • Fields already validated upstream

Lenient conversions (try_*) for:

  • Optional fields that may be missing or malformed
  • Dirty data with mixed formats
  • Fields where a fallback value is acceptable

Practical patterns

Safe numeric parsing with fallback:

emit amount = raw_amount.try_float() ?? 0.0

Parse dates from multiple formats:

emit parsed = raw_date.try_date("%Y-%m-%d")
    ?? raw_date.try_date("%m/%d/%Y")
    ?? raw_date.try_date("%d-%b-%Y")

Strict conversion for required fields:

emit employee_id = raw_id.to_int()    # halts on bad data -- correct behavior
emit salary = raw_salary.to_float()   # must be numeric

Lenient conversion for optional fields:

emit bonus = raw_bonus.try_float()    # null if missing or non-numeric
emit total = salary + (bonus ?? 0.0)  # safe arithmetic

Introspection & Debug

CXL provides 4 introspection methods and 1 debug method. These are the only methods that accept null receivers without propagating null – they are designed specifically for inspecting and handling null values.

type_of() -> String

Returns the type name of the receiver as a string. Works on any value, including null.

Type name strings: "String", "Int", "Float", "Bool", "Date", "DateTime", "Null", "Array", "Map".

$ cxl eval -e 'emit a = 42.type_of()' -e 'emit b = "hello".type_of()' \
    -e 'emit c = null.type_of()'
{
  "a": "Int",
  "b": "String",
  "c": "Null"
}

Useful for branching on dynamic types:

emit formatted = match value.type_of() {
  "Int"   => value.to_string() + " (integer)",
  "Float" => value.round_to(2).to_string() + " (decimal)",
  _       => value.to_string()
}

is_null() -> Bool

Returns true if the receiver is null, false otherwise. This is the primary way to test for null values – it is NOT subject to null propagation.

$ cxl eval -e 'emit a = null.is_null()' -e 'emit b = 42.is_null()'
{
  "a": true,
  "b": false
}

Use in filter statements:

filter not field.is_null()

is_empty() -> Bool

Returns true for empty strings, empty arrays, or null values. Returns false for all other values.

$ cxl eval -e 'emit a = "".is_empty()' -e 'emit b = "hello".is_empty()' \
    -e 'emit c = null.is_empty()'
{
  "a": true,
  "b": false,
  "c": true
}

Useful for filtering out blank or missing records:

filter not name.is_empty()

catch(fallback: Any) -> Any

Returns the receiver if it is non-null, otherwise returns the fallback value. This is the method equivalent of the ?? operator.

$ cxl eval -e 'emit a = null.catch("default")' \
    -e 'emit b = "present".catch("default")'
{
  "a": "default",
  "b": "present"
}

catch and ?? are interchangeable:

# These two are equivalent:
emit name = raw_name.catch("Unknown")
emit name = raw_name ?? "Unknown"

debug(label: String) -> Any

Passes the receiver through unchanged while emitting a trace log with the given label. Zero overhead when tracing is disabled. The return value is always the receiver, making it safe to insert into any expression chain.

$ cxl eval -e 'emit result = 42.debug("check value")'
{
  "result": 42
}

Insert debug anywhere in a method chain for inspection without affecting the output:

emit total = price.debug("price")
    * qty.debug("qty")

When tracing is enabled, this produces log lines like:

TRACE source_row=1 source_file=input.csv: price: Integer(100)
TRACE source_row=1 source_file=input.csv: qty: Integer(5)

Null-safe summary

MethodNull receiver behavior
type_of()Returns "Null"
is_null()Returns true
is_empty()Returns true
catch(x)Returns x
debug(l)Passes through null, logs it
All other methodsReturn null (propagation)

Path Methods

CXL provides 5 built-in methods for extracting components from file path strings. All path methods take a string receiver and return a string. They return null when the receiver is null or when the requested component does not exist.

file_name() -> String

Returns the full filename (with extension) from the path.

$ cxl eval -e 'emit result = "/data/reports/sales.csv".file_name()'
{
  "result": "sales.csv"
}

file_stem() -> String

Returns the filename without the extension.

$ cxl eval -e 'emit result = "/data/reports/sales.csv".file_stem()'
{
  "result": "sales"
}

extension() -> String

Returns the file extension (without the leading dot).

$ cxl eval -e 'emit result = "/data/reports/sales.csv".extension()'
{
  "result": "csv"
}

Returns null when no extension is present:

$ cxl eval -e 'emit result = "/data/reports/README".extension()'
{
  "result": null
}

parent() -> String

Returns the parent directory path.

$ cxl eval -e 'emit result = "/data/reports/sales.csv".parent()'
{
  "result": "/data/reports"
}

parent_name() -> String

Returns just the name of the parent directory (not the full path).

$ cxl eval -e 'emit result = "/data/reports/sales.csv".parent_name()'
{
  "result": "reports"
}

Practical examples

Organize output by source directory:

emit source_dir = $pipeline.source_file.parent_name()
emit source_type = $pipeline.source_file.extension()

Extract file identifiers:

emit file_id = $pipeline.source_file.file_stem()
emit is_csv = $pipeline.source_file.extension() == "csv"

Route by file type:

let ext = input_path.extension()
emit format = match ext {
  "csv"  => "delimited",
  "json" => "structured",
  "xml"  => "markup",
  _      => "unknown"
}

Array Methods

CXL provides closure-bearing and non-closure array builtins for traversing and transforming nested arrays carried on a single record. The closure-bearing methods take an arrow-syntax closure and evaluate it once per element.

Null propagation

Every array method returns null when the receiver is null. The closure body is not invoked on a null receiver.

Closure-bearing methods

filter(it => Bool) -> Array

Returns a new array containing the elements for which the closure body evaluates to true.

- type: transform
  name: filter_items
  input: orders
  config:
    cxl: |
      emit kept = items.filter(it => it["price"] > 5)

For an input record where items is [{"sku":"a","price":10},{"sku":"b","price":20},{"sku":"c","price":5}], kept is [{"sku":"a","price":10},{"sku":"b","price":20}].

map(it => T) -> Array

Returns a new array whose elements are the closure body’s value for each input element. The element type need not match the input element type.

    cxl: |
      emit skus = items.map(it => it["sku"])
      emit doubled_prices = items.map(it => it["price"] * 2)

skus is ["a", "b", "c"]; doubled_prices is [20, 40, 10].

find(it => Bool) -> Element | Null

Returns the first element for which the closure body evaluates to true. Returns null if no element matches.

    cxl: |
      emit first_premium = items.find(it => it["price"] > 15)

first_premium is {"sku":"b","price":20} for the running example.

any(it => Bool) -> Bool

Returns true if the closure body evaluates to true for at least one element. Returns false if no element matches (including on an empty array).

    cxl: |
      emit has_cheap = items.any(it => it["price"] < 10)

has_cheap is true.

flat_map(it => Array) -> Array

Like map, but the closure body returns an array per input element; the results are concatenated into a single flat array. A null body result contributes no elements; a non-array body result contributes a single element.

    cxl: |
      emit all_tags = items.flat_map(it => it["tags"])

For input items carrying tags arrays (e.g. [{"sku":"a","tags":["new"]},{"sku":"b","tags":["sale","new"]}]), all_tags is ["new","sale","new"].

Non-closure methods

remove(index: Int) -> Array

Returns a new array with the element at the given 0-based index removed. The original array is unchanged.

    cxl: |
      emit shifted = items.remove(1)

shifted is [{"sku":"a","price":10},{"sku":"c","price":5}] – index 0 is preserved, index 2 shifts down to index 1.

If the index is negative or out of range, remove returns the receiver array unchanged.

length() -> Int

Returns the number of elements in the array. length is also defined on strings (see String Methods).

    cxl: |
      emit item_count = items.length()

item_count is 3.

join(separator: String) -> String

Joins an array of values into a single string with the given separator between elements. Defined as a string method (see String Methods) but accepts array receivers.

    cxl: |
      emit sku_list = items.map(it => it["sku"]).join(", ")

sku_list is "a, b, c".

Bracket indexing vs .remove

Bracket indexing (items[0]) reads an element by position and returns null when out of range. .remove(idx) returns a new array with the element dropped; out-of-range indices leave the array unchanged. See Nested Paths for the index-access surface.

See also

  • Closures – the it => body form used by closure-bearing array methods.
  • Map Methods – builtins that operate on the map elements typically iterated by these array methods.
  • Nested Paths – bracket-index and dotted-path access through nested arrays and maps.
  • Emit Each – fan one input record into many output records, one per array element.

Map Methods

CXL provides five built-in methods for working with map values (key-value pairs). Maps arise naturally from JSON object inputs, from the set builder below, and from upstream emits that produce nested structures.

All map methods return new values – they never mutate the receiver. This is copy-on-write semantics: chaining .set then .remove_field produces a fresh map at each step, leaving the upstream binding untouched.

Null propagation

Every map method returns null when the receiver is null or is not a Value::Map.

Method reference

keys() -> Array

Returns the map’s keys as an array of strings, preserving insertion order.

- type: transform
  name: list_keys
  input: rows
  config:
    cxl: |
      emit field_names = profile.keys()

For an input record where profile is {"name":"Alice","tier":"gold","since":"2021-04"}, field_names is ["name","tier","since"].

values() -> Array

Returns the map’s values as an array, preserving insertion order. Value types are heterogeneous – the array carries each value as-is.

    cxl: |
      emit field_values = profile.values()

field_values is ["Alice","gold","2021-04"].

merge(other: Map) -> Map

Returns a new map containing every key from the receiver and from other. On conflicting keys, other’s value wins.

    cxl: |
      emit enriched = profile.merge(overrides)

For profile = {"name":"Alice","tier":"gold"} and overrides = {"tier":"platinum","since":"2021-04"}, enriched is {"name":"Alice","tier":"platinum","since":"2021-04"}.

set(key: String, value: Any) -> Map

Returns a new map with key set to value. If the key was already present, its value is replaced; insertion order is preserved.

    cxl: |
      emit stamped = profile.set("region", "us-east")

stamped is {"name":"Alice","tier":"gold","since":"2021-04","region":"us-east"}.

Nested paths

key may be a dotted/indexed path that descends into nested maps and arrays, so a single set writes into a deep document. Dots separate map keys; a [n] suffix indexes an array.

    cxl: |
      emit moved = profile.set("address.city", "NYC")
      emit relabel = order.set("items[0].sku", "A-100")
  • Auto-create. Missing intermediate map segments are created as empty maps, so a path can build structure that does not yet exist. {}.set("a.b.c", 7) returns {"a":{"b":{"c":7}}}. This is what lets set assemble a nested document from scratch (matching jq setpath and Bloblang assignment).
  • Type conflict -> null. If an intermediate segment already exists but is the wrong kind for the next step – descending into a key whose value is a scalar, indexing a map with [n], or naming a field on an array – the whole operation returns null. Nothing is partially written.
  • Array index past the end -> null. Indexing past the last element returns null for the whole operation; arrays are never silently grown. The path can only overwrite an array slot that already exists.
  • A bare key is a single key, not a path. "region" writes the top-level region. Only . and [n] introduce nesting; a key with neither behaves exactly as before.

For profile = {"name":"Alice","address":{"city":"LA"}}, profile.set("address.city", "NYC") is {"name":"Alice","address":{"city":"NYC"}} – the sibling name and any other address keys are preserved.

Known limitation. Because . and [ are path syntax, set cannot target a key whose name literally contains a . or [ (for example a JSON field literally named "a.b"). To write such a key, build it with merge and a map literal; to remove it, use remove_field, which matches the exact key string.

remove_field(key: String) -> Map

Returns a new map without key. If the key was absent, the receiver is returned unchanged.

    cxl: |
      emit slim = profile.remove_field("since")

slim is {"name":"Alice","tier":"gold"}.

Worked example: chained set + remove_field

Map methods compose naturally because each returns a new map.

- type: transform
  name: rewrite_profile
  input: rows
  config:
    cxl: |
      emit profile =
        profile.set("region", "us-east").remove_field("internal_id")

For profile = {"name":"Alice","internal_id":"ix-77","tier":"gold"}, the emitted profile is {"name":"Alice","tier":"gold","region":"us-east"}. The internal_id slot is removed and the region slot is appended; both happen on a fresh map so the upstream record’s profile is unaffected for any other downstream branch.

Parentheses are required

All map methods are method calls and must be written with parentheses, even the zero-argument ones:

profile.keys()         -- ok
profile.keys           -- parses as a field lookup, not a method call

profile.keys parses as a dotted path – a lookup for a field literally named keys inside profile. That path almost certainly returns null. Always include the parentheses when invoking a map method.

Using map methods inside array closures

Map methods compose with closure-bearing array builtins when the array elements are themselves maps.

    cxl: |
      emit enriched_items = items.map(it => it.set("region", "us-east"))
      emit item_keys = items.map(it => it.keys())

Each it is a map; the closure body invokes a map method on it. enriched_items is an array where every element gained a region field. item_keys is an array of key-name arrays, one per element.

See also

  • Closures – arrow-syntax closures often invoke map methods on their it binding.
  • Array Methods – closure-bearing array methods commonly carry maps as their elements.
  • Nested Paths – bracket-index access (profile["name"]) reads a single key without producing a new map.

Window Functions

Window functions allow CXL expressions to access aggregated values across a set of records within an analytic window. Unlike aggregate functions (which collapse groups into single rows), window functions attach computed values to each individual record.

Window functions are accessed via the $window.* namespace and require an analytic_window: configuration on the transform node.

Configuring an analytic window

Window functions are only available in transform nodes that declare an analytic_window: section in YAML:

nodes:
  - name: ranked_sales
    type: transform
    input: raw_sales
    analytic_window:
      group_by: [region]
      sort_by:
        - field: amount
          order: desc
    cxl: |
      emit region = region
      emit amount = amount
      emit running_total = $window.sum(amount)
      emit rank_position = $window.count()

Window configuration fields

FieldDescription
group_byList of fields to partition the window by (the SQL PARTITION BY axis).
sort_byList of { field, order } ordering specifications (order is asc or desc).
sourceOptional explicit source-name reference for cross-source windows.
onOptional cross-source partition-lookup field.

Frame specification (frame: { rows: ... } / frame: { range: ... }) is not yet plumbed through the YAML parser; today every window evaluates with a rows: unbounded_preceding..current_row semantic, which matches the SQL default for the listed window functions. See the deferred-work tracker for status of explicit frame syntax.

Aggregate window functions

These compute aggregate values over the window frame.

$window.sum(field)

Sum of the field values in the window frame.

emit running_total = $window.sum(amount)

$window.avg(field)

Average of the field values in the window frame. Returns Float.

emit moving_avg = $window.avg(amount)

$window.min(field)

Minimum value in the window frame.

emit window_min = $window.min(amount)

$window.max(field)

Maximum value in the window frame.

emit window_max = $window.max(amount)

$window.count()

Count of records in the window frame. Takes no arguments.

emit window_size = $window.count()

$window.first_value(field)

Returns the value of field at the first record of the window frame (ordered by sort_by). Equivalent to SQL FIRST_VALUE(field).

emit opening_amount = $window.first_value(amount)

$window.last_value(field)

Returns the value of field at the last record of the window frame (ordered by sort_by). Equivalent to SQL LAST_VALUE(field).

emit closing_amount = $window.last_value(amount)

Ranking window functions

Zero-argument integer functions that return the current row’s rank within its partition.

$window.row_number()

1-indexed position of the current record within its partition.

emit row_idx = $window.row_number()

$window.rank()

SQL RANK(): rows that share the same sort_by tuple receive the same rank, and the next distinct row jumps by the size of the tie group.

emit sales_rank = $window.rank()

$window.dense_rank()

SQL DENSE_RANK(): ties share a rank with no gaps between distinct ranks.

emit sales_dense_rank = $window.dense_rank()

Positional window functions

These access specific records by position within the window frame.

$window.first()

Returns the value of the current field from the first record in the window frame.

emit first_amount = $window.first()

$window.last()

Returns the value of the current field from the last record in the window frame.

emit last_amount = $window.last()

$window.lag(n)

Returns the value from n records before the current record. Returns null if there is no record at that offset.

emit prev_amount = $window.lag(1)
emit two_back = $window.lag(2)

$window.lead(n)

Returns the value from n records after the current record. Returns null if there is no record at that offset.

emit next_amount = $window.lead(1)

Iterable window functions

These evaluate predicates or collect values across the window.

$window.any(predicate)

Returns true if the predicate is true for any record in the window.

emit has_high = $window.any(amount > 1000)

$window.every(predicate)

Returns true if the predicate is true for every record in the window.

emit all_positive = $window.every(amount > 0)

$window.exists(predicate)

Returns true if the predicate is true for at least one record in the window — a SQL-fluency alias of $window.any.

emit any_high = $window.exists(amount > 1000)

$window.not_exists(predicate)

Returns true if no record in the window satisfies the predicate. Equivalent to not $window.exists(predicate) and to $window.every(not predicate).

emit none_negative = $window.not_exists(amount < 0)

$window.collect(field)

Collects all values of the field in the window into an array.

emit all_amounts = $window.collect(amount)

$window.distinct(field)

Collects distinct values of the field in the window into an array.

emit unique_regions = $window.distinct(region)

Complete example

nodes:
  - name: sales_analysis
    type: transform
    input: daily_sales
    analytic_window:
      group_by: [store_id]
      sort_by:
        - field: sale_date
          order: asc
    cxl: |
      emit store_id = store_id
      emit sale_date = sale_date
      emit daily_revenue = revenue
      emit week_avg = $window.avg(revenue)
      emit week_total = $window.sum(revenue)
      emit prev_day_revenue = $window.lag(1)
      emit day_over_day = revenue - ($window.lag(1) ?? revenue)

This computes per-store running averages and totals over the partition’s history-up-to-and-including the current row.

Retraction interaction

When a window sits downstream of a relaxed-CK aggregate whose dropped correlation-key fields overlap the window’s group_by, the planner switches the window from streaming-emit to buffer-mode. The window operator stores per-partition raw row buffers until commit; on retraction, it reruns the configured $window.* evaluation over partition − retracted_rows and emits per-output deltas through the replay phase.

All 13 window functions are covered uniformly by wholesale recompute. The operator-by-operator retraction cost reference has the per-operator memory ceilings; clinker run --explain reports the live per-window detail.

Aggregate Functions

Aggregate functions operate across grouped record sets in aggregate nodes, collapsing multiple input records into summary rows. They are distinct from window functions, which attach computed values to each individual record.

Aggregate functions

CXL provides 7 aggregate functions. These are called as free-standing function calls (not method calls) within the CXL block of an aggregate node.

FunctionSignatureReturnsDescription
sum(expr)NumericNumericSum of values
count(*)IntCount of records in the group
avg(expr)NumericFloatArithmetic mean
min(expr)AnyAnyMinimum value
max(expr)AnyAnyMaximum value
collect(expr)AnyArrayAll values collected into an array
weighted_avg(value, weight)Numeric, NumericFloatWeighted arithmetic mean

YAML aggregate node

Aggregate functions are used inside the cxl: block of a node with type: aggregate. The node must declare group_by: fields.

nodes:
  - name: dept_summary
    type: aggregate
    input: employees
    group_by: [department]
    cxl: |
      emit total_salary = sum(salary)
      emit headcount = count(*)
      emit avg_salary = avg(salary)
      emit max_salary = max(salary)
      emit min_salary = min(salary)

Group-by fields pass through automatically

Fields listed in group_by: are automatically included in the output. You do NOT need to emit them – they are carried through as group keys.

In the example above, department is automatically present in every output record without an explicit emit department = department statement.

Function details

sum(expr) -> Numeric

Computes the sum of the expression across all records in the group. Null values are skipped.

cxl: |
  emit total_revenue = sum(price * quantity)

count(*) -> Int

Counts the number of records in the group. The argument is the wildcard *.

cxl: |
  emit num_orders = count(*)

avg(expr) -> Float

Computes the arithmetic mean. Null values are skipped. Returns Float.

cxl: |
  emit avg_order_value = avg(order_total)

min(expr) -> Any

Returns the minimum value in the group. Works on numeric, string, and date types.

cxl: |
  emit earliest_order = min(order_date)
  emit lowest_price = min(unit_price)

max(expr) -> Any

Returns the maximum value in the group. Works on numeric, string, and date types.

cxl: |
  emit latest_order = max(order_date)
  emit highest_price = max(unit_price)

collect(expr) -> Array

Collects all values of the expression into an array. Useful for building lists of values per group.

cxl: |
  emit all_order_ids = collect(order_id)

weighted_avg(value, weight) -> Float

Computes a weighted average: sum(value * weight) / sum(weight). Takes two arguments.

cxl: |
  emit weighted_price = weighted_avg(unit_price, quantity)

Aggregates vs. windows

FeatureAggregate nodeWindow function
Record outputOne row per groupOne row per input record
Syntaxsum(field) (free-standing)$window.sum(field) (namespace)
Configurationtype: aggregate + group_by:type: transform + analytic_window:
Use caseSummarize groupsEnrich records with group context

Combining aggregates with expressions

Aggregate function calls can be mixed with regular CXL expressions in emit statements:

nodes:
  - name: category_stats
    type: aggregate
    input: products
    group_by: [category]
    cxl: |
      emit total_revenue = sum(price * quantity)
      emit avg_price = avg(price)
      emit margin_pct = (sum(revenue) - sum(cost)) / sum(revenue) * 100
      emit product_count = count(*)
      emit has_premium = max(price) > 100

Restrictions

  • let bindings in aggregate transforms are restricted to row-pure expressions (no aggregate function calls in let).
  • filter in aggregate transforms runs pre-aggregation – it filters input records before grouping.
  • distinct is not permitted inside aggregate transforms. Place a separate distinct transform upstream.

Complete example

pipeline:
  name: sales_summary
  nodes:
    - name: raw_sales
      type: source
      format: csv
      path: sales.csv

    - name: monthly_summary
      type: aggregate
      input: raw_sales
      group_by: [region, month]
      cxl: |
        emit total_sales = sum(amount)
        emit order_count = count(*)
        emit avg_order = avg(amount)
        emit top_sale = max(amount)
        emit all_reps = collect(sales_rep)

    - name: output
      type: output
      input: monthly_summary
      format: csv
      path: summary.csv

Closures

CXL supports arrow-syntax closures as arguments to closure-bearing array builtins like filter, map, find, any, and flat_map. They give CXL a way to express element-by-element predicates and projections over nested arrays carried inside a single record – without writing a separate transform node per element.

Syntax

it => expression

A closure has one parameter, named it, and a single expression body. The arrow => separates them.

- type: transform
  name: filter_items
  input: orders
  config:
    cxl: |
      emit kept = items.filter(it => it["price"] > 5)

The body is an expression, not a block of statements. Use if/then/else or match if you need branching inside a closure.

    cxl: |
      emit price_buckets = items.map(it =>
        if it["price"] >= 100 then "premium"
        else if it["price"] >= 10 then "standard"
        else "value")

Parameter name

The parameter is always it. Other identifiers are not accepted as the closure binding:

items.filter(item => item["price"] > 5)   -- parse error
items.filter(it => it["price"] > 5)       -- ok

it is recognized in expression position only inside a closure body. Outside of one, it has no special meaning.

Lexical capture

Inside the closure body, the outer record’s fields and let bindings remain visible. For each iteration the closure parameter it is bound to the current element, the body evaluates, then it is removed before the next iteration.

    cxl: |
      let threshold = 10
      emit kept = items.filter(it => it["price"] > threshold)

Here the closure body reads both it (the current array element) and threshold (an outer let binding). The record’s fields are also reachable by name – a closure over items can still read customer_id, region, or any other field on the same record.

Where closures appear

Closures are valid only as method-call arguments to closure-bearing builtins. They cannot be assigned to variables, stored in fields, or passed to non-closure builtins:

let f = it => it * 2          -- rejected at resolve time
emit doubler = it => it * 2   -- rejected at resolve time

If you need to share a closure across multiple call sites, repeat the literal closure expression. CXL has no first-class function values.

Null propagation

Closure-bearing builtins applied to a null receiver return null without evaluating the body. The body is also never called on records where the array is null:

    cxl: |
      emit kept = items.filter(it => it["price"] > 5)
      -- when `items` is null, `kept` is null; the body never runs

This matches the null-propagation policy on every other builtin – see Null Handling for the wider rules.

Worked example: filter and map over a nested array

Suppose each input record carries an items array of objects, each with sku and price:

{"order_id":"O-1","items":[{"sku":"a","price":10},{"sku":"b","price":20},{"sku":"c","price":5}]}

A transform that drops cheap items and projects the remaining SKUs:

- type: transform
  name: filter_items
  input: orders
  config:
    cxl: |
      emit order_id = order_id
      emit kept = items.filter(it => it["price"] > 5)
      emit kept_skus = items.filter(it => it["price"] > 5).map(it => it["sku"])

For the input above, the transform produces:

{
  "order_id": "O-1",
  "kept": [{"sku": "a", "price": 10}, {"sku": "b", "price": 20}],
  "kept_skus": ["a", "b"]
}

Bracket-index access (it["price"]) reaches into each map element. See Nested Paths for the full traversal surface.

See also

  • Array Methods – the closure-bearing builtins (filter, map, find, any, flat_map).
  • Map Methods – callable on map elements inside a closure body.
  • Nested Paths – bracket-index and dotted-path navigation through nested arrays and maps.
  • Emit Each – statement that fans one input record into many output records, using a binding similar to the closure parameter.

Nested Paths

CXL records can carry nested arrays and maps as field values (for example, a JSON input where each record has an items array of objects). Reaching into that structure uses two complementary forms: dotted paths and bracket indices.

Dotted paths

A dotted identifier path reads a static field name from a map.

doc.metadata.tenant

Each segment must be a valid identifier. Dotted paths are resolved at compile time – the typechecker walks the structure declared in the source schema and reports a missing-field error if any segment doesn’t exist.

- type: transform
  name: project_tenant
  input: events
  config:
    cxl: |
      emit tenant = doc.metadata.tenant
      emit user_id = doc.user.id

Use dotted paths for structures whose shape is fixed and known at authoring time.

Bracket indices

A bracket index reads a runtime-computed key. The receiver may be an array (integer index) or a map (string index).

items[0]
profile["name"]
items.map(it => it["sku"])

Bracket indices are dynamic – the index expression evaluates per record. The typechecker treats the result as Any and does not assert that the key is present.

Integer index on an array

- type: transform
  name: first_item
  input: orders
  config:
    cxl: |
      emit head = items[0]
      emit second = items[1]

For items = [{"sku":"a"},{"sku":"b"},{"sku":"c"}], head is {"sku":"a"} and second is {"sku":"b"}.

Out-of-range indices return null. Negative indices also return null (CXL does not support negative indexing).

String index on a map

    cxl: |
      emit name = profile["name"]
      emit tier = profile["tier"]

Missing keys return null – the lookup never raises an error. This is the same null-propagation policy closure builtins use on their receivers.

Mixing forms

The two forms compose in either order:

    cxl: |
      emit first_sku = items[0]["sku"]
      emit profile_email = users.profile["email"]

items[0]["sku"] is two bracket indices chained – an integer index against the array, then a string index against the resulting map. users.profile["email"] walks a dotted path to reach profile (a map field on users), then bracket-indexes into it for a runtime key.

Null propagation

Every nested-access form propagates null end-to-end. If the receiver is null, the result is null without evaluating the index expression:

    cxl: |
      emit sku = items[0]["sku"]
      -- when `items` is null, `sku` is null
      -- when `items[0]` is null, `sku` is also null

This matches the null behavior on dotted paths and on method-call receivers. Records with missing intermediate structure produce nulls in their derived fields rather than aborting the transform.

Method calls on indexed values

A bracket-indexed expression is a regular value, so it composes with any method or further index:

    cxl: |
      emit head_sku_upper = items[0]["sku"].upper()
      emit cheap_skus = items.filter(it => it["price"] < 10).map(it => it["sku"])

The first chain reads a string out of nested structure and uppercases it. The second filters an array of maps by a numeric field and projects the SKU strings out.

See also

  • Closures – closures over arrays of maps typically use bracket-index on the it binding.
  • Array Methods – traversal builtins that consume nested arrays.
  • Map Methods – builders and accessors for map values.
  • Null Handling – the wider null-propagation rules.

Emit Each

The emit each statement fans one input record into multiple output records – one per element of an array on the input. The body emits the fields each output record carries. A trailing outer modifier preserves the trigger row when the array is empty or null.

Syntax

emit each <binding> in <source> {
  <statements>
}
  • <binding> is the identifier the body uses to refer to the current array element. The conventional name is it (same as the closure parameter), but any identifier is accepted.
  • <source> is any expression producing an array. Typically a field reference on the input record.
  • The body is a block of let and emit statements that produce one output record per iteration.

Worked example

Suppose each input record carries an items array of objects, each with sku and price:

{"order_id":"O-1","items":[{"sku":"a","price":10},{"sku":"b","price":20},{"sku":"c","price":5}]}

A transform that fans each input into one record per item:

- type: transform
  name: explode
  input: orders
  config:
    cxl: |
      emit each it in items {
        emit order_id = order_id
        emit sku = it["sku"]
        emit price = it["price"]
      }

For the input above, the transform produces three output records:

{"order_id":"O-1","sku":"a","price":10}
{"order_id":"O-1","sku":"b","price":20}
{"order_id":"O-1","sku":"c","price":5}

The body reads both it (the current element) and order_id (an outer record field). Outer-record fields remain visible inside the body for every iteration.

Cardinality

If the source array has N elements, emit each produces exactly N output records. Empty array sources produce zero records. A null source also produces zero records – no DLQ entry, no error – mirroring the explode-on-null convention used elsewhere in CXL.

When fan-out nests, the cardinalities multiply: an outer array of M elements whose inner arrays have N elements each produces up to M×N records. The cumulative max_expansion cap bounds that product.

A non-array, non-null source raises a runtime type-mismatch error and routes the originating record to the DLQ.

Preserving the trigger row: outer

A trailing outer modifier switches emit each to its outer-join variant. The grammar is identical except for the keyword after the source:

emit each <binding> in <source> outer {
  <statements>
}

The only behavioral difference is what happens when the source is null or an empty array. Plain emit each drops the trigger row entirely (zero output records). The outer variant instead emits the trigger row once, with <binding> bound to null:

Sourceemit each ...emit each ... outer
3-element3 records3 records (identical)
empty array0 records1 record, binding = null
null0 records1 record, binding = null

This is the shape SQL engines spell LATERAL VIEW OUTER EXPLODE (Spark, Hive) or an outer UNNEST (DuckDB): “for each tag on this article emit a tagged row, but keep articles that have no tags.”

Using the worked example above with an order that carries no items:

{"order_id":"O-2","items":[]}
- type: transform
  name: explode_outer
  input: orders
  config:
    cxl: |
      emit each it in items outer {
        emit order_id = order_id
        emit sku = it["sku"]
        emit price = it["price"]
      }

produces a single record that keeps order_id while the per-item fields read through the null binding:

{"order_id":"O-2","sku":null,"price":null}

Outer-record fields (like order_id) and any emit statements preceding the block still apply to the preserved trigger row, so an outer row is never bare.

The source type rule is slightly wider than plain emit each: a statically-null source is accepted (it is the case the variant exists to handle), alongside arrays and Any. Everything else in this page — the cumulative max_expansion cap, the nesting rules, the body-statement restrictions — applies unchanged to the outer variant. The two variants compose freely: an outer block may nest inside a plain emit each block and vice versa.

Output schema

The body’s emit statements define the output record’s field set, the same way emit does in a regular transform body. Fields the body does not emit fall under the Output node’s include_unmapped policy (see Output Nodes).

Fields written by the body shadow same-named fields on the originating input record.

Nested fan-out: fan-out within fan-out

An emit each body may itself contain emit each blocks — fan-out within fan-out for one trigger row. This is the canonical “for each article, for each section, for each tag, emit a row” shape:

emit each section in article["sections"] {
  emit each tag in section["tags"] {
    emit article_id = article_id
    emit section = section["name"]
    emit tag = tag
  }
}

For one input article, this produces one output record per (section, tag) pair. The inner binding (tag) reads the current inner element; the outer binding (section) and any outer-record field (article_id) stay visible inside the inner body. A field name reused as both an outer and inner binding shadows lexically — the inner binding wins inside the inner body, and the outer value is restored when the inner block finishes.

Emits are positional: an emit placed in the outer body before a nested block applies to every leaf record that block produces, but an emit placed after a nested block does not retroactively reach the records that block already emitted. Put the fields shared across leaves above the nested block.

Plain and outer blocks compose in any order. An inner plain emit each over an empty or null array contributes no records for that branch, while an inner emit each ... outer preserves one trigger row (inner binding bound to null) — exactly the per-level semantics from the single-level table, applied at each level.

Nesting is bounded to 32 levels so that adversarially deep input cannot exhaust the parser stack; legitimate document fan-out is only a few levels deep. Beyond that bound, parsing fails with a “nesting too deep” diagnostic.

The flat-array workaround (precompute a flattened array with .flat_map and use a single emit each) is still available and may be clearer for a simple two-level cartesian product, but is no longer required.

Body-statement restrictions

Within the body, let, emit, trace, and nested emit each / emit each ... outer are accepted. filter and distinct are rejected at evaluation time – a body filter would split work between branches the engine can’t represent. Move filter/distinct logic into a downstream transform, or pre-filter the source array with .filter before the emit each block.

Safety cap: max_expansion

To bound fan-out, every transform body carries a max_expansion cap on the cumulative records emit each may produce from a single original input record. The cap is cumulative across all nesting levels: every leaf record a nested fan-out produces charges against one shared budget, so nesting cannot multiply past the cap undetected. If the cap is exceeded, the originating record routes to the DLQ with category expansion_limit_exceeded instead of producing a truncated or unbounded result. The default cap is 10000.

See Transform Nodes -> Expansion Cap for the YAML field and tuning guidance.

See also

System Variables

CXL provides several system variable namespaces prefixed with $. These give CXL expressions access to pipeline execution context, user-defined variables, per-record metadata, and the current time.

$pipeline.* – Pipeline context

Pipeline variables are accessed via $pipeline.member_name. Some are frozen at pipeline start; others update per record.

Stable (frozen at pipeline start)

VariableTypeDescription
$pipeline.nameStringPipeline name from YAML config
$pipeline.execution_idStringUUID v7, unique per pipeline run
$pipeline.batch_idStringFrom --batch-id CLI flag, or auto-generated UUID v7
$pipeline.start_timeDateTimeFrozen at pipeline start, deterministic within a run
$ cxl eval -e 'emit name = $pipeline.name' \
    -e 'emit exec = $pipeline.execution_id'
{
  "name": "cxl-eval",
  "exec": "00000000-0000-0000-0000-000000000000"
}

Counters

VariableTypeDescription
$pipeline.total_countIntTotal records processed so far
$pipeline.ok_countIntRecords that passed successfully
$pipeline.dlq_countIntRecords sent to dead-letter queue
$pipeline.filtered_countIntRecords excluded by filter statements
$pipeline.distinct_countIntRecords excluded by distinct statements
trace info if $pipeline.total_count % 10000 == 0 then "processed " + $pipeline.total_count.to_string() + " records"

$source.* – Per-record source lineage

$source.* exposes engine-stamped columns that travel with every record from its origin Source node downstream through merges, combines, and transforms. They identify where the record came from and when in event-time it happened. All three columns are filtered out of default Output projections — reference them explicitly with emit if you need them in your output schema.

VariableTypeDescription
$source.fileStringPath of the input file the current record was read from.
$source.nameStringName of the Source node that produced the current record. Survives through merge / combine so downstream nodes can branch on origin.
$source.event_timeDateTimeEngine-stamped event time, delay-corrected by the source’s watermark.delay. Null when the source has no watermark: block, or when the per-record value did not parse.
filter $source.name == "src_web"
emit origin = $source.name
emit ingest_file = $source.file
emit ts = $source.event_time

$source.event_time is the column a time-windowed aggregate reads to assign records to windows. It is only populated for records from a source that declares watermark: — otherwise it holds Null.

$vars.* – User-defined variables

User-defined variables are declared in the YAML pipeline config under pipeline.vars: and accessed via $vars.name in CXL expressions.

YAML declaration

pipeline:
  name: invoice_processing
  vars:
    high_value_threshold: 10000
    tax_rate: 0.21
    output_currency: "USD"
    fiscal_year_start_month: 4

CXL usage

filter amount > $vars.high_value_threshold
emit tax = amount * $vars.tax_rate
emit currency = $vars.output_currency

Variables provide a clean way to externalize configuration from CXL logic. Combined with channels, different variable sets can parameterize the same pipeline for different environments or clients.

$meta.* – Per-record metadata

Metadata is a per-record key-value store that travels with the record through the pipeline but is not part of the output columns. Write to it with emit meta; read from it with $meta.field.

Writing metadata

emit meta quality = if amount < 0 then "suspect" else "ok"
emit meta source_system = "legacy_erp"

Reading metadata

Downstream nodes can read metadata:

filter $meta.quality == "ok"
emit audit_system = $meta.source_system

Metadata is useful for tagging records with quality flags, routing hints, or audit information that should not appear in the final output unless explicitly emitted.

now – Current time

The now keyword returns the current wall-clock time as a DateTime value. It is evaluated fresh per record, so each record gets the actual time of its processing.

$ cxl eval -e 'emit timestamp = now'
{
  "timestamp": "2026-04-11T15:30:00"
}

now is useful for timestamping records:

emit processed_at = now
emit days_old = now.diff_days(created_date)

Note: now is a keyword, not a function call. Write now, not now().

Complete example

pipeline:
  name: order_enrichment
  vars:
    discount_threshold: 500
    tax_rate: 0.08

  nodes:
    - name: orders
      type: source
      format: csv
      path: orders.csv

    - name: enrich
      type: transform
      input: orders
      cxl: |
        emit order_id = order_id
        emit amount = amount
        emit discount = if amount > $vars.discount_threshold then 0.1 else 0.0
        emit tax = amount * $vars.tax_rate
        emit total = amount * (1 - discount) + tax
        emit processed_at = now
        emit meta source_file = $source.file
        emit pipeline_run = $pipeline.execution_id

    - name: output
      type: output
      input: enrich
      format: csv
      path: enriched_orders.csv

Null Handling

Null values in CXL represent missing or absent data. CXL uses null propagation – most operations on null produce null – with specific tools for detecting and handling nulls.

Null propagation

When a method receives a null receiver, it returns null without executing. This is called null propagation and applies to all methods except the introspection methods.

$ cxl eval -e 'emit result = null.upper()'
{
  "result": null
}

Propagation flows through method chains:

$ cxl eval -e 'emit result = null.trim().upper().length()'
{
  "result": null
}

Null propagation exceptions

Five methods are exempt from null propagation and actively handle null receivers:

MethodNull behavior
is_null()Returns true
type_of()Returns "Null"
is_empty()Returns true
catch(x)Returns x
debug(l)Passes through null, logs it
$ cxl eval -e 'emit a = null.is_null()' -e 'emit b = null.type_of()' \
    -e 'emit c = null.catch("fallback")'
{
  "a": true,
  "b": "Null",
  "c": "fallback"
}

Null coalesce operator (??)

The ?? operator returns its left operand if non-null, otherwise its right operand. It is the primary tool for providing default values.

$ cxl eval -e 'emit a = null ?? "default"' \
    -e 'emit b = "present" ?? "default"'
{
  "a": "default",
  "b": "present"
}

Chain multiple ?? operators for fallback chains:

$ cxl eval -e 'emit result = null ?? null ?? "last resort"'
{
  "result": "last resort"
}

Three-valued logic

Boolean operations with null follow three-valued logic (like SQL):

and

LeftRightResult
truenullnull
falsenullfalse
nulltruenull
nullfalsefalse
nullnullnull

The key insight: false and null is false because the result is false regardless of the unknown value.

or

LeftRightResult
truenulltrue
falsenullnull
nulltruetrue
nullfalsenull
nullnullnull

The key insight: true or null is true because the result is true regardless of the unknown value.

not

OperandResult
truefalse
falsetrue
nullnull

Arithmetic with null

Any arithmetic operation involving null produces null:

$ cxl eval -e 'emit result = 5 + null'
{
  "result": null
}

Comparison with null

Comparisons involving null produce null (not false):

$ cxl eval -e 'emit result = null == null'
{
  "result": null
}

To test for null, use is_null():

$ cxl eval -e 'emit result = null.is_null()'
{
  "result": true
}

Practical patterns

Fallback values with ??

emit name = raw_name ?? "Unknown"
emit amount = raw_amount ?? 0
emit active = is_active ?? false

Safe conversion with try_* and ??

emit price = raw_price.try_float() ?? 0.0
emit qty = raw_qty.try_int() ?? 1

Explicit null testing

filter not amount.is_null()
emit has_email = not email.is_null()

Catch method (equivalent to ??)

emit name = raw_name.catch("Unknown")

Conditional null handling

emit status = if amount.is_null() then "missing"
    else if amount < 0 then "invalid"
    else "ok"

Filter blank or null

# Filter out records where name is null or empty string
filter not name.is_empty()

Null-safe chaining

When working with fields that may be null, place the null check early or use ??:

# Safe: coalesce first, then transform
emit normalized = (raw_name ?? "").trim().upper()

# Safe: test before use
emit name = if raw_name.is_null() then "N/A" else raw_name.trim()

Modules & use

CXL supports a module system for organizing reusable expressions. Modules contain function declarations and constant bindings that can be imported into CXL programs.

Module files

A module is a .cxl file containing fn declarations and let constants. Module files live in the rules path (default: ./rules/).

Function declarations

Functions are pure, single-expression bodies with named parameters:

fn fiscal_year(d) = if d.month() < 4 then d.year() - 1 else d.year()

fn full_name(first, last) = first.trim() + " " + last.trim()

fn clamp_pct(value) = value.clamp(0, 100).round_to(1)

Functions are pure – they have no side effects and always return a value.

Module constants

Constants are let bindings at the module level:

let tax_rate = 0.21
let max_retries = 3
let default_currency = "USD"

Example module file

File: rules/shared/dates.cxl

fn fiscal_year(d) = if d.month() < 4 then d.year() - 1 else d.year()

fn quarter(d) = match {
  d.month() <= 3  => 1,
  d.month() <= 6  => 2,
  d.month() <= 9  => 3,
  _               => 4
}

fn fiscal_quarter(d) = quarter(d.add_months(-3))

let fiscal_start_month = 4

Importing modules

Use the use statement to import a module. Module paths use dot notation (not ::):

use shared.dates as d

This imports the module at rules/shared/dates.cxl and binds it to the alias d.

Import syntax

use module.path
use module.path as alias

The as alias clause is optional. When omitted, the last segment of the path becomes the default name.

use shared.dates          # access as dates::fiscal_year(...)
use shared.dates as d     # access as d::fiscal_year(...)

Path resolution

Module paths are resolved relative to the rules path:

ImportFile path
use shared.datesrules/shared/dates.cxl
use transforms.normalizerules/transforms/normalize.cxl
use utilsrules/utils.cxl

The rules path defaults to ./rules/ and can be overridden with --rules-path.

Using imported functions and constants

After importing, reference module members with :: (double colon) syntax:

use shared.dates as d
use shared.finance as f

emit fiscal_year = d::fiscal_year(invoice_date)
emit quarter = d::quarter(invoice_date)
emit tax = amount * f::tax_rate
emit net = amount - tax

Functions

Call functions with alias::function_name(args):

use shared.dates as d
emit fy = d::fiscal_year(order_date)

Constants

Access constants with alias::constant_name:

use shared.finance as f
emit tax = amount * f::tax_rate

Restrictions

  • No wildcard imports. use shared.* is not supported. Import modules explicitly.
  • Dot separator only. Module paths use ., not ::. The :: syntax is reserved for member access after import.
  • Single expression bodies. Functions must be a single expression – no multi-statement bodies.
  • Pure functions. Functions cannot use emit, filter, distinct, or other statement forms. They are pure computations.
  • No recursion. Functions cannot call themselves (directly or indirectly).

Complete example

File: rules/etl/clean.cxl

fn normalize_name(name) = name.trim().upper()

fn safe_amount(raw) = raw.try_float() ?? 0.0

fn flag_suspicious(amount, threshold) =
  if amount > threshold then "review" else "ok"

let max_amount = 999999.99

Pipeline CXL block:

use etl.clean as c

emit customer = c::normalize_name(raw_customer)
emit amount = c::safe_amount(raw_amount)
filter amount <= c::max_amount
emit review_flag = c::flag_suspicious(amount, 10000)

The cxl CLI Tool

The cxl command-line tool validates, evaluates, and formats CXL source files. It is the standalone companion to the Clinker pipeline engine, useful for testing expressions, validating transforms, and debugging CXL logic.

Commands

cxl check

Parse, resolve, and type-check a .cxl file. Reports errors with source locations and fix suggestions.

$ cxl check transform.cxl
ok: transform.cxl is valid

On errors:

error[parse]: expected expression, found '}' (at transform.cxl:12)
  help: check for missing operand or extra closing brace
error[resolve]: unknown field 'amoutn' (at transform.cxl:5)
  help: did you mean 'amount'?
error[typecheck]: cannot apply '+' to String and Int (at transform.cxl:8)
  help: convert one operand — use .to_int() or .to_string()

cxl eval

Evaluate CXL expressions against provided data and print the result as JSON.

Inline expression:

$ cxl eval -e 'emit result = 1 + 2'
{
  "result": 3
}

From a file with field values:

$ cxl eval transform.cxl \
    --field Price=10.5 \
    --field Qty=3

From a file with JSON input:

$ cxl eval transform.cxl --record '{"price": 10.5, "qty": 3}'

Multiple inline expressions:

$ cxl eval -e 'let tax = 0.21' -e 'emit net = price * (1 - tax)' \
    --field price=100
{
  "net": 79.0
}

cxl fmt

Parse and pretty-print a .cxl file in canonical format with normalized whitespace and consistent styling.

$ cxl fmt transform.cxl

Output is printed to stdout. Redirect to overwrite:

$ cxl fmt transform.cxl > transform.cxl.tmp && mv transform.cxl.tmp transform.cxl

Input data

–field name=value

Provide individual field values as key-value pairs. Values are automatically type-inferred:

InputInferred typeExample
Integer patternInt--field count=42
Decimal patternFloat--field price=10.5
true / falseBool--field active=true
nullNull--field value=null
Anything elseString--field name=Alice
$ cxl eval -e 'emit t = amount.type_of()' --field amount=42
{
  "t": "Int"
}
$ cxl eval -e 'emit t = name.type_of()' --field name=Alice
{
  "t": "String"
}

–record JSON

Provide a full JSON object as input. Mutually exclusive with --field.

$ cxl eval -e 'emit total = price * qty' \
    --record '{"price": 10.5, "qty": 3}'
{
  "total": 31.5
}

JSON types map directly:

JSON typeCXL type
nullNull
true / falseBool
integer numberInt
decimal numberFloat
"string"String
[array]Array

Output format

Output is always JSON. Each emit statement produces a key-value pair:

$ cxl eval -e 'emit a = 1' -e 'emit b = "two"' -e 'emit c = true'
{
  "a": 1,
  "b": "two",
  "c": true
}

Date and DateTime values are serialized as ISO 8601 strings:

$ cxl eval -e 'emit d = #2024-03-15#'
{
  "d": "2024-03-15"
}

Exit codes

CodeMeaning
0Success (or warnings only)
1Parse, resolve, type-check, or evaluation errors
2I/O error (file not found, invalid JSON, etc.)

Pipeline context in eval mode

When running cxl eval, a minimal pipeline context is provided:

VariableValue
$pipeline.name"cxl-eval"
$pipeline.execution_idZeroed UUID
$pipeline.batch_idZeroed UUID
$pipeline.start_timeCurrent wall-clock time
$pipeline.source_fileFilename or "<inline>"
$pipeline.source_row1
nowCurrent wall-clock time (live)

Practical usage

Quick expression testing:

$ cxl eval -e 'emit result = "hello world".upper().split(" ").length()'
{
  "result": 2
}

Validate a transform file:

$ cxl check transforms/enrich_orders.cxl && echo "Valid"

Test conditional logic:

$ cxl eval -e 'emit tier = match {
    amount > 1000 => "high",
    amount > 100 => "med",
    _ => "low"
  }' \
    --field amount=500
{
  "tier": "med"
}

Test date operations:

$ cxl eval -e 'emit year = d.year()' -e 'emit month = d.month()' \
    -e 'emit next_week = d.add_days(7)' \
    --record '{"d": "2024-03-15"}'

Test null handling:

$ cxl eval -e 'emit safe = raw.try_int() ?? 0' --field raw=abc
{
  "safe": 0
}

CLI Reference

Clinker ships two command-line tools: clinker (the pipeline runner) and cxl (the expression REPL, covered in the CXL CLI chapter). This page is the complete reference for clinker.

clinker run

Execute a pipeline.

clinker run [OPTIONS] <CONFIG>

Positional arguments

ArgumentDescription
<CONFIG>Path to the pipeline YAML configuration file (required)

Options

FlagDefaultDescription
--memory-limit <SIZE>256MMemory budget for the execution. Accepts binary (1024-based) K/M/G suffixes (K = 1024 bytes, M = 1024², G = 1024³); a bare integer is bytes. When the limit is approached, aggregation operators spill to disk rather than crashing. CLI value overrides any memory.limit set in the YAML.
--threads <N>number of CPUsSize of the thread pool used for parallel node execution.
--error-threshold <N>0 (unlimited)Maximum number of records routed to the dead-letter queue before the pipeline aborts. 0 means no limit – the pipeline will run to completion regardless of DLQ volume.
--batch-id <ID>UUID v7Custom execution identifier. Appears in metrics output and log lines. Use a meaningful value (e.g. daily-2026-04-11) for correlation across retries.
--explain [FORMAT]textPrint the execution plan and exit without processing data. Accepted formats: text, json, dot. See Explain Plans.
--dry-runValidate the configuration (YAML structure, CXL syntax, type checking, DAG wiring) without reading any data.
-n, --dry-run-n <N>Process only the first N records through the full pipeline. Implies --dry-run.
--dry-run-output <FILE>stdoutRedirect dry-run output to a file instead of stdout. Only meaningful with -n.
--rules-path <DIR>./rules/Search path for CXL module files referenced by use statements.
--base-dir <DIR>Base directory for resolving relative paths in the YAML config. Defaults to the directory containing the config file.
--allow-absolute-pathsPermit absolute file paths in the pipeline YAML. By default, absolute paths are rejected to encourage portable configs.
--env <NAME>Set the active environment. Equivalent to setting CLINKER_ENV. Used by when: conditions in channel overrides.
--quietSuppress progress output. Errors are still printed to stderr.
--forceAllow output files to be overwritten if they already exist. Without this flag, the pipeline aborts rather than clobbering existing output.
--log-level <LEVEL>infoLogging verbosity. One of: error, warn, info, debug, trace.
--metrics-spool-dir <DIR>Directory for per-execution metrics files. See Metrics & Monitoring.

Examples

# Basic execution
clinker run pipeline.yaml

# Production run with memory budget and forced overwrite
clinker run pipeline.yaml --memory-limit 512M --force --log-level warn

# Validate without processing
clinker run pipeline.yaml --dry-run

# Preview first 10 records
clinker run pipeline.yaml --dry-run -n 10

# Show execution plan as Graphviz
clinker run pipeline.yaml --explain dot | dot -Tpng -o plan.png

# Run with a custom batch ID for tracing
clinker run pipeline.yaml --batch-id "daily-2026-04-11" --metrics-spool-dir ./metrics/

clinker metrics collect

Sweep per-execution metrics files from a spool directory into a single NDJSON archive.

clinker metrics collect [OPTIONS]

Options

FlagDescription
--spool-dir <DIR>Spool directory to sweep (required).
--output-file <FILE>NDJSON archive destination (required). If the file exists, new entries are appended.
--delete-after-collectRemove spool files after they have been successfully written to the archive.
--dry-runPreview which files would be collected without writing anything.

Examples

# Collect and archive, then clean up spool
clinker metrics collect \
  --spool-dir /var/spool/clinker/ \
  --output-file /var/log/clinker/metrics.ndjson \
  --delete-after-collect

# Preview what would be collected
clinker metrics collect \
  --spool-dir ./metrics/ \
  --output-file ./archive.ndjson \
  --dry-run

Environment Variables

VariableDescription
CLINKER_ENVActive environment name. Equivalent to --env. Used by when: conditions in channel overrides to select environment-specific configuration.
CLINKER_METRICS_SPOOL_DIRDefault metrics spool directory. Overridden by --metrics-spool-dir.

Precedence (highest to lowest): CLI flag, environment variable, YAML config value.

Validation & Dry Run

Clinker provides two levels of pre-flight validation so you can catch problems before committing to a full run.

Config-only validation

clinker run pipeline.yaml --dry-run

This validates everything that can be checked without reading data:

  • YAML structure and required fields
  • CXL syntax and compile-time type checking
  • Schema compatibility between connected nodes
  • DAG wiring (no cycles, no dangling inputs, no missing nodes)
  • File path resolution (existence checks for inputs)

No records are read. No output files are created. The command exits with code 0 on success or code 1 with a diagnostic message on failure.

Use this after every YAML edit. It runs in milliseconds and catches the majority of configuration mistakes.

Record preview

clinker run pipeline.yaml --dry-run -n 10

This reads the first 10 records from each source and processes them through the full pipeline – transforms, aggregations, routing, and output formatting. Results are printed to stdout.

The record preview exercises the runtime evaluation path, catching issues that config-only validation cannot:

  • CXL expressions that are syntactically valid but fail at runtime (e.g., calling a string method on an integer)
  • Data format mismatches between the declared schema and actual file contents
  • Unexpected null values in required fields

Save preview to file

clinker run pipeline.yaml --dry-run -n 100 --dry-run-output preview.csv

The output format matches what the pipeline’s output node would produce, so preview.csv shows you exactly what the full run will write.

Use both validation levels in sequence before every production run:

  1. --dry-run – catch configuration and type errors instantly.
  2. --dry-run -n 10 – verify output shape and values against real data.
  3. Full run – execute with confidence.

This three-step pattern is especially valuable when:

  • Editing CXL expressions in transform or aggregate nodes
  • Changing source schemas or swapping input files
  • Adding or removing nodes from the pipeline DAG
  • Modifying route conditions

Combining with explain

You can also inspect the execution plan before running:

clinker run pipeline.yaml --explain

This shows the DAG structure, parallelism strategy, and node ordering without reading any data. See Explain Plans for details.

The typical full pre-flight sequence is:

clinker run pipeline.yaml --explain          # inspect the DAG
clinker run pipeline.yaml --dry-run          # validate config
clinker run pipeline.yaml --dry-run -n 10    # preview with data
clinker run pipeline.yaml --force            # run for real

Explain Plans

The --explain flag prints the execution plan – the DAG of nodes, their connections, and the parallelism strategy the optimizer has chosen – without reading any data.

Text format

clinker run pipeline.yaml --explain
# or explicitly:
clinker run pipeline.yaml --explain text

The text format shows a human-readable summary of the execution plan:

Execution Plan: customer_etl
============================

Node 0: customers (Source, parallel: file-chunked)
  -> transform_1

Node 1: transform_1 (Transform, parallel: record)
  -> route_1

Node 2: route_1 (Route, parallel: record)
  -> [high] output_high
  -> [default] output_standard

Node 3: output_high (Output, parallel: serial)

Node 4: output_standard (Output, parallel: serial)

Key information shown:

  • Node index and name – the topological position in the DAG
  • Node type – Source, Transform, Aggregate, Route, Merge, Output, Composition
  • Parallelism strategy – how the optimizer plans to execute the node
  • Connections – downstream nodes, with port labels for route branches
  • Buffer class (Physical Properties section) – buffer: streaming for nodes whose output is handed straight to a single downstream consumer rather than crossing a charged inter-stage buffer (fused Source → Transform → Output chains, Merge.interleave of Sources, single-branch Route, non-fused Merge, streaming-strategy Aggregate, hash-build-probe Combine probe-side, and every sink Output); buffer: materialized for nodes whose output sits in an inter-stage buffer between dispatch arms. See Streaming vs. Blocking Stages for which streaming stages bound their footprint to one batch and which only spare the second copy
  • Arbitration parameters (Physical Properties section, plus a === Buffer Edges === block) – each node’s arbitration: spill_priority=.., can_back_pressure=.. line shows which operator the memory arbitrator would spill or pause first. See Reading --explain arbitration output for the full annotation model and a worked example.

The buffer class is a pre-runtime signal for memory pressure: every materialized node charges its in-flight rows against pipeline.memory.limit and may spill to disk once the soft threshold trips. A streaming node’s output crosses no charged inter-stage buffer, so it is never charged twice and never spill-eligible (though a non-fused streaming stage still builds its own result before handing it off — see Streaming vs. Blocking Stages). Use the annotation alongside --memory-limit / pipeline.memory.limit to predict which stages dominate the RSS budget before running the pipeline.

JSON format

clinker run pipeline.yaml --explain json

Produces a machine-readable JSON object for programmatic consumption. Useful for:

  • CI pipelines that need to assert plan properties
  • Custom dashboards that visualize execution plans
  • Diffing plans between config versions
# Compare plans before and after a config change
clinker run old.yaml --explain json > plan_old.json
clinker run new.yaml --explain json > plan_new.json
diff plan_old.json plan_new.json

Graphviz DOT format

clinker run pipeline.yaml --explain dot

Produces a Graphviz DOT graph. Pipe it to dot to render an image:

# PNG
clinker run pipeline.yaml --explain dot | dot -Tpng -o pipeline.png

# SVG (scalable, good for documentation)
clinker run pipeline.yaml --explain dot | dot -Tsvg -o pipeline.svg

# PDF
clinker run pipeline.yaml --explain dot | dot -Tpdf -o pipeline.pdf

This requires the graphviz package to be installed on the system.

The resulting diagram shows:

  • Nodes as labeled boxes with type and parallelism annotations
  • Edges as arrows with port labels where applicable
  • Branch/merge fan-out and fan-in structure

When to use explain

  • During development – verify the DAG shape matches your mental model before writing test data.
  • After adding route or merge nodes – confirm branch wiring is correct.
  • When tuning parallelism – check which strategy the optimizer selected for each node.
  • In code review – generate a DOT diagram and include it in the PR for visual confirmation.

Explain runs instantly because it only parses the YAML and builds the plan – no data is touched. Pair it with --dry-run for full config validation:

clinker run pipeline.yaml --explain       # inspect plan
clinker run pipeline.yaml --dry-run       # validate config

Retraction section

Pipelines whose at least one Aggregate has a group_by that omits a correlation-key field get a === Retraction === block in the text output. The engine selects the retraction-mode path automatically based on group_by content; the block is silent on every other pipeline, so strict-correlation and non-correlated --explain output stays identical to today’s text.

The block opens with a one-line summary – retraction enabled — N relaxed aggregates, M buffer-mode windows, fanout policy: <policy>. – followed by one block per retraction-mode Aggregate and one per buffer-mode window index.

Per retraction-mode Aggregate the block reports:

  • the resolved accumulator path (Reversible or BufferRequired),
  • the per-row lineage memory cost (~8 bytes/row for Reversible, n/a for BufferRequired which holds raw contributions instead),
  • the worst-case degrade fallback when retraction’s preconditions break at runtime.

Per buffer-mode window index the block reports:

  • the source name and partition_by fields,
  • the per-row buffer cost in Value slots over the index’s arena fields,
  • the worst-case partition memory ceiling under degrade.

Group cardinality is honestly surfaced as “unknown at plan time” – the planner has no group-cardinality side-table to consult before the run. Use the operator-by-operator retraction cost reference and the per-row figures the explain block prints for capacity planning, then confirm the live shape via clinker metrics collect after the first production run.

Looking up diagnostic codes

clinker explain --code <CODE> prints the documentation for any registered error or warning code, including retraction-specific codes:

clinker explain --code E15Y   # retraction-mode aggregate incompatible with strategy: streaming

The full set of codes is enumerated in the error returned when an unknown code is passed.

Memory Tuning

Clinker is designed to be a good neighbor on shared servers. Rather than consuming all available memory, it works within a configurable budget and reaches for back-pressure or disk spill before it runs out.

The memory: block

All pipeline-level memory tuning lives under a single optional block:

pipeline:
  name: my_pipeline
  memory:
    limit: "1G"          # optional — defaults to 512M
    backpressure: pause  # optional — defaults to pause

The entire block is optional. A pipeline with no opinions about memory writes nothing:

pipeline:
  name: my_pipeline

…and gets the runtime defaults (512 MB hard limit, backpressure: pause).

Individual fields are also optional. Setting just one is fine:

pipeline:
  name: my_pipeline
  memory:
    limit: "2G"

Setting the memory limit

CLI flag (highest priority):

clinker run pipeline.yaml --memory-limit 512M

YAML config:

pipeline:
  memory:
    limit: "512M"

The CLI flag overrides the YAML value. Suffixes are binary (1024-based): K = 1024 bytes, M = 1024², G = 1024³; a bare integer is bytes. (This differs from the decimal KB/MB/GB used by min_size/max_size, which are 1000-based.)

Default: 512 MB.

Choosing a backpressure policy

When the soft spill threshold (80 % of limit) trips, somebody has to give up memory. The backpressure knob picks the policy that decides who:

ValueActive policyBehavior
pause (default)BackPressurePreferred -> PriorityIf any consumer can be paused at its inbound channel, pause it. Otherwise pick the lowest-priority consumer (cheapest to spill) and ask it to spill.
spillPriorityNever pause a producer. Pick the lowest-priority consumer and ask it to spill. Closest to the pre-arbitrator react-only behavior, but with deterministic priority-based selection.
bothBackPressurePreferred -> LargestFirstPause when possible, otherwise force the largest holder to spill regardless of priority. Useful when one operator dominates the budget and a fairness override is wanted.

pause is the right default for most pipelines: when a fast Source is feeding a slow Combine through a bounded inter-stage buffer, pausing the Source is strictly cheaper than spilling the buffer to disk (no I/O, no serialization round-trip). spill and both exist for the rarer cases where you want a different posture.

To inspect which policy is active before running, use --explain:

=== Execution Plan ===

Mode: ...
DAG nodes: 7
arbitration: BackPressurePreferred -> Priority

The arbitration: line shows the composed policy name. A pipeline with backpressure: spill prints arbitration: Priority; one with backpressure: both prints arbitration: BackPressurePreferred -> LargestFirst.

When to override the default

  • Fast Source + slow Combine → leave on pause (the default). The arbitrator pauses the Source when the inter-stage buffer approaches its share of the budget, and no spill files are written.
  • Two parallel Aggregate stages, one much larger than the other → consider both. BackPressurePreferred -> LargestFirst pauses where it can, then targets the dominant Aggregate for spill, freeing the most headroom per spill call.
  • Pure react-only with deterministic priorityspill. Pauses are disabled; the arbitrator picks the cheapest-to-spill consumer (node_buffers before grace-hash before sort before Aggregate) every time. Closest to the pre-arbitrator behavior.

Streaming batch size (batch_size)

pipeline.batch_size sets how many events (records plus document-boundary punctuations) a streaming-eligible stage hands off to its downstream consumer at a time over a back-pressured channel. For a fused stage (Source → Transform → Output, Merge.interleave of Sources) it bounds the in-flight working set to one batch rather than the whole stage, because the stage pulls records off a live upstream channel without ever building a full result. The other streaming stages build their full result first and stream it in batches; there the knob sizes only the inter-stage slice, not the producer’s footprint. The knob is optional; omit it to use the built-in default of 2048 events. See Streaming vs. Blocking Stages for the distinction.

pipeline:
  name: orders_rollup
  batch_size: 1024          # optional; default 2048

A per-transform override is available on a Transform’s config.batch_size (see Transform Nodes); it takes precedence over the pipeline value for that one stage. A batch_size of 0 is rejected at config load. The knob affects only the memory profile of streaming stages, never their output — blocking stages (sort, hash Aggregate, Combine build side) ignore it and continue to fully materialize. See Streaming vs. Blocking Stages for the full model.

How it works

Clinker tracks memory in two layers. RSS (resident set size) is sampled at chunk boundaries and supplies the primary spill / abort signal. Alongside RSS, every memory-touching operator (Source ingest channels, Aggregate hash maps, sort buffers, grace-hash partitions, sort-merge accumulators, IEJoin arrays, inline-Combine hash tables, node_buffers slots, and window-runtime arenas) registers a MemoryConsumer wrapper with the pipeline-scoped arbitrator. Each operator owns its live byte counter and updates it on every admit / spill transition; the arbitrator queries current_usage() per consumer at every policy poll. This pull-mode attribution lets the policy distinguish reclaimable bytes (what an operator can give up right now) from currently-held bytes — a grace-hash with on-disk partitions, for instance, reports only its in-memory portion.

Window-runtime arenas (the columnar backing store that analytic-window evaluation reads from) are attributed but not independently spillable: an arena is immutable once built and is freed only indirectly, when the operator that consumes its windows drains to disk. Its wrapper reports the arena’s bytes so the arbitrator’s attribution is complete, but ranks last among spill victims so a policy never elects an arena while any consumer that can actually pause or spill remains.

Per-operator arbitration parameters

Each registered consumer carries two parameters the active policy reads: a spill priority (lower is spilled first under Priority) and a back-pressure flag (whether its producer can be paused instead). The defaults are:

Operator classspill_prioritycan_back_pressure
node_buffers slot (inter-stage buffer)0false
grace-hash Combine10false
sort buffer / IEJoin build20false
sort-merge Combine25false
hash Aggregate30false
inline-hash Combine30false
Source ingestN/Atrue
streaming AggregateN/Afalse
window arenalastfalse

Lower priority is spilled first, so node_buffers slots (priority 0) are the cheapest victim class — spilling an inter-stage buffer to disk costs one LZ4 + postcard round-trip and frees the most reclaimable bytes per call. The blocking operators climb from there: a grace-hash Combine (10) is preferred over a sort buffer (20), which is preferred over a hash Aggregate or inline-hash Combine (30).

A Source and a streaming Aggregate show spill_priority=N/A because neither operator holds spillable accumulated state. A Source’s try_spill always frees zero bytes — its only real lever is the pause its can_back_pressure=true advertises. A streaming Aggregate emits each group as it completes and never accumulates a spillable group table. The N/A here is about the operator’s own state, not its downstream handoff: when a streaming stage’s output rides a per-batch streaming handoff to a single consumer, that handoff registers a priority-0 consumer just like a node_buffers slot does, and its in-flight batches are spilled to disk one batch at a time if RSS crosses the soft threshold while they are in flight. So a streaming Aggregate’s group table is never a spill victim, but the batches it hands downstream can be.

When memory pressure crosses the soft threshold (80 % of limit), the arbitrator runs the active policy to pick a victim and invokes the corresponding action: pause() on a back-pressureable consumer (its producer’s hot loop parks on a Condvar until resume), or try_spill(target_bytes) on a spillable consumer (the consumer’s wrapper flips a spill-requested flag the operator reads at its next batch boundary). When RSS crosses the hard limit, the engine fails fast with E310 MemoryBudgetExceeded.

This means:

  • Pipelines always complete if disk space is available, regardless of input size.
  • Performance degrades gracefully under memory pressure — you will see slower execution (and possibly disk I/O), not failures.
  • The memory limit is a soft ceiling, not a hard wall. Momentary spikes may briefly exceed the limit before the policy fires.

Bounded-memory contract for non-fused stages

A stage runs streaming — no charged per-stage node_buffers slot — when it hands its output to a single downstream sink Output and roots no window: fused Source → Transform → Output and Merge.interleave-of-Sources chains, plus single-branch Route, non-fused Merge, streaming-strategy Aggregate, and hash-build-probe Combine probe-side feeding one Output (see Streaming vs. Blocking Stages). The remaining boundaries — multi-branch Route fan-out, a Merge or other operator whose output forks to several consumers, Composition bodies, diamond DAGs, and every blocking strategy — materialize records into per-stage node_buffers. Each slot registers a NodeBufferConsumer with the arbitrator (priority 0 — the cheapest-to-spill victim class), so the active policy’s victim selection is fully attributed.

When a buffer crosses the soft threshold (80 % of the limit) the arbitrator runs the active policy. Under the default pause, the producer feeding the buffer is paused at its inbound channel; under spill or when no consumer can be paused, the slot spills to disk using the same LZ4 + postcard frame format as grace-hash sort partitions. When RSS crosses the hard limit, the engine fails fast with E310 MemoryBudgetExceeded { node } naming the operator whose hot loop polled the abort gate. See error E310 for the full diagnostic model, including the composition-involved two-shape error model.

Spill fires at the producer side of the first slot whose downstream topology permits it — single-consumer, port-less. For a Source feeding a Route, that’s the Source’s own slot, not the Route’s per-branch slots, because the Source has the one outgoing edge that satisfies the topology rule. Per-branch slots can still spill independently when their own row-distribution drives them past the soft threshold, but the canonical case lands at the producer.

Use clinker run --explain to predict which stages will dominate the budget before runtime — each node carries a buffer: streaming | materialized annotation. Materialized nodes charge pipeline.memory.limit as one full-stage slot and spill the whole stage; streaming nodes charge per in-flight batch and, on a single-consumer edge, spill those batches one at a time. Both classes count against the limit and can spill — the annotation tells you the granularity (whole-stage vs. per-batch), not whether a stage is exempt from the budget.

Reading --explain arbitration output

Alongside the buffer: class, every node in the Physical Properties stanza of --explain carries an arbitration: line giving the per-operator parameters the arbitrator would apply at runtime. The numbers are derived at plan time — --explain does no I/O, so there are no live consumers to query — but they mirror the runtime values exactly, so an author can read the spill/pause model before running the pipeline.

For a fast Source feeding a slow Aggregate (the canonical bounded-memory shape), the relevant lines read:

=== Physical Properties ===

source.orders:
  buffer: materialized
  arbitration: spill_priority=N/A, can_back_pressure=true, predicted_peak=1K, predicted_freed=0B, predicted_subtree_reclaim=1K

aggregation.dept_totals:
  buffer: materialized
  arbitration: spill_priority=30, can_back_pressure=false, predicted_peak=1K, predicted_freed=1K, predicted_subtree_reclaim=1K

The Source advertises can_back_pressure=true and spill_priority=N/A: when memory pressure rises, the arbitrator pauses the Source rather than asking it to spill (it has nothing to free). The hash Aggregate advertises the opposite — spill_priority=30, can_back_pressure=false — so it is a spill victim, ranked behind any cheaper consumer.

The three predicted_* values are the scheduler’s inputs (see Scheduling below). predicted_peak is the live volume a node is expected to hold at its peak — seeded at a file-backed Source from its path: file’s on-disk size and propagated forward. predicted_freed is what the node returns to the budget the instant it finishes draining: a blocking Aggregate holds its whole accumulated input (predicted_peak=1K) and frees it on drain (predicted_freed=1K), while a streaming Source carries the volume through but frees nothing the instant it drains (predicted_freed=0B). predicted_subtree_reclaim is the largest reclaim the node’s downstream chain eventually unlocks: the Source frees nothing itself, but launching it is the only way to reach the point where its downstream Aggregate can drain, so it inherits that Aggregate’s reclaim (predicted_subtree_reclaim=1K). Propagation of the subtree value stops at a convergence node — the Combine two independent chains feed — so each feeding chain keeps the distinct reclaim it owns up to the join rather than the shared post-join total. All three render 0B when no file-size seed reached the node — a multi-file (glob/regex/paths) or absent/unreadable Source, or any node downstream of one. The bytes are formatted in the same binary-prefix units as memory.limit (1K, 64M, 2G), and the same three values appear in --explain --format json under node_properties.<name>.predicted_peak_bytes, predicted_freed_bytes_on_complete, and predicted_subtree_reclaim_bytes.

A === Buffer Edges === section follows, listing the node_buffers slot between each pair of non-fused stages. Every slot is a priority-0, non-back-pressureable NodeBufferConsumer — the cheapest victim class — and the slot= number is the stable index the executor admits into. The slot carries the producer’s predicted volume (it holds the producer’s materialized output and frees that whole buffer once the consumer drains it):

=== Buffer Edges ===

edge source.orders -> aggregation.dept_totals:
  buffer: node_buffer (slot=0)
  arbitration: spill_priority=0, can_back_pressure=false, predicted_peak=1K, predicted_freed=1K (producer: source)

Reading top to bottom: under memory pressure the arbitrator first spills the inter-stage buffer (priority 0), then — if the soft threshold is still tripped — pauses the Source before it ever forces the Aggregate (priority 30) to spill. That ordering is exactly what the default pause policy (BackPressurePreferred -> Priority) encodes. Cross-reference the per-operator table to see where any operator in your own pipeline lands.

Scheduling

When a pipeline has several nodes that are simultaneously runnable — every one of their inputs is ready, so the executor could legally run any of them next — the engine picks one deterministically rather than walking topological position blindly. The common case is a single linear chain where only one node is ever runnable at a time, and there is nothing to choose. The choice matters only for a pipeline whose DAG has multiple independent subgraphs (for example, two unrelated Source → Aggregate branches that a later Combine or Merge joins): both branches’ lead nodes become runnable together.

The engine runs one node to completion before dispatching the next. When two independent chains converge — two Source → Aggregate branches a later Combine joins — both branches’ outputs must be materialized and held until the Combine consumes them, so the chain that runs second builds its working set while the first chain’s output already sits in a buffer. Running the memory-heaviest chain first therefore drains and releases its large state before the lighter chain’s output has to coexist with it, lowering the peak resident working set; running it last makes its large state coexist with the already-materialized output of every chain that finished before it. What the ranking also buys is when the frontier offers a mix of node kinds: with a blocking operator ready to drain (and reclaim its accumulated state) alongside a fresh Source about to charge a new buffer, draining first reclaims headroom before the new charge lands, and under a tight budget the engine prefers the runnable node that fits the remaining headroom over one that would overflow it.

The engine ranks the simultaneously-runnable nodes by these rules, in order:

  1. Headroom fit. A node whose predicted_peak fits within the budget’s remaining headroom is preferred over one that does not. Running a node that fits avoids tipping the live working set over the soft threshold and forcing a spill that a different ordering would have avoided. A node with an unknown peak (predicted_peak=0B — no file-size seed reached it) counts as fitting, because 0 is always within any headroom; this keeps an unestimated pipeline on its topological order rather than deprioritizing every node.

  2. Immediate-freed tiebreak. Among nodes that fit equally, the one with the larger predicted_freed runs first. Finishing a node that returns more bytes to the budget the instant it completes maximizes the headroom available to everything still waiting — the same intuition as shortest-remaining-state-first. A ready blocking operator (which reclaims its accumulated state now) therefore wins over a fresh Source (which frees nothing the instant it drains), because the immediate reclaim is the headroom-minimizing choice.

  3. Subtree-reclaim tiebreak. Among nodes that also tie on immediate freed — most importantly the fresh Sources of independent chains, which all free 0 the instant they drain — the one with the larger predicted_subtree_reclaim runs first. This front-loads the chain whose completion eventually frees the most: a Source’s value is the reclaim its downstream Aggregate will release, so the heavier chain’s Source is dispatched ahead of the lighter one even when it sorts later in topological order. Because it ranks below immediate freed, it never elects a fresh heavy Source over a ready light Aggregate (which would raise the peak), only between candidates whose immediate reclaim is equal.

  4. Stable-index tiebreak. If two nodes still tie (equal fit, equal immediate freed, equal subtree reclaim — including the all-unknown case where all are 0), the one with the lower stable node index wins. The index is each node’s position in the plan’s topological order — the exact sequence the executor walks the DAG — so this tiebreak is fully deterministic and independent of the machine, the thread schedule, and the order the runnable set happened to be assembled in.

Fallback to topological order. When no node carries a volume estimate (every predicted_peak is 0B), rules 1–3 are no-ops — every node fits and every node frees the same 0 — so rule 4 alone decides, and the engine runs nodes in exactly the lowest-index / topological order it used before any volume estimates existed. This is the load-bearing guarantee: scheduling never changes record output or branching order. A pipeline’s data output is byte-identical regardless of the predictions; the estimates only steer which runnable node goes first to reclaim headroom sooner, front-load the heaviest chain, and prefer fitting nodes under pressure, never what each node computes.

Because the predictions are a pure function of the plan shape and the input files’ on-disk sizes (resolved against the pipeline file’s directory, never the process working directory), the scheduling decision is identical on every machine for an identical plan over identically-sized inputs.

Sizing guidelines

WorkloadRecommended limitNotes
Small files (<10 MB)128MMinimal memory pressure
Medium files (10–50 MB)256MCovers most ETL jobs
Large files or complex aggregations512M (default) – 1GMultiple group-by keys, large cardinality
Multiple large group-by keys1G+High-cardinality distinct values

Target workload: Clinker is optimized for 1–5 input files of up to 100 MB each, processing 10K–2M records per run.

Aggregation strategy interaction

Memory consumption depends heavily on the aggregation strategy the optimizer selects:

  • Hash aggregation accumulates state in a hash map. Memory usage is proportional to the number of distinct group-by values. With high-cardinality keys, this can consume significant memory before spill triggers.

  • Streaming aggregation processes groups in order and emits results as each group completes. Memory usage is minimal (proportional to a single group’s state) but requires the input to be sorted by the group-by keys.

  • strategy: auto (the default) lets the optimizer choose based on the declared sort order of the input. If the data arrives sorted by the group-by keys, streaming aggregation is selected automatically.

To influence strategy selection:

  - type: aggregate
    name: rollup
    input: sorted_data
    config:
      group_by: [department]
      strategy: streaming    # force streaming (input MUST be sorted)
      cxl: |
        emit total = sum(amount)

Only force streaming when you are certain the input is sorted by the group-by keys. If the data is not sorted, results will be incorrect. Use auto when in doubt.

Compositions

A composition (a reusable sub-pipeline included via use:) does not get its own memory budget. Body operators register with the same arbitrator instance as the parent pipeline, admit through the same paths, and spill to the same temporary directory. The recursion is purely structural. Body-scope consumer registrations are unregistered automatically when the body exits, so a body’s NodeBufferConsumer wrappers do not leak into the parent scope’s policy registry.

When a budget exceedance involves a composition, the error message arrives in one of two shapes:

  • At the composition boundary (records flowing into the body via an input port, or back out into the parent) — the error names the composition’s call-site directly (e.g. enrich_call).
  • Inside the body — the error is wrapped so the user-visible call-site name surfaces alongside the body-internal operator that tripped. The rendered message reads in composition 'enrich_call': ... followed by the inner detail.

See error E310 for the full diagnostic model.

Monitoring memory usage

Use the metrics system to track peak_rss_bytes across runs:

clinker run pipeline.yaml --metrics-spool-dir ./metrics/

The metrics file includes peak_rss_bytes, which shows the maximum resident memory during execution. If this consistently approaches your memory limit, consider increasing the budget or restructuring the pipeline to reduce intermediate state.

Shared server considerations

On servers running JVM applications, memory is often at a premium. Recommendations:

  • Set --memory-limit or memory.limit explicitly rather than relying on the default. Know your budget.
  • Use --threads to limit CPU contention alongside memory limits.
  • Monitor peak_rss_bytes in production metrics to right-size the limit over time.
  • Schedule large pipelines during off-peak hours when JVM heap pressure is lower.

Storage & Spill Location

Blocking operators — Aggregate, sort, and grace-hash Combine — accumulate state in memory up to the configured budget, then spill to disk when a soft or hard memory threshold trips, rather than running the process out of memory. By default those spill files land in the operating system’s temporary directory. The [storage] block in clinker.toml lets you redirect them.

The [storage] block

Storage settings are a property of the workspace, not of an individual pipeline, so they live in clinker.toml at the workspace root rather than in the per-pipeline YAML:

[storage.spill]
dir = "/var/clinker/spill"   # optional; default = OS temp dir
disk_cap_bytes = "10GB"      # optional; default = unlimited
compress = "auto"            # optional; auto | off | on   (default = auto)

[storage.staging]
enabled  = false             # opt-in; default off
dir      = "/var/clinker/staging"   # required when enabled
patterns = ["/mnt/nfs/data/**"]     # which sources to stage

The whole block is optional. With no clinker.toml, or a clinker.toml that omits [storage], Clinker spills to the OS temp directory exactly as it always has.

storage.spill.dir — where spill files go

When dir is set, the per-run spill directory (clinker-spill-<random>/) is created under that path, and every blocking operator writes its spill files there. When dir is omitted, the per-run directory is created under the OS temp directory (std::env::temp_dir, typically $TMPDIR or /tmp).

The directory is validated once at startup, before any input is read. If the path does not exist, is a file, or is not writable, the run fails immediately with a diagnostic naming the setting:

storage.spill.dir /var/clinker/spill does not exist; create it or point at an existing volume

Validating up front — rather than at the first spill — means a misconfigured spill volume fails fast, while the run is cheap to abandon, instead of after minutes of work. (This is the trap DuckDB fell into when its temp-directory setting was honored only lazily, duckdb/duckdb#9401.)

Why redirect spill off /tmp

On many Linux hosts — especially systemd-managed ones — /tmp is mounted as tmpfs, which is backed by RAM (and swap), not disk. Spilling there does not actually free physical memory: the spill bytes stay resident, defeating the whole point of the memory budget. If df -T /tmp reports a tmpfs filesystem, point storage.spill.dir at a path on a real block device so spilling moves pressure off RAM and onto disk.

Inspecting the resolved spill root

clinker run --explain prints the resolved spill root and where it came from, so you can confirm the setting took effect before committing to a run:

Spill root: /var/clinker/spill [storage.spill.dir]

…or, with no configuration:

Spill root: /tmp [OS temp dir (default)]

The same --explain output reports the resolved disk cap on the next line:

Spill disk cap: 10737418240 bytes [storage.spill.disk_cap_bytes]

…or, with no cap configured:

Spill disk cap: unlimited (default)

Finally, --explain reports the resolved compression decision per spill-writing operator, so you can see which spills will be LZ4-framed (lz4) and which will be written raw (off) before the run starts. Under auto the choice varies by operator width:

Spill compression: Auto [storage.spill.compress]
  Aggregate 'totals' → lz4
  Sort 'by_amount' → off

Only operators that actually write spill files appear here: the external sort, the hash Aggregate, and the grace-hash / sort-merge Combine. In-memory join strategies (the inline hash build/probe and the IEJoin range join) run their kernel entirely in RAM and never open a spill file, so spill compression does not apply to them and they are omitted from this list — even though they carry a spill priority for memory arbitration.

storage.spill.disk_cap_bytes — cap cumulative spill

By default a run will spill as much as it needs, limited only by the physical space on the spill volume. disk_cap_bytes sets a cumulative budget: the total on-disk size of every spill file a run writes. When that running total would cross the cap, the run aborts with a dedicated diagnostic instead of continuing to fill the volume.

[storage.spill]
dir = "/mnt/fast-ssd/clinker-spill"
disk_cap_bytes = "50GB"

The value accepts the same human-readable byte-size grammar as the source size filters — a bare integer is bytes, and KB/MB/GB suffixes use decimal units (1GB = 1,000,000,000 bytes), matching du, df, and the AWS CLI. Omitting the key leaves spill unlimited, exactly as before.

The cap is a policy ceiling, deliberately independent of both the memory budget and the physical volume size. A run can sit well inside its memory.limit and still exhaust local disk through an unbounded stream of spill files; the cap lets an operator bound that on a shared volume. It is the guard DataFusion shipped without (apache/datafusion#15358) until production runs filled volumes.

storage.spill.compress — LZ4 compression policy

Spill files are postcard-encoded record streams. By default each stream is wrapped in an LZ4 frame, which shrinks large spilled runs. But LZ4 carries a per-frame fixed cost — clearing the compressor’s internal state on every frame reset — and on small spills that cost can outweigh the byte savings. The LZ4 v1.8.2 release notes call this out directly, and Pentaho Kettle ships explicit guidance to turn spill compression off for small rows.

compress controls the policy:

[storage.spill]
compress = "auto"   # auto | off | on   (default = auto)
ModeBehavior
auto (default)Compress only when a spilled batch is projected large enough to amortize LZ4’s per-frame cost — both ≥ 4 KiB and ≥ 1024 rows. Below either threshold the batch is written raw. The projection comes from the operator’s schema width and the run’s batch_size, so the decision is made per blocking operator.
offNever compress. Postcard records are written straight to disk with no LZ4 frame. Cheapest for small spills; largest on-disk size.
onAlways compress with an LZ4 frame. The pre-knob behavior, best for spills of large, compressible rows.

Each spill file records its compression choice in a one-byte header tag, so the read path always dispatches to the right decoder regardless of the mode the file was written with — changing the knob between runs never breaks re-reading an earlier run’s files.

The 4 KiB / 1024-row thresholds mark the empirical crossover: below them the LZ4 frame’s fixed cost dominates the small amount of compressible payload, and writing raw is faster end-to-end (the spill_compression benchmark sweeps batch sizes from 256 B to 64 KiB and confirms auto tracks the faster of on / off across the range). Most pipelines should leave compress at auto; set on when spilling wide, highly compressible rows to a space-constrained volume, and off when spills are dominated by many small batches.

Observability — what the planner will do before you run

clinker run --explain is plan-only (it reads no input and spills nothing), so it is the safe place to see what a run would do to the spill volume and to the staging dir before committing to it. On top of the resolved spill root, disk cap, and compression decision documented above, --explain surfaces three storage-observability sections, and a real clinker run reports the matching actuals at end-of-run so you can calibrate the estimate.

A note on byte units. Three different unit conventions appear across the storage surface, and it helps to know which is which before comparing figures:

  • Config values you write (disk_cap_bytes = "10GB") use decimal units — 1GB = 1,000,000,000 bytes — matching du, df, and the AWS CLI (see the disk-cap grammar).
  • The === Estimated Spill Volume === section humanizes with binary suffixes — K/M/G = KiB/MiB/GiB — so it lines up with the predicted_peak figure on each stage’s Physical Properties line, which uses the same humanizer.
  • The cap-headroom line and the post-run actuals print raw bytes with no suffix, so the cap-minus-estimate subtraction and the estimate-vs-actual comparison are exact rather than rounded.

When you calibrate the estimate against the post-run actual, convert the binary estimate suffix to bytes first (1K = 1024 bytes, 1M = 1,048,576 bytes) so you are comparing the same unit the actuals report.

Estimated spill volume per stage

The === Estimated Spill Volume === section lists one line per spill-writing stage (hash Aggregate, external sort, grace-hash / sort-merge Combine) with its plan-time spill-volume estimate, followed by a total. In-memory join strategies (inline hash build/probe, IEJoin) never write spill files, so they do not appear here and do not inflate the total:

=== Estimated Spill Volume ===

Estimated spill volume (per blocking stage):
  [aggregation:hash] dept_totals → 1K
  [sort] by_amount → 4K
  Total: 5K

Each figure is the operator’s coarse predicted peak live state — the same predicted_peak the Physical Properties arbitration line shows — and bytes render in binary units (K/M/G = KiB/MiB/GiB). Summing rather than maxing is the conservative choice for a preflight: two blocking operators can be live and spilled at the same time, so their footprints add.

A streaming-only pipeline (no blocking operator) has nothing that spills, so the section is omitted entirely.

Unknown stages. The estimate is seeded from input file sizes resolved at plan time. A stage whose volume cannot be known before the run renders unknown instead of a misleading 0B, and the total notes that unknown stages are excluded:

  [aggregation:hash] dept_totals → unknown
  Total (known stages): 0B (excludes stages whose volume is unknown at plan time
  — a network source, a missing or unreadable input, or a glob/regex matcher
  whose discovery fails)

The seed is known for every file-backed matcher whose files can be sized at plan time: a single-file path: source, an explicit paths: list, and a glob: or regex: matcher. A glob/regex seed runs the same discovery resolver the run uses — applying its exclude, min_size/max_size, modified_after/before, take, and sort filters — and sums the matched files’ sizes, so the estimate names exactly the bytes the run will read with no second implementation to drift. A glob/regex that matches nothing seeds zero (rendered as unknown, since there is no spill volume to preview). The seed is genuinely unknown for a network source, for a missing or unreadable input file, and for a glob/regex matcher whose discovery itself fails (an invalid pattern, or no match under on_no_match: error) — the run surfaces the same error at startup. Check the post-run actuals below to calibrate any estimate.

Staging plan per source

When storage.staging is enabled, the === Staging Plan === section reports, for each source (and each discovered file under a multi-file matcher): whether it would be staged, the resolved content-addressed staged path, and — under on_existing = reuse — the reuse-if-fresh cache decision (hit if a committed prior copy still matches the live source, miss if it would be re-staged):

=== Staging Plan ===

Source 'orders':
  /data/in/orders-2024.csv → staged: yes, path: /mnt/local/staging/3f2a…b1.staged, reuse: hit
  /data/in/orders-2025.csv → staged: yes, path: /mnt/local/staging/9c4e…07.staged, reuse: miss

The reuse prediction runs the exact freshness check (mtime + size against the committed manifest) the real run makes, read-only — --explain copies nothing. A source that matches no staging pattern reports staged: no (no pattern match, reads in place); a network source reports not stagable (network source reads in place). When staging is disabled the section states that every source reads in place.

Cap headroom

When a spill cap is configured, --explain reports the headroom (cap minus estimate) with the same per-invocation disclaimer the startup cap-headroom preflight carries, and the same 80% warning:

Cap headroom: 5000000000 bytes free (5000000000 estimated of 10000000000 cap, 50%)
  [per invocation — does NOT account for sibling invocations sharing the spill
  volume under partition-and-run]

Machine-readable form — --explain json

clinker run --explain json emits the whole plan as JSON for tooling (the Kiln canvas, dashboards, CI gates). The same storage observability the text form prints lives under a structured storage_summary object, so a consumer reads per-stage spill estimates and the cap / staging summary without re-parsing prose:

{
  "schema_version": "1",
  "nodes": [ ... ],
  "node_properties": { ... },
  "storage_summary": {
    "spill_root": { "path": "/mnt/fast-ssd/clinker-spill", "source": "storage.spill.dir" },
    "spill_disk_cap_bytes": 1000000000,
    "estimated_spill": {
      "per_stage": [
        { "node_name": "dept_totals", "display_name": "[aggregation:hash] dept_totals", "estimate_bytes": 1024 },
        { "node_name": "by_amount", "display_name": "[sort] by_amount", "estimate_bytes": 4096 }
      ],
      "total_known_bytes": 5120,
      "any_unknown": false
    },
    "spill_compression": {
      "mode": "auto",
      "per_operator": [
        { "node_name": "dept_totals", "display_name": "[aggregation:hash] dept_totals", "compression": "lz4" },
        { "node_name": "by_amount", "display_name": "[sort] by_amount", "compression": "off" }
      ]
    },
    "cap_headroom": {
      "headroom_bytes": 999994880,
      "estimated_bytes": 5120,
      "cap_bytes": 1000000000,
      "pct_of_cap": 0.000512,
      "over_threshold": false
    },
    "staging": { "enabled": false, "sources": [] }
  }
}

The fields mirror the text sections one-for-one: estimated_spill is the === Estimated Spill Volume === section (a stage whose volume is unknown at plan time carries estimate_bytes: null and sets any_unknown: true), spill_compression is the Spill compression: projection, cap_headroom is the cap-headroom line (omitted when no cap is configured or the estimate is zero), and staging is the === Staging Plan === section. The JSON and DOT formats emit only their machine payload — the human-readable === Resolved Outputs === preamble the text form prints is suppressed so the output parses cleanly.

Post-run actuals — calibrating the estimate

A real clinker run that spills prints a per-stage actual spill-volume section at end-of-run, so you can compare it against the --explain estimate for the same stage — the calibration loop that turns a coarse pre-run estimate into a trustworthy one over repeated runs:

=== Spill Volume (actual, per stage) ===
  dept_totals → 1048576 bytes
  by_amount → 4194304 bytes
  Total: 5242880 bytes (compare against the --explain estimate)

The per-stage breakdown sums to the pipeline-wide cumulative spill total. A run that stayed within memory spilled nothing and prints no section. A large estimate-vs-actual delta is the single highest-leverage signal when a pipeline starts spilling unexpectedly (the failure mode behind Polars’ documented 13.5× spill amplification, where an optimizer interaction turned 30 GB of input into 400 GB of spill with no per-stage visibility).

Note on the --explain compression projection. The per-operator spill-compression decision shown under Spill compression: is projected from the same column count the operator’s runtime spill writer sees, so the projected auto verdict matches the file the run actually writes. A hash Aggregate and a grace-hash / sort-merge Combine project against their output schema (engine-stamped identity columns included), exactly the width their dispatch arms resolve compression against; an enforcer sort projects against the width of the records flowing into it — its upstream’s emitted schema — which is the width its sort buffer reads at runtime. The read path also dispatches on each spill file’s own one-byte header tag, so re-reading is robust regardless.

Distinguishing the runtime storage-abort conditions

A run that fails while spilling or staging emits one of several distinct diagnostics so a single glance at the error tells you exactly what to fix — instead of every disk and memory problem rendering as one ambiguous “out of memory” message (the trap DuckDB hit in duckdb/duckdb#14142, where a temp-dir cap was reported as “Out of Memory Error … 187.3 GiB/187.3 GiB used” and users inspected df only to find free space). The aborts split along two axes: the spill side (in-memory operator state landing on disk) and the staging side (matched source files copied to local disk before they are read).

Spill aborts

ConditionCodeWhat happenedWhat to do
Out of memoryE310An operator’s in-RAM state crossed the hard memory.limit (a true RSS overrun).Raise memory.limit, reduce input, or let the operator spill.
Spill cap exceededE320Cumulative spill bytes crossed storage.spill.disk_cap_bytes. The volume may still have free space — you hit the configured budget.Raise disk_cap_bytes, point storage.spill.dir at a larger volume, or reduce the spill footprint.
Spill volume fullE321The OS reported the spill volume out of space (ENOSPC). The physical disk filled.Free space on the volume, or move storage.spill.dir to a larger mount.
Spill directory unavailable(Spill)The spill directory went bad mid-run — unmounted, remounted read-only, deleted by a cleaner, or permissions revoked.Remount/restore the volume; stop the over-eager cleaner.

The key separations:

  • E310 vs E320 — an OOM is an in-RAM overrun; a cap-exceeded is a disk-budget stop. A run can hit E320 while comfortably inside its memory envelope, so conflating the two would point you at the wrong knob.
  • E320 vs E321 — E320 is the budget you set; E321 is the disk itself running dry. If you removed disk_cap_bytes, an over-large run would no longer trip E320 and would instead spill until the volume filled (E321).

(A future per-operator memory-reservation surface will add a fifth, reservation-exhausted condition; it is not part of the engine yet.)

Staging-copy aborts

When storage.staging is enabled, copying a matched source to local disk can fail in three distinct ways. Like the spill split, each has its own code so a content-corruption problem never renders as a budget problem and vice versa. Staging runs before any record flows, so these surface as startup-style validation failures.

ConditionCodeWhat happenedWhat to do
Staged copy corruptE335The local copy’s BLAKE3 digest did not match the source — the transport (e.g. a soft-mount NFS share) delivered different bytes than the source holds.Re-run over a healthy transport, harden the mount, or stage from a stable snapshot. Do not set verify = "none" to silence it — that hides corruption, not fixes it.
Staging cap exceededE336The cumulative bytes staged this run would cross storage.staging.disk_cap_bytes. The volume may still have free space — you hit the configured budget, not a full disk.Raise disk_cap_bytes, point storage.staging.dir at a larger volume, narrow storage.staging.patterns, or remove the cap.
Staged copy already existsE337A staged copy of this source already exists and on_existing = error refuses to touch it.Remove the existing copy, or switch on_existing to overwrite (re-stage) or reuse (reuse a fresh copy).

The same cap-vs-full-disk separation applies here as on the spill side: E336 is the budget you set (mirroring E320), so it must not render as an out-of-space message — a physically full staging volume instead surfaces as a staging I/O error (mirroring E321). E335 is distinct from a generic staging I/O error: an I/O error means the OS reported a fault, whereas E335 means the copy completed cleanly yet still does not match the source.

Startup storage validation

Before a run spawns its first source-ingest thread — after the plan compiles but before any input is read or any byte is spilled or staged — Clinker runs a single comprehensive validation pass over the resolved [storage] configuration. It rejects configurations that are physically wrong for the job, each with a stable diagnostic code, the offending clinker.toml field, and a clinker explain --code <CODE> pointer. Validating up front fails a misconfigured volume while the run is still cheap to abandon, rather than after minutes of work when the first spill or staged copy hits the bad volume.

CodeRejected configurationWhy
E330storage.spill.dir on an in-memory filesystem (Linux tmpfs / ramfs, Windows RAM disk).Spilling there keeps the bytes in RAM, so it frees no physical memory and defeats the memory budget.
E331storage.spill.dir on a network filesystem (NFS / SMB / CIFS / FUSE).A spill target on a soft-mounted share risks silent truncation and mmap data loss — the failure modes spill exists to avoid.
E332storage.staging.dir on a network filesystem.Staging copies inputs off a flaky share; a staging dir that is itself on a share reintroduces the fragility staging exists to escape.
E333storage.staging.dir on the same physical device as a matched (staged) source.The copy moves no I/O off the source volume, so it buys nothing while still spending time and space. Applies only to matched sources.
E334storage.spill.dir equal to storage.staging.dir.Spill files and staged source copies are sized and cleaned up differently; sharing one directory makes accounting and cleanup ambiguous.

The filesystem-class checks (E330–E332) read the volume type through one cross-platform detection layer, so they behave identically on Linux, macOS, and Windows: Linux matches the statfs f_type magic, macOS matches the f_fstypename string, and Windows maps GetDriveTypeW. (macOS has no native tmpfs, so E330 only ever fires on Linux and Windows.) The same-device check (E333) compares the device id on Linux/macOS and the volume serial number on Windows — the very same probe the staging same-volume rule uses, so there is one consistent notion of “same device” across the whole run.

Free-space preflight

Separately from the runtime disk cap (E320) and the full-volume surface (E321), the startup pass runs a free-space preflight: it queries the bytes available on the spill volume and compares them to the run’s estimated spill footprint (the sum of every blocking operator’s predicted peak state, the same estimate --explain surfaces). When the spill volume looks too small, the run prints a warning and continues:

W330: spill volume /var/clinker/spill has 2000000000 bytes free but the run is
estimated to spill up to 8000000000 bytes; the run may abort with a full-volume
error (E321) at the final spill — point storage.spill.dir at a larger volume or
reduce the spill footprint (raise memory.limit, partition the input)

This is advisory, not fatal: the estimate is a coarse upper bound (it ignores spill compression and the streaming drain), so the run may well finish within the available space. The warning exists so a long pipeline that would die at its final spill surfaces that risk before it runs for an hour, rather than after. The free-space query uses a cross-platform probe (statvfs on Unix, GetDiskFreeSpaceExW on Windows) that returns a 64-bit byte count, so the historical 32-bit f_bavail truncation never affects the comparison.

Cap-headroom preflight

When storage.spill.disk_cap_bytes is configured, the same startup pass also runs a cap-headroom preflight: it compares the run’s estimated spill volume to the configured cap and warns when the estimate reaches 80% of the cap. Unlike the free-space preflight (which probes the physical volume), this checks the run against the policy ceiling you set, so it fires even on a volume with plenty of free space:

W331: this run is estimated to spill up to 9000000000 bytes, which is 90% of the
configured spill cap storage.spill.disk_cap_bytes (10000000000 bytes); the run
may abort with a spill-cap error (E320) before it finishes — raise disk_cap_bytes
or reduce the spill footprint (raise memory.limit, partition the input). This
headroom is per invocation: if you partition the input and run several clinker
invocations against the same spill volume and cap, they share the cap, so the
real headroom is smaller than this figure

Like W330, this is advisory, not fatal — the estimate is a coarse upper bound, so a run that compresses well or never trips its memory budget may finish comfortably under the cap. It fires on a normal clinker run (before ingestion, at startup), not only under --explain, so an operator sees the signal on the real run even when they did not explicitly inspect the plan first.

Per-invocation accounting. The cap and the headroom figure are scoped to a single clinker invocation. Under the partition-and-run model — where you split a large input by file or key and launch several clinker processes that share one spill volume and one disk_cap_bytes — the physical spill volume is shared by every sibling, so the real headroom is smaller than any one invocation’s figure. The warning text states this explicitly rather than silently presenting a per-invocation number as a whole-volume guarantee. Clinker is single-process by design (one invocation = one OS process), so the engine cannot see its siblings; the disclaimer is the honest stance.

Mid-run spill failures

The startup check guarantees the spill directory is writable when the run begins, but it can still go bad mid-run — an NFS share remounts read-only, a volume unmounts, an over-eager temp-file cleaner deletes the directory, or permissions are revoked. When a spill write fails because the directory has vanished or become read-only, the run aborts cleanly with a distinct diagnostic rather than a generic I/O error or a panic:

spill directory /var/clinker/spill became unavailable mid-run: No such file or directory
(the directory may have been unmounted, remounted read-only, deleted by an
external cleaner, or had its permissions revoked)

This surfaces the directory-level cause directly, so the fix (remount the volume, stop the cleaner, restore permissions) is obvious from the message.

Crash purge of orphaned spill directories

A run’s spill directory (clinker-spill-<random>/) is normally removed when the run ends — a clean exit, a run that aborts with a fatal error, or even a panic that unwinds all delete it, on every platform. (The run holds an advisory lock on a .lock file inside that directory; the lock is always released before the directory is removed, so the removal never trips over its own open handle — which matters on Windows, where an open file handle blocks deletion.) But a SIGKILL, the Linux OOM-killer, or a power loss kills the process before that cleanup runs, leaking the directory and every spill file inside it under the spill root. Over many crashed runs that fills the spill volume.

To prevent that, a run reaps orphaned spill directories at startup — but only when a spill directory is explicitly configured (storage.spill.dir), before it creates its own. Each live run holds an operating-system advisory lock on a .lock file inside its spill directory for the run’s whole lifetime; the OS releases that lock automatically when the process exits, however it exits. At startup a run scans the configured spill root and, for each clinker-spill-* directory, tries to take that lock: if it succeeds the owning process is gone, so the directory is an orphan and is removed; if the lock is still held a concurrent live run owns it, so it is left alone. Asking the kernel “is anyone still holding this?” is robust against PID reuse and never reaps a directory a concurrent run is still using. The purge is best-effort: a failure to reap one directory is logged and the run proceeds.

When storage.spill.dir is not set, the spill root defaults to the OS temp directory (std::env::temp_dir, typically $TMPDIR or /tmp), and no startup purge runs there. In the default case a run cleans up after itself directly: the per-run spill directory is removed on every exit short of a hard kill — a clean exit, an error-return abort, or a panic that unwinds. Only a SIGKILL, the OOM-killer, or a power loss leaks one, and a directory leaked into the OS temp directory is the operating system’s temp-reaper’s responsibility to clean up, not Clinker’s.

The purge is deliberately confined to a configured spill root because it must never police the shared OS temp directory. That directory is used by every process on the host, so a startup sweep there would race not only concurrent Clinker runs (the lock-based check narrows that window but cannot eliminate it against a peer whose just-created spill directory is not yet locked) but also unrelated programs that happen to use a colliding name prefix. A run owns its configured spill volume and can safely sweep it; it does not own /tmp and so leaves it alone.

storage.staging — opt-in source staging

Reading source files directly from a network share (NFS, SMB) couples every run to the share’s availability and quirks: a soft-mount can silently truncate a read, and latency multiplies across many small files. Source staging copies matched source files to a local volume before the pipeline reads them, so the run works from stable local copies. It is off by default and activated per workspace by pattern match — pipelines that don’t opt in behave exactly as before.

[storage.staging]
enabled        = true
dir            = "/var/clinker/staging"   # required when enabled
patterns       = [
    "/mnt/nfs/data/**",
    "//fileserver/share/**",
]
disk_cap_bytes = "50GB"   # optional; cap on bytes copied per run (default unlimited)
verify         = "blake3" # optional; blake3 | none   (default blake3)
on_existing    = "overwrite" # optional; overwrite | reuse | error (default overwrite)
cleanup        = "on_success" # optional; on_success | always | never (default on_success)
KeyDefaultMeaning
enabledfalseMaster switch. When false, patterns is ignored and every source reads in place.
dirLocal directory the copies are written under. Required when enabled.
patterns[]Glob patterns selecting which source paths to stage. A source is staged only when enabled and its path matches at least one pattern. Empty ⇒ nothing is staged.
disk_cap_bytesunlimitedCumulative cap on bytes copied per run. Same byte-size grammar as the spill cap ("50GB", bare integers are bytes).
verifyblake3Post-copy integrity check. blake3 hashes source and copy and requires a match — the only check that catches a soft-mount’s silent truncation. none skips the check.
on_existingoverwriteWhat to do when a staged copy of this source already exists from a prior run: overwrite re-copies unconditionally; reuse reuses the existing copy only when it is still fresh (the source’s modification time and size match what was recorded when it was staged), otherwise re-copies; error fails the run rather than touch the existing copy. See The staging cache below.
cleanupon_successWhen staged copies are deleted relative to the run’s outcome: on_success removes them after a clean exit but keeps them after a failure so the operator can inspect the exact inputs the failed run saw; always removes them regardless; never keeps them as a persistent reuse cache for a later reuse run. See Cleanup.

Pattern matching

patterns uses the same glob grammar as a source’s exclude: list. Each pattern is tested against both the full path and the basename, so /mnt/nfs/** matches a deep path by its full path while *.csv matches any CSV by basename. ** crosses directory boundaries; * does not.

Startup validation

When enabled, staging is validated once at startup, before any input is opened, so a misconfiguration fails the run immediately rather than at the first copy. The run is refused when:

  • dir is unset.
  • dir does not exist, is a file, or is not writable (probed with a real create-and-delete, so a read-only mount or restrictive ACL is caught).
  • a patterns entry is not a valid glob.
  • dir sits on the same volume as a matched source. Staging within one volume copies bytes without moving I/O off the slow share — a well-documented anti-pattern — so it is refused up front rather than left to surface as a confusingly slow pipeline. The check compares the source’s and the staging dir’s storage volume (the device id on Linux/macOS, the volume mount root on Windows); point dir at a local disk on a different volume.

The same-volume rule applies only to matched sources: a source the patterns don’t select reads in place, so its volume is irrelevant.

How a file is staged

Each matched source maps to a stable, content-addressed pair of files directly under dir, deterministic across runs of the same source:

  • <source-id>.staged — the local copy the reader opens.
  • <source-id>.manifest.json — a sidecar recording the source’s identity: its path, modification time, size, the BLAKE3 content hash, and the stage time.
  • <source-id>.lock — a small advisory-lock file that serializes concurrent invocations staging the same source (see the staging cache). It carries no data and persists between runs as the per-source coordination point, alongside the cached copy it guards. Once that source’s cache entry is gone and no run holds the lock, a later startup crash purge reclaims the lock too, so a persistent staging root does not accumulate one orphan lock per source that has ever passed through it.

<source-id> is derived from the source’s canonical path, so the same source always resolves to the same staged file. That stability is what makes the reuse cache work (a later run can find the prior copy) and it is why the layout is stable rather than per-run UUIDs.

The copy is built to survive a crash at any point without leaving a corrupt or partial file a later run might trust:

  1. Single-pass copy + hash. The source is read once in ~1 MiB chunks; each chunk is fed to both the BLAKE3 hasher and the destination file in the same pass. The copy never holds the whole file in memory, so it stays a memory-budget no-op regardless of file size.
  2. Atomic publish. Bytes are written to a <source-id>.<run>.partial temp file, flushed and fsync’d, then renamed to <source-id>.staged. A rename is an atomic replace on Linux, macOS, and Windows (Windows 10 1607+), so a reader scanning for .staged files sees either nothing or the complete file — never a half-written one. The <run> segment in the partial name keeps any two in-flight copies of one source on distinct temp files, and the per-source lock (see the staging cache) ensures only one of them ever runs at a time.
  3. Durable rename. On Linux/macOS the parent directory is fsync’d after the rename, because on ext4/xfs a rename is only crash-durable once the directory entry itself is flushed. On Windows the NTFS journal makes the rename durable, so there is no separate directory flush to do.
  4. Verify. With verify = blake3 (the default) the source is independently re-read and hashed, and the two digests must match. A size check cannot catch a soft-mount that silently truncated the read; two content digests can. A mismatch removes the published copy and fails the run with a distinct “staged copy is corrupt” diagnostic (E335) — not a generic I/O error.
  5. Commit the manifest. The identity manifest is written with the same atomic temp-file + rename discipline. The manifest is the commit marker: a .staged file is only trustworthy once its manifest exists. A crash between the copy and the manifest leaves a .staged with no manifest, which the next run’s crash purge reaps as an orphan rather than half-trusting.

If the copy fails partway, the .partial is removed before the error propagates.

The staging cache (on_existing)

Because staged copies live at stable paths, a copy from a prior run is still on disk when the next run starts (unless cleanup removed it). on_existing decides what happens when that prior copy is found:

ModeBehavior
overwrite (default)Always re-stage. The prior copy and its manifest are removed and the source is copied fresh. The safe default: a copy from a crashed run must not be trusted.
reuseReuse the prior copy only when it is still fresh — the source’s current modification time and size both match what the manifest recorded. A fresh match skips the copy entirely (no bytes read off the share, nothing charged against the disk cap). A changed mtime or size means the source was rewritten, so the copy is stale and is re-staged.
errorFail the run with a clear diagnostic if a staged copy already exists, rather than overwrite or reuse it. For workflows that want an explicit “the cache is already populated” stop.

reuse is the mode that turns staging into a cache: re-running the same pipeline over an unchanged network share copies nothing on the second run. The freshness check is mtime + size, not a re-hash, so it is a cheap stat rather than a full read of the source.

Staging is collision-safe across concurrent invocations. Under the partition-and-run model — several clinker processes over a partitioned input sharing one staging volume — independent runs may stage, reuse, or clean up the same shared source at the same time. The per-source advisory lock (a <source-id>.lock file under the staging root) is a reader-writer lock that keeps every such overlap safe on Linux, macOS, and Windows:

  • Exactly one copy. A run that needs to copy takes the lock exclusively for its copy-and-publish. The first run to take it copies and publishes; every other run blocks, then acquires the lock, finds the now-fresh .staged, and reuses it. So a source is copied exactly once no matter how many invocations race for it.
  • A reader is never yanked. A run reading a staged copy holds the lock in shared mode for as long as it has the file open, and keeps it held across the moment it decides to reuse a copy and the moment it opens that copy — so the file it chose cannot be deleted or replaced in between. Any number of readers share the lock at once, so concurrent runs all read the same copy in parallel.
  • Cleanup and overwrite wait for readers. Removing or re-copying a staged pair takes the lock exclusively, which a live reader’s shared lock blocks. Cleanup probes the lock without waiting: if a concurrent run is still reading the copy, cleanup leaves it in place (the last run to release it, or a later crash purge, reaps it). An overwrite re-stage waits for in-flight readers to finish, then publishes atomically.

On Windows the staged copy is additionally opened with a share mode that permits a concurrent delete or atomic-rename replace (FILE_SHARE_DELETE), so an open reader and a concurrent publish/cleanup interoperate there exactly as they do on POSIX, where an unlinked-but-open file stays readable. The net guarantee: across any mix of concurrent runs sharing a staged source, a reader always sees a complete, coherent .staged file and no run fails spuriously.

Cleanup (on_success | always | never)

cleanup decides when a run’s staged copies are removed, keyed on the run’s outcome:

ModeBehavior
on_success (default)Remove the copies after a clean exit; keep them after a failure (or an interrupted / DLQ-producing run) so the operator can inspect the exact inputs the run saw and re-run without re-fetching.
alwaysRemove the copies when the run ends, success or failure.
neverKeep the copies indefinitely as a persistent reuse cache. Combine with on_existing = reuse to make repeated runs over a stable source copy-free. The operator reclaims the staging dir manually (or lets the next run’s crash purge eventually reap stale entries).

Each staged file’s manifest is removed alongside it, so cleanup never leaves a manifest pointing at a staged file that is gone.

Crash purge of orphaned artifacts

A clean (or panicking) run runs its cleanup. But a SIGKILL, the Linux OOM-killer, or a power loss kills the process before any cleanup runs, leaking its staged artifacts under the staging root. To stop that from accumulating across crashes, every run performs an idempotent crash purge at startup, before it stages anything. It reaps four orphan shapes left under the staging root:

  • a *.partial — an interrupted copy. Reaped only when its owning run is dead (see below), so a concurrent sibling’s in-flight copy is never reaped;
  • a *.staged with no matching manifest — a copy that crashed before it could commit its manifest;
  • a *.manifest.json with no matching staged file;
  • a *.lock whose source has no surviving cache entry — a coordination lock left by a source that is no longer staged (not necessarily from a crash), reclaimed under the liveness and age gates described below.

A clean pair (a .staged with its committed .manifest.json) is the reuse cache and is kept — the purge never removes a complete, trustworthy copy — and the source’s .lock is kept alongside it so a later reuse run has a lock to take.

A .lock whose source has no surviving cache entry (no .staged and no .manifest.json) is itself reclaimed by the purge, under the same liveness gate as a partial: it is removed only when it is acquirable under a try-lock (no live run holds it) and has aged past the creation grace window (so a sibling mid-acquire is not raced), and the removal is performed while the purge holds the lock exclusively. A held lock, a lock still guarding a cached copy, and a freshly created lock are all kept. This bounds what would otherwise be unbounded growth of one zero-byte lock file per distinct source ever staged — relevant for a long-lived persistent cache (on_existing = reuse, cleanup = never) — while never removing a coordination point a live or cached source still needs.

Because several invocations can share one staging volume, the purge must not reap a live sibling’s in-flight .partial. It tells a crash corpse from a live copy the same way the spill-directory purge does: a .partial is reaped only when the source’s .lock is acquirable under a try-lock (no live process holds it, so the owner is gone) and the partial has aged past a short creation grace window (so a copy that has just started but not yet taken the lock is left alone). A partial whose lock is still held, or one too young to have been locked yet, is kept. This asks the operating system “is anyone still staging this?” rather than guessing, so a concurrent purge can never delete a running sibling’s work.

File permissions

Staged copies hold verbatim source records — potentially PII, credentials, or financial data — and on a shared staging volume they must not be readable by other users. On Unix each staged file and its manifest are created with mode 0o600 (owner-only). On Windows there is no portable mode bit; staged files inherit the staging directory’s ACL, so restrict the directory’s ACL if the volume is shared.

Crash durability and the parent-directory fsync

The atomic-rename guarantee only holds across a crash if the rename is durable. On POSIX filesystems (ext4, xfs) a rename’s directory entry can still be in the page cache after rename returns, so Clinker fsyncs the parent directory after the rename. Windows is intentionally exempt: the NTFS metadata journal already makes the rename crash-durable (the semantics MOVEFILE_WRITE_THROUGH requests), and Windows offers no directory-fsync equivalent.

Streaming vs. Blocking Stages

Every node in a pipeline plan is one of two kinds at runtime:

  • Streaming stages hand their output downstream in bounded batches over a back-pressured channel, never crossing an inter-stage buffer that charges the memory budget. The two fused streaming paths additionally hold at most one batch of in-flight events at a time, so their inter-stage memory does not grow with input size. The other streaming stages still build their own result before handing it off — streaming spares them the second copy into a charged buffer and overlaps the writer with downstream work, but their own working set is as large as a blocking stage’s would be.
  • Blocking stages must see their whole input before they can produce any output. They accumulate state inside the memory budget and spill to disk when the soft threshold trips, rather than holding everything in RAM.

This distinction is what makes Clinker a bounded-memory executor: a pipeline’s peak memory is set by its largest live blocking-or-non-fused-streaming stage plus one batch per fused streaming stage, not by the cumulative size of every stage at once. A streaming stage’s output is never separately buffered between dispatch arms, so it is never charged twice: the arbitrator counts each in-flight batch once when the producer flushes it and discharges that charge as the consumer drains it. If RSS still crosses the soft threshold while a single-consumer streaming stage holds batches in flight, the engine spills those batches’ records to disk one batch at a time — the streaming handoff is the per-batch counterpart of a blocking stage’s full-stage spill, not an exemption from spilling.

Which stages stream

A stage streams when its output is handed straight to a single downstream consumer instead of crossing a charged inter-stage buffer. The downstream consumer is a sink Output writer, an Aggregate’s ingest, or a hash build-probe Combine’s probe (driver) side — see Streaming into an Aggregate and Streaming into a Combine probe below.

Two stages stream and bound their own footprint to one batch, because they pull records off a live upstream channel and forward each batch without ever building a full result:

  • Source → Transform → Output fused chains. A non-windowed Transform whose only upstream is a single Source and whose only downstream is a single sink Output consumes that Source’s records directly and hands each batch to the Output’s writer thread over a back-pressured channel; neither the Transform nor the Output materializes the whole record set. A Transform that fans out to multiple consumers, feeds another operator, or roots a window keeps the buffered (materialized) path.
  • Merge in interleave mode fed entirely by Sources. The merge reads each Source’s live stream and forwards records as they arrive.

These stages stream their output to a single downstream consumer too — sparing the second copy and overlapping the consumer — but each still builds its full result first, so its own working set is not bounded to one batch:

  • Single-branch Route. A Route with exactly one branch feeding one sink Output streams that branch’s records to the writer thread. A multi-branch Route forks records across several successor buffers and stays materialized.
  • Merge in concat mode, or interleave fed by non-Source inputs, feeding one sink Output. The merge drains its predecessors’ buffers in order (concat) or round-robin (interleave) into the merged result, then streams it.
  • streaming-strategy Aggregate feeding one sink Output. When the planner certifies the aggregate’s input is pre-sorted on the group key, it finalizes the group rows and streams them rather than buffering them for a downstream arm.
  • Combine probe side (hash build-probe strategy) feeding one sink Output. The build relation stays fully materialized in the hash table; the matched probe output streams to the writer.

Each of these requires the producer to feed exactly one downstream consumer and to root no window; a producer that roots a window keeps the materialized path because the window arena needs the producer’s full output to build.

  • Every Output. A sink writes records to its configured writer and never buffers a whole stage.

Document-boundary punctuations (DocumentOpen / DocumentClose, the signals behind $doc.*) flow inline with records through streaming stages, preserving their order: a document’s close always trails the document’s last record, even when the document’s records span several batches.

Streaming into an Aggregate

The streaming consumer above is usually a sink Output. It can also be an Aggregate’s ingest: when an eligible producer (a fused Source → Transform, a single-branch Route, a non-fused Merge, or a streaming-strategy Aggregate) feeds exactly one downstream Aggregate, the producer streams record-at-a-time into the aggregate’s add_record over a back-pressured channel rather than the aggregate pre-draining the producer’s whole output from a charged buffer. The producer reports buffer: streaming and --explain shows no node_buffer edge between it and the aggregate.

This streams the aggregate’s ingest half only — the producer no longer needs a charged inter-stage slot, and a slow aggregate (one that is spilling, say) paces the producer through the bounded channel. The aggregate’s finalize half stays blocking by nature: a group_by value depends on every member, so the group table accumulates the whole input and emits only after the channel closes (end of input). Spill stays driven by RSS pressure, never by channel depth, exactly as on the materialized path.

Two aggregate shapes keep the materialized ingest, because their finalize is not a single forward pass: a time-windowed aggregate runs a multi-pass per-window algorithm over the whole input, and a relaxed correlation-key aggregate retains its group state for the correlation-commit phase. Both show buffer: materialized on the edge into them.

Streaming into a Combine probe

A producer can also stream into a hash build-probe Combine’s probe (driver) side. When an eligible producer (a fused Source → Transform, a single-branch Route, a non-fused Merge, a streaming-strategy Aggregate, or another hash build-probe Combine) is the Combine’s driver input, the producer streams record-at-a-time into the probe kernel over a back-pressured channel rather than the Combine pre-draining the driver’s whole output from a charged buffer. The driver producer reports buffer: streaming and --explain shows no node_buffer edge between it and the Combine. Only the HashBuildProbe strategy qualifies — the range, sort-merge, and grace-hash kernels re-sort or re-scan the driver and stay materialized.

This streams the Combine’s probe half only. The build side stays fully materialized: the engine builds the complete hash table on the main thread before the driver producer streams its first record, so the probe never matches against an incomplete index. The probe consumer runs on its own thread, so a slow driver paces the probe through the bounded channel and a slow probe (a large fan-out) back-pressures the driver. The build relation’s footprint is the hash table, exactly as on the materialized path; the streaming handoff spares only the driver’s inter-stage slot. Per-source dead-letter rewind, memory accounting, and output are byte-identical to the materialized path.

Which stages block

A stage blocks when its result depends on records it has not seen yet:

  • sort — the full input must be present before the first sorted record is known.
  • Hash Aggregate — a group’s final value depends on every member, so the group table accumulates the whole input. (A streaming-strategy Aggregate over a pre-sorted input is the exception: the planner certifies it can emit a group as soon as the sort key advances.)
  • Combine build side — the build relation is fully indexed before any probe record is matched. The probe side streams against the built index, but the build side materializes.
  • IEJoin / sort-merge Combine — both inputs are sorted and buffered before the band/merge step runs.
  • CorrelationCommit — a correlation group is held until its commit decision (flush or dead-letter) is known.

A blocking stage keeps its full-stage accumulation inside pipeline.memory.limit and spills to disk past the soft threshold; it does not stream batches.

Seeing the classification

clinker run <pipeline>.yaml --explain annotates every node with its class in the Physical Properties section:

output.report:
  buffer: streaming

aggregation.dept_totals:
  buffer: materialized

buffer: streaming marks a stage whose output is consumed without an inter-stage buffer — it charges the budget per in-flight batch and, on a single-consumer edge, spills those batches to disk under pressure; buffer: materialized marks a stage whose output crosses a node_buffers slot that charges the memory budget as one full-stage slot and spills the whole stage. Both classes are spill-eligible; they differ in granularity, not in whether they can spill. The explain annotation is derived from the same classifier the executor uses at runtime, so what --explain reports is exactly what the dispatcher does. See Explain Plans and Memory Tuning for the arbitration model that rides alongside the buffer class.

Tuning the batch size

The number of events handed downstream per batch is set by pipeline.batch_size (default 2048), with an optional per-transform override. For a fused streaming stage — the only kind whose footprint is one batch — smaller batches lower its in-flight footprint at the cost of more per-batch bookkeeping; larger batches do the reverse. For the other streaming stages the batch size sets only the in-flight slice handed across the channel; the producer’s own result is built in full regardless, so batch_size does not cap their footprint. The batch size changes only the memory profile of streaming handoffs — never their output, and never the behavior of blocking stages.

Metrics & Monitoring

Clinker writes per-execution metrics as JSON files to a spool directory. These files can be collected into an NDJSON archive for ingestion into monitoring systems.

Enabling metrics

There are three ways to enable metrics collection, listed from highest to lowest priority:

CLI flag:

clinker run pipeline.yaml --metrics-spool-dir ./metrics/

Environment variable:

export CLINKER_METRICS_SPOOL_DIR=./metrics/
clinker run pipeline.yaml

YAML config:

pipeline:
  metrics:
    spool_dir: "./metrics/"

When metrics are enabled, each execution writes one JSON file to the spool directory, named <execution_id>.json.

Metrics schema

Each metrics file follows schema version 1:

{
  "execution_id": "01912345-6789-7abc-def0-123456789abc",
  "schema_version": 1,
  "pipeline_name": "customer_etl",
  "config_path": "/opt/clinker/pipelines/daily_etl.yaml",
  "hostname": "prod-etl-01",
  "started_at": "2026-04-11T10:00:00Z",
  "finished_at": "2026-04-11T10:00:05Z",
  "duration_ms": 5000,
  "exit_code": 0,
  "records_total": 50000,
  "records_ok": 49950,
  "records_dlq": 50,
  "execution_mode": "streaming",
  "peak_rss_bytes": 134217728,
  "thread_count": 4,
  "input_files": ["./data/customers.csv"],
  "output_files": ["./output/enriched.csv"],
  "dlq_path": "./output/errors.csv",
  "error": null
}

Field reference

FieldTypeDescription
execution_idstringUUID v7 or custom --batch-id value
schema_versionintegerAlways 1 for this release
pipeline_namestringThe name from the pipeline YAML
config_pathstringAbsolute path to the config file
hostnamestringMachine hostname
started_atstringISO 8601 UTC timestamp
finished_atstringISO 8601 UTC timestamp
duration_msintegerWall-clock duration in milliseconds
exit_codeintegerProcess exit code (see Exit Codes)
records_totalintegerTotal records read from all sources
records_okintegerRecords that reached an output node
records_dlqintegerRecords routed to the dead-letter queue
execution_modestringstreaming or batch
peak_rss_bytesintegerMaximum resident set size during execution
thread_countintegerThread pool size used
input_filesarrayPaths to all source files
output_filesarrayPaths to all output files written
dlq_pathstring/nullPath to the DLQ file, or null if none
errorstring/nullError message on failure, or null on success

Collecting metrics

The spool directory accumulates one file per execution. Use clinker metrics collect to sweep them into an NDJSON archive:

clinker metrics collect \
  --spool-dir ./metrics/ \
  --output-file ./metrics/archive.ndjson \
  --delete-after-collect

This appends all spool files to the archive (one JSON object per line) and removes the originals. The NDJSON format is compatible with most log aggregation and monitoring tools.

Preview without writing:

clinker metrics collect \
  --spool-dir ./metrics/ \
  --output-file ./metrics/archive.ndjson \
  --dry-run

Integration with monitoring systems

Grafana / Prometheus

Parse the NDJSON archive with a log shipper (Promtail, Filebeat, Vector) and create dashboards tracking:

  • duration_ms – execution time trends
  • records_dlq – data quality over time
  • peak_rss_bytes – memory utilization

Datadog

Ship NDJSON to Datadog Logs, then create metrics from log attributes:

# Example: tail the archive and ship to Datadog
tail -f ./metrics/archive.ndjson | datadog-agent log-stream

ELK Stack

Filebeat can ingest NDJSON directly:

# filebeat.yml
filebeat.inputs:
  - type: log
    paths:
      - /var/log/clinker/metrics.ndjson
    json.keys_under_root: true

Simple alerting with jq

For environments without a full monitoring stack, use jq to query the archive directly:

# Find all runs with DLQ entries in the last 24 hours
jq 'select(.records_dlq > 0)' metrics/archive.ndjson

# Find runs that exceeded 400MB RSS
jq 'select(.peak_rss_bytes > 419430400)' metrics/archive.ndjson

# Average duration by pipeline
jq -s 'group_by(.pipeline_name) | map({
  pipeline: .[0].pipeline_name,
  avg_ms: (map(.duration_ms) | add / length)
})' metrics/archive.ndjson

Operational recommendations

  • Always enable metrics in production. The overhead is negligible (one small JSON write at the end of each run).
  • Run metrics collect --delete-after-collect on a schedule (e.g., hourly) to prevent spool directory growth.
  • Use --batch-id with meaningful identifiers to correlate metrics across retries and environments.
  • Alert on records_dlq > 0 to catch data quality regressions early.
  • Track peak_rss_bytes trends to anticipate when memory limits need adjustment.

Exit Codes & Error Diagnosis

Clinker uses structured exit codes to communicate the outcome of a pipeline run. These codes are designed for integration with schedulers, cron, CI systems, and monitoring tools.

Exit code reference

CodeMeaningDescription
0SuccessPipeline completed. All records processed successfully.
1Configuration errorInvalid YAML, CXL syntax error, type mismatch, or DAG wiring problem. Fix the pipeline configuration.
2Partial successPipeline ran to completion, but some records were routed to the dead-letter queue. Check the DLQ file.
3Evaluation errorCXL runtime error during record processing (e.g., division by zero, type coercion failure).
4I/O errorFile not found, permission denied, disk full, or input format mismatch.

Understanding exit code 2

Exit code 2 is not a crash. It means:

  • The pipeline started and ran to completion.
  • All viable records were processed and written to output files.
  • Some records could not be processed and were diverted to the dead-letter queue.

Your scheduler should treat exit code 2 as a warning, not a failure. The DLQ file contains the problematic records along with the error that caused each one to be rejected.

To control when exit code 2 escalates to a hard failure, use --error-threshold:

# Abort if more than 100 records hit the DLQ
clinker run pipeline.yaml --error-threshold 100

With a threshold set, the pipeline aborts (exit code 3) when the DLQ count exceeds the threshold, rather than continuing to completion.

Diagnosing failures

Exit code 1: Configuration error

The error message includes a span-annotated diagnostic pointing to the exact location of the problem:

Error: CXL type error in node 'transform_1'
  --> pipeline.yaml:25:15
   |
25 |   emit total = amount + name
   |                ^^^^^^^^^^^^^ cannot add Int and String

Action: Fix the YAML or CXL expression indicated in the diagnostic, then re-run with --dry-run to confirm the fix.

Exit code 2: Partial success (DLQ entries)

Check the DLQ file for details:

# The DLQ path is shown in the run output and in metrics
cat output/errors.csv

Common causes:

  • Null values in fields that a CXL expression does not handle
  • Data that does not match the declared schema (e.g., non-numeric value in an integer column)
  • Coercion failures between types

Action: Review the DLQ records, fix the data or add null handling to CXL expressions, and re-run.

Exit code 3: Evaluation error

A CXL expression failed at runtime. The error message includes the failing expression and the record that triggered it:

Error: division by zero in node 'compute_ratio'
  expression: emit ratio = total / count
  record: {total: 500, count: 0}

Action: Add guard conditions to the CXL expression:

emit ratio = if count == 0 then 0 else total / count

Exit code 4: I/O error

File system or format errors:

Error: file not found: ./data/customers.csv
  --> pipeline.yaml:8:12

Common causes:

  • Input file does not exist or path is wrong
  • Permission denied on input or output directories
  • Output file already exists (use --force to overwrite)
  • Disk full during output writing
  • Input file format does not match the declared type (e.g., invalid CSV)

Action: Fix file paths, permissions, or disk space, then re-run.

Plan-time diagnostic codes

The process exit codes above tell a scheduler whether the run succeeded. The E### codes below appear inside the structured Error: messages a configuration error (exit code 1) prints, and identify the specific compile-time check that rejected the pipeline. The codes below cover the event-time watermark and time-windowed aggregate surface (issue #61); related code sets live in Pipeline Variables, Channels, and Correlation Keys.

CodeTriggerRemediation
E154A source declares watermark.column: <col> but <col> is not present in that source’s schema: block.Add the column to schema:, or remove the watermark: block.
E155A source declares watermark.column: <col> and the column exists, but its declared CXL type is not date_time or date.Change the column’s type: to date_time or date, or point watermark.column at a column that already has one of those types.
E156An aggregate declares time_window: but at least one upstream-reachable source does not declare watermark.column.Add watermark: { column: <event-time-column> } to each listed source, or remove time_window: from the aggregate. Without a watermark on every upstream source, min_across_sources never advances past None and the window can never close.

See Source Nodes → Watermarks and Aggregate Nodes → Time-windowed aggregates for the field semantics each code is enforcing.

DLQ category: LateRecord

When a time-windowed aggregate sees a record whose event time falls inside an already-closed window (window_end + allowed_lateness < min_across_sources), the engine routes the record to the DLQ instead of attempting to fold it into a finalized accumulator. Mirrors Flink’s sideOutputLateData and Spark Structured Streaming’s late-data drop.

The DLQ row carries:

  • _cxl_dlq_error_category = late_record
  • _cxl_dlq_stage = time_window:<aggregate-name>
  • _cxl_dlq_error_detail — the closed window’s [start, end) bounds as i64 nanoseconds since the Unix epoch

Tune watermark.delay (source-side, applies before any aggregate) or allowed_lateness (operator-side, applies per aggregate) to absorb expected out-of-order tails before they reach this path.

Scheduler integration

Cron script

#!/bin/bash
set -euo pipefail

PIPELINE=/opt/clinker/pipelines/daily_etl.yaml
METRICS_DIR=/var/spool/clinker/

clinker run "$PIPELINE" \
  --memory-limit 512M \
  --log-level warn \
  --metrics-spool-dir "$METRICS_DIR" \
  --force

EXIT=$?

case $EXIT in
  0)
    echo "$(date): Success" >> /var/log/clinker/daily_etl.log
    ;;
  2)
    echo "$(date): Warning - DLQ entries produced" >> /var/log/clinker/daily_etl.log
    mail -s "Clinker ETL Warning: DLQ entries" ops@company.com < /dev/null
    ;;
  *)
    echo "$(date): FAILURE (exit code $EXIT)" >> /var/log/clinker/daily_etl.log
    mail -s "Clinker ETL FAILURE (exit $EXIT)" ops@company.com < /dev/null
    ;;
esac

exit $EXIT

CI pipeline (GitHub Actions)

- name: Run ETL pipeline
  run: clinker run pipeline.yaml --dry-run
  # Exit code 1 fails the build on config errors

- name: Smoke test with real data
  run: clinker run pipeline.yaml --dry-run -n 100
  # Catches runtime evaluation errors

Systemd

Systemd Type=oneshot services interpret non-zero exit codes as failures. To allow exit code 2 (partial success) without triggering service failure:

[Service]
Type=oneshot
SuccessExitStatus=2
ExecStart=/opt/clinker/bin/clinker run /opt/clinker/pipelines/daily_etl.yaml --force

Production Deployment

Clinker is a single statically-linked binary with no runtime dependencies. Deployment is straightforward: copy the binary to the server.

Installation

# Copy the binary
scp target/release/clinker user@server:/opt/clinker/bin/

# Verify it runs
ssh user@server /opt/clinker/bin/clinker --version

No JVM, no Python, no container runtime required.

/opt/clinker/
  bin/
    clinker                    # The binary
  pipelines/
    daily_etl.yaml             # Pipeline configs
    weekly_report.yaml
  data/                        # Input data (or symlinks to data locations)
  output/                      # Output files
  rules/                       # CXL module files (for use statements)
  metrics/                     # Metrics spool directory

Create a dedicated user:

sudo useradd --system --home-dir /opt/clinker --shell /usr/sbin/nologin clinker
sudo chown -R clinker:clinker /opt/clinker

Systemd service

For scheduled one-shot execution:

[Unit]
Description=Clinker ETL - Daily Customer Processing
After=network.target

[Service]
Type=oneshot
ExecStart=/opt/clinker/bin/clinker run /opt/clinker/pipelines/daily_etl.yaml \
  --memory-limit 512M \
  --log-level warn \
  --metrics-spool-dir /var/spool/clinker/ \
  --force
WorkingDirectory=/opt/clinker
User=clinker
Group=clinker
SuccessExitStatus=2

# Resource limits
MemoryMax=1G
CPUQuota=200%

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=clinker-daily

[Install]
WantedBy=multi-user.target

Pair with a systemd timer for scheduling:

[Unit]
Description=Run Clinker daily ETL at 2 AM

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true

[Install]
WantedBy=timers.target
sudo systemctl enable --now clinker-daily.timer

Note: SuccessExitStatus=2 tells systemd that exit code 2 (partial success with DLQ entries) is not a service failure. See Exit Codes for the full reference.

Cron scheduling

# Run daily at 2 AM, log to syslog
0 2 * * * /opt/clinker/bin/clinker run \
  /opt/clinker/pipelines/daily_etl.yaml \
  --log-level warn --force \
  2>&1 | logger -t clinker

# Collect metrics hourly
0 * * * * /opt/clinker/bin/clinker metrics collect \
  --spool-dir /var/spool/clinker/ \
  --output-file /var/log/clinker/metrics.ndjson \
  --delete-after-collect

Environment-based configuration

Use the CLINKER_ENV variable or --env flag to activate environment-specific overrides:

# Production
CLINKER_ENV=production clinker run pipeline.yaml

# Staging
CLINKER_ENV=staging clinker run pipeline.yaml

Combined with channel overrides in the pipeline YAML, this allows a single pipeline definition to target different file paths, connection strings, or thresholds per environment.

Logging

Log levels for production

LevelUse case
warnRecommended for production cron jobs. Prints warnings and errors only.
infoDefault. Includes progress messages. Useful during initial deployment.
errorMinimal output. Only prints when something fails.
debugTroubleshooting. Generates significant output.
traceDevelopment only. Extremely verbose.

Directing logs

To syslog via logger:

clinker run pipeline.yaml --log-level warn 2>&1 | logger -t clinker

To a log file:

clinker run pipeline.yaml --log-level warn 2>> /var/log/clinker/etl.log

Systemd journal captures stdout and stderr automatically when running as a service.

DLQ monitoring

When a pipeline exits with code 2, records that could not be processed are written to the dead-letter queue file. Set up a daily check:

#!/bin/bash
# Check for DLQ files produced today
DLQ_DIR=/opt/clinker/output/
DLQ_FILES=$(find "$DLQ_DIR" -name "*_errors.csv" -mtime 0 -size +0c)

if [ -n "$DLQ_FILES" ]; then
    echo "DLQ entries found:" | mail -s "Clinker DLQ Alert" ops@company.com <<EOF
The following DLQ files were produced today:

$DLQ_FILES

Review the files and address data quality issues.
EOF
fi

Batch ID for tracing

Use --batch-id with a meaningful, consistent naming scheme:

# Date-based
clinker run pipeline.yaml --batch-id "daily-$(date +%Y-%m-%d)"

# Include environment
clinker run pipeline.yaml --batch-id "prod-daily-$(date +%Y-%m-%d)"

The batch ID appears in metrics output and log lines, making it easy to correlate a specific run across logs, metrics, and DLQ files. On retries, use a different batch ID (e.g., append -retry-1) to distinguish attempts.

Upgrades

To upgrade Clinker:

  1. Validate the new version against your pipelines:
    /opt/clinker/bin/clinker-new run pipeline.yaml --dry-run
    
  2. Replace the binary:
    cp clinker-new /opt/clinker/bin/clinker
    
  3. Verify:
    /opt/clinker/bin/clinker --version
    

There is no configuration migration. Pipeline YAML files are forward-compatible within the same major version.

CSV-to-CSV Transform

This recipe reads employee data from a CSV file, computes salary tiers using CXL expressions, and writes the enriched result to a new CSV file.

Input data

employees.csv:

id,name,department,salary
1,Alice Chen,Engineering,95000
2,Bob Martinez,Marketing,62000
3,Carol Johnson,Engineering,88000
4,Dave Williams,Sales,71000
5,Eva Brown,Marketing,58000
6,Frank Lee,Engineering,102000

Pipeline

salary_tiers.yaml:

pipeline:
  name: salary_tiers

nodes:
  - type: source
    name: employees
    config:
      name: employees
      type: csv
      path: "./employees.csv"
      schema:
        - { name: id, type: int }
        - { name: name, type: string }
        - { name: department, type: string }
        - { name: salary, type: int }

  - type: transform
    name: classify
    input: employees
    config:
      cxl: |
        emit id = id
        emit name = name
        emit department = department
        emit salary = salary
        emit level = if salary >= 90000 then "senior" else "junior"
        emit salary_band = match {
          salary >= 100000 => "100k+",
          salary >= 90000 => "90-100k",
          salary >= 70000 => "70-90k",
          _ => "under 70k"
        }

  - type: output
    name: report
    input: classify
    config:
      name: salary_report
      type: csv
      path: "./output/salary_report.csv"

error_handling:
  strategy: fail_fast

Run it

# Validate first
clinker run salary_tiers.yaml --dry-run

# Preview output
clinker run salary_tiers.yaml --dry-run -n 3

# Full run
clinker run salary_tiers.yaml

Expected output

output/salary_report.csv:

id,name,department,salary,level,salary_band
1,Alice Chen,Engineering,95000,senior,90-100k
2,Bob Martinez,Marketing,62000,junior,under 70k
3,Carol Johnson,Engineering,88000,junior,70-90k
4,Dave Williams,Sales,71000,junior,70-90k
5,Eva Brown,Marketing,58000,junior,under 70k
6,Frank Lee,Engineering,102000,senior,100k+

Key points

Schema declaration. The source node declares the schema explicitly with typed columns. This enables compile-time type checking of CXL expressions – if you write salary + name, the type checker catches the error before any data is read.

Emit statements. Each emit in the transform produces one output column. The output schema is defined entirely by the emit statements – input columns that are not emitted are dropped. This is intentional: explicit output schemas prevent accidental data leakage.

Match expressions. The match block evaluates conditions top to bottom and returns the value of the first matching arm. The _ wildcard is the default case and must appear last.

Error handling. The fail_fast strategy aborts the pipeline on the first record error. For production pipelines processing dirty data, consider dead_letter_queue instead – see Error Handling & DLQ.

Variations

Filtering records

Add a filter statement to exclude records:

  - type: transform
    name: classify
    input: employees
    config:
      cxl: |
        filter salary >= 60000
        emit id = id
        emit name = name
        emit salary = salary

Records where salary < 60000 are dropped silently – they do not appear in the output or the DLQ.

Computed columns with type conversion

      cxl: |
        emit id = id
        emit name = name
        emit monthly_salary = (salary.to_float() / 12.0).round(2)
        emit salary_display = "$" + salary.to_string()

The .to_float() conversion is required because salary is declared as int and division by a float literal requires matching types.

Multi-Input Combine

This recipe enriches order records with product metadata from a separate catalog stream using a combine node. Combine is a first-class N-ary operator: every input is declared up front, and the where expression uses qualified field references (orders.product_id, products.product_id) to express the join.

Input data

orders.csv:

order_id,product_id,quantity,unit_price
ORD-001,PROD-A,5,29.99
ORD-002,PROD-B,2,149.99
ORD-003,PROD-A,1,29.99
ORD-004,PROD-C,10,9.99
ORD-005,PROD-B,3,149.99

products.csv:

product_id,product_name,category
PROD-A,Widget Pro,Hardware
PROD-B,DataSync License,Software
PROD-C,Cable Kit,Hardware

Pipeline

order_enrichment.yaml:

pipeline:
  name: order_enrichment

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: csv
      path: "./orders.csv"
      schema:
        - { name: order_id, type: string }
        - { name: product_id, type: string }
        - { name: quantity, type: int }
        - { name: unit_price, type: float }

  - type: source
    name: products
    config:
      name: products
      type: csv
      path: "./products.csv"
      schema:
        - { name: product_id, type: string }
        - { name: product_name, type: string }
        - { name: category, type: string }

  - type: combine
    name: enrich
    input:
      orders: orders
      products: products
    config:
      where: "orders.product_id == products.product_id"
      match: first
      on_miss: null_fields
      cxl: |
        emit order_id = orders.order_id
        emit product_id = orders.product_id
        emit product_name = products.product_name
        emit category = products.category
        emit quantity = orders.quantity
        emit unit_price = orders.unit_price
        emit line_total = orders.quantity.to_float() * orders.unit_price
      propagate_ck: driver

  - type: output
    name: result
    input: enrich
    config:
      name: enriched_orders
      type: csv
      path: "./output/enriched_orders.csv"

Run it

clinker run order_enrichment.yaml --dry-run
clinker run order_enrichment.yaml --dry-run -n 3
clinker run order_enrichment.yaml

Expected output

output/enriched_orders.csv:

order_id,product_id,product_name,category,quantity,unit_price,line_total
ORD-001,PROD-A,Widget Pro,Hardware,5,29.99,149.95
ORD-002,PROD-B,DataSync License,Software,2,149.99,299.98
ORD-003,PROD-A,Widget Pro,Hardware,1,29.99,29.99
ORD-004,PROD-C,Cable Kit,Hardware,10,9.99,99.90
ORD-005,PROD-B,DataSync License,Software,3,149.99,449.97

How combine works

A combine node declares every input in its input: map, binding each upstream stream to a qualifier used inside expressions:

- type: combine
  name: enrich
  input:
    orders: orders        # qualifier: upstream_node
    products: products
  config:
    where: "orders.product_id == products.product_id"
    propagate_ck: driver

The config: block carries four fields that shape behavior:

  • where – a CXL boolean expression. Every field reference must be qualified with its input name. The expression must contain at least one cross-input equality (e.g. orders.product_id == products.product_id); additional range or arbitrary conjuncts can be combined with and.
  • matchfirst (default), all, or collect. See below.
  • on_missnull_fields (default), skip, or error. Applies only to records on the driving input that find no match.
  • cxl – emit statements that shape the output row. Under match: collect, this field must be empty; the combine node auto-derives the output schema.

Match modes

match: first

Emit one output row per driver record, using the first matching build-side record. This is the standard 1:1 enrichment. When no match exists, the behavior is governed by on_miss.

config:
  where: "orders.product_id == products.product_id"
  match: first

match: all

Emit one output row for every matching build-side record. This is 1:N fan-out – if a driver record matches three build records, three rows are emitted.

- type: combine
  name: expand_benefits
  input:
    employees: employees
    benefits: benefits
  config:
    where: "employees.department == benefits.department"
    match: all
    cxl: |
      emit employee_id = employees.employee_id
      emit benefit = benefits.benefit_name
    propagate_ck: driver

An employee in a department with three benefits produces three output records.

match: collect

Gather every matching build-side record into a single Array-typed field on the output row. The driver record appears once; the build matches are aggregated into a list. The cxl: body must be empty under match: collect – the combine node synthesizes the output as { driver fields..., <build_qualifier>: Array }.

- type: combine
  name: gather
  input:
    orders: orders
    products: products
  config:
    where: "orders.product_id == products.product_id"
    match: collect
    cxl: ""
    propagate_ck: driver

Use collect when you need the set of matches as a single structured value (e.g. every price history row for an order). Use all when you need one flat row per match.

Unmatched records (on_miss)

on_miss controls what happens to driver records with zero matches:

config:
  where: "orders.product_id == products.product_id"
  on_miss: null_fields   # default: emit with build fields set to null
config:
  where: "orders.product_id == products.product_id"
  on_miss: skip          # inner-join semantics: drop unmatched drivers
config:
  where: "orders.product_id == products.product_id"
  on_miss: error         # fail the pipeline on first unmatched driver

Use skip for inner-join semantics, null_fields for left-join semantics, and error for strict referential integrity where any miss should halt processing.

Composite keys

Chain multiple equalities with and to combine on more than one field. Each conjunct is a separate cross-input equality:

- type: combine
  name: match_by_region
  input:
    sales: sales
    targets: targets
  config:
    where: |
      sales.department == targets.department
      and sales.region == targets.region
    cxl: |
      emit department = sales.department
      emit region = sales.region
      emit actual = sales.amount
      emit goal = targets.goal
    propagate_ck: driver

Both equalities must hold for a record pair to match.

Equi plus residual filter

The where clause can mix equi predicates with additional filter conjuncts. Non-equality conjuncts are applied as a residual filter after the equi match:

- type: combine
  name: high_value_enrichment
  input:
    orders: orders
    products: products
  config:
    where: |
      orders.product_id == products.product_id
      and orders.amount >= 100
    match: first
    on_miss: skip
    cxl: |
      emit order_id = orders.order_id
      emit product_name = products.product_name
      emit amount = orders.amount
    propagate_ck: driver

The equi conjunct drives the hash lookup; the amount >= 100 conjunct is evaluated as a post-filter. At least one cross-input equality is required in every combine.

Multi-input combine (three or more)

Combine accepts any number of inputs. Each pair of inputs that should be related needs an explicit equality in the where clause:

- type: combine
  name: fully_enriched
  input:
    orders: orders
    products: products
    categories: categories
  config:
    where: |
      orders.product_id == products.product_id
      and products.category_id == categories.category_id
    match: first
    on_miss: null_fields
    cxl: |
      emit order_id = orders.order_id
      emit product_name = products.product_name
      emit category_name = categories.name
      emit amount = orders.amount
    propagate_ck: driver

Input order in the input: map is preserved, and downstream reasoning treats the first input as the default driving side unless a drive: hint overrides it.

Choosing the driving input

By default the planner picks a driving (probe) input and builds hash tables for the rest. Use drive: to force a specific input to be the driver – typically the larger stream, or the one whose ordering you want to preserve:

- type: combine
  name: product_driven
  input:
    orders: orders
    products: products
  config:
    where: "orders.product_id == products.product_id"
    match: first
    drive: products
    cxl: |
      emit product_id = products.product_id
      emit product_name = products.product_name
      emit sample_order_id = orders.order_id
    propagate_ck: driver

With drive: products, the pipeline emits one row per product enriched with a matching order, instead of one row per order enriched with its product.

Memory considerations

Build-side inputs are materialized in memory as hash tables keyed by the equi columns. For each non-driving input, plan for roughly 1.5-2x the raw CSV size in heap. A 50 MB product catalog typically uses 75-100 MB of hash-table memory. Tune with --memory-limit; see Memory Tuning for spill thresholds and strategy overrides.

Routing to Multiple Outputs

This recipe splits a stream of order records into separate output files based on business rules. High-value orders go to one file, standard orders to another.

Input data

orders.csv:

order_id,customer,amount,region
ORD-001,Acme Corp,15000,US
ORD-002,Globex,450,EU
ORD-003,Initech,8500,US
ORD-004,Umbrella,22000,APAC
ORD-005,Stark Ind,950,US
ORD-006,Wayne Ent,3200,EU

Pipeline

order_routing.yaml:

pipeline:
  name: order_routing
  vars:
    high_value_threshold: 5000

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: csv
      path: "./orders.csv"
      schema:
        - { name: order_id, type: string }
        - { name: customer, type: string }
        - { name: amount, type: float }
        - { name: region, type: string }

  - type: route
    name: split_by_value
    input: orders
    config:
      mode: exclusive
      conditions:
        high: "amount >= $vars.high_value_threshold"
      default: standard

  - type: output
    name: high_value_output
    input: split_by_value.high
    config:
      name: high_value_orders
      type: csv
      path: "./output/high_value.csv"

  - type: output
    name: standard_output
    input: split_by_value.standard
    config:
      name: standard_orders
      type: csv
      path: "./output/standard.csv"

Run it

clinker run order_routing.yaml --dry-run
clinker run order_routing.yaml

Expected output

output/high_value.csv:

order_id,customer,amount,region
ORD-001,Acme Corp,15000,US
ORD-003,Initech,8500,US
ORD-004,Umbrella,22000,APAC

output/standard.csv:

order_id,customer,amount,region
ORD-002,Globex,450,EU
ORD-005,Stark Ind,950,US
ORD-006,Wayne Ent,3200,EU

How routing works

Port syntax

Route nodes produce named output ports. Downstream nodes reference these ports using dot syntax: split_by_value.high and split_by_value.standard.

The port names come from two places:

  • Condition names in the conditions map (here, high)
  • The default field (here, standard)

Exclusive mode

With mode: exclusive, each record goes to exactly one branch. Conditions are evaluated top to bottom – the first matching condition wins, and the record is sent to that port. Records that match no condition go to the default port.

Pipeline variables

The threshold is defined in pipeline.vars and referenced in the CXL expression as $vars.high_value_threshold. This makes it easy to adjust the threshold without editing the route condition, and channel overrides can change it per environment.

Variations

Multiple branches

Route nodes can have any number of named branches:

  - type: route
    name: split_by_region
    input: orders
    config:
      mode: exclusive
      conditions:
        us: "region == \"US\""
        eu: "region == \"EU\""
        apac: "region == \"APAC\""
      default: other

  - type: output
    name: us_output
    input: split_by_region.us
    config:
      name: us_orders
      type: csv
      path: "./output/us_orders.csv"

  - type: output
    name: eu_output
    input: split_by_region.eu
    config:
      name: eu_orders
      type: csv
      path: "./output/eu_orders.csv"

  # ... additional outputs for apac, other

Transform before output

Insert a transform between the route and output to shape the data differently per branch:

  - type: transform
    name: enrich_high_value
    input: split_by_value.high
    config:
      cxl: |
        emit order_id = order_id
        emit customer = customer
        emit amount = amount
        emit priority = "URGENT"
        emit review_required = true

  - type: output
    name: high_value_output
    input: enrich_high_value
    config:
      name: high_value_orders
      type: csv
      path: "./output/high_value.csv"

Combining routing with aggregation

Route first, then aggregate each branch independently:

  - type: aggregate
    name: high_value_summary
    input: split_by_value.high
    config:
      group_by: [region]
      cxl: |
        emit total = sum(amount)
        emit count = count(*)

This produces a per-region summary of high-value orders only.

Aggregation & Rollups

This recipe demonstrates grouping records and computing summary statistics. The pipeline filters active sales records, then rolls them up by department.

Input data

sales.csv:

id,department,amount,status,rep
1,Engineering,5000,active,Alice
2,Marketing,3000,active,Bob
3,Engineering,7000,active,Carol
4,Sales,4000,inactive,Dave
5,Marketing,2000,active,Eva
6,Engineering,9500,active,Frank
7,Sales,6000,active,Grace
8,Marketing,1500,inactive,Hank

Pipeline

dept_rollup.yaml:

pipeline:
  name: dept_rollup

nodes:
  - type: source
    name: sales
    config:
      name: sales
      type: csv
      path: "./sales.csv"
      schema:
        - { name: id, type: int }
        - { name: department, type: string }
        - { name: amount, type: float }
        - { name: status, type: string }
        - { name: rep, type: string }

  - type: transform
    name: active_only
    input: sales
    config:
      cxl: |
        filter status == "active"

  - type: aggregate
    name: rollup
    input: active_only
    config:
      group_by: [department]
      cxl: |
        emit total = sum(amount)
        emit count = count(*)
        emit average = avg(amount)
        emit maximum = max(amount)
        emit minimum = min(amount)

  - type: output
    name: report
    input: rollup
    config:
      name: dept_totals
      type: csv
      path: "./output/dept_totals.csv"

Run it

clinker run dept_rollup.yaml --dry-run
clinker run dept_rollup.yaml

Expected output

output/dept_totals.csv:

department,total,count,average,maximum,minimum
Engineering,21500,3,7166.67,9500,5000
Marketing,5000,2,2500,3000,2000
Sales,6000,1,6000,6000,6000

One row per department. The inactive records (Dave’s $4000, Hank’s $1500) are excluded by the filter.

How aggregation works

Group-by keys

The group_by field lists the columns that define each group. Records with the same values for all group-by columns are aggregated together. The group-by columns appear automatically in the output – you do not need to emit them.

Aggregate functions

Available aggregate functions in CXL:

FunctionDescription
sum(expr)Sum of values
count(*)Number of records
avg(expr)Arithmetic mean
min(expr)Minimum value
max(expr)Maximum value
first(expr)First value encountered
last(expr)Last value encountered

Strategy selection

Clinker offers two aggregation strategies:

  • Hash aggregation (default): Builds an in-memory hash map keyed by the group-by columns. Works with any input order. Memory usage is proportional to the number of distinct groups.

  • Streaming aggregation: Processes records in order, emitting each group’s result as soon as the next group starts. Requires input sorted by the group-by keys. Uses minimal memory regardless of the number of groups.

The default strategy (auto) selects streaming when the optimizer can prove the input is sorted by the group-by keys, and hash otherwise. You can force a strategy:

    config:
      group_by: [department]
      strategy: streaming   # requires sorted input

See Memory Tuning for details on memory implications.

Variations

Multiple group-by keys

    config:
      group_by: [department, region]
      cxl: |
        emit total = sum(amount)
        emit count = count(*)

Produces one row per unique (department, region) combination.

Pre-aggregation transform

Compute derived fields before aggregating:

  - type: transform
    name: prepare
    input: sales
    config:
      cxl: |
        filter status == "active"
        emit department = department
        emit amount = amount
        emit is_large = amount >= 5000

  - type: aggregate
    name: rollup
    input: prepare
    config:
      group_by: [department]
      cxl: |
        emit total = sum(amount)
        emit large_count = sum(if is_large then 1 else 0)
        emit small_count = sum(if not is_large then 1 else 0)

Aggregation followed by routing

Aggregate first, then route the summary rows:

  - type: aggregate
    name: rollup
    input: active_only
    config:
      group_by: [department]
      cxl: |
        emit total = sum(amount)

  - type: route
    name: split_by_total
    input: rollup
    config:
      mode: exclusive
      conditions:
        large: "total >= 10000"
      default: small

This routes departments with over $10,000 in total sales to one output and the rest to another.

No group-by (grand total)

Omit group_by to aggregate all records into a single output row:

    config:
      cxl: |
        emit grand_total = sum(amount)
        emit record_count = count(*)
        emit average_amount = avg(amount)

Time-windowed rollups

When the grouping dimension is event-time bucket, declare a watermark: on every source and a time_window: on the aggregate. Three patterns cover the common shapes; all three ship as runnable pipelines under examples/pipelines/.

Tumbling: hourly click counts

Non-overlapping one-hour buckets per user. Use when each record should contribute to exactly one reporting bucket.

examples/pipelines/tumbling_clicks.yaml:

pipeline:
  name: tumbling_clicks

nodes:
  - type: source
    name: clicks
    description: Per-user click stream with an event-time column.
    config:
      name: clicks
      type: csv
      path: ./data/tumbling_clicks.csv
      options:
        has_header: true
      watermark:
        column: event_ts
      schema:
        - { name: user_id, type: string }
        - { name: event_ts, type: date_time }
        - { name: kind, type: string }

  - type: aggregate
    name: hourly_clicks
    description: Per-user click count, bucketed by event-time hour.
    input: clicks
    config:
      group_by: [user_id]
      time_window:
        tumbling: { size: 1h }
      cxl: |
        emit user_id = user_id
        emit n = count(*)

  - type: output
    name: results
    input: hourly_clicks
    config:
      name: results
      type: csv
      path: ./output/tumbling_clicks.csv

error_handling:
  strategy: fail_fast

Run:

cargo run -p clinker -- run examples/pipelines/tumbling_clicks.yaml

The source’s watermark advances with each record’s event_ts; each hour-aligned bucket emits one row per user_id as soon as the watermark crosses bucket_end. Records observed out-of-order land in the DLQ as late_record — add delay: on the source or allowed_lateness: on the aggregate if the input has a known out-of-order tail.

Hopping: 1-hour sums advanced every 5 minutes

Overlapping one-hour windows that move forward every 5 minutes. Use for moving averages and rolling sums where one record should contribute to multiple overlapping reports.

examples/pipelines/hopping_sliding_5m_1h.yaml:

pipeline:
  name: hopping_sliding_5m_1h

nodes:
  - type: source
    name: clicks
    config:
      name: clicks
      type: csv
      path: ./data/hopping_clicks.csv
      options:
        has_header: true
      watermark:
        column: event_ts
        delay: 5s
      schema:
        - { name: user_id, type: string }
        - { name: event_ts, type: date_time }
        - { name: amount, type: int }

  - type: aggregate
    name: sliding_amount
    input: clicks
    config:
      group_by: [user_id]
      time_window:
        hopping:
          size: 1h
          slide: 5m
      allowed_lateness: 30s
      cxl: |
        emit user_id = user_id
        emit total = sum(amount)
        emit n = count(*)

  - type: output
    name: results
    input: sliding_amount
    config:
      name: results
      type: csv
      path: ./output/hopping_sliding_5m_1h.csv

error_handling:
  strategy: fail_fast

Run:

cargo run -p clinker -- run examples/pipelines/hopping_sliding_5m_1h.yaml

Each record fans into ceil(size / slide) = 12 overlapping windows, so the output row count is roughly 12× the active-window record count. The source’s delay: 5s plus the aggregate’s allowed_lateness: 30s give the pipeline 35 seconds of total grace beyond strict event-time order before a record drops to the DLQ.

Session: per-user multi-source login sessions

Variable-duration windows bounded by inactivity, computed across two independent sources. Use for activity grouping where the window length is data-driven rather than clock-aligned.

examples/pipelines/multi_source_session.yaml:

pipeline:
  name: multi_source_session

nodes:
  - type: source
    name: src_web
    description: Web login events.
    config:
      name: src_web
      type: csv
      path: ./data/session_logins.csv
      options:
        has_header: true
      watermark:
        column: event_ts
      schema:
        - { name: user_id, type: string }
        - { name: event_ts, type: date_time }
        - { name: source, type: string }

  - type: source
    name: src_mobile
    description: Mobile login events.
    config:
      name: src_mobile
      type: csv
      path: ./data/session_mobile.csv
      options:
        has_header: true
      watermark:
        column: event_ts
      schema:
        - { name: user_id, type: string }
        - { name: event_ts, type: date_time }
        - { name: source, type: string }

  - type: merge
    name: all_logins
    inputs: [src_web, src_mobile]

  - type: aggregate
    name: user_sessions
    input: all_logins
    config:
      group_by: [user_id]
      time_window:
        session: { gap: 5m }
      allowed_lateness: 30s
      cxl: |
        emit user_id = user_id
        emit logins = count(*)

  - type: output
    name: results
    input: user_sessions
    config:
      name: results
      type: csv
      path: ./output/multi_source_session.csv

error_handling:
  strategy: fail_fast

Run:

cargo run -p clinker -- run examples/pipelines/multi_source_session.yaml

Each source declares its own watermark.column independently. The aggregate’s close decision reads min_across_sources across both sources’ partitions: a session can’t emit until both src_web and src_mobile have advanced past session_end + allowed_lateness. Drop the watermark: block on either source and the planner rejects the pipeline with E156.

When to pick each

KindBucket shapeTypical use
tumblingDisjoint, clock-aligned, fixed widthHourly metrics, daily rollups, billing periods.
hoppingOverlapping, clock-aligned, fixed widthMoving averages, sliding sums, anomaly detection where each record should affect multiple reports.
sessionVariable width, gap-bounded, per-keyUser sessions, telemetry burst grouping, activity envelopes where the window length is data-driven.

File Splitting

This recipe demonstrates splitting large output files into smaller chunks, optionally keeping related records together.

Basic record-count splitting

Split output into files of at most 5,000 records each:

pipeline:
  name: monthly_report

nodes:
  - type: source
    name: transactions
    config:
      name: transactions
      type: csv
      path: "./data/transactions.csv"
      schema:
        - { name: id, type: int }
        - { name: date, type: string }
        - { name: department, type: string }
        - { name: amount, type: float }
        - { name: description, type: string }

  - type: output
    name: split_output
    input: transactions
    config:
      name: monthly_report
      type: csv
      path: "./output/report.csv"
      split:
        max_records: 5000
        naming: "{stem}_{seq:04}.{ext}"
        repeat_header: true

Output files

output/report_0001.csv  (5000 records + header)
output/report_0002.csv  (5000 records + header)
output/report_0003.csv  (remaining records + header)

Naming pattern variables

VariableDescriptionExample
{stem}Base filename without extensionreport
{ext}File extensioncsv
{seq:04}Zero-padded sequence number (width 4)0001

The path field provides the template: ./output/report.csv means stem is report and ext is csv.

Header behavior

When repeat_header: true, each output file includes the CSV header row. This is the recommended setting – each file is self-contained and can be processed independently.

Grouped splitting

Keep all records with the same group key value in the same file:

      split:
        max_records: 5000
        group_key: "department"
        naming: "{stem}_{seq:04}.{ext}"
        repeat_header: true
        oversize_group: warn

With group_key: "department", the splitter ensures that all records for a given department land in the same output file. A new file starts only at a group boundary (when the department value changes), even if the current file has not reached max_records yet.

Oversize group policy

If a single group contains more records than max_records, the oversize_group setting controls behavior:

PolicyBehavior
warn (default)Log a warning and write all records for the group into one file, exceeding the limit
errorStop the pipeline with an error
allowSilently allow the oversized file

For example, if max_records is 5,000 but the Engineering department has 7,000 records, the warn policy produces a file with 7,000 records and logs a warning.

Byte-based splitting

Split by file size instead of record count:

      split:
        max_bytes: 10485760  # 10 MB per file
        naming: "{stem}_{seq:04}.{ext}"
        repeat_header: true

The splitter estimates the current file size and starts a new file when the limit is approached. The actual file size may slightly exceed the limit because the current record is always completed before splitting.

Combined limits

Use both max_records and max_bytes together – whichever limit is reached first triggers a new file:

      split:
        max_records: 10000
        max_bytes: 5242880   # 5 MB
        naming: "{stem}_{seq:04}.{ext}"
        repeat_header: true

This is useful when record sizes vary widely. Short records might produce a tiny file at 10,000 records, while long records might hit the byte limit well before 10,000.

Full pipeline example

A complete pipeline that reads a large transaction file, filters it, and splits the output:

pipeline:
  name: split_transactions

nodes:
  - type: source
    name: transactions
    config:
      name: transactions
      type: csv
      path: "./data/all_transactions.csv"
      schema:
        - { name: id, type: int }
        - { name: date, type: string }
        - { name: department, type: string }
        - { name: category, type: string }
        - { name: amount, type: float }

  - type: transform
    name: current_year
    input: transactions
    config:
      cxl: |
        filter date.starts_with("2026")

  - type: output
    name: chunked
    input: current_year
    config:
      name: transactions_2026
      type: csv
      path: "./output/transactions_2026.csv"
      split:
        max_records: 5000
        group_key: "department"
        naming: "{stem}_{seq:04}.{ext}"
        repeat_header: true
        oversize_group: warn
clinker run split_transactions.yaml --force

Practical considerations

  • Downstream consumers. Splitting is useful when the receiving system has file size limits (e.g., an upload API that accepts files up to 10 MB) or when parallel processing of chunks is desired.

  • Record ordering. Records within each output file maintain their original order from the pipeline. Across files, the sequence number ({seq}) indicates the order.

  • Group key sorting. For group_key to work correctly, the input should ideally be sorted by the group key. If the input is not sorted, records for the same group may appear in multiple files. Pre-sort with a transform if needed, or accept the split-group behavior.

  • Overwrite behavior. Use --force when re-running a pipeline with splitting enabled. Without it, the pipeline aborts if any of the output chunk files already exist.

Intra-Record Closures

This recipe shows the complete intra-record fan-out shape: an NDJSON source where each record carries an array of line items, a transform that filters items by price and then fans each remaining item into its own output record, and a flat NDJSON sink ready for downstream billing.

The pieces involved:

  • Arrow-syntax closures for predicates and projections.
  • Array methods (filter, map) for in-place transformation.
  • Bracket-index access (it["sku"]) for reading fields off each map element.
  • emit each for fan-out.
  • The Output node’s include_unmapped flag for controlling which fields reach the sink.

Input data

orders.ndjson – one JSON object per line, each carrying a nested items array:

{"order_id":"O-1","customer":"alice@example.com","items":[{"sku":"a","price":10,"qty":2},{"sku":"b","price":20,"qty":1},{"sku":"c","price":3,"qty":5}]}
{"order_id":"O-2","customer":"bob@example.com","items":[{"sku":"a","price":10,"qty":1},{"sku":"d","price":50,"qty":1}]}

Each record has two order-level fields (order_id, customer) and an items array whose elements are maps with sku, price, and qty.

Goal

For each order:

  1. Drop items priced under $5 (a sub-threshold cutoff).
  2. Fan the surviving items into one output record each, carrying the order-level identifiers plus the per-item fields.
  3. Compute the per-line revenue (unit_price * qty) for each output record.

Pipeline

billing_lines.yaml:

pipeline:
  name: billing_lines

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: json
      options:
        format: ndjson
      path: "./orders.ndjson"
      schema:
        - { name: order_id, type: string }
        - { name: customer, type: string }
        - { name: items, type: any }

  - type: transform
    name: filter_lines
    input: orders
    config:
      cxl: |
        emit order_id = order_id
        emit customer = customer
        emit item_count = items.length()
        emit kept = items.filter(it => it["price"] >= 5)

  - type: transform
    name: explode
    input: filter_lines
    config:
      max_expansion: 10000
      cxl: |
        emit each it in kept {
          emit order_id = order_id
          emit customer = customer
          emit sku = it["sku"]
          emit unit_price = it["price"]
          emit qty = it["qty"]
          emit line_total = it["price"] * it["qty"]
        }

  - type: output
    name: lines_out
    input: explode
    config:
      name: lines_out
      type: json
      path: "./output/billing_lines.ndjson"
      options:
        format: ndjson
      include_unmapped: false
      exclude: [items, kept]

error_handling:
  strategy: continue

Run it

# Validate first
clinker run billing_lines.yaml --dry-run

# Preview the first few output records
clinker run billing_lines.yaml --dry-run -n 3

# Full run
clinker run billing_lines.yaml

Expected output

output/billing_lines.ndjson:

{"order_id":"O-1","customer":"alice@example.com","sku":"a","unit_price":10,"qty":2,"line_total":20}
{"order_id":"O-1","customer":"alice@example.com","sku":"b","unit_price":20,"qty":1,"line_total":20}
{"order_id":"O-2","customer":"bob@example.com","sku":"a","unit_price":10,"qty":1,"line_total":10}
{"order_id":"O-2","customer":"bob@example.com","sku":"d","unit_price":50,"qty":1,"line_total":50}

Order O-1’s three input items collapse to two output records (the sku=c line was filtered out because its price was below $5). Order O-2’s two items both survive the filter and produce two output records.

How it works

Filter stage. The filter_lines transform reads each order, runs items.filter(it => it["price"] >= 5) to drop sub-threshold items, and stashes the survivors in a kept field. The closure body uses bracket indexing (it["price"]) because each it is a map; bracket indexing returns null for missing keys without aborting. The same record also carries an item_count projection so downstream nodes could route or audit on the original (pre-filter) item count.

Explode stage. The explode transform contains one emit each block over kept. For each surviving item, the body emits a flat record with the order-level identifiers (order_id, customer) repeated, plus the per-item fields lifted out of it. The body has no filter or nested emit each – those are forbidden inside the block; pre-filter upstream as we did, or post-filter in a downstream transform.

include_unmapped: false. The default Output policy is to pass every unmapped input field through. Here we set it to false so the order-level items array (carried through from the source), the item_count projection, and the intermediate kept array (used only as the fan-out source) do not leak into the per-line output. The exclude: [items, kept] list provides a belt-and-suspenders defense against future renaming.

max_expansion: 10000. Caps how many output records a single input order may produce. The default is 10000; we set it explicitly here so the value is visible in the YAML. Orders with arrays larger than the cap route to the DLQ with category expansion_limit_exceeded (see Transform Nodes -> Expansion Cap).

Variations

Pass through every input field

Remove include_unmapped: false (or set it to true) and the original order-level fields plus the intermediate kept array will appear on every output record. Useful when downstream consumers expect a complete record context, or when you need to audit what was filtered.

Emit a single record per order with the kept-items array

Drop the explode transform and route filter_lines directly to the Output. Each output record stays at order grain, with kept carrying the post-filter array. This is the same pipeline minus the fan-out step.

Reach for .flat_map instead of two transforms

When the per-element transformation is simple enough to fit in a single closure body, flat_map collapses the filter + project + explode pattern into one expression. It produces a flat array, which downstream nodes still see as a single field on the input record; the explicit emit each is what produces multiple output records.

Rewrite a nested field in place with .set

When you want to keep the record at order grain but mutate a value buried inside it, the set map method takes a dotted/indexed path and rewrites a single leaf, leaving every sibling untouched:

    cxl: |
      emit order = order.set("items[0].sku", "A-100").set("ship.region", "us-east")

The first set overwrites the SKU of the first item; the second writes ship.region, auto-creating the ship map if the order had no ship field yet. Because set is copy-on-write, this builds a fresh order document without disturbing the upstream binding. A path that conflicts with the existing shape (descending into a scalar, or an array index past the end) yields null for that set rather than partially writing – guard with catch if a path may not match every record.

See also