Clinker

Clinker is a pure-Rust, bounded-memory batch DAG executor for CSV, JSON, XML, and fixed-width data. It reads finite inputs, drives them through a directed acyclic graph of transformation nodes one record at a time, and exits when the inputs are drained. It ships as a single static binary with no interpreter, no runtime, and no install dependencies.

Pipelines are declared in YAML. Data transformation logic is written in CXL, a custom expression language purpose-built for ETL. Together they replace legacy tools like Informatica, SSIS, Talend, and NiFi with something deterministic, lightweight, and easy to reason about.

What Clinker is, plainly

A finite batch executor with per-record streaming evaluation, not a long-running stream processor. A pipeline run is a job: Sources read until EOF, the DAG drains, the process exits. Within a run, stateless operators (Transform, Route, most Combine probe-side work, Output) evaluate records one at a time without accumulating per-record state. Every stage is charged against the configured RSS budget. Fused Source → Transform → Output paths run streaming with no per-stage materialization; non-fused boundaries (Route fan-out, Merge fan-in, Composition bodies, diamond DAGs) materialize records into per-stage buffers that charge against the same envelope. The engine spills buffers to disk at 80% of the limit and fails fast with E310 MemoryBudgetExceeded at the hard limit, naming the offending producer. Blocking operators (Aggregate, sort, grace-hash Combine) accumulate state inside that same budget and spill to disk when soft and hard memory thresholds trip, rather than OOM-killing the process.

If you have used Flink, Kafka Streams, or Beam in unbounded mode: Clinker is not that. There are no watermarks against wall-clock time, no infinite-source semantics, no exactly-once delivery across restarts. The closest prior art is Pentaho Kettle / Apache Hop, Embulk, Singer, Benthos in batch mode, and Vector running file-to-file – finite ETL jobs with per-record evaluation and a hard memory ceiling.

Three pillars of what Clinker is:

Finite inputs. Files (CSV, JSON, XML, fixed-width, EDIFACT, X12, HL7 v2, SWIFT MT) are the canonical shape. Finite-cursor network sources (paginated REST APIs with hard page/record caps) fit the same model – they exhaust their cursor and EOF. Unbounded sources (Kafka topics, Kinesis streams, Server-Sent Events, webhooks, tail -f-style file followers) are out of scope and will remain so.
Finite jobs. A pipeline run begins when you invoke clinker run, drains the DAG, and exits with a status code. No long-running daemon, no service surface, no infinite event loop.
Single process. One clinker binary invocation is one operating- system process. Parallelism happens inside the process via threads (std::thread, Rayon). Clinker does not spawn worker processes, does not coordinate a cluster, and does not shuffle data between machines. Scale by giving the host more cores, more RAM, and more disk – the DuckDB / Polars / Kettle model. If a single host genuinely can’t fit the work, partition the input by file or by key and run multiple clinker invocations from a shell script; that’s a five-line bash script, not an architectural addition.

Why Clinker?

Single binary, zero dependencies. Download it, run it. No JVM, no Python, no package manager. Runs on Linux, macOS, and Windows out of the box — CI builds and tests on all three, and the spill, staging, and RSS-sampling layers have platform-specific paths so behavior is consistent across them.

Good neighbor on busy servers. Clinker enforces a strict memory ceiling (default 512 MB) so it can run alongside JVM applications, databases, and other services without competing for RAM. Aggregation spills to disk when memory pressure rises.

Reproducible output. Given the same input and pipeline, Clinker produces byte-identical output across runs. No nondeterminism from thread scheduling, hash randomization, or floating-point reordering.

Operability-first design. Per-stage metrics, dead-letter queues for error records, explain plans for understanding execution, and structured exit codes for scripting. Built for production from day one.

Two binaries:

Binary	Purpose
`clinker`	Run pipelines against real data
`cxl`	Check, evaluate, and format CXL expressions interactively

A taste of Clinker

Here is a complete pipeline that reads a customer CSV, filters to active customers, classifies them into tiers, and writes the result:

pipeline:
  name: customer_etl

nodes:
  - type: source
    name: customers
    config:
      name: customers
      type: csv
      path: "./data/customers.csv"
      schema:
        - { name: customer_id, type: int }
        - { name: first_name, type: string }
        - { name: last_name, type: string }
        - { name: status, type: string }
        - { name: lifetime_value, type: float }

  - type: transform
    name: enrich
    input: customers
    config:
      cxl: |
        filter status == "active"
        emit customer_id = customer_id
        emit full_name = first_name + " " + last_name
        emit tier = if lifetime_value >= 10000 then "gold" else "standard"

  - type: output
    name: result
    input: enrich
    config:
      name: enriched
      type: csv
      path: "./output/enriched_customers.csv"

Run it:

clinker run customer_etl.yaml

That is the entire workflow. No project scaffolding, no configuration files, no compile step. One YAML file, one command.

Next steps

Installation – download the binary and verify it works
Your First Pipeline – build and run a pipeline step by step
Key Concepts – understand the mental model behind Clinker pipelines

Non-Goals

This page lists what Clinker is deliberately not. These are architectural commitments — design surfaces Clinker will not grow into, not just features that haven’t been built yet.

If you arrived here because you were considering Clinker for one of the scenarios below, the answer is “a different tool is the right fit.” Each non-goal is paired with the kind of tool that is the right fit.

Not an unbounded stream processor

Clinker reads sources that have an end. A pipeline run is a finite job: Sources read until EOF, the DAG drains, the process exits.

Out of scope:

Kafka topics, Kinesis streams, Pub/Sub subscriptions (long-running consumers without a natural end).
Server-Sent Events, WebSocket subscriptions, webhooks-as-input.
tail -f-style file followers.
Watermarking against wall-clock time.
Exactly-once delivery across process restarts.
Stateful infinite-stream windowing (tumbling / sliding / session windows over event time without a finite boundary).

Right fit instead: Apache Flink, Kafka Streams, Apache Beam in unbounded mode, Vector with streaming sources, Benthos with streaming inputs, Apache NiFi.

Not a multi-process or distributed engine

One clinker run invocation is one operating-system process. Clinker does not spawn worker processes, does not coordinate a cluster, and does not shuffle data between machines.

Out of scope:

Worker-process pools on a single machine.
Multi-machine sharded execution.
Network shuffle between executors.
Cluster managers (Kubernetes operators, YARN, Mesos integrations).
Distributed memory accounting.
Partial-failure recovery across worker boundaries.

Right fit instead: Apache Spark, Trino / Presto, Apache Flink in cluster mode, Apache Beam on Dataflow, Hadoop MapReduce.

Scaling Clinker: give the host more cores, more RAM, more disk — the DuckDB / Polars / Kettle / Hop model. If a single host genuinely can’t fit the work, partition the input by file or by key and run multiple clinker invocations from a shell script. That’s a five-line script, not an architectural addition.

Not a long-running service

Clinker is a CLI binary, not a server. There is no daemon mode, no HTTP control plane, no JDBC/ODBC listener, no UI server, no scheduled job runner inside Clinker itself.

Out of scope:

HTTP API exposing pipeline execution.
Built-in cron / scheduler / orchestrator.
Persistent connection pool living across pipeline runs.
A long-lived process accepting new pipeline submissions over a socket.

Right fit instead:

For scheduling: cron, systemd timers, Airflow, Dagster, Prefect, Temporal.
For HTTP-fronted ETL: any of the above orchestrators wrapping clinker run invocations.
For interactive queries against finite data: DuckDB, Polars, or any embedded query engine.

Orchestration by Temporal (and the others above) is supported via a shell-out contract — the orchestrator runs clinker run as a child process and reads its exit code, logs, and metrics. Clinker embeds no Temporal client or worker; that coupling is a decided non-goal (issue #622). See Running Under a Workflow Orchestrator for the exit-code, cancellation, and output-atomicity guarantees that contract depends on.

Not an OLAP / SQL query engine

Clinker is a per-record expression engine with explicit nodes: in a DAG. It does not parse SQL, does not optimize joins via cost-based optimization across the whole pipeline, and does not present a relational table model.

Out of scope:

SQL parsing (the CXL language is the surface; no SELECT ... FROM is accepted).
Cost-based join reordering across more than the local Combine node.
Materialized views or query caching.
Interactive query latencies under a second.
ANSI-SQL semantics for NULL, type coercion, or aggregate behavior.

Right fit instead: DuckDB, ClickHouse, DataFusion, Trino, Postgres, or any RDBMS. If you want SQL-driven transformation over files, DuckDB is the closest single-binary alternative to Clinker for the cases where SQL is the right surface.

Not a connector marketplace

Clinker ships with a deliberately small set of source and sink types: CSV, JSON, XML, fixed-width, EDIFACT, X12, HL7 v2, and SWIFT MT files, plus a finite-cursor REST source. Writing to a network endpoint is not supported: a REST Output sink (issue #224) and finite-cursor SQL sources and sinks (#225, #226) are tracked but unbuilt. There is no plugin registry, no third-party connector store, no SaaS-API catalog.

Out of scope:

Hundreds of pre-built SaaS integrations (Salesforce, HubSpot, Stripe, etc.).
A central registry of community-maintained connectors.
Schema discovery against arbitrary external APIs.
Change-data-capture (CDC) sources.

Right fit instead: Airbyte, Fivetran, Stitch, Singer with its tap ecosystem, dlt (data load tool).

Not a streaming-CDC engine

Clinker treats each pipeline run as a fresh, finite pass over the input. It does not maintain a persistent log of source changes, does not replicate row-level changes from a database, and does not produce an append-only stream of inserts / updates / deletes.

Out of scope:

Postgres logical replication subscriptions.
MySQL binlog tailing.
Debezium-style CDC stream production.
Maintaining a target database in continuous sync with a source.

Right fit instead: Debezium, Maxwell, AWS DMS, Striim, Estuary Flow, or vendor-native CDC like Snowflake Streams.

What Clinker is

For the positive framing, see the Introduction and Key Concepts. The short version:

A pure-Rust, single-binary, bounded-memory batch DAG executor for finite file and finite-cursor inputs.
Per-record evaluation through a directed acyclic graph of Source, Transform, Aggregate, Route, Merge, Combine, Output, and Composition nodes.
Pipelines declared in YAML, transformation logic written in CXL (a custom per-record expression language).
One process, finite job, EOF-then-exit. Disk spill under memory pressure rather than OOM.

Installation

Clinker is a single static binary with no runtime dependencies. Download it, put it on your PATH, and you are ready to go.

Binaries

Clinker ships two binaries:

clinker – the pipeline executor. This is the main tool you use to validate and run pipelines against data.
cxl – the CXL expression checker, evaluator, and formatter. Use it during development to test expressions interactively, check types, and format CXL blocks.

Verify installation

After placing the binaries on your PATH, confirm they work:

clinker --version

clinker 0.1.0

cxl --version

cxl 0.1.0

Both commands should print a version string and exit. If you see command not found, check that the directory containing the binaries is in your PATH.

Building from source

Clinker requires Rust 1.91+ (edition 2024). If you have a Rust toolchain installed, build and install both binaries directly from the repository:

# Clone the repository
git clone https://github.com/rustpunk/clinker.git
cd clinker

# Install the pipeline executor
cargo install --path crates/clinker

# Install the CXL expression tool
cargo install --path crates/cxl-cli

This compiles release-optimized binaries and places them in ~/.cargo/bin/, which is typically already on your PATH.

To verify the build:

cargo test --workspace

This runs the full test suite (approximately 1100 tests) and confirms everything is working correctly on your system.

Rust toolchain

The repository includes a rust-toolchain.toml that pins the exact Rust version. If you use rustup, it will automatically download the correct toolchain when you build.

Requirement	Value
Rust edition	2024
Minimum version	1.91
C dependencies	None

Your First Pipeline

This walkthrough builds a pipeline from scratch, runs it, and explores the tools Clinker provides for validating and understanding pipelines before they touch real data.

1. Create sample data

Save the following as employees.csv:

id,name,department,salary
1,Alice Chen,Engineering,95000
2,Bob Martinez,Marketing,62000
3,Carol Johnson,Engineering,88000
4,Dave Williams,Sales,71000

2. Write the pipeline

Save the following as my_first_pipeline.yaml:

pipeline:
  name: salary_report

nodes:
  - type: source
    name: employees
    config:
      name: employees
      type: csv
      path: "./employees.csv"
      schema:
        - { name: id, type: int }
        - { name: name, type: string }
        - { name: department, type: string }
        - { name: salary, type: int }

  - type: transform
    name: classify
    input: employees
    config:
      cxl: |
        emit id = id
        emit name = name
        emit department = department
        emit salary = salary
        emit level = if salary >= 90000 then "senior" else "junior"

  - type: output
    name: report
    input: classify
    config:
      name: salary_report
      type: csv
      path: "./salary_report.csv"

This pipeline has three nodes:

employees (source) – reads the CSV file and declares the schema.
classify (transform) – passes all fields through and adds a level field based on salary.
report (output) – writes the result to a new CSV file.

The input: field on each consumer node wires the DAG together. Data flows from employees through classify to report.

3. Validate before running

Before processing any data, check that the pipeline is well-formed:

clinker run my_first_pipeline.yaml --dry-run

Dry-run parses the YAML, resolves the DAG, and type-checks all CXL expressions against the declared schemas. If there are errors – a typo in a field name, a type mismatch, a missing input: reference – Clinker reports them with source-location diagnostics and stops. No data is read.

4. Preview records

To see what the output will look like without writing files, preview a few records:

clinker run my_first_pipeline.yaml --dry-run -n 2

This reads the first 2 records from the source, runs them through the pipeline, and prints the results to the terminal. Useful for sanity-checking transformations before committing to a full run.

5. Understand the execution plan

To see how Clinker will execute the pipeline:

clinker run my_first_pipeline.yaml --explain

The explain plan shows the DAG topology, the order nodes will execute, per-node parallelism strategy, and schema propagation through the pipeline. This is valuable for understanding complex pipelines with routes, merges, and aggregations.

6. Run it

clinker run my_first_pipeline.yaml

Clinker reads employees.csv, applies the transform, and writes salary_report.csv. The output:

id,name,department,salary,level
1,Alice Chen,Engineering,95000,senior
2,Bob Martinez,Marketing,62000,junior
3,Carol Johnson,Engineering,88000,junior
4,Dave Williams,Sales,71000,junior

Alice’s salary of 95,000 meets the threshold, so she is classified as senior. Everyone else is junior.

What just happened

The pipeline executed as a streaming process:

The source node read employees.csv one record at a time.
Each record flowed through the classify transform, which evaluated the CXL block to produce the output fields.
The output node wrote each transformed record to salary_report.csv.

At no point was the entire dataset loaded into memory. This is how Clinker processes files of any size under its memory ceiling.

Next steps

Key Concepts – understand the building blocks of Clinker pipelines
Pipeline YAML Structure – full reference for pipeline configuration
CXL Overview – learn the expression language in depth

Key Concepts

This page covers the mental model behind Clinker pipelines. If you have experience with other ETL tools, most of this will feel familiar – but pay attention to where Clinker diverges, especially around CXL, per-record evaluation, and the memory budget.

Batch jobs, not unbounded streams

A Clinker run is a finite batch job. Source nodes read their files until EOF, the DAG drains, and the process exits. There are no watermarks against wall-clock time, no infinite-source semantics, no exactly-once delivery across restarts. If you have used Flink, Kafka Streams, or Beam in unbounded mode: Clinker is not that.

The word “streaming” in Clinker’s documentation always refers to per-record evaluation within a single batch run – records flow through the graph one at a time rather than being materialized as a whole table – not to long-running stream-processor semantics. Internal identifiers in the codebase (function names like streaming_output_task, config fields like strategy: streaming, error messages, log lines) use the word in the same row-by-row sense; if you see it in a stack trace, it is not Flink leaking through.

Finite inputs only

Clinker reads sources that have an end. Files are the canonical shape, and finite-cursor network sources (paginated REST APIs with hard page/record caps) fit the same model – they exhaust their cursor and EOF. Unbounded sources (Kafka, Kinesis, Server-Sent Events, webhooks, tail -f-style file followers) are explicitly out of scope and will remain so.

Single process, ever

One clinker run invocation is one OS process. Parallelism happens inside that process via threads. Clinker does not spawn worker processes, does not coordinate a cluster, and does not shuffle data between machines. Scale by giving the host more cores, more RAM, more disk – the DuckDB / Polars / Kettle model. If a single host genuinely can’t fit the work, partition the input by file or by key and run multiple clinker invocations from a shell script; that’s a five-line script, not an architectural addition to Clinker.

For the full list of what Clinker deliberately does not do, see Non-Goals.

Pipelines are DAGs

A pipeline is a directed acyclic graph of nodes. Data flows from sources, through processing nodes, to outputs. There are no cycles – a node cannot consume its own output, directly or indirectly.

You define the graph by setting input: on each consumer node, naming the upstream node it reads from. Clinker resolves these references, validates that the graph is acyclic, and determines execution order automatically.

The `nodes:` list

Every pipeline has a single flat list of nodes. Each node has a type: discriminator that determines its behavior. The eight node types are:

Type	Purpose
`source`	Read data from a file (CSV, JSON, XML, fixed-width)
`transform`	Apply CXL logic to reshape, filter, or enrich records
`aggregate`	Group records and compute summary values (sum, count, etc.)
`route`	Split a stream into named ports based on conditions
`merge`	Concatenate multiple streams that share a schema
`combine`	Join records across N inputs with cross-input predicates
`output`	Write data to a file
`composition`	Embed a reusable sub-pipeline

You can have as many nodes of each type as your pipeline requires. The only constraint is that the resulting graph must be a valid DAG.

CXL is not SQL

CXL is a per-record expression language. Each record flows through a CXL block independently – there is no table-level context, no SELECT, no FROM, no JOIN. Think of it as a programmable row mapper.

The core statements:

emit name = expr – produce a field in the output record. Only emitted fields appear downstream. If you want to pass a field through unchanged, you must emit it explicitly: emit id = id.
let name = expr – bind a local variable for use in later expressions. Local variables do not appear in the output.
filter condition – discard the record if the condition is false. A filtered record produces no output and is not counted as an error.
distinct / distinct by field – deduplicate records. distinct deduplicates on all output fields; distinct by field deduplicates on a specific field.

CXL uses and, or, and not for boolean logic – not && or ||. String concatenation uses +. Conditional expressions use if ... then ... else ... syntax.

System namespaces use a $ prefix: $pipeline.*, $source.*, $record.*, $window.*, and $vars.*. These provide access to pipeline and source context, per-record scoped state, window function state, and static configuration respectively.

Per-record evaluation and the memory budget

Within a run, stateless operators evaluate records one at a time: a CXL block sees exactly one record, with no table-level context and no per-record state carried across records. Clinker does not load an entire file into memory before processing it. This is what “streaming” means in Clinker – row-by-row evaluation inside a finite batch job, not Flink-style unbounded stream processing.

“One at a time” describes the evaluation model, not the transport. For efficiency, records move between stages in bounded batches (default 2048 events, tunable per stage or pipeline-wide via batch_size) rather than as individual messages – but each record is still evaluated independently against the CXL block. The batch is a handoff unit, not a window: an operator never needs the whole batch to process any single record. (Blocking operators – Aggregate, sort, grace-hash Combine – are the exception that do accumulate across records; see below.)

Per-record evaluation keeps per-row memory usage bounded for the stateless parts of the graph (Transform, Route, Merge, most Combine probe-side work, Output). Every stage is charged against the configured RSS budget. Fused Source → Transform → Output paths run streaming, with no per-stage materialization, so a 100 GB CSV passes through with the same footprint as a 100 KB CSV. A stage that hands its output to a single downstream sink Output also avoids a charged inter-stage buffer – single-branch Route, non-fused Merge, streaming Aggregate, and the Combine probe-side stream their result straight to the writer (see Streaming vs. Blocking Stages). The remaining boundaries – multi-branch Route fan-out, output that forks to several consumers, Composition bodies, diamond DAGs – materialize records into per-stage buffers that charge against the same budget envelope. Every materialized buffer can spill past the soft threshold, including buffers shared by several readers and Route/Cull output-port buffers. Readers run sequentially over the same immutable memory-or-spill backing; each opens one cursor, and the final reader takes the authoritative buffer regardless of declaration or dispatch order. A consumer that needs a full resident vector reserves that materialization first. If the overlap would exceed the hard limit, the engine fails before allocating with a structured E310 MemoryBudgetExceeded diagnostic that names the consumer.

Use clinker run --explain to see which nodes will materialize (buffer: materialized) versus which will stream (buffer: streaming) before runtime – that label is the canonical “which stages charge the budget” signal. See the --explain reference and the memory-tuning page.

Stateful operators must accumulate. Aggregate, sort, and grace-hash Combine cannot emit until they have seen enough input – sums need every addend, a full sort needs the last row, a hash join needs the build side complete. These operators run inside a configured RSS budget (default 512 MB) and degrade gracefully under pressure rather than OOM:

Aggregate uses hash aggregation by default and spills partitions to disk when soft/hard memory thresholds trip. When the input is already sorted by the group key, the planner picks streaming aggregation, which requires only constant memory.
Sort spills runs to disk and merges them.
Combine picks among in-memory hash join, grace hash join (spilled), and IEJoin / sort-merge depending on predicates and memory pressure. A pure-range Combine (band join with no equality key) runs the block-band IEJoin, which external-sorts each side and spills a matched-output sort to disk so both its input and its result stay inside the budget.

The memory ceiling is a first-class promise. Clinker is designed to share a server with JVM applications, databases, and other services without competing for RAM.

Input wiring

Consumer nodes reference their upstream via the input: field:

- type: transform
  name: enrich
  input: customers    # reads from the node named "customers"

Route nodes produce named output ports. Downstream nodes reference a specific port using dot notation:

- type: route
  name: split_by_region
  input: customers
  config:
    routes:
      us: region == "US"
      eu: region == "EU"
    default: other

- type: output
  name: us_output
  input: split_by_region.us    # reads from the "us" port

Merge nodes accept multiple inputs using inputs: (plural):

- type: merge
  name: combined
  inputs:
    - us_transform
    - eu_transform

Schema declaration

Source nodes require an explicit schema: that declares every column’s name and type:

config:
  schema:
    - { name: customer_id, type: int }
    - { name: email, type: string }
    - { name: balance, type: float }
    - { name: created_at, type: date }

Clinker uses these declarations to type-check CXL expressions at compile time, before any data is read. If a CXL block references a field that does not exist in the upstream schema, or applies an operation to an incompatible type, the error is caught during validation – not at row 5 million of a production run.

Supported types include int, float, string, bool, date, and datetime.

Error handling

A pipeline picks one error handling strategy, in the top-level error_handling: block:

Strategy	Behavior
`fail_fast`	Stop the pipeline on the first error (default)
`continue`	Route error records to a dead-letter queue file and continue

When using continue, Clinker writes rejected records to a DLQ file alongside the output. Each DLQ entry includes the original record, the error category, the error message, and the node that rejected it. This makes diagnosing production issues straightforward: check the DLQ, fix the data or the pipeline, and rerun. A run that dead-letters at least one record exits with code 2 rather than 0, so a scheduler can tell a clean run from a partial one.

See Error Handling & DLQ for the DLQ columns, the error categories, and the per-source options.

Pipeline YAML Structure

A Clinker pipeline is a single YAML file with three top-level sections: pipeline (metadata), nodes (the processing graph), and optionally error_handling.

Top-level shape

pipeline:
  name: my_pipeline            # Required — pipeline identifier
  memory:                      # Optional — see ops/memory.md
    limit: "256M"              # Optional (K/M/G suffixes), default 512M
    backpressure: pause        # Optional, default `pause`
  vars:                        # Optional key-value pairs
    threshold: 500
    label: "Monthly Report"
  date_formats: ["%Y-%m-%d"]   # Optional — custom date parsing formats
  rules_path: "./rules/"       # Optional — CXL module search path
  concurrency:                 # Optional
    threads: 4
    chunk_size: 1000
  metrics:                     # Optional
    spool_dir: "./metrics/"

nodes:                         # Required — flat list of pipeline nodes
  - type: source
    name: raw_data
    config:
      name: raw_data
      type: csv
      path: "./data/input.csv"
      schema:
        - { name: id, type: int }
        - { name: value, type: string }

  - type: transform
    name: clean
    input: raw_data
    config:
      cxl: |
        emit id = id
        emit value = value.trim()

  - type: output
    name: result
    input: clean
    config:
      name: result
      type: csv
      path: "./output/result.csv"

error_handling:                # Optional
  strategy: fail_fast

Pipeline metadata

The pipeline: block carries global settings that apply to the entire run.

Field	Required	Description
`name`	Yes	Pipeline identifier. Used in logs and metrics.
`memory`	No	Memory-arbitrator tuning. Nested fields: `limit` (RSS budget, `K`/`M`/`G` suffixes, default `512M`) and `backpressure` (`spill`/`pause`/`both`, default `pause`). See Memory Tuning.
`vars`	No	Scalar constants accessible in CXL via `$vars.*`.
`date_formats`	No	List of `strftime`-style patterns for date parsing.
`rules_path`	No	Directory for CXL `use` module resolution.
`concurrency`	No	`threads` and `chunk_size` for parallel chunk processing.
`metrics`	No	`spool_dir` for per-run JSON metric files.
`date_locale`	No	Locale for date formatting.
`include_provenance`	No	Attach provenance metadata to records.

The nodes list

Every pipeline has a flat nodes: list. Each entry is a node with a type: discriminator that determines its kind:

Type	Role
`source`	Reads data from a file
`transform`	Applies CXL expressions to each record
`aggregate`	Groups and summarizes records
`route`	Splits records into named branches by condition
`merge`	Concatenates multiple upstream branches that share a schema
`combine`	Joins records across N inputs with `where:` predicates
`output`	Writes records to a file
`composition`	Imports a reusable transform fragment

Node naming

Every node must have a name: field. Names must be unique within the pipeline and must not contain dots – the dot character is reserved for port syntax (see below). Names are used for wiring, logging, and diagnostics.

Wiring: input and inputs

Nodes connect to each other through input: (singular) and inputs: (plural) fields that live at the node’s top level, alongside name: and type:.

Single upstream – used by transform, aggregate, route, and output nodes:

- type: transform
  name: clean
  input: raw_data       # References the source node named "raw_data"
  config: ...

Port syntax – for consuming a specific branch from a route node, use node.port:

- type: output
  name: high_value_out
  input: split.high     # Consumes the "high" branch of route node "split"
  config: ...

Multiple upstreams – merge nodes use inputs: (plural) instead of input::

- type: merge
  name: combined
  inputs:
    - east_processed
    - west_processed
  config: {}

Source nodes have no input field. They are entry points – adding an input: field to a source is a parse error.

Using inputs: on a non-merge node (or input: on a merge node) is caught at parse time by deny_unknown_fields.

Optional fields on all nodes

Every node type supports these optional fields:

description: – human-readable text for documentation. Ignored by the engine.
_notes: – arbitrary metadata (JSON object). Ignored by the engine and available to external tooling.

- type: transform
  name: enrich
  description: "Add customer tier based on lifetime value"
  _notes:
    color: "#4a9eff"
    position: { x: 300, y: 200 }
  input: customers
  config:
    cxl: |
      emit tier = if lifetime_value >= 10000 then "gold" else "standard"

Strict parsing

All config structs use deny_unknown_fields. If you misspell a field name – for example, writing inputt: instead of input: or stratgy: instead of strategy: – the YAML parser rejects it immediately with a diagnostic pointing to the typo. This catches configuration errors before any data processing begins.

Environment variable: CLINKER_ENV

The CLINKER_ENV environment variable can be used for conditional logic outside of pipelines (e.g., selecting channel directories or controlling CLI behavior). It is not directly referenced within pipeline YAML but is available to the channel and workspace systems.

Scoped Variables

Clinker’s scoped-variable system lets a pipeline read and write named values at three lifetimes: the pipeline run, the source, and the record. Each variable is declared on the Transform that writes it via that Transform’s declares: block (type, scope, optional default), written by the same Transform’s CXL with emit $<scope>.<name> = ..., and read inline from any downstream node via the $pipeline.*, $source.*, and $record.* namespaces.

The three scopes

Scope	Lifetime	Reset	Reader namespace
`pipeline`	Entire pipeline run	Never (per run)	`$pipeline.<key>`
`source`	One per source file (`Arc<str>`-keyed)	Per source-file	`$source.<key>`
`record`	A single record as it flows through nodes	Per record	`$record.<key>`

Record-scope variables are the per-record private store: every transform along the row’s path can read them, but they never serialize as output columns unless explicitly re-emitted as a regular column. They are written with emit $record.<key> = ... from a transform that declares them.

Declaring variables

A scoped variable is declared on the Transform that writes it, in that Transform’s config.declares: list. Each entry is named, scoped, typed, and optionally given a default that satisfies reads firing before the writer has run:

- type: transform
  name: enrich
  input: orders
  config:
    declares:
      - { name: cutoff_date,  scope: pipeline, type: date,   default: "2024-01-01" }
      - { name: ingest_label, scope: source,   type: string, default: "prod" }
      - { name: fuzzy_score,  scope: record,   type: float }
    cxl: |
      emit id = id
      emit $pipeline.cutoff_date = "2024-01-01"
      emit $source.ingest_label = $source.file.file_stem()
      emit $record.fuzzy_score = fuzzy_match(name, $pipeline.canonical_name)

Allowed types: int, float, string, bool, date, date_time.

Each (scope, name) pair must be declared on exactly one Transform — the same pair declared on two Transforms is rejected at config-validation time, ahead of compilation. $pipeline, $source, and $record are flat shared namespaces; declare each name once and read it from every consumer.

The pipeline’s top-level vars: block is a separate, flat registry for static configuration read via $vars.<key> — it does not carry the nested pipeline: / source: / record: scopes:

pipeline:
  name: order_processing
  vars:
    fuzzy_threshold: { type: float, default: 0.85 }   # read as $vars.fuzzy_threshold

Built-in members of each scope ($source.file, $source.name, $source.row, $source.path, $source.count, $source.batch, $source.ingestion_timestamp; $pipeline.start_time, $pipeline.name, $pipeline.execution_id, $pipeline.batch_id, $pipeline.total_count, $pipeline.ok_count, $pipeline.dlq_count, $pipeline.filtered_count, $pipeline.distinct_count) are reserved — declaring a user variable with one of those names is rejected at parse time.

`$source.count` semantics

$source.count is the per-source record total for the Source that produced the current record. The total isn’t known until the source finishes, so you can’t use it during per-record evaluation: a read on a mid-stream record (in a Transform, Route, Window, or Merge) resolves to Null, while reads after the source has finished (such as a terminal aggregate emit) resolve to the per-source total.

This means a streaming denominator like value / $source.count yields Null on mid-stream records. If you need a running row counter, declare a scope: source variable on a Transform and increment it from that Transform’s CXL instead.

Reading variables

CXL access is identical for declared and built-in keys:

- type: transform
  name: filter_recent
  input: orders
  config:
    cxl: |
      emit id = id
      filter received_at > $pipeline.cutoff_date
      emit batch = $source.batch_id
      emit confidence = $record.fuzzy_score

Reads of undeclared keys are rejected with E203 (CXL name resolution failed) at compile time, with a “did you mean” suggestion that scans the declared registry.

Writing variables

A scoped variable is written by the Transform that declares it: list the variable in the Transform’s declares: block and assign it from the same Transform’s CXL with emit $<scope>.<name> = <expr>. The Transform still processes records normally — declaring and writing a scoped var is additive to its ordinary emit/filter logic.

- type: transform
  name: capture_header
  input: salesforce_in
  config:
    declares:
      - { name: batch_id,        scope: source, type: string }
      - { name: ingestion_label, scope: source, type: string }
    cxl: |
      emit id = id
      emit $source.batch_id = batch
      emit $source.ingestion_label = $source.file.file_stem()

- type: transform
  name: row_score
  input: enrich
  config:
    declares:
      - { name: fuzzy_score, scope: record, type: float }
    cxl: |
      emit id = id
      emit $record.fuzzy_score = fuzzy_match(name, $pipeline.canonical_name)

An emit $<scope>.<name> write to a variable the Transform does not declare is rejected at compile time. Requiring the declares: entry keeps the dependency between writers and readers visible at plan time.

Init phase: pre-runtime population

Set phase: init on a Transform to pre-compute a $pipeline.* or $source.* value from a config-file source before the main run starts:

- type: source
  name: config_src
  config:
    name: config_src
    type: csv
    path: config.csv
    schema:
      - { name: cutoff, type: int }

- type: aggregate
  name: max_agg
  input: config_src
  config:
    group_by: []
    cxl: |
      emit cap = max(cutoff)

- type: transform
  name: precompute_cutoff
  input: max_agg
  config:
    phase: init
    declares:
      - { name: cutoff_date, scope: pipeline, type: int }
    cxl: |
      emit cap = cap
      emit $pipeline.cutoff_date = cap

Init-phase nodes must be terminal — no runtime-phase node may consume from an init-phase Transform. (Init-phase nodes can chain through init-only descendants for compositions.) Use disjoint Sources for init vs runtime when you need both: a Source shared between an init and a runtime branch only feeds the init pass.

Compile-time validation

Scoped variables are checked before the run starts. Every reference and every writer is validated, and every flow from a writer to its readers is checked against the pipeline. Each code below tells you what to fix.

Code	What it catches
E109	Channel targets a composition but carries `vars:` overrides.
E116	Channel var changes an existing type, or any default mismatches its declared type.
E117	Channel var name shadows a reserved system field for that scope.
E118	Channel `vars.source.<src>` references an unknown source-node name.
E164	An init-phase Transform has a runtime descendant.
E171	A reader is not a transitive DAG descendant of its writer.
E172	Bare `$source.<custom>` read downstream of a Merge or Combine.
E173	Composition body reads a parent scoped var without opting in.
E174	Composition `_compose.scoped_vars` declares a different type than the parent.
E175	An init-phase node reads a runtime-only writer’s variable.
E203	A reference to an undeclared scoped variable (resolver-level failure).

Cross-Transform duplicate declares: (the same (scope, name) declared on two Transforms) is rejected before the run starts. $pipeline, $source, and $record are flat shared namespaces; declare each name once and reference it from every consumer.

Each diagnostic points at the exact place you read or wrote the variable, plus the conflicting writer or parent declaration, so the report lands where you can act on it rather than in some unrelated configuration block.

Post-merge access: qualified `$source.<input>.<key>`

After a Merge or Combine, the bare $source.<custom> form is ambiguous: each record carries its own source’s value, but the reader’s intent is usually to compare across inputs. E172 rejects the unqualified form and the qualified form is the legal alternative:

- type: transform
  name: read_after_merge
  input: merged
  config:
    cxl: |
      emit id = id
      emit lt = $source.left_input.left_label
      emit rt = $source.right_input.right_label

The <input_name> segment matches the named input on the Combine (its IndexMap key) or the upstream node name on the Merge.

Composition opt-in

A composition body cannot see parent scoped variables by default — the seal is enforced by E173. To pass values across the boundary, the composition declares the schema of parent vars it consumes in its _compose.scoped_vars block:

# read_pipeline_var.comp.yaml
_compose:
  name: read_pipeline_var
  inputs:
    inp:
      schema:
        - { name: id, type: int }
  outputs:
    out: tap
  scoped_vars:
    pipeline:
      cutoff:
        type: int

nodes:
  - type: transform
    name: tap
    input: inp
    config:
      cxl: |
        emit id = id
        emit cutoff_seen = $pipeline.cutoff

The parent must declare cutoff with the matching type; mismatches raise E174.

What scoped variables are not

These are intentional non-features:

No persistence across runs. State is in-memory only. A pipeline run starts with declaration defaults; the writes don’t survive the process.
No undeclared writes. A Transform may only write a scoped variable it lists in declares:; an emit $pipeline.x to an undeclared name is a compile error. Requiring the declaration keeps every writer visible at plan time and the writer→reader dependency explicit in the DAG.
No dynamic var creation. The set of variables is closed at plan time, by design. This bounds memory and makes the validation matrix above tractable.

Channel overrides

A channel can both override a pipeline’s declaration defaults and add new entries across all four registries ($vars.*, $pipeline.*, $source.*, $record.*). Each registry has its own sub-block under vars: on a .channel.yaml, and each entry uses the same { type, default } shape that pipeline-side declarations use:

# Pipeline declarations
pipeline:
  name: orders
  vars:
    fuzzy_threshold: { type: float, default: 0.85 }   # $vars.*
nodes:
  - type: source
    name: orders_src
    config: { name: orders_src, type: csv, path: in.csv,
              schema: [{ name: id, type: int }] }
  - type: transform
    name: enrich
    input: orders_src
    config:
      declares:
        - { name: cutoff_date,  scope: pipeline, type: date,   default: "2024-01-01" }
        - { name: ingest_label, scope: source,   type: string, default: "prod" }
        - { name: tier,         scope: record,   type: string, default: "bronze" }
      cxl: |
        emit id = id

# channel/acme-prod/orders.channel.yaml
channel:
  target: ../../pipelines/orders.yaml
vars:
  static:
    fuzzy_threshold: { type: float, default: 0.95 }
  pipeline:
    cutoff_date: { type: date, default: "2026-01-01" }
  source:
    orders_src:
      ingest_label: { type: string, default: "acme-prod" }
  record:
    tier: { type: string, default: "platinum" }

The overlay lives in the tenant’s folder (channel/acme-prod/) and is applied with --channel acme-prod; the channel.target field is authoritative.

Override semantics (entry name already declared) require the channel’s type to match the declared type — mismatches produce E116. Add semantics (entry name not yet declared) extend the registry with a new declaration. In both cases, a default that does not match the entry’s type also produces E116. $source overrides are keyed by source-node name; an unknown source name produces E118. The reserved-name guard (E117) blocks channels from shadowing system fields like $pipeline.execution_id or $source.path. Channels that target a .comp.yaml may not carry vars: (E109 if they do).

See Channels for the full overlay rules and the channel manifest reference.

Channels

Channels make one pipeline serve many tenants. A single base pipeline is authored once; each tenant (a channel) layers its own configuration, variable defaults, and structural changes on top — without copying or editing the base YAML. The system is built for scale: thousands of per-tenant channels against one pipeline, with strict validation and per-value provenance.

A channel is a tenant. A group is a reusable overlay shared by many channels — selected automatically from a channel’s labels, or invoked by name. Everything a channel or group contributes is expressed through two surfaces: value clobber (config: / vars:) and an ordered op list (overrides:).

Workspace layout

Channels live in a channel-centric workspace. A clinker.toml at the workspace root declares the layout roots; the rest is folders of YAML:

workspace/
  clinker.toml                       # declares the [channel] and [group] roots
  pipeline/       *.yaml             # base pipelines  (the pipeline-default layer)
  composition/    *.comp.yaml        # reusable sub-pipelines
  schema/         *.schema.yaml      # shared schemas
  group/          *.group.yaml       # group overlays: selector, priority, overrides
  channel/<tenant>/                  # one folder per channel; the folder name is the channel id
    channel.cfg.yaml                 # channel manifest: labels + channel-wide overlays (optional)
    <target>.channel.yaml            # per-target overlay of a pipeline
    <target>.comp.yaml               # per-target overlay of a composition

The channel folder name is the channel id — channel/globex/ is the globex channel. A --channel globex invocation resolves by a computed path (channel/globex/…), never an O(N) scan of the workspace; the full scan is reserved for channels lint.

`clinker.toml` roots

[channel]
root = "channel"      # per-channel folders live under <root>/<channel-id>/
shard = "none"        # enumeration layout: none (default) | first-char | hash

[group]
root = "group"        # *.group.yaml definitions live here

Both tables are optional; omitting them defaults [channel].root to channel, [channel].shard to none, and [group].root to group. shard is an enumeration-ergonomics choice for very large channel trees (it splits the folder fan-out); a channel is always looked up by computed path regardless of shard scheme, so shard never changes resolution semantics.

The layer model

Every value and every op is attributed to exactly one layer. Layers apply in a fixed semantic order — never lexical or file order:

pipeline-default  <  group(s) by priority  <  channel-wide  <  channel-per-target

pipeline-default — the base pipeline’s own configuration.
group(s) by priority — every group applied to the run, ordered by priority (higher priority applies later and thus wins).
channel-wide — the channel manifest (channel.cfg.yaml): overlays that apply to every pipeline this channel runs.
channel-per-target — the per-target overlay file (<target>.channel.yaml): the highest-precedence layer.

Clobber, never deep-merge

A higher layer’s value replaces the lower layer’s value wholesale. There is no deep-merge and no list-append: overriding a list swaps the entire list. To override individual elements, model them as a keyed map (which the config: and overrides: surfaces already are), not a list — so each element is addressed and replaced by key. Every resolved value maps 1:1 back to the single layer that supplied it, and channels resolve / explain --field report that layer.

Structural ops (overrides:) apply in a total order — layer precedence first, then declaration order within a layer. Collisions are errors, never silent no-ops: adding a node whose name already exists, or targeting a missing or already-removed node, fails with a diagnostic anchored to the offending op.

Overlays are applied pre-compile: layers are resolved, config/vars values are clobbered, the overrides: op streams are concatenated in total order and folded over the base pipeline’s node list, and only then does schema binding and compilation run. One invocation produces one effective plan.

Value clobber: `config` and `vars`

The value-clobber surface carries scalar overrides. It appears identically on a group, a channel manifest, and a per-target overlay.

config: overrides composition config knobs, keyed by alias.param dotted paths (the composition node’s alias, then the parameter name):

config:
  scorer.threshold: 0.95     # override the `threshold` knob of the `scorer` composition node

The override changes executed behavior, not just the rendered provenance: the composition body reads the knob as $config.<param>, which the planner constant-folds to the resolved value for that instantiation at compile time. The winning layer is still recorded in the provenance side-table, so channels resolve / explain --field continue to report which layer supplied the value.

A config: key that matches no parameter in the compiled plan is a hard error (E113) — a misspelled or stale key aborts the run rather than silently doing nothing.

Locking a value: `fixed`

Every layer file (a group, the channel manifest, and a per-target overlay) may carry a fixed: block beside its config: block. Both use the same alias.param dotted-path grammar and the same unknown-key hard error (E113); the difference is precedence. A key under config: is a plain clobber — a higher layer may still override it. A key under fixed: is locked: it holds against every higher-precedence layer, so a lower layer can pin a value the layers above it cannot change.

# channel.cfg.yaml — the channel-wide manifest
channel: { name: globex }
fixed:
  scorer.threshold: 0.9      # locked channel-wide

# order_fulfillment.channel.yaml — the per-target overlay (a higher layer)
channel: { target: ../../pipeline/order_fulfillment.yaml }
config:
  scorer.threshold: 0.95     # ignored: the channel-wide layer locked this key

Here the resolved scorer.threshold is 0.9, not 0.95: the fixed channel-wide value wins even though the per-target layer is higher. When several layers lock the same key, the lowest-precedence lock wins (it pinned the value first). A key present in both config: and fixed: of the same file resolves to the fixed: value. channels resolve marks a locked value with (fixed) next to its winning layer, and explain --field reports the same.

vars: overrides or adds scoped-variable defaults, using the same four scopes a pipeline’s own vars: block uses ($vars.* / $pipeline.* / $source.* / $record.*). Each leaf is the same { type, default } shape a pipeline declaration uses:

vars:
  static:                    # $vars.*
    currency: { type: string, default: "USD" }
  pipeline:                  # $pipeline.*
    cutoff_date: { type: date, default: "2026-01-01" }
  source:                    # $source.<src>.*  — outer key is the source-node name
    orders:
      ingest_label: { type: string, default: "prod" }
  record:                    # $record.*
    tier: { type: string, default: "bronze" }

See Variables for the scoped-variable model these overlay.

Structural ops: `overrides`

The overrides: surface is an ordered list of discrete, name-addressed ops applied to the base pipeline’s node list before compilation. Each op is a mapping with an op: discriminant. Unknown keys, or keys that belong to a different op kind, are rejected at parse time.

The op vocabulary is add / remove / replace / set / bypass / patch_schema.

`add` — splice in a node

Insert a new node, either inline or as a composition reference. The splice anchor is exactly one of after: / before: / an explicit input:.

overrides:
  # Inline transform, spliced after `normalize` (its former consumers now read `stamp`):
  - op: add
    node:
      type: transform
      name: stamp
      input: normalize
      config:
        cxl: "emit order_id = order_id"
    after: normalize

  # A composition, named by `alias`, with a config knob for the injected node:
  - op: add
    composition: ../composition/fraud_check.comp.yaml
    alias: fraud_check
    after: normalize
    config:
      threshold: 0.8

after: X reads from X and repoints X’s former consumers onto the new node; before: X feeds X, taking over X’s former upstream. An inline node with no splice anchor keeps its own declared input:. Adding a node whose name already exists is an error.

`remove` — delete a node and rewire

Delete a node by name, repointing its named consumers through an explicit rewire: map so no dangling reference is left behind:

overrides:
  - op: remove
    target: legacy_audit
    rewire:
      route_priority.input: product_lookup   # <consumer>.input: <new upstream>

Each rewire: key is a <node>.input path; each value is the replacement upstream. Any consumer still referencing the removed node afterward is an error, as is removing a node that does not exist.

`bypass` — remove a linear node

Sugar for remove on a 1-in/1-out node: it auto-rewires the node’s sole consumer onto its sole upstream.

overrides:
  - op: bypass
    target: legacy_audit

bypass only applies to a single-input, single-consumer node; a fan-in/fan-out node must use the explicit remove op with a spelled-out rewire: map.

`replace` — swap a node’s definition

Replace a whole node by name, keeping its identity (and therefore every consumer edge) intact. The replacement node’s own name: must equal target:.

overrides:
  - op: replace
    target: normalize
    node:
      type: transform
      name: normalize
      input: orders
      config:
        cxl: "emit order_id = upper(order_id)"

`set` — set one field within a node

Set a single field within a named node by path. The currently addressable path is config.cxl — the primary CXL body of a transform / aggregate / combine node — so replacing a stage’s logic wholesale is a set, not a special case:

overrides:
  - op: set
    target: route_priority
    field: config.cxl
    value: >
      emit _route = if priority_level == "urgent"
        then "priority_report" else "fulfilled_orders"

Here _route is an ordinary audit field; it does not select an Output. Direct Outputs sharing route_priority each receive every record. To partition rows by destination, add a Route node with conditions that read the field (or express the conditions directly on the Route).

Any other field path is a hard error, never a silent no-op.

`patch_schema` — shape a source’s columns

Add / rename / modify / remove columns on a source node’s declared schema, via a column-name-keyed map (the map key is the column name). Each column carries exactly one op:

overrides:
  - op: patch_schema
    target: orders
    schema:
      amount:      { type: float, scale: 2 }       # modify: set any subset of attrs
      cust_id:     { rename: customer_id }         # rename (a physical->logical alias)
      order_notes: remove                          # drop an existing column (bare scalar)
      region:      { add: { type: string } }       # add a new column (map key = new name)

The modify leaf is a bare attribute map: it sets any subset of the column’s attributes (type, scale, precision, format, width, …), leaf-replace, keeping every attribute it does not name. A typo’d attribute is rejected rather than silently appended. The same grammar applies identically at every override layer (pipeline / group / channel).

The keyed-map shape (rather than a list) is deliberate: a column op is addressed and leaf-replaced by name, with first-class rename / remove / add, exactly matching the source-config schema patch grammar so the two surfaces resolve columns and their diagnostics identically.

rename is a source-column alias, not a bare relabel: the reader still binds the original physical column and re-labels its value under the new name, so downstream CXL and the output see the new name carrying the original column’s data. A missing column, an add that collides with an existing name, or a rename onto an existing name are all errors (E231–E233).

To see which layer set a given attribute on a patched column, trace it with clinker explain <pipeline> --field <source>.<column>.<attribute> (optionally --channel <name>); the output names the winning Base < Pipeline < Group < Channel layer and each shadowed one. See Field provenance.

Groups and selectors

A group (group/<name>.group.yaml) is a reusable overlay layer that sits between the pipeline default and the channel layers. It carries the same two surfaces every layer carries — config: / vars: value clobber and an overrides: op list:

group:
  name: enterprise
  match: 'tier == "enterprise"'   # optional selector; higher priority wins
  priority: 20
config:
  scorer.threshold: 0.8
overrides:
  - op: add
    node:
      type: transform
      name: fraud_stamp
      input: normalize
      config:
        cxl: "emit order_id = order_id"
    after: normalize

A group plays two roles under one concept:

Selector-derived — when match: is present, the group is applied automatically to every channel whose labels satisfy the CXL boolean. Multiple matching groups are ordered by priority (higher wins; the default priority is 0).
Standalone / explicit — when match: is absent, the group is never auto-selected; it applies only when invoked by name with --group. Groups are channel-agnostic — their overrides never read channel labels — so any group can run standalone against the base pipeline, with or without a channel.

Selectors are label-only CXL

match: is a bare CXL boolean expression evaluated in a restricted label-only context: the only names in scope are the channel’s labels. $record / $source / $pipeline / $vars / $doc, window and aggregate calls, now, and wildcards are all rejected, so a selector is a pure, deterministic predicate over labels.

match: 'region == "west" and tier == "enterprise"'

Labels are typed from their YAML/JSON scalar kind (string, bool, int, float), so the typechecker rejects label/literal type mismatches. A selector that references a label a channel does not declare is a hard error, never a silent false — a typo surfaces as an unresolved-identifier error rather than quietly excluding the channel.

The channel manifest

channel.cfg.yaml declares a channel’s identity labels and optional channel-wide overlays (applied to every pipeline this channel runs):

channel:
  name: globex
labels: { region: west, tier: enterprise }   # identity — drives group selectors
config:
  scorer.threshold: 0.9                        # channel-wide value clobber (optional)
vars:
  static:
    currency: { type: string, default: "USD" }
overrides: []                                  # channel-wide op list (optional)

Labels are identity, never a pipeline override. The manifest is optional: a channel with no labels and no channel-wide overlays needs no channel.cfg.yaml at all — its folder name is still its id. But a channel that groups select on must declare the labels those selectors read, otherwise the selector errors on the unresolved label rather than cleanly not matching.

The per-target overlay

<target>.channel.yaml overlays a single pipeline (or <target>.comp.yaml a composition). The channel.target: field is authoritative — the filename suffix is optional and, when present, must agree:

channel:
  target: ../../pipeline/order_fulfillment.yaml
config:
  scorer.threshold: 0.95
overrides:
  - op: patch_schema
    target: orders
    schema:
      tax_exempt: { add: { type: bool } }

CLI surface

Running with overlays

# Run as a tenant: resolves the channel folder and derives matching groups
# from its labels.
clinker run pipeline/order_fulfillment.yaml --channel globex --base-dir .

# Force-include a group by name, with or without a channel.
clinker run pipeline/order_fulfillment.yaml --group enterprise --base-dir .

run resolves the overlay stack from the workspace (rooted at --base-dir, default the current directory) and folds the resolved overrides into the plan before execution. Overlay flags shared across run and explain:

Flag	Meaning
`--group <NAME>`	Force-include a group overlay by name (repeatable), with or without a channel.
`--no-auto-groups`	Suppress selector-derived group membership; only explicit `--group` overlays apply.
`--channel <ID>`	Apply a tenant channel by id (its folder under the channel root), resolved by computed path. Derives matching groups from the channel’s labels and applies the layered `config`/`vars` clobber, the `overrides:` op stream, and `sources:` per-source patches.

explain --field <alias.param> --group <NAME> reports the same overlay stack for provenance lookups, mirroring run.

Inspecting overlays

channels resolve renders the effective post-overlay DAG for one target under a chosen channel and/or groups, with per-value provenance — which layer supplied each value and which group injected which node:

# Resolve the effective plan for the globex channel (derives matching groups from its labels)
clinker channels resolve pipeline/order_fulfillment.yaml --channel globex --base-dir .

# Preview a group overlay standalone (no channel)
clinker channels resolve pipeline/order_fulfillment.yaml --group enterprise --base-dir .

Here --channel is a channel id (the folder name under the channel root), resolved by computed path; resolve derives that channel’s matching groups from its labels unless --no-auto-groups is passed.

channels lint compiles every channel/group overlay in the workspace and reports every failure — the one full-tree scan in the system:

clinker channels lint --base-dir .

Membership and labels

# List the channels a group's selector currently matches
clinker channels group members enterprise --base-dir .

# Stamp/overwrite a label across one or more channels (idempotent)
clinker channels label set tier=enterprise globex initech --base-dir .

channels label set takes a key=value assignment; the value is typed by YAML scalar inference (true/false → bool, integers → int, decimals → float, else string) so numeric and boolean labels compare correctly against selectors.

Renaming a base node

refactor rename-node renames a base node and propagates the rename to every overlay that references it (splice anchors, target:, rewire: keys) across the workspace:

# Preview every file that would change
clinker refactor rename-node pipeline/order_fulfillment.yaml orders purchases --dry-run

# Apply it, then re-lint
clinker refactor rename-node pipeline/order_fulfillment.yaml orders purchases --base-dir .
clinker channels lint --base-dir .

The new name must be letters, digits, and _ only.

Source config patches

Independent of the overlay op engine, a channel file can patch a source node’s parsed config directly through a sources: block, applied before validation and compile so the run behaves exactly as if the source YAML had been hand-edited. This is the same column-keyed schema grammar patch_schema reuses, plus multi-value and per-format option patches.

sources:
  transactions:                            # source-node name (unknown -> E230)
    options:
      record_path: batch_records           # set a scalar per-format option (bad key -> E235)
    split_to_rows:                          # keyed by field name
      items:      { mode: split, position_column: line_no }  # add-or-modify
      tags:       { position_column: ~ }    # clear one attribute
      line_items: remove                    # drop an entry (unknown field -> E234)
    split_values:                           # keyed by field name
      codes:      { delimiter: "|" }        # add-or-modify an entry
      tags:       { delimiter: ~ }          # reset to the default delimiter
      notes:      remove                    # drop an entry (unknown field -> E234)
    schema:                                 # keyed by column name
      amount:      { type: float, scale: 2 }
      cust_id:     { rename: customer_id }
      order_notes: remove
      region:      { add: { type: string } }

All ops are keyed and leaf-replace — there is no deep-merge. On an existing split_to_rows / split_values entry a partial map is a modify: an omitted key keeps its current value, and a new entry takes the same defaults hand-written config would. Because an omitted key means “keep current”, clearing an attribute that is already set needs its own form — an explicit YAML null. On position_column that removes the attribute; on delimiter, which always holds some separator, it restores the ; default. options are merged onto the source’s current options and re-validated through the format’s option struct, so an unknown or mistyped key is rejected exactly as in hand-written config. A schema rename is a source-column alias — the same alias a base column can declare directly with source_name::

schema:
  # read the physical `cust_id` column, expose it downstream as `customer_id`
  - { name: customer_id, type: string, source_name: cust_id }

Format-structure patches (X12 / HL7 v2)

Beyond the format-agnostic ops above, a sources: patch can reshape the format-layer structures an X12 or HL7 source declares in its options: block — with keyed add/modify/remove grammar instead of blob-replacing the whole options map:

sources:
  interchange:                             # an X12 source
    group_section:                         # the GS functional-group declaration
      name: fg                             # rename the section (omit to keep)
      fields:
        e04: int                           # set/add a typed field
        e05: remove                        # drop a declared field
    set_section: remove                    # drop the whole ST declaration
  messages:                                # an HL7 v2 source
    split_fields:                          # keyed by positional field name
      f08: { components: 3 }               # add-or-modify a composite split
      f03: remove                          # drop a declared split

group_section / set_section patch the X12 nested-envelope declarations (the GS functional-group and ST transaction-set levels); split_fields patches the HL7 composite-field splits, keyed by positional field name and resolved by wire position (f8 and f08 address the same split). Each op applies only to a source of the matching format (anything else is E238). The set form is a partial modify on an existing declaration — an omitted name or axis width keeps its current value — and creates the declaration when absent, in which case name (X12) or components (HL7) is required (E240). Removing a declaration, field, or split the source does not carry is E239. These ops apply after the options merge, so they layer on top of an options value that replaces the same declaration in one patch.

Multi-record patches (discriminator-driven flat files)

A multi-record flat file interleaves several record layouts in one file, each identified by a discriminator tag. A sources: patch reshapes that layout with records: (keyed by record-type id) and a discriminator: merge, so a tenant’s record set can differ from the base without editing the pipeline:

sources:
  ledger:                                  # a multi-record source
    discriminator: { start: 2 }            # move the tag byte range (partial merge)
    records:
      detail:  { tag: X }                  # retag; a nested `columns:` reshapes fields
      trailer: remove                      # drop a record type
      header:                              # add a record type (map key = its id)
        add:
          tag: H
          columns:
            - { name: hdr_id, type: string, start: 1, width: 8 }

A records entry follows the same keyed grammar as schema: a bare remove drops the record type, { add: { tag, columns, ... } } declares a new one, and a bare attribute map modifies an existing one. A modify sets any subset of the record type’s tag / parent / join_key / description and carries a nested columns: map that runs the column-op grammar (modify / rename / add / remove) against that record type’s own fields. The discriminator: op merges field by field onto the current discriminator — a named field overwrites, an omitted one is kept — and the merged result must be a byte range (start + optional width) XOR a field.

These ops apply only to a multi-record schema (E241). Modifying or removing an unknown record-type id is E242, adding an id that already exists is E243, a merged discriminator that is neither pure byte-range nor pure field is E244, and a discriminator tag shared by two record types after the patch is E245.

Sources inside a composition body

A plain sources: key names a top-level source node. To patch a source declared inside a composition body, qualify the key with the composition call-site node name: <composition-node>.<source>. The composition body is expanded during compile, so the patch is applied to the body’s source when the body is bound — before the body typechecks — exactly as a top-level patch shapes a top-level source before it binds:

sources:
  enrich.lookups:                          # source `lookups` inside composition node `enrich`
    schema:
      code: { rename: lookup_code }

Resolution is one level deep: the qualifier must name a composition node in the pipeline (an unknown composition — or a nested a.b.c key naming a source inside a nested composition body — is E230), and the source half must name a source node declared in that composition’s body (an unknown one is E230, naming the body file). A plain unqualified key still targets a top-level source, and a name that matches no top-level source still fails with E230 — now hinting at the qualified form when the pipeline has compositions.

Note: a body-declared source binds (its schema seeds the body) but is not yet fed at runtime — the engine ingests only top-level sources — so a channel patch to a body source is applied and observable at compile (--explain), while a data run through a body source awaits body-source runtime support.

When a patch changes the effective source config, the run’s pipeline identity differs from the base and from other patched variants, so their outputs and lineage do not collide.

Diagnostics

Code	Meaning
E113	A `config:` / override key matches no composition parameter in the compiled plan. A misspelled or stale key aborts the run instead of silently doing nothing.
E114	An overlay op failed to apply (missing splice anchor, duplicate node name, missing/removed `target`, invalid `set` field, invalid `bypass` node). The diagnostic is anchored to the offending op’s source span, not the base pipeline.
E230	A source patch (`sources.<src>` or `patch_schema`) targets a source that does not exist: an unknown top-level source, an unknown composition for a qualified `<composition>.<source>` key, a `<composition>.<source>` naming no source in that composition’s body, or a nested (`a.b.c`) key.
E231	A schema `rename` / `modify` / `remove` of a column that does not exist.
E232	A schema `add` of a column name that already exists.
E233	A schema `rename` whose target name collides with an existing column.
E234	A `split_to_rows` / `split_values` `remove` of a field with no matching entry.
E235	An `options` patch sets an unknown or mistyped option key for the source’s format.
E236	A renamed/aliased column’s exposed name collides with a real input field, which would mislocate that field. Raised at read time.
E237	A `schema` patch on a multi-record / generated / external-file schema — column ops apply only to a single-record column list.
E238	A `group_section` / `set_section` patch on a non-X12 source, or a `split_fields` patch on a non-HL7 source.
E239	A `remove` of a nested-section declaration, declared section field, or field split the source does not carry.
E240	A malformed format-structure patch: creating a nested section without a `name`, adding a split without `components`, a split key that is not a positional `fNN` name, or a zero axis width.
E241	A `records` / `discriminator` patch on a single-record / generated / external-file schema — these ops apply only to a multi-record schema.
E242	A `records` `modify` / `remove` of a record-type id the source does not declare.
E243	A `records` `add` of a record-type id that already exists.
E244	A merged `discriminator` that is neither a pure byte range (`start` + optional `width`) nor a pure `field`.
E245	Two record types share a discriminator tag after the patch, which would make the reader’s discriminator dispatch ambiguous.

Compositions

Compositions are reusable pipeline fragments that can be imported into multiple pipelines. They encapsulate common transform patterns – date derivations, address normalization, currency conversion – into self-contained, testable units.

Using a composition

A composition node in your pipeline references an external .comp.yaml file:

- type: composition
  name: fiscal_dates
  input: invoices
  use: "./compositions/fiscal_date.comp.yaml"
  config:
    start_month: 4

The use: field points to the composition definition file. The config: block passes parameters that customize the composition’s behavior for this specific invocation.

Resolving the `use:` path

A use: value names a .comp.yaml in the workspace. It is resolved relative to the directory of the pipeline file being compiled, then against the set of .comp.yaml files discovered under the workspace root, finally falling back to a filename match. A use: that resolves to no .comp.yaml — a typo, a wrong relative prefix, or a file that does not exist — fails compilation with a spanned E103 diagnostic naming the composition node. The whole run aborts loudly; it does not silently drop the composition and write an empty output. The same holds for the other composition-binding errors (E102–E108): an ill-bound call site fails compile rather than producing a run that writes zero records. Run clinker explain --code E103 for details.

Composition definition file

A .comp.yaml file declares the composition’s interface – what fields it requires from upstream and what fields it produces:

# compositions/fiscal_date.comp.yaml
composition:
  name: fiscal_date
  description: "Derive fiscal year, quarter, and period from a date field"

  requires:
    - { name: invoice_date, type: date }

  produces:
    - { name: fiscal_year, type: int }
    - { name: fiscal_quarter, type: string }
    - { name: fiscal_period, type: int }

  params:
    - name: start_month
      type: int
      default: 1
      description: "First month of the fiscal year (1-12)"

Composition fields

Field	Required	Description
`name`	Yes	Composition identifier
`description`	No	Human-readable purpose
`requires`	Yes	Input fields the composition needs from upstream (name + type)
`produces`	Yes	Output fields the composition adds to the record (name + type)
`params`	No	Configurable parameters with optional defaults

Reading config parameters in the body

A composition body reads its own config parameters as $config.<param>. The planner constant-folds each reference to the value resolved for that instantiation — the call site’s config: value, or a channel/group config: override, or the declared default — so the same composition used with different config: compiles to different bodies. Because the resolution happens per instantiation, a channel or group config: override changes what the body computes, not just the reported provenance.

Body validation

Nodes inside a composition body are validated with the same node-scoped config checks as top-level pipeline nodes. A body node that would be rejected at the top level — an envelope wiring the not-yet-supported trailer: port, a transform declaring a reserved variable name or a default that does not match its declared type, an invalid log directive, or a batch_size: 0 — fails compilation with an E115 diagnostic naming the composition call site, the body file, and the violation. Run clinker explain --code E115 for details.

A body source or output that sets a CSV delimiter or quote_char which is not exactly one ASCII byte is likewise rejected at compile time, not first at run, with the same one-byte rule top-level nodes get.

A body source or output whose schema: names an external .schema.yaml file has that path resolved relative to the composition file’s own directory (not the invoking pipeline’s), and the file’s columns are inlined before the body binds. A body output therefore rounds decimal columns to their declared scale at the write boundary exactly as a top-level output does.

Advanced wiring

For compositions with multiple input or output ports, the node supports explicit port bindings:

- type: composition
  name: enrich_address
  input: customers
  use: "./compositions/address_normalize.comp.yaml"
  inputs:
    primary: customers
    reference: zip_lookup
  outputs:
    normalized: next_stage
  config:
    country_code: "US"
  resources:
    zip_database: "./data/zipcodes.csv"

Port and resource fields

Field	Required	Description
`inputs`	No	Map of composition input ports to upstream node references
`outputs`	No	Map of composition output ports to downstream node references
`config`	No	Parameter overrides (key-value pairs)
`resources`	No	External resource bindings (file paths, connection strings)
`alias`	No	Namespace prefix for expanded node names (avoids collisions)

Complete example

pipeline:
  name: invoice_pipeline

nodes:
  - type: source
    name: invoices
    config:
      name: invoices
      type: csv
      path: "./data/invoices.csv"
      schema:
        - { name: invoice_id, type: int }
        - { name: customer_id, type: int }
        - { name: invoice_date, type: date }
        - { name: amount, type: float }

  - type: composition
    name: fiscal_dates
    input: invoices
    use: "./compositions/fiscal_date.comp.yaml"
    config:
      start_month: 4

  - type: transform
    name: final_enrich
    input: fiscal_dates
    config:
      cxl: |
        emit invoice_id = invoice_id
        emit customer_id = customer_id
        emit amount = amount
        emit fiscal_year = fiscal_year
        emit fiscal_quarter = fiscal_quarter

  - type: output
    name: result
    input: final_enrich
    config:
      name: result
      type: csv
      path: "./output/invoices_enriched.csv"

Correlation Keys

A correlation key declares a set of records from a single source as an atomic group: if any record in the group fails validation or processing, the whole group is sent to the DLQ. This is the right shape for transactional data where partial processing is worse than total rejection – the canonical example is an order with multiple line items where one bad line should reject the entire order.

This page describes how to declare a correlation key and how it behaves through each node that can fan out, fan in, group, or join records.

Declaration

Correlation keys are declared per source. Each source’s config: block carries an optional correlation_key: field naming the column (or list of columns) whose value identifies a record’s correlation group within that source.

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: csv
      path: ./data/orders.csv
      correlation_key: order_id
      schema:
        - { name: order_id, type: string }
        - { name: amount, type: int }

  - type: source
    name: customers
    config:
      name: customers
      type: csv
      path: ./data/customers.csv
      correlation_key: [customer_id, region]   # multi-column key
      schema:
        - { name: customer_id, type: string }
        - { name: region, type: string }
        - { name: name, type: string }

  - type: source
    name: sensor_readings
    config:
      name: sensor_readings
      type: csv
      # No correlation_key: record-level errors land in the DLQ as
      # standalone entries with no group atomicity.
      schema:
        - { name: ts, type: date_time }
        - { name: value, type: float }

A record’s correlation group is identified by the tuple of values for that source’s listed fields. Records sharing the same tuple within the same source belong to the same group. There is no pipeline-level correlation key — declare it on each contributing source.

The group identity is captured at ingest, so rewriting the key column in a later Transform does not change a record’s group — anonymizing or transforming order_id downstream still keeps the original grouping intact.

A source whose declared correlation_key: field names a column not present in its own schema: block is rejected at compile time with diagnostic E153. The fix is to add the field to the schema or remove it from correlation_key:.

DLQ semantics

When a record fails inside a correlation group:

The failing record produces a trigger DLQ entry. Its category reflects the actual failure (e.g. type_error, validation_failed).
Every other record from the same source in that group produces a collateral DLQ entry, carrying the category correlated.
Records belonging to other (clean) groups proceed normally.

A record with a null value for the correlation-key field is treated as its own group: it has no peers, so DLQ atomicity does not span multiple records.

The dlq_count counter sums triggers and collaterals.

Group buffering

The engine buffers records per correlation group until either the group completes or a failure triggers a flush. The max_group_buffer: field on the pipeline-level error_handling: block caps per-group buffering across every source’s groups:

error_handling:
  max_group_buffer: 100000     # Default: 100,000

Groups that exceed the cap are DLQ’d entirely with a group_size_exceeded trigger plus a collateral entry per buffered record. This is a backpressure boundary, not a hard error.

Per-operator interactions

Route interaction (fan-out)

A correlation group can span multiple route branches. Group atomicity is preserved across branches: if any record in the group fails (in any branch’s transform, or in the route predicate itself), the entire group is rejected from every branch.

For an inclusive route where one record reaches both branches, a single failure DLQ’s that source row exactly once — not once per branch.

Merge interaction (fan-in)

Merge concatenates upstream branches that share a schema. Records keep their correlation identity through the merge, so rows from different sources that share the same key value become one correlation group downstream: a failure on any one of them DLQ’s the whole group across both sources.

Per-source rollback narrowing

When two sources contribute records to the same correlation group, a failure originating from one source does not collaterally DLQ records from the other source. The collateral fan-out is scoped to the failing source’s records only.

For example, with [src_a, src_b] → merge → transform → out where both declare correlation_key: id, an error that fires on a src_b row produces a trigger for that row while the src_a row sharing the same id is spared and reaches the output. Single-source pipelines behave exactly as a pipeline-wide collateral DLQ would, since every co-grouped record shares the one source.

Two cases stay group-wide rather than narrowing per source:

max_group_buffer overflow DLQ’s every record in the overflowing group — no single source is to blame for the overflow.
Combine output failures DLQ the synthesized output row, which has no single-source attribution.

Aggregate interaction

When an aggregate’s group_by covers every correlation-key field, the aggregate stays on the strict path: each emitted row inherits the correlation identity of its inputs, and any DLQ trigger in the group rolls back every record in the group, including the aggregate output row.

- type: aggregate
  name: order_totals
  input: orders                         # correlation_key: order_id
  config:
    group_by: [order_id]                # covers the key
    cxl: |
      emit total = sum(amount)

When an aggregate’s group_by omits a correlation-key field, the engine automatically retracts only the failing records and recomputes the affected groups, so the surviving contributions still produce a correct aggregate row. You do not configure this — the engine picks the path from the group_by content. (One restriction: this mode cannot be combined with strategy: streaming, which is rejected at compile time.)

- type: aggregate
  name: dept_totals
  input: orders                         # correlation_key: order_id
  config:
    group_by: [department]              # omits the key — surviving rows recomputed
    cxl: |
      emit total = sum(amount)

Combine interaction

Every combine declares propagate_ck: to select which correlation-key fields its output rows carry:

propagate_ck: driver — output inherits only the driver input’s correlation identity. The common case; today’s strict-correlation pipelines stay on this setting.
propagate_ck: all — output carries the union of correlation-key fields across every input. Use when the build side carries keys that downstream operators need to read.
propagate_ck: { named: [<field>, ...] } — output carries exactly the named subset. Use to project a multi-field key down after a join.

- type: combine
  name: enriched
  input:
    o: orders                          # driver (correlation_key: employee_id)
    d: departments                     # build side
  config:
    where: "o.employee_id == d.employee_id"
    match: first
    on_miss: skip
    cxl: |
      emit employee_id = o.employee_id
      emit amount = o.amount
      emit dept = d.dept
    propagate_ck: driver

How match mode fills the propagated key:

match: first — the single matched build’s key fills the slot.
match: all — one output row per matched build, each carrying its own build’s key.
match: collect — one row per driver; the first matched build’s key fills the (single-valued) slot, while every matched build’s full payload still rides inside the array column.

Driver wins on a name collision: if both the driver and a build input declare the same key field, the output keeps the driver’s value.

propagate_ck is a required field — every combine must spell out which mode it uses.

Composition interaction

A composition’s body operates on records flowing in from the parent pipeline; correlation identity flows into the composition inputs and back out the named ports unchanged. Compositions cannot declare their own correlation key — a key is a property of a source, not of the composition body that consumes a source’s records.

Debugging

Correlation grouping is tracked on internal columns you never write in YAML or CXL, and they are hidden from writer output by default. To surface them for debugging, set include_correlation_keys: true on an output node:

- type: output
  name: debug
  input: any_node
  config:
    type: csv
    path: "./debug.csv"
    include_correlation_keys: true

The output then contains extra columns named $ck.<field> (literal prefix in the CSV header) for each declared correlation-key field.

To investigate DLQ collaterals: every collateral entry’s category is correlated, and the trigger entry in the same group carries the actual failure category and message.

Document Envelope Context (`$doc.*`)

Many enterprise file formats wrap their record body in an envelope: named sections that surround the records and carry document-level metadata — a batch header with a run date and batch id, a trailer with a record count and checksum, or arbitrary sibling sections. Clinker exposes these sections to CXL through the $doc.<section>.<field> namespace.

sources:
  - name: payments
    path: data/payments.xml
    format: xml
    envelope:
      sections:
        BatchInfo:
          extract: { xml_path: "/payments/BatchInfo" }
          fields:
            batch_id: string
            run_date: date
        Summary:
          extract: { xml_path: "/payments/Summary" }
          fields:
            record_count: int
            checksum: string

Like the rest of the pipeline config, the envelope: block is strict: an unknown key at any level (a misspelled sections:, extract:, or fields:) is rejected at plan parse time with a diagnostic naming the bad key, rather than being silently ignored.

A downstream transform reads any declared section field on every body record:

nodes:
  - transform: tag
    inputs: { in: payments }
    project:
      - batch: $doc.BatchInfo.batch_id
      - expected_total: $doc.Summary.record_count
      - amount: amount

Section names are yours

The engine reserves no section names. BatchInfo and Summary above are arbitrary identifiers chosen by the pipeline author — Head / Foot, preamble / trailer, batch_metadata / eob_summary are all equally valid. A section name is whatever string you put in the sections: map; CXL exposes it verbatim as $doc.<that_name>.<field>.

All sections are available everywhere in the body stream

Every declared section is available to every body record, no matter where the section physically sits in the file. A header at the top and a trailer at the bottom are both visible from the first record to the last, so every body record sees every $doc.<section>.<field> value.

This means a trailer field is available during body processing, not just at end-of-file. A pipeline can compute, on every row, a ratio against the trailer’s total:

project:
  - running_fraction: row_index / $doc.Summary.record_count

Note that an extracted trailer section you read via $doc.* is distinct from the structural counts an EDI reader validates internally (the X12 SE/GE/IEA, EDIFACT UNT/UNZ, HL7 BTS/FTS segment counts). Those trailer counts are checked by the reader against the body it streamed, and a mismatch is a structural-integrity failure — see Malformed envelopes for how dlq_granularity: document dead-letters a malformed file instead of aborting the run.

The pre-scan reads the envelope-bearing segments of the file before body streaming begins. Envelope payloads are small (a few hundred bytes per document is typical), and how much of the file the reader retains to reach a trailing section depends on the format:

JSON streams the pre-scan. The reader walks the document once and deserializes only the subtrees the declared sections point at — every other key (including a multi-megabyte body array) is parsed-and-skipped without being stored. The retained sections live in a bounded document index capped by max_index_bytes (see below), so retained section memory scales with the declared sections, not the document size.
XML streams the pre-scan too. The reader event-walks the document once and flattens only the declared section subtrees — every other element, including a multi-megabyte body, is event-walked and dropped without being materialized. The retained sections live in the same bounded document index capped by max_index_bytes. The reader still holds the file’s raw bytes for the lifetime of the read (one disk read backs both the body parser and the pre-scan), but the pre-scan no longer materializes the undeclared section subtrees.

Only the envelope sections live in the document context — body records still flow through the pipeline one at a time, for every format.

Bounding envelope retention with `max_index_bytes`

For JSON and XML sources, the document index is capped so a pathologically large declared section fails loud rather than exhausting memory. The cap is charged incrementally as each section is parsed — byte by byte while the section’s subtree is built — so even a single oversized declared section aborts mid-parse, before its whole subtree materializes, naming the offending section and the cap. (Undeclared siblings, including a multi-megabyte body, are skipped without being parsed into the index at all.)

- type: source
  name: events
  config:
    name: events
    type: json
    path: "./data/events.json"
    options:
      record_path: data.rows
      max_index_bytes: 64MB   # cap on retained envelope sections

The same max_index_bytes option applies to an XML source’s options: block, capping its envelope pre-scan identically.

max_index_bytes accepts a decimal size string (64MB, 500KB) or a bare byte count. It is optional; when omitted the reader applies a documented finite default of 64MB. Only the declared sections a program actually reads are retained, so envelope metadata sits far below this ceiling in practice — the cap exists to convert an unbounded mistake into a clear error.

Extract rules per format

Each section declares how the reader locates its payload:

Format	`extract:` key	Value
XML	`xml_path`	Slash-path to the section element, e.g. `/doc/Head`
JSON	`json_pointer`	RFC 6901 pointer — empty (whole document) or leading `/`, e.g. `/Head`
EDIFACT	`segment`	A service-segment tag — only `UNB`
X12	`segment`	A service-segment tag — only `ISA` (GS/ST surface as nested levels)
HL7 v2	`segment`	A header-segment tag — only `FHS` (BHS/MSH surface as nested levels)
Multi-record CSV / fixed-width	`record_type`	A header record-type tag, e.g. `H`

xml_path and the source-level record_path option are both slash-paths over XML but root differently: xml_path tolerates a leading / (/doc/Head is its documented form), while record_path rejects one. They locate different things and are deliberately not aligned — see XML Format → record_path and xml_path root differently.

Declaring an xml_path section against a JSON source (or vice versa), a segment extract against XML/JSON, a record_type extract against any format other than multi-record CSV / fixed-width, or any envelope: block at all on a plain (single-schema) CSV or fixed-width source is a configuration error that fails fast rather than silently producing empty sections.

A json_pointer must be a valid RFC 6901 pointer: either empty ("", naming the whole document) or a /-introduced path such as /Head or /batch/summary. A slashless value like Head is a typo — it would decode to zero segments and silently match the root document — so it is rejected at validation rather than resolving to the wrong metadata.

A plain (single-schema) CSV or fixed-width source carries no envelope — there is no header/trailer structure to extract. Declaring envelope: sections on one is a configuration error (E356): the sections would never be populated and every $doc.<section>.<field> against the source would resolve to null, so the compiler rejects it rather than accepting an inert declaration. Envelope extraction on a flat file — the record_type extract — applies only to a multi-record source, one declaring a discriminator: + records: block; declare that schema if the file genuinely carries header/trailer records.

Network (REST) sources carry no `$doc` context

A rest source pulls its records page by page over paginated HTTP — it has no single buffered document with head and tail sections, so it carries no envelope context. Envelope sections are a file-document concept, so the compiler rejects them on a REST source rather than letting them silently resolve to null:

Declaring an envelope: block on a rest source is an error (E349) — the declaration would be inert.
Reading $doc.<section>.<field> from a node fed by a rest source is an error (E349) — the access can never resolve.

Pull the document-level metadata into record fields through the API’s own response shape (record_path, split_to_rows) instead, so the value travels as a normal field rather than as document-envelope context.

EDIFACT `segment` extract

An EDIFACT source exposes its interchange header UNB as an envelope section. The section’s field names are the positional element keys e01, e02, … :

envelope:
  sections:
    interchange:
      extract: { segment: "UNB" }
      fields:
        e05: string          # interchange control reference

Only the UNB header is extractable. Trailer segments (UNT, UNZ) that arrive after the body are not envelope sections — their control counts are validated by the reader instead. A mismatch between a trailer’s declared count and the body the reader streamed is a structural-integrity failure: by default it aborts the run, and under a source’s dlq_granularity: document opt-in it dead-letters the whole file to the DLQ (see Malformed envelopes). A segment extract naming any tag other than UNB is rejected at startup. See EDIFACT Format for the full reference.

Multi-record `record_type` extract

A multi-record CSV or fixed-width source — one that declares a discriminator: and a records: list — exposes a header record type as an envelope section through the record_type extract. The tag names which of the source’s declared record types carries the section’s payload; the matched header row’s named fields become the section’s fields.

schema:
  discriminator: { start: 0, width: 1 }
  records:
    - { id: header,  tag: H, columns: [ { name: batch_id, type: string, start: 1, width: 9 } ] }
    - { id: detail,  tag: D, columns: [ { name: amount,   type: int, start: 1, width: 9 } ] }
    - { id: trailer, tag: T, columns: [ { name: count,    type: int, start: 1, width: 9 } ] }
  structure:
    - { record: trailer, count: count }
envelope:
  sections:
    head:
      extract: { record_type: H }       # the H header row → $doc.head.*
      fields:
        batch_id: string

Only a header record type — one whose rows precede the body at the file head — is extractable as a $doc section; the reader captures the first such row in a bounded pre-scan and excludes it from the body stream. A trailer record type (one named by a structure: constraint) arrives after the body it closes, so it is not an envelope section — its declared count is validated against the streamed body count instead, the same structural-integrity check the EDI trailers use. See CSV and Fixed-Width for the full reference.

A JSON example:

sources:
  - name: payments
    path: data/payments.json
    format: json
    record_path: records
    envelope:
      sections:
        Head:
          extract: { json_pointer: "/Head" }
          fields:
            batch_id: string
        Foot:
          extract: { json_pointer: "/Foot" }
          fields:
            count: int

against:

{
  "Head": { "batch_id": "RUN-001" },
  "records": [ { "amount": 10 }, { "amount": 20 } ],
  "Foot": { "count": 2 }
}

Typed fields

Each section’s fields: map declares the field name and its type, drawn from the same small vocabulary as source schemas: string, int, float, bool, date, date_time. The extracted raw value is coerced to the declared type at pre-scan time; a value that cannot coerce (e.g. a non-numeric string declared int) fails the source with a diagnostic naming the section, field, and offending value.

A field that the document does not carry resolves to null — $doc.* follows the same missing-value convention as $source.* and $pipeline.*. A section that the document does not carry at all is simply absent from the context; any $doc.<missing_section>.<field> resolves to null.

Declared-path validation

Every $doc.<section>.<field> reference is cross-checked at compile time against the schema the feeding source’s reader will actually serve. A reference that can never resolve — almost always a typo — is rejected at compile time, pointing at the node that made it, rather than resolving silently to null. How the path is checked depends on the source:

Closed-schema sources (XML, JSON) — the envelope: block is the complete schema: the reader extracts exactly the sections and fields it declares. A reference naming a section the source does not declare, or a field the declared section does not declare, is rejected with error E341 ($doc.Summry.total against a declared Summary). Run clinker explain --code E341 for the full write-up.

Multi-record CSV / fixed-width sources — a source whose schema declares discriminator: + records: exposes a header record type as a $doc section through the record_type extract. The reader coerces the matched header record’s columns through the section’s declared fields: and serves exactly those fields, so the section is closed just like an XML/JSON one: a reference naming an undeclared section, or a field the section does not declare, is rejected with error E341. (A plain single-schema CSV / fixed-width source has no such structure — declaring an envelope: on one is rejected with error E356.)

Segment/positional sources (X12, EDIFACT, HL7) — the file-level header (ISA/UNB/FHS) is declared through envelope: and is closed, but the reader also synthesizes nested envelope levels the config never names — X12’s functional_group / transaction_set, HL7’s batch / transaction_set — keyed by positional eNN / fNN elements bounded by the source’s max_elements / max_fields. A $doc path is checked against that synthesized vocabulary plus any section/field you declared, so a legitimate wire-derived path ($doc.functional_group.e06) is accepted while a misspelled section ($doc.functonal_group.e06) or an out-of-range positional element ($doc.transaction_set.e99) is rejected with error E348. Run clinker explain --code E348 for the full write-up.

REST sources carry no document and reject $doc outright — see Network (REST) sources carry no $doc context above (E349).

Plain (single-schema) CSV / fixed-width and SWIFT MT sources are not statically checked. A plain flat file synthesizes no $doc sections, and SWIFT serves declared sections under user-chosen or default block labels — neither fits the closed or positional model cleanly. (A multi-record CSV / fixed-width source is checked — see above.)

Indexed `$doc` access

A section field that holds an array or a map can be indexed inline, the same way any record value is — see Nested paths for the full bracket-index reference. Integer indices select array elements; string keys select map entries; the two compose into a chain:

project:
  - first_line:   $doc.Header.line_items[0]      # array element
  - run_date:     $doc.Header.meta["run_date"]   # map entry
  - first_sku:    $doc.Header.line_items[0]["sku"]  # array-of-maps chain

An out-of-range array index or a missing map key resolves to null — it never errors or panics, and a null mid-chain short-circuits the rest of the chain to null rather than failing. This is the same missing-value convention $doc.<missing_field> and $source.* follow.

Only literal paths are readable

Every index segment must be a literal — a constant integer or string written in the program text. The section and field are always literal identifiers (the grammar requires it), so a literal index is the last piece a reader needs to know, before reading any input, exactly which envelope paths a run will consume. The pre-scan extracts precisely those statically-resolvable paths and nothing else.

A computed index — one derived from runtime data, such as $doc.Header.line_items[row_index] — is not statically resolvable: the reader cannot pre-scan a row-dependent element. clinker rejects it at compile time with a diagnostic pointing at the offending index, rather than reading it at run time. Use a literal index, or pull the value into a record field upstream and index that instead.

This compiles together with the declared-path rule above: a $doc read of a section or field the source does not declare is a compile-time error (E341), and a $doc read with a computed index is a compile-time diagnostic — neither reaches run time as a silent null. Only a literal path over a declared section is pre-scanned and readable.

One document per file

Each source file is its own document with its own envelope context. When a source matches multiple files (via glob: / paths:), each file gets a fresh document context with its own section values. Records from different files never share a context — a record’s $doc.* always reflects the file that record came from.

Document boundaries flow through the pipeline so that document-scoped operators fire at exactly the right point. A document-scoped operator fires exactly once per document, even when that document arrives across several inputs that a Merge or Combine brings together.

Per-document aggregation

A grouped or global Aggregate reading a multi-document source produces one set of grouped rows per document, not a single aggregate spanning every document. When a document closes, the Aggregate finalizes and emits the groups belonging to that document, then drops their state before the next document accumulates — so a glob: source over twelve monthly files through a group_by Aggregate yields twelve independent monthly roll-ups, and only one document’s groups are ever live at once (the others have already been emitted and freed, or have not yet started).

This applies only when a document boundary actually reaches the Aggregate. A plain single-file source is one document, so it still emits one aggregate. A Merge that combines several distinct single-document sources flushes those sources independently downstream — one roll-up per source document, exactly as feeding each source to its own Aggregate would. This holds for every Merge mode.

A Combine (join) preserves document boundaries on every strategy, so a per-document Aggregate downstream of a join also rolls up per driver document.

Nested (multi-level) envelopes

Some formats wrap their records in several envelope levels, one inside another. EDI X12 is the canonical example and the first format that implements this: an interchange (ISA/IEA) contains one or more functional groups (GS/GE), each containing one or more transaction sets (ST/SE), each containing the records. A single file can carry multiple interchanges back to back. See X12 Format for the full reference.

HL7 v2 is the second multi-level format: an optional file (FHS/FTS) contains optional batches (BHS/BTS), each containing one or more messages (MSH..), each containing the segment records. The tiers map onto the same nested levels — the FHS file header is a declared segment: "FHS" section, while the BHS batch and the MSH message surface automatically as the reader-supplied sections batch and transaction_set. Every tier is optional, and a level’s section exists only when its header segment is present in the input — the reader synthesizes a batch section only where it reads a BHS, and the message level only where it reads an MSH.

A bare MSH-led stream — messages with no BHS batch and no FHS file wrapper — therefore opens only the message level. $doc.transaction_set.* resolves against each message, but $doc.batch.* resolves to null: no BHS was read, so no batch section was synthesized. Whether a file carries a BHS batch wrapper is a property of the input bytes, not the config, so a $doc.batch.* path is never a compile error — it follows the same missing-value convention an absent $doc section follows (resolve null, never error), and populates as soon as a BHS wrapper is present:

# Bare MSH stream (no BHS/FHS) — only the message level opens:
- msg_type: $doc.transaction_set.f08   # MSH-9 message type — populated
- batch_id: $doc.batch.f01             # null — no BHS, so no batch tier

# BHS-wrapped stream — the BHS opens a batch level, so both populate:
- msg_type: $doc.transaction_set.f08   # MSH-9 message type — populated
- batch_id: $doc.batch.f01             # BHS field — now populated

See HL7 v2 Format for the full reference.

A reader for such a format opens and closes each nested level as it crosses the corresponding envelope boundary mid-file. Each level contributes its own sections to $doc. There is no new $doc syntax for nesting — every level’s sections are read through the same two-level $doc.<section>.<field> lookup. A record inside the innermost level sees every enclosing level’s sections at once. For X12 the interchange header is a declared segment: "ISA" envelope section (you choose its name), while the GS group and ST set surface automatically as the reader-supplied sections functional_group and transaction_set, each keyed by positional eNN elements:

project:
  - interchange_control: $doc.interchange.e13        # ISA13, declared section
  - functional_id:       $doc.functional_group.e01   # GS01 (reader-supplied)
  - transaction_type:    $doc.transaction_set.e01     # ST01 (reader-supplied)
  - claim_amount:        amount                       # body field

A record streamed inside the ST level resolves the ST section, the enclosing GS section, and the outermost ISA section, all at once: each inner level inherits every enclosing level’s sections as siblings in one flat namespace. If two levels declare a section with the same name, the innermost wins for records inside it — the same shadowing rule a nested scope follows in any language. Picking distinct per-level names (as above) keeps every level independently visible.

The reader-supplied default names are not the only option: an X12 source can name the GS and ST levels itself and give each a typed field schema, so a nested level is addressable under a chosen name with coerced fields exactly like the declared ISA section. The declaration lives on the source’s options (not the envelope: block, which is reserved for the pre-scannable file-level header), and each nested level is named independently:

type: x12
options:
  group_section:
    name: functional_group       # your choice — the engine reserves no name
    fields: { e06: int }         # GS06 group control number, typed
  set_section:
    name: transaction_set        # your choice
    fields: { e01: int }         # ST01 transaction-set id, typed

Omit a level’s declaration and it falls back to its reader-supplied default name keyed by untyped positional eNN strings. See X12 Format for the full reference.

Boundaries nest correctly through the pipeline: each level opens before the records inside it and closes after them, in strict innermost-first order. A level that arrives across several branches is still handled once where a Merge or Combine brings those branches together — exactly like a single-level document.

Header-only interchanges

A multi-level envelope file can legitimately carry an interchange whose body is empty — envelope structure (an interchange header, and possibly inner group headers) with zero records inside. Such an interchange still opens a document and emits its open/close boundaries, so downstream operators and trailer-count validation observe it just like any other document. The interchange’s $doc.* sections are extracted and the boundaries flow even though no body record ever streams from it.

The same holds for an empty inner envelope — an open/close pair with no records between — and for an inner envelope that opens or closes after the file’s last body record. Every envelope boundary a reader signals is applied, whether or not a record follows it, so the document frame stays balanced end to end.

Error Handling & DLQ

Clinker provides structured error handling with a dead-letter queue (DLQ) for records that fail processing. The error_handling: block at the top level of the pipeline YAML controls the behavior.

Configuration

error_handling:
  strategy: continue
  dlq:
    path: "./output/errors.csv"
    include_reason: true
    include_source_row: true

Strategies

error_handling.strategy is pipeline-wide – it is set once at the top level, not per node. It controls what happens when a record fails:

Strategy	Behavior	Exit code
`fail_fast`	Default. Abort the run on the first record failure.	Non-zero, by the class of the aborting error (`3` for an evaluation failure, `4` for an I/O failure – see Exit Codes)
`continue`	Route the failing record to the DLQ and keep processing.	`2` if any record was dead-lettered, `0` otherwise

There are exactly two, because the engine makes exactly one decision at each record failure: propagate it and stop, or dead-letter it and carry on.

fail_fast

The safest strategy. Any record-level error (type coercion failure, validation error, missing required field) halts the pipeline immediately, with a non-zero exit and no DLQ file. Use this when data quality is critical and you prefer to fix issues before reprocessing.

Some failures abort the run under either strategy, because they are not record-scoped: an unwritable output path, a config or CXL compile error, and the DLQ-rate ceiling (dlq.max_rate, E315/E316) all end the run regardless of the strategy.

An executor invariant failure also aborts under either strategy with exit code 1. In particular, if a planned materialized input is unavailable when its consumer runs, Clinker stops instead of treating that input as a legitimate zero-row result. The message names the consuming node and planned producer (including the producer port when applicable) and says the input was not treated as empty. A source or stage that really emits zero rows remains valid; it carries an explicit empty buffer and completes normally. Report any missing- input internal error as an engine defect rather than routing it to the DLQ.

continue

The production workhorse. Bad records are written to the DLQ file with diagnostic metadata, and the pipeline continues processing remaining records. After the run completes, inspect the DLQ to understand and correct failures.

A pipeline that completes with DLQ entries exits with code 2 – this signals “pipeline completed successfully but some records were rejected.” It is not a crash or internal error. A continue run that dead-letters nothing exits 0, exactly like a clean fail_fast run.

Migrating from best_effort. The removed best_effort spelling was a third name for the continue behavior: it wrote the same DLQ entries and produced the same exit code, because the runtime never distinguished the two. Replace it with strategy: continue. A pipeline still carrying best_effort is rejected at config-validation time with a message naming the replacement.

DLQ configuration

The DLQ is always written as CSV, regardless of the pipeline’s input/output formats.

  dlq:
    path: "./output/errors.csv"
    include_reason: true
    include_source_row: true

Field	Required	Default	Description
`path`	No	–	File path for DLQ output. If omitted, DLQ records are logged but not written to file.
`include_reason`	No	–	Include `_cxl_dlq_error_category` and `_cxl_dlq_error_detail` columns.
`include_source_row`	No	–	Include original source fields alongside DLQ metadata.

DLQ columns

Every DLQ record includes these metadata columns:

Column	Description
`_cxl_dlq_id`	UUID v7 (time-ordered unique identifier)
`_cxl_dlq_timestamp`	RFC 3339 timestamp of when the error occurred
`_cxl_dlq_source_file`	Input filename that produced the failing record
`_cxl_dlq_source_row`	1-based row number in the source file
`_cxl_dlq_stage`	Name of the transform or aggregate node where the error occurred
`_cxl_dlq_route`	Route branch name (if the error occurred after routing)
`_cxl_dlq_trigger`	Validation rule name that triggered the rejection

When include_reason: true is set, two additional columns appear:

Column	Description
`_cxl_dlq_error_category`	Machine-readable error classification
`_cxl_dlq_error_detail`	Human-readable error description

Error categories

The _cxl_dlq_error_category column contains one of these values:

Category	Description
`missing_required_field`	A required field is absent from the record
`type_coercion_failure`	A value could not be converted to the expected type
`required_field_conversion_failure`	A required field exists but its value cannot be converted
`nan_in_output_field`	A computation produced NaN
`aggregate_type_error`	An aggregate function received an incompatible type
`validation_failure`	A declarative validation check failed
`aggregate_finalize`	An aggregate function failed during finalization
`correlated`	A non-failing record was DLQ’d as collateral because another record in its correlation group failed
`group_size_exceeded`	A correlation-key group exceeded the configured `max_group_buffer` limit
`document_rejected`	A non-failing record was DLQ’d as collateral because another record in its document failed under a source’s `dlq_granularity: document` policy
`late_record`	A record arrived at a time-windowed aggregate after its event-time window had already closed
`expansion_limit_exceeded`	A transform’s `emit each` fan-out produced more output records than its `max_expansion` ceiling allows
`combine_output_row`	A Combine output-stage eval failed for one driver row (probe-key, residual, or matched / `on_miss: null_fields` body); the entry carries the contributing-build lineage and rewinds both the driver and matched build source’s rollback cursor. Routed to the DLQ under `continue` across every Combine join mode; `fail_fast` propagates the eval error
`structural_validation`	An envelope trailer’s declared count did not match the body the reader streamed (X12 `SE`/`GE`/`IEA`, EDIFACT `UNT`/`UNZ`, HL7 `BTS`/`FTS`, a multi-record flat-file trailer), or a multi-record flat file broke a non-count structural rule (an unknown record-type discriminator, a body record after the trailer). The root-cause `trigger: true` entry for a malformed document rejected under a source’s `dlq_granularity: document` policy; every already-streamed record of the same file is a `document_rejected` collateral

Advanced options

Type error threshold

Abort the pipeline if the fraction of failing records exceeds a threshold:

  type_error_threshold: 0.05    # Abort if >5% of records fail

This acts as a circuit breaker – if your input data is unexpectedly corrupt, the pipeline stops early rather than filling the DLQ with millions of entries.

Correlation key

Group DLQ rejections by a key field. When any record in a correlation group fails, records from the failing source’s contribution to that group are routed to the DLQ:

  correlation_key: order_id

For compound keys:

  correlation_key: [order_id, customer_id]

This is useful for transactional data where partial processing of a group is worse than rejecting the entire group. For example, if one line item in an order fails validation, you may want to reject the entire order.

Under multi-source ingest, the collateral fan-out narrows to the failing source: a src_b trigger does NOT DLQ records from src_a that share the same correlation key. Single-source pipelines see bit-identical behavior to today’s pipeline-wide collateral DLQ. See Per-source rollback narrowing for the full semantic and the two documented exceptions (max_group_buffer overflow and Combine output failures).

For the full lifecycle and per-operator semantics (route, merge, aggregate, combine), see Correlation Keys.

Max group buffer

Limit the number of records buffered per correlation group:

  max_group_buffer: 100000     # Default: 100,000

Groups exceeding this limit are DLQ’d entirely with a group_size_exceeded summary entry.

Document-level DLQ

By default a record failure dead-letters only that record (dlq_granularity: record). A source can instead reject the entire document any record of which fails, by declaring the granularity per source:

nodes:
  - type: source
    name: claims
    config:
      name: claims
      type: x12
      glob: ./claims/*.edi
      dlq_granularity: document   # record (default) | document

Under dlq_granularity: document and the continue strategy, when any record of a document fails:

the failing record becomes the root-cause DLQ entry (_cxl_dlq_trigger = true, carrying its original error category);
every other record of the same document becomes a collateral entry (_cxl_dlq_trigger = false, category document_rejected);
no record of that document reaches the success sink.

Clean documents in the same run stream through untouched, and records from sibling sources still on the default record granularity keep per-record semantics — the policy is per source.

This is the document-shaped analogue of correlation keys: use it when partial processing of a document (an EDI interchange, a batch file with a header/trailer) is worse than rejecting the whole document. Unlike correlation keys, which group across files by a key value, document-level DLQ scopes rejection to a single document’s records.

Document grain. The document is the outermost level — the source file. For a flat format (CSV, JSON, plain XML) each input file is one document. For a nested-envelope format (an X12 ISA → GS → ST interchange, an EDIFACT UNB → UNG → UNH) the document is the whole interchange / file, not an inner functional group or transaction set: a failure anywhere in the interchange rejects the entire interchange, including the transaction sets that validated cleanly. Reject the inner-level grain instead by partitioning the input so each interchange is its own file is not currently offered — the grain is fixed at the file.

DLQ rate. Every emitted entry — the trigger and each collateral — counts toward the DLQ-rate denominator the type error threshold circuit breaker measures, matching the correlated-collateral precedent. A rejected 1000-record document contributes 1000 entries, so size any type_error_threshold with whole-document rejection in mind.

Memory. The engine buffers each open document’s records until its boundary, then flushes the document clean to the sink or rejects it and drops the buffer. Peak memory scales with the concurrently-open documents, not the total input; a single very large document spills its buffer to disk under the run’s memory budget rather than holding everything in RAM. See Streaming vs blocking for the spill model.

Output restriction. Document-level DLQ flushes each whole document to a single output writer, so it cannot be combined with a per-source-file output (a {source_file} / {source_path} path template over a multi-file source). The two are rejected together at compile time (E343); use a single output path, or set dlq_granularity: record if per-file output is the requirement.

Strategy requirement. dlq_granularity: document requires error_handling.strategy: continue. It is incompatible with the default fail_fast: document-level dead-lettering keeps the run going past a bad document, which contradicts fail-fast’s abort-on-first-error. The combination is rejected at compile time (E344) — set strategy: continue to dead-letter bad documents, or keep fail_fast with the default dlq_granularity: record.

Spilling stages. Document identity survives memory pressure end to end. The per-document buffer identifies each document before buffering and spills under the memory budget, and a blocking stage (Sort, hash Aggregate, grace-hash Combine) between the source and the output preserves each record’s document context — including the source file the grain keys on — across its own spill round-trip. A document whose records pass through a spilling stage is therefore still grouped and rejected as one document under memory pressure, exactly as it would be in memory.

Malformed envelopes (structural validation)

Envelope formats carry their own structural-integrity claims: an X12 interchange declares a segment count in each SE/GE/IEA trailer, EDIFACT in each UNT/UNZ, HL7 batch/file in each BTS/FTS, and a multi-record flat file’s trailer record declares a body count via its structure: constraint. When the declared count does not match the body the reader actually streamed, the file is structurally invalid. A multi-record flat file can also break a non-count structural rule — a line whose record-type discriminator matches no declared records: entry (E345), or a body record appearing after the trailer that closes the document; these are classified separately from a count mismatch but carry the same disposition.

Under dlq_granularity: document, such a structural failure dead-letters the whole source file to the DLQ rather than aborting the run:

the file’s records dead-letter as one structural_validation root-cause entry (_cxl_dlq_trigger = true) plus a document_rejected collateral for every other already-streamed record of the file;
no record of the malformed file reaches the success sink.

nodes:
  - type: source
    name: claims
    config:
      name: claims
      type: x12
      glob: ./claims/*.edi
      dlq_granularity: document   # reuse the document opt-in — no separate config

The opt-in is the same dlq_granularity: document knob that governs per-record document rejection above; there is no separate validation: block. A malformed envelope is simply one more reason a source under the document policy condemns a whole document.

Honest timing — rejected at the sink boundary, not before the first record. The trailer that carries the count arrives at the end of the file, after every body record it counts has already streamed through the DAG. Clinker is a bounded-memory streaming engine — it does not buffer the whole file up front to pre-validate it (that would defeat the streaming model). So the count mismatch is detected mid-stream, the file is marked failed, and the document-level DLQ buffer rejects every already-streamed record of the file at its close. The user-visible outcome is the same — no record of a malformed envelope is ever written to the output — but the rejection lands at the sink boundary, not literally before the file’s first record streams.

Grain — the whole file. An SE-level mismatch (one transaction set inside a larger interchange) rejects the entire interchange / file, not just that one transaction set, because the document grain is the outermost source file (see Document grain above). Split the input so each interchange is its own file if you need finer rejection.

Multiple files keep flowing. When a glob / paths source matches several files and one is malformed, only that file dead-letters — ingestion continues to the remaining files, so the clean files after a bad one still reach the sink. (This is unlike the default record granularity, where a count mismatch aborts the whole run and no file’s records are written.) Dead-lettering one malformed file never silently drops the good files around it.

Default behavior unchanged. Under the default dlq_granularity: record, a structural failure still aborts the run exactly as before — the document-DLQ disposition is strictly opt-in. Genuine corruption (a truncated stream, a bad delimiter, a control-number echo mismatch, a segment after an X12/EDIFACT/HL7 envelope trailer) always aborts, even under the document opt-in: only the trailer-count claims and the multi-record structural failures above are reclassified to the DLQ; structural corruption that makes the stream un-parseable is never silently dead-lettered.

Cryptographic integrity (checksums / signatures) is not yet validated. Envelope formats can also carry a SHA-256 body hash, a JWS-signed JSON payload, or an XML Signature. Clinker extracts these envelope sections but does not yet verify them. Tracked for a future release.

Exit codes

Code	Meaning
0	Pipeline completed successfully, no errors
1	Configuration error – the pipeline never started
2	Pipeline completed, but DLQ entries were produced
3	Data error halted the run: a `fail_fast` evaluation/accumulator failure, or the DLQ-rate ceiling
4	I/O, format, or spill failure

Exit code 2 is not a failure – it means the pipeline ran to completion and handled errors according to the configured strategy. Check the DLQ file for details. See Exit Codes & Error Diagnosis for the full reference and the orchestrator retry policy.

Complete example

pipeline:
  name: order_processing
  memory: { limit: "512M" }

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: csv
      path: "./data/orders.csv"
      schema:
        - { name: order_id, type: int }
        - { name: customer_id, type: int }
        - { name: amount, type: float }
        - { name: email, type: string }

  - type: transform
    name: validate_orders
    input: orders
    config:
      cxl: |
        emit order_id = order_id
        emit customer_id = customer_id
        emit amount = amount
        emit email = email
      validations:
        - field: email
          check: "not_empty"
          severity: error
          message: "Customer email is required"
        - check: "amount > 0"
          severity: error
          message: "Order amount must be positive"

  - type: output
    name: valid_orders
    input: validate_orders
    config:
      name: valid_orders
      type: csv
      path: "./output/valid_orders.csv"

error_handling:
  strategy: continue
  dlq:
    path: "./output/rejected_orders.csv"
    include_reason: true
    include_source_row: true
  type_error_threshold: 0.10
  correlation_key: order_id

Nodes

A Clinker pipeline is a single flat nodes: list. Every entry carries a type: discriminator that selects a node kind — the unified node taxonomy. There is no separate “join section” or “filter section”: records flow through one homogeneous graph of typed nodes, wired together by input: / inputs:. This part documents the nine record-processing node kinds; a tenth, Composition, is a call-site that inlines a reusable sub-pipeline and is covered under Pipelines.

The pages in this part are ordered the way data flows through a DAG — a record enters at a Source, fans through the record-level and combining nodes, and leaves at an Output:

Node	Role	Arity	Streaming vs blocking
Source	Reads records from a file or network cursor; the entry point.	0 → 1	Streaming
Transform	Record-level CXL projection, filter, and lookup.	1 → 1	Streaming
Route	Predicate-based fan-out into named branches.	1 → N	Streaming
Merge	Streamwise concatenation of inputs that share a schema.	N → 1	Streaming
Combine	N-ary record combining with mixed predicates (equi + range + arbitrary CXL).	N → 1	Blocking (build side)
Aggregate	Grouped or windowed reduction.	1 → 1	Blocking (or streaming when sorted)
Reshape	Pivot / unpivot between wide and long record shapes.	1 → 1	Blocking
Cull	Per-correlation-group removal on a group-level predicate, with a `removed_to` side-output port.	1 → 2	Blocking
Envelope	Frames a body stream into per-document documents; a composable framing stage.	1 → 1	Streaming
Output	Writes records to a sink; the exit point.	1 → 0	Streaming

Streaming vs blocking

Stateless nodes (Transform, Route, Merge, the Combine probe side, Output) evaluate records one at a time without accumulating per-record state. Blocking nodes (Aggregate, sort, the grace-hash Combine build side) accumulate state inside the RSS budget and spill to disk rather than OOM the process. The Streaming vs. Blocking Stages page in the Operations Guide is the full memory model.

Wiring and naming

Every node needs a unique name: (no dots — the dot is reserved for port syntax). Single-input nodes use input:; Merge and Combine use inputs:. Route branches are consumed downstream as route_name.port. The Pipeline YAML Structure page covers the full wiring grammar, optional fields (description:, _notes:), and strict parsing rules.

Source Nodes

Source nodes read data from files and are the entry points of every pipeline. They have no input: field – they produce records, they do not consume them.

Basic structure

- type: source
  name: customers
  config:
    name: customers
    type: csv
    path: "./data/customers.csv"
    schema:
      - { name: customer_id, type: int }
      - { name: name, type: string }
      - { name: email, type: string }
      - { name: status, type: string }
      - { name: amount, type: float }

Schema declaration

The schema: field is required on every source node. Clinker does not infer types from data – you must declare each column’s name and CXL type explicitly. This schema drives compile-time type checking across the entire pipeline.

Each entry is a { name, type } pair:

schema:
  - { name: employee_id, type: string }
  - { name: salary, type: int }
  - { name: hired_at, type: date_time }
  - { name: is_active, type: bool }
  - { name: notes, type: nullable(string) }

Available types

Type	Description
`string`	UTF-8 text
`int`	64-bit signed integer
`float`	64-bit IEEE 754 floating point
`bool`	Boolean (`true` / `false`)
`date`	Calendar date
`date_time`	Date with time component
`array`	Ordered sequence of values
`any`	Unknown type – field used in type-agnostic contexts
`nullable(T)`	Nullable wrapper around any inner type (e.g. `nullable(int)`)

A source column’s declared type must be concrete: numeric — the inference-only int | float union CXL resolves during type unification — is not a valid source column type. Declaring one is rejected at compile with E158; declare int or float explicitly.

`long_unique` — storage hint for high-cardinality text

A string column may carry an optional long_unique: true flag. It is an advisory, opt-in hint, not a type change: it tells Clinker the column’s values are long and effectively unique — never repeated across records — so the run uses less memory for that column. Typical candidates are UUIDs rendered as text, street addresses, and free-text comment or note fields.

schema:
  - { name: ticket_id,  type: string, long_unique: true }   # 36-char UUID
  - { name: notes,      type: string, long_unique: true }   # free text
  - { name: department, type: string }                      # low-cardinality, default

The flag lowers memory use only. A value’s content and its comparison, grouping, join, sort, and output behavior are all unchanged — a long_unique value behaves identically to the same text in any other column. Omitting the flag (the common case) leaves the default behavior untouched. Set it only when you know a column is genuinely high-cardinality free text; on a column whose values repeat, leave it off.

`source_name` — read a differently-named physical column

A column may carry an optional source_name naming the physical input column it reads from, when that differs from the exposed name. The reader matches input fields by physical name and re-labels the value under name, so downstream CXL and the output see name carrying the physical column’s data.

schema:
  # read the physical `cust_id` column, expose it downstream as `customer_id`
  - { name: customer_id, type: string, source_name: cust_id }

Omitting source_name (the common case) reads the input field whose key equals name, unchanged from before. A channel schema patch’s rename op sets this alias automatically (see Channels).

Transport vs format

A source declaration has two independent layers:

Transport (transport:) selects where the records come from. Two transports exist: file — read bytes from the filesystem, resolved through one of the file matchers (path / glob / regex / paths) — and rest — pull records from a paginated HTTP endpoint under a hard page/record cap (see Network Sources (REST)). transport: is optional and defaults to file, so a source that omits it reads from disk exactly as before.
Format (type:) selects how the bytes decode into records: csv, json, xml, fixed_width, edifact, x12, hl7, swift.

- type: source
  name: orders
  config:
    name: orders
    transport: file        # optional; this is the default
    type: csv              # the on-disk format
    path: "./data/orders.csv"
    schema:
      - { name: order_id, type: int }

A file transport requires exactly one file matcher (path, glob, regex, or paths). Declaring none fails validation with E211; declaring more than one fails with E210. Both are reported at config-load time, before any file is opened.

Format types

The type: field inside config: selects the on-disk format. Each format has its own reference page covering its options and decoding model:

`type:`	Format	Reference
`csv`	Delimited text (RFC 4180)	CSV Format
`json`	Array / NDJSON / wrapper object	JSON Format
`xml`	Element-path-selected record elements	XML Format
`fixed_width`	Column-positioned legacy extracts	Fixed-Width Format
`edifact`	UN/EDIFACT interchanges	EDIFACT Format
`x12`	ANSI ASC X12 interchanges	X12 Format
`hl7`	HL7 v2.x pipe-and-hat messages	HL7 v2 Format
`swift`	SWIFT MT (FIN) messages	SWIFT MT Format

The same schema: rules apply regardless of format: the reader maps each decoded record onto the declared schema, and undeclared input fields fall under the on_unmapped policy below.

`on_unmapped` — undeclared input fields

The per-source on_unmapped policy decides what to do with input fields the source’s schema: block does not name. Three modes — auto_widen (default), drop, reject:

- type: source
  name: orders
  config:
    name: orders
    type: csv
    path: "./data/orders.csv"
    on_unmapped:
      mode: auto_widen     # default; other values: drop, reject
    schema:
      - { name: order_id, type: string }
      - { name: amount, type: float }

See Auto-Widen & Schema Drift for the full specification: how undeclared columns flow through each downstream node type, the include_unmapped Output flag, E315 merge-policy mismatch, and fixed-width behavior.

Sort order

If your source data is pre-sorted, declare the sort order so the optimizer can use streaming aggregation instead of hash aggregation:

- type: source
  name: sorted_transactions
  config:
    name: sorted_transactions
    type: csv
    path: "./data/transactions_sorted.csv"
    schema:
      - { name: account_id, type: string }
      - { name: txn_date, type: date }
      - { name: amount, type: float }
    sort_order:
      - { field: "account_id", order: asc }
      - { field: "txn_date", order: asc }

Sort order declarations are trusted – Clinker does not verify that the data is actually sorted. If the data violates the declared order, downstream streaming aggregation may produce incorrect results.

The shorthand form is also accepted – a bare string defaults to ascending:

    sort_order:
      - "account_id"
      - { field: "txn_date", order: desc }

Watermarks

An event-time watermark declares which column on the source carries each record’s event time — the wall-clock instant the event happened, distinct from when Clinker read the row. When set, Clinker takes the column on every record, subtracts the source’s delay, and uses the result to track event-time progress so downstream time windows know when to close. The delay-corrected value is also stamped on every record as $source.event_time, the column a downstream time-windowed aggregate uses to assign records to windows.

- type: source
  name: clicks
  config:
    name: clicks
    type: csv
    path: "./data/clicks.csv"
    options:
      has_header: true
    watermark:
      column: event_ts       # must be date_time or date
      delay: 5s              # bounded out-of-order tolerance
      idle_timeout: 30s      # flip partitions to idle if quiet
    schema:
      - { name: user_id, type: string }
      - { name: event_ts, type: date_time }
      - { name: amount, type: int }

Fields:

column (required) — the schema column whose value is each record’s event time. The column’s declared type must be date_time or date. A column: that names a field absent from schema: raises E154; a column: whose declared type is neither raises E155.
delay (optional duration, default unset) — bounded out-of-order tolerance. Each record’s event time is shifted earlier by delay before being folded into the watermark, so the source’s effective watermark trails its observed max event time by this amount. Mirrors Flink’s BoundedOutOfOrdernessWatermarks. Without delay, the watermark advances strictly to the observed max — a single late record routes to the DLQ.
idle_timeout (optional duration, default unset) — if a source stays quiet longer than this, it stops holding back downstream window-close progress, so windows keep closing when one source pauses. Unset means the source never goes idle.

Durations use the suffixes ms, s, m, h, d. ms is matched before the single-character s, so 500ms reads as 500 milliseconds, not 500 seconds with a stray m.

A pipeline whose aggregate declares time_window: must have a watermark.column on every upstream-reachable source. Without it, event-time progress can never advance and the window can never close — the planner rejects this with E156.

Multi-value fields

A field that holds more than one value is declared on the schema column, not on the pipeline:

schema:
  - { name: order_id, type: string }
  - { name: tags, type: string, multiple: true }

multiple: true says the column holds zero or more values of its declared type. Reading collects every occurrence of the field into one array — a single occurrence is still an array, so downstream code never has to branch on how many values happened to arrive. A field absent from a record has no column at all and resolves to null, exactly as any other absent column does. CXL sees the column as an array; the declared type: describes each element and drives coercion. The declaration describes the shape of the data, so it serves both directions: a writer that can encode repetition reads the same declaration.

The split_to_rows, split_values, and join_values blocks below accept a compact shorthand (a bare field name, or a mapping that omits defaults). To see the fully-materialized form the engine actually runs — every default spelled out — print the canonical config with clinker config --resolved. It rewrites only those shorthand blocks and leaves the rest of the file untouched.

Both ends of the declaration are checked at compile, so a shape the formats cannot carry fails before a run starts rather than mid-stream:

Format	As a source	As an output
`json`	native — an array	native — an array
`xml`	native — repeated child elements	not yet (issue 916)
`csv`	delimited cell via `split_values`	delimited cell via `join_values`
`fixed_width`	delimited cell via `split_values`	not yet (issue 918)
`edifact`, `x12`, `hl7`, `swift`	no — repetition is positional	no — repetition is positional

A multiple: true column reaching an output that cannot encode it is E359; one on a source that cannot produce it is E361. E359 covers an output’s own schema: block too — the attribute is direction-neutral, but the remaining writers do not encode repetition yet, so declaring it on such a sink would be accepted and ignored. Run clinker explain --code E361 for the full remediation of either.

csv and fixed_width read a multiple: true column through a split_values entry. Neither wire format repeats a field, but a cell’s text may hold several values separated by a delimiter. Declare that delimiter with a split_values entry and the reader parses the cell into the array the column holds. A multiple: true column no entry covers is rejected by E361 — the reader would have no delimiter and deliver the raw cell; either add the entry or leave the column single-valued and split it in a transform (tags.split(";")). The entry is read only on a single-schema source: a multi-record source of either format runs a backend that does not consume it. On the output side, a CSV sink joins a multiple: field into one delimited cell with join_values (defaulting to ; / on_conflict: error); fixed_width output is still pending (#918).

A split_values entry also recovers a CSV cell a sink wrote under join_values on_conflict: escape or encode_json: add escape: "\\" to un-escape an escaped delimiter, or json: true to read the whole cell as an embedded JSON array.

The segment formats are a permanent no, not a pending one. Repetition there is a positional coordinate rather than a list: a repeated composite is written as two axes interleaved in one element (11:B:1^12:B:2), which a flat array cannot represent without losing the component axis. The faithful shape is one column per coordinate — for HL7, that is what options.split_fields produces, with a writer that reassembles the wire field byte-for-byte.

One record per value: `split_to_rows`

split_to_rows fans a record out to one record per occurrence of a repeated field. Each entry is either a bare field name or a full mapping, and the two forms mix freely in one list:

- type: source
  name: invoices
  config:
    name: invoices
    type: json
    path: "./data/invoices.json"
    schema:
      - { name: invoice_id, type: int }
      - { name: customer, type: string }
      - { name: line_item, type: string }
      - { name: line_amount, type: float }
      - { name: line_no, type: int }
    split_to_rows:
      - tags                      # shorthand: field name, all defaults
      - field: line_items         # full form
        keep_empty: true
        mode: extract
        position_column: line_no

Key	Default	Meaning
`field`	—	The repeated field, as a flattened dotted name
`keep_empty`	`true`	Whether a record whose field is empty or absent survives
`mode`	`extract`	`extract` — the occurrence becomes the record; `split` — the record shape is kept
`position_column`	none	Column receiving each occurrence’s 1-based position

The field is named as it appears in the input document, not as the schema exposes it: a column declared source_name: is addressed by that source_name. The same rule applies to split_values below.

keep_empty defaults to true. A record whose field holds an empty array, or carries no such field at all, is emitted with that field unset rather than disappearing. Several widely used engines drop the record instead; a vanished row is the costliest failure mode there is, so dropping is opt-in here.

mode: extract (the default) makes the occurrence the record: its own fields are lifted out from under the field name and every field outside the group is merged onto each output. {"orders": [{"id": 1}]} yields a top-level id, and repeated <Item><name> children yield name. When lifting lands an occurrence’s field on a name an outside field already occupies, the occurrence wins — it is the record, so its own value is not shadowed by the parent it was merged with. A position_column wins over both: you named it, so a field of the same name inside or outside the occurrence gives way to the index.

mode: split preserves the record shape: the occurrence’s fields keep their dotted path (orders.id, Item.name) and each output carries exactly one occurrence.

Entries apply in declaration order, so two entries multiply. Declaring the same field twice is rejected at compile (E358), as is fanning out a field the schema also declares multiple: true — the attribute collects the occurrences into one array, the fan-out spends them one per record, and a field cannot be both. On an XML source, two entries may not name nested element groups either (Item and Item.part): that reader assigns each element to one occurrence group by document position, and a nested pair leaves the inner group’s membership ambiguous.

A JSON source accepts a nested pair to produce a two-level expansion, but only when the outer entry declares mode: split. mode: extract lifts the occurrence’s own keys to the top level, which removes the dotted path the inner entry addresses — the inner entry would then match nothing and fan nothing out, so the pairing is rejected (E358):

    split_to_rows:
      - { field: orders, mode: split }
      - { field: orders.items, mode: split }

Several values in one cell: `split_values`

split_values parses a delimited cell into the several values a multiple: column holds. It takes the same bare-name-or-mapping shorthand:

    split_values:
      - tags                      # shorthand: default delimiter `;`
      - field: codes              # full form
        delimiter: "|"
    schema:
      - { name: tags,  type: string, multiple: true }
      - { name: codes, type: string, multiple: true }

The delimiter defaults to ;. A split_values field the schema does not declare multiple: true is rejected at compile (E358): splitting produces several values, and only a multi-value column can hold them. So is an entry naming a column’s exposed name when that column reads a differently-named input field — the split runs against the document’s own field names, so name the source_name.

The entry is read by the JSON and XML readers, and — on a single-schema source — by the CSV and fixed-width readers. On a multi-record CSV or fixed-width source, or on any segment format, declaring it is rejected (E358) rather than silently ignored: those readers are never handed it, so the cell would arrive unsplit with nothing to say so.

Migrating from `array_paths`

array_paths: was the earlier form of these declarations. It is no longer read, and a source still carrying it is rejected at compile (E360) rather than running with the fan-out silently dropped. An explode path becomes a split_to_rows: entry, a delimited cell becomes a split_values: entry, and a path kept as an array becomes multiple: true on the schema column.

mode: extract (the default) reproduces the old projection — the element’s own fields lifted to the top level of each output record. It does not reproduce the old cardinality: explode dropped a record whose array was empty, while keep_empty defaults to true here and keeps it with the element’s fields unset. Add keep_empty: false to the entry to migrate row-for-row.

Format notes

split_to_rows is honored by the JSON and XML readers, over a file path and over a rest response body alike. split_values is honored by those two and also by the single-schema CSV and fixed-width readers. Declaring a knob on a reader that is never handed it — split_to_rows on any delimited-cell or segment format, split_values on a multi-record or segment source — is rejected at compile (E358) rather than accepted and inert, and a multiple: true column no split_values entry can cover is rejected by E361.

Composition body files are gated by the same four checks as the pipeline that calls them.

JSON — the field names the key holding the array (line_items, or order.line_items for an array nested under an object). A field present but holding a single object or scalar rather than an array counts as one occurrence, and is projected exactly as a one-element array would be — many producers unwrap a lone element, and XML cannot express the difference at all, so the two readers agree on this. A field that is absent, holds an empty array, or is explicitly null has no occurrence and is governed by keep_empty: an explicit null is how many producers write “no value”, and it counts as none rather than as one. For the same reason a multiple: true column holding an explicit null stays null rather than becoming [null], so size() over it reads the same as it does for a field the document omits.

XML — the field is the repeated child element’s dotted path relative to the record element (Item, or Items.Item when nested). Repetition and absence are indistinguishable in XML, so a record with no occurrence of the element is governed by keep_empty exactly as an empty array is. See XML Format for the full rules.

Transform Nodes

Transform nodes apply CXL expressions to each record, producing new fields, filtering records, or both. They process one record at a time in streaming fashion with constant memory overhead.

Basic structure

- type: transform
  name: enrich
  input: customers
  config:
    cxl: |
      emit full_name = first_name + " " + last_name
      emit tier = if lifetime_value >= 10000 then "gold" else "standard"
      filter status == "active"

The cxl: field is required and contains a CXL program. The three core CXL statements for transforms are:

emit – produces an output field. Only emitted fields appear in downstream nodes.
filter – drops records that do not match the boolean condition.
let – binds a local variable for use in subsequent expressions (not emitted).

    cxl: |
      let margin = revenue - cost
      emit product_id = product_id
      emit margin = margin
      emit margin_pct = if revenue > 0 then margin / revenue * 100 else 0
      filter margin > 0

Analytic window

The analytic_window field enables cross-source lookups by joining a secondary dataset into the transform. The secondary source is loaded into memory and indexed by the join key.

- type: transform
  name: enrich_orders
  input: orders
  config:
    analytic_window:
      source: products
      on: product_id
      group_by: [product_id]
    cxl: |
      emit order_id = order_id
      emit product_name = $window.first()
      emit quantity = quantity
      emit line_total = quantity * price

The $window.* namespace provides access to the windowed data. Functions like $window.first(), $window.last(), and $window.count() operate over the matched group.

Validations

Declarative validation checks can be attached to a transform. They run against each record and either route failures to the DLQ (severity error) or log a warning and continue (severity warn).

- type: transform
  name: validate_orders
  input: raw_orders
  config:
    cxl: |
      emit order_id = order_id
      emit amount = amount
      emit email = email
    validations:
      - field: email
        check: "not_empty"
        severity: error
        message: "Email is required"
      - check: "amount > 0"
        severity: warn
        message: "Non-positive amount"
      - field: order_id
        check: "not_empty"
        severity: error

Validation fields

Field	Required	Description
`field`	No	Restrict the check to a single field
`check`	Yes	Validation name (e.g. `"not_empty"`) or CXL boolean expression
`severity`	No	`error` (default) routes to DLQ; `warn` logs and continues
`message`	No	Custom error message for DLQ entries
`name`	No	Validation name for DLQ reporting. Auto-derived from field + check if omitted
`args`	No	Additional arguments as key-value pairs

Expansion cap (`max_expansion`)

When a transform body contains an emit each statement, every input record can fan out into multiple output records. The max_expansion field caps how many output records a single input record may produce – a safety bound against unexpectedly large arrays.

- type: transform
  name: explode_items
  input: orders
  config:
    max_expansion: 5000      # default: 10000
    cxl: |
      emit each it in items {
        emit order_id = order_id
        emit sku = it["sku"]
        emit price = it["price"]
      }

Field	Type	Default	Description
`max_expansion`	`u64`	`10000`	Maximum cumulative output records per input record.

If a single input record’s emit each block produces more than max_expansion output records, the originating record routes to the DLQ with category expansion_limit_exceeded instead of producing a truncated or unbounded result. No partial output is emitted for that record – the cap is enforced eagerly so the writer never sees records from a runaway expansion.

When to tune

Lower (e.g. 100, 1000) when input arrays are bounded by a known business rule and you want hostile or malformed input to surface as a DLQ entry rather than as a flood of downstream records.
Higher (e.g. 100000, 1000000) when legitimate input carries large arrays – for example, an order with a long line-item list or an event carrying a per-second pricing curve.

The DLQ category expansion_limit_exceeded is distinct from generic CXL evaluation failures, so DLQ-side filters and metrics can target expansion runaway specifically. See Error Handling & DLQ for the wider DLQ contract.

Batch size (`batch_size`)

A streaming-eligible transform hands its output downstream in bounded batches rather than accumulating the whole stage before the next stage runs. batch_size sets how many events (records plus document-boundary punctuations) a batch holds. A per-transform batch_size overrides the pipeline-level pipeline.batch_size for this one stage; omit it to inherit the pipeline value (or the built-in default of 2048).

- type: transform
  name: enrich
  input: orders
  config:
    batch_size: 512         # override pipeline.batch_size for this stage
    cxl: |
      emit order_id = order_id
      emit total = quantity * unit_price

Field	Type	Default	Description
`batch_size`	`usize`	inherits `pipeline.batch_size` (else 2048)	Events per streaming batch for this transform. Must be `>= 1`.

A batch_size of 0 is rejected at config load (a zero-event batch never flushes). Smaller batches lower the in-flight memory of a streaming stage at the cost of more per-batch bookkeeping; larger batches amortize the bookkeeping at the cost of a larger live working set. The default suits typical record widths — tune it only when a profiling run shows a streaming stage’s per-batch footprint matters. See Streaming vs. Blocking Stages for which stages stream and which fully materialize.

Log directives

Log directives control diagnostic output during transform execution:

- type: transform
  name: process
  input: validated
  config:
    cxl: |
      emit id = id
      emit result = compute(value)
    log:
      - level: info
        when: per_record
        every: 1000
        message: "Processed record"
      - level: warn
        when: on_error
        message: "Record failed processing"
      - level: debug
        when: before_transform
        message: "Starting transform"

Log directive fields

Field	Required	Description
`level`	Yes	`trace`, `debug`, `info`, `warn`, or `error`
`when`	Yes	`before_transform`, `after_transform`, `per_record`, or `on_error`
`message`	Yes	Log message text
`every`	No	Only log every N records (for `per_record` timing)
`condition`	No	CXL boolean expression – only log when true
`fields`	No	List of field names to include in the log output
`log_rule`	No	Reference to an external log rule definition

Complete example

- type: source
  name: employees
  config:
    name: employees
    type: csv
    path: "./data/employees.csv"
    schema:
      - { name: employee_id, type: string }
      - { name: first_name, type: string }
      - { name: last_name, type: string }
      - { name: department, type: string }
      - { name: salary, type: int }
      - { name: hire_date, type: date }

- type: transform
  name: enrich_employees
  description: "Compute display name and tenure"
  input: employees
  config:
    cxl: |
      emit employee_id = employee_id
      emit display_name = last_name + ", " + first_name
      emit department = department.upper()
      emit salary = salary
      emit annual_bonus = if salary >= 80000 then salary * 0.15
        else salary * 0.10
    validations:
      - field: employee_id
        check: "not_empty"
        severity: error
        message: "Employee ID is required"
      - check: "salary > 0"
        severity: warn
        message: "Salary should be positive"
    log:
      - level: info
        when: per_record
        every: 5000
        message: "Processing employees"

Route Nodes

Route nodes split a stream of records into named branches based on CXL boolean conditions. Each branch becomes an independent output port that downstream nodes can wire to using port syntax.

Basic structure

- type: route
  name: split_by_value
  input: orders
  config:
    mode: exclusive
    conditions:
      high: "amount.to_int() > 1000"
      medium: "amount.to_int() > 100"
    default: low

This creates three output ports: split_by_value.high, split_by_value.medium, and split_by_value.low.

Conditions

The conditions: field is an ordered map of branch names to CXL boolean expressions. Each expression is evaluated against the incoming record.

    conditions:
      priority: "urgency == \"high\" and amount > 500"
      standard: "urgency == \"medium\""
      bulk: "quantity > 100"
    default: other

Condition keys become the port names used in downstream input: wiring.

Compile-time checking

Branch conditions are typechecked when the pipeline is compiled – the plan-building pass you trigger with clinker run pipeline.yaml --explain or a bare --dry-run (see Validation & Dry Run). Each condition is checked against the concrete column types of the route’s input, so a condition that references an unknown column or compares incompatible types fails at compile time rather than partway through a run.

Compile failures surface as a CXL diagnostic keyed to the failure class – E202 for a branch condition that does not parse, E203 for one that references an unknown column, and E200 for one that compares incompatible types – and name the offending branch, for example split_by_value (branch high), so in a multi-branch route the error points at the specific branch rather than just the route node.

Default branch

The default: field is required. Records that match no condition are routed to the default branch. The default branch name must not collide with any condition key.

Routing modes

Exclusive (default)

In exclusive mode, conditions are evaluated in declaration order and the first matching condition wins. A record appears in exactly one branch. Order matters – put more specific conditions first.

    mode: exclusive
    conditions:
      vip: "lifetime_value > 100000"
      high: "lifetime_value > 10000"
      medium: "lifetime_value > 1000"
    default: standard

A customer with lifetime_value = 50000 matches both vip and high, but because exclusive stops at first match, they go to high only if vip was checked first – and they do, because vip comes first. Actually, 50000 is not > 100000, so they match high.

Inclusive

In inclusive mode, all matching conditions route the record. A single record can appear in multiple branches simultaneously.

    mode: inclusive
    conditions:
      needs_review: "amount > 10000"
      flagged: "status == \"flagged\""
      international: "country != \"US\""
    default: standard

A flagged international order over 10000 would appear in needs_review, flagged, and international – three copies routed to three branches.

Downstream wiring

Downstream nodes reference route branches using port syntax: route_name.branch_name.

- type: route
  name: classify
  input: transactions
  config:
    mode: exclusive
    conditions:
      high: "amount > 1000"
      medium: "amount > 100"
    default: low

- type: transform
  name: high_value_processing
  input: classify.high
  config:
    cxl: |
      emit txn_id = txn_id
      emit amount = amount
      emit review_flag = true

- type: transform
  name: standard_processing
  input: classify.medium
  config:
    cxl: |
      emit txn_id = txn_id
      emit amount = amount

- type: output
  name: low_value_out
  input: classify.low
  config:
    name: low_value_out
    type: csv
    path: "./output/low_value.csv"

Constraints

At least 1 condition is required.
Maximum 256 branches (conditions + default).
Branch names must be unique.
The default name must not collide with any condition key.

Complete example

pipeline:
  name: order_routing

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: csv
      path: "./data/orders.csv"
      schema:
        - { name: order_id, type: int }
        - { name: region, type: string }
        - { name: amount, type: float }
        - { name: priority, type: string }

  - type: route
    name: by_region
    input: orders
    config:
      mode: exclusive
      conditions:
        domestic: "region == \"US\" or region == \"CA\""
        emea: "region == \"UK\" or region == \"DE\" or region == \"FR\""
        apac: "region == \"JP\" or region == \"AU\" or region == \"SG\""
      default: other

  - type: output
    name: domestic_orders
    input: by_region.domestic
    config:
      name: domestic_orders
      type: csv
      path: "./output/domestic.csv"

  - type: output
    name: emea_orders
    input: by_region.emea
    config:
      name: emea_orders
      type: csv
      path: "./output/emea.csv"

  - type: output
    name: apac_orders
    input: by_region.apac
    config:
      name: apac_orders
      type: csv
      path: "./output/apac.csv"

  - type: output
    name: other_orders
    input: by_region.other
    config:
      name: other_orders
      type: csv
      path: "./output/other_regions.csv"

Merge Nodes

Merge nodes concatenate multiple upstream branches into a single stream. They are the counterpart to route nodes – where a route splits one stream into many, a merge joins many streams back into one.

Merge is for streamwise concatenation of inputs that share a schema. For record-level joining across inputs that have different schemas, see Combine Nodes.

Basic structure

- type: merge
  name: combined
  inputs:
    - east_data
    - west_data
  config: {}

Note the key differences from other node types:

Uses inputs: (plural), not input: (singular).
The config: block is empty – all wiring is on the node header.
Using input: (singular) on a merge node is a parse error.

Wiring

The inputs: field is a list of upstream node references. These can be bare node names or port references from route nodes:

- type: merge
  name: rejoin
  inputs:
    - process_high
    - process_medium
    - classify.low           # Port syntax for a route branch
  config: {}

Downstream nodes wire to the merge as a normal single-input reference:

- type: output
  name: final_output
  input: rejoin
  config:
    name: final_output
    type: csv
    path: "./output/combined.csv"

Modes

Merge’s cross-input ordering discipline is selected by config.mode. Two modes exist; concat is the default.

`concat` (default)

Predecessor records drain in declaration order: inputs[0] flows to output first, then inputs[1], then inputs[2], and so on. Within a single predecessor, per-source FIFO order is preserved. Output is reproducible run-to-run.

- type: merge
  name: combined
  inputs: [east, west]
  config:
    mode: concat

`interleave`

Records flow to output as they become available from any predecessor. Per-source FIFO is preserved within each input; cross-input order follows wall-clock arrival and is non-deterministic.

- type: merge
  name: combined
  inputs: [east, west]
  config:
    mode: interleave

Seeded interleave — `interleave_seed:`

Snapshot tests and benchmarks that need reproducible cross-input ordering can opt into a deterministic schedule:

- type: merge
  name: combined
  inputs: [east, west]
  config:
    mode: interleave
    interleave_seed: 42

With a seed, the cross-input order is reproducible from run to run regardless of upstream timing. To get there, the Merge reads all of its inputs into memory before emitting, so a seeded interleave buffers more than the other modes — use it for tests and benchmarks, not high-volume production merges.

Choosing a mode

Mode	Order	When to use
`concat`	Inputs emitted in declaration order, each fully drained before the next.	Downstream depends on a stable, declaration-ordered sequence (byte-identical output, contiguous time partitions).
`interleave` (unseeded)	Records emitted as they arrive; per-input order preserved, cross-input order varies.	Lowest latency and the consumer is order-insensitive (e.g. an aggregator grouping by key, or a writer that doesn’t assert on row order).
`interleave` (seeded)	Reproducible cross-input order.	Tests and benchmarks that assert on exact row sequence. Buffers all inputs in memory.

For high-volume merges, prefer concat or unseeded interleave — both stream their inputs and let a slow downstream consumer naturally throttle the upstream readers, so memory stays bounded. A seeded interleave does not, because it buffers everything first.

Record ordering

Records arrive in the order described by the mode in use — see Modes and Choosing a mode above. If you need sorted output regardless of merge mode, apply a sort_order on the downstream output node.

Use cases

Reuniting route branches

The most common pattern is routing records through different processing paths and then merging them back together:

- type: route
  name: classify
  input: orders
  config:
    mode: exclusive
    conditions:
      high: "amount > 1000"
    default: standard

- type: transform
  name: process_high
  input: classify.high
  config:
    cxl: |
      emit order_id = order_id
      emit amount = amount
      emit surcharge = amount * 0.02
      emit tier = "premium"

- type: transform
  name: process_standard
  input: classify.standard
  config:
    cxl: |
      emit order_id = order_id
      emit amount = amount
      emit surcharge = 0
      emit tier = "standard"

- type: merge
  name: all_orders
  inputs:
    - process_high
    - process_standard
  config: {}

- type: output
  name: result
  input: all_orders
  config:
    name: result
    type: csv
    path: "./output/all_orders.csv"

Unioning multiple sources

Merge nodes can combine records from multiple source files that share the same schema:

- type: source
  name: jan_sales
  config:
    name: jan_sales
    type: csv
    path: "./data/sales_jan.csv"
    schema:
      - { name: sale_id, type: int }
      - { name: amount, type: float }
      - { name: region, type: string }

- type: source
  name: feb_sales
  config:
    name: feb_sales
    type: csv
    path: "./data/sales_feb.csv"
    schema:
      - { name: sale_id, type: int }
      - { name: amount, type: float }
      - { name: region, type: string }

- type: merge
  name: all_sales
  inputs:
    - jan_sales
    - feb_sales
  config: {}

- type: aggregate
  name: totals
  input: all_sales
  config:
    group_by: [region]
    cxl: |
      emit total = sum(amount)
      emit count = count(*)

Schema constraints across inputs

Merge concatenates streams positionally against the merge node’s output_schema (taken from the first input). Every input must therefore agree on column shape — same column names, same on_unmapped policy, same correlation_key set.

Disagreement on the $widened auto_widen sidecar (one source uses auto_widen, another uses drop / reject) fails compile with E315. See Auto-Widen & Schema Drift → E315 for the full diagnostic shape and remediation.

Combine Nodes

Combine nodes are the N-ary record-combining operator. Every input is declared up front and bound to a qualifier; the where: expression matches records across inputs using qualified field references (e.g. orders.product_id == products.product_id); the cxl: body shapes the output row.

Combine is distinct from merge: merge concatenates upstream branches that share a schema, while combine joins records across inputs that have different schemas.

Basic structure

- type: combine
  name: enrich
  input:
    orders: orders         # qualifier: upstream node name
    products: products
  config:
    where: "orders.product_id == products.product_id"
    match: first
    on_miss: null_fields
    cxl: |
      emit order_id = orders.order_id
      emit product_name = products.product_name
      emit amount = orders.amount
    propagate_ck: driver

Note the differences from other node types:

Uses input: as a map, binding qualifier names to upstream node references. Other nodes use input: as a single string or inputs: as a list of strings.
Every field reference inside where: and cxl: must be qualified (<qualifier>.<field>). Bare field names are a compile error.
Using inputs: (plural list) on a combine node is a parse error.

Wiring

Each entry in the input: map binds a qualifier to an upstream node:

  input:
    orders: orders                  # qualifier "orders" -> source node "orders"
    products: products
    high_priority: classify.high    # qualifier "high_priority" -> route port

Qualifiers are local names used inside where: and cxl:; they do not need to match the upstream node name. Upstream references can be bare node names or port references from a route node.

Iteration order in the input: map is preserved and used as the default driver-selection order (see Choosing the driving input below).

Configuration fields

Field	Required	Default	Description
`where`	Yes	–	CXL boolean expression matching records across inputs. Must contain at least one cross-input equality or range conjunct (a predicate with neither is rejected at plan time — see Predicate requirements).
`match`	No	`first`	Match cardinality: `first`, `all`, or `collect`.
`on_miss`	No	`null_fields`	Driver-record handling on zero predicate matches: `null_fields`, `skip`, or `error`.
`cxl`	Yes (except under `match: collect`)	–	Emit statements defining the output row. Empty under `match: collect`.
`drive`	No	first input	Explicit driver-input qualifier. Overrides the iteration-order default.
`strategy`	No	`auto`	Execution strategy hint: `auto` or `grace_hash`.
`propagate_ck`	Yes	–	Selects which correlation-key columns ride onto the output. `driver` keeps the driver’s CK only; `all` unions every input’s CK columns; `{ named: [<field>, ...] }` carries an explicit subset. See Correlation-key propagation below.
`max_output_rows`	No	unlimited	Opt-in cap on the number of rows this combine may emit. When set, the run fails loud (diagnostic `E325`) the moment the output would exceed the cap — it never truncates to a partial result. See Output-size cap.

The `where:` predicate

The where: expression is a CXL boolean expression evaluated for every candidate record pair across inputs. It must contain at least one cross-input equality – an equality with field references from two different inputs:

  where: "orders.product_id == products.product_id"

Compound predicates combine multiple conjuncts with and. Each conjunct is classified by the planner:

Equi conjunct – a cross-input equality (a.x == b.y). Drives the hash lookup or sort-merge join.
Range conjunct – a cross-input ordered comparison (a.start <= b.ts and b.ts <= a.end). Handled by the IEJoin algorithm when no equi conjunct constrains the same input pair.
Residual conjunct – any other CXL predicate (intra-input filter, function call, etc.). Applied as a post-filter after the equi/range match.

  where: |
    orders.product_id == products.product_id
    and orders.amount >= 100
    and products.region == "us-east"

Above: the equi conjunct drives the join; orders.amount >= 100 and products.region == "us-east" are applied as residuals.

Every combine predicate must carry at least one cross-input equality or range conjunct. A predicate with neither — a pure residual with no decomposable cross-input comparison — is rejected at plan time with diagnostic E313; there is no supported execution strategy for it. Pure-range predicates without an equi conjunct are fully supported via IEJoin.

Non-orderable range keys

A range conjunct compares values that must be orderable at runtime: integers, finite floats, exact decimals, dates, and datetimes. When a record’s range key evaluates to a non-orderable value — SQL NULL, a non-finite float (NaN/infinity), or any other type — that record can never satisfy the range comparison, so it is routed out of the range match rather than joined:

Decimal and mixed-numeric range keys. Exact fixed-point decimal range keys are fully supported on every join strategy — a monetary band join (amount >= tier.floor and amount < tier.ceiling) matches correctly. A range conjunct that mixes an integer and a float operand across the two inputs is also supported: the integer is compared as a float, exactly as the >=/< operators compare it elsewhere.

Decimal range keys are placed on a shared fixed-point grid with up to 18 fractional digits and an integer magnitude up to roughly 1.7 × 10²⁰ — well beyond any realistic monetary value. A decimal range value outside that grid (more than 18 fractional digits, which would truncate, or a magnitude that would overflow the grid) is not silently dropped: the run stops with diagnostic E326 naming the combine, so a wrong or empty result is never emitted. Rescale or narrow the compared values if you hit it.

Datetime range keys. date and datetime range keys compare at their native resolution — a datetime to the nanosecond, across the full representable calendar (no microsecond rounding; instants before 1677 or after 2262 stay exact). Two timestamps that differ only below the microsecond therefore match, sort, and group as distinct instants on every join strategy, so a sub-microsecond as-of or band lookup neither drops a boundary match nor merges two near-simultaneous events.

abs/min/max/clamp and rejected range keys (E327). abs, min, max, and clamp return the numeric supertype (int | float). When both operands of the range conjunct recover the same concrete type — abs(int) >= abs(int), or a min/max/clamp whose result can only be one type (all-int or all-float, recovered through nested calls and arithmetic such as abs(a.x + 1) or abs(min(a.i, b.i))) — the axis is exactly as safe as a plain matching-typed key and the join runs normally. Otherwise the conjunct cannot be reduced to one exact numeric axis and is rejected at plan time with E327, rather than routed to the join where it could silently drop rows or mismatch. E327 fires when such an operand stays genuinely ambiguous — a mixed pair like abs(int) >= abs(float), or a per-row-ambiguous result like min(int, float) that may return either the integer or the float operand — or on a non-orderable pairing such as a string comparison or a decimal compared against a float/numeric. Compare a matching-typed range key instead. (A supported int/float/decimal/date/datetime range key, including a mixed int/float field pair, never triggers E327.)

A driver record with a non-orderable range key is treated as a zero-match driver and handled by on_miss (null_fields / skip / error).
A build record with a non-orderable range key is dropped (it can match nothing).

This is a runtime routing decision on the data, distinct from the plan-time E313 rejection above (which is about the predicate shape, not the values).

Match modes

`match: first`

Emit one output row per driver record, using the first matching build-side record. Standard 1:1 enrichment. Default.

  config:
    where: "orders.product_id == products.product_id"
    match: first
    cxl: |
      emit order_id = orders.order_id
      emit product_name = products.product_name

The where: predicate selects the match; the cxl: body is a post-match projection that runs once on the chosen build record. Selection and projection are separate steps: if the body filters the row out (a filter that fails, or a body that emits nothing), that one output row is dropped. The combine does not fall back to a later matching build, and the driver is not treated as unmatched — it matched the predicate, the body just produced no row. on_miss (below) never fires for such a driver; it fires only when the predicate matched nothing at all. This holds identically for every join strategy the planner may pick.

Behavior change. This is a change in observable output for existing pipelines whose where: predicate carries a range, equi+range, or single-inequality comparison — the shapes the planner runs as a sort-merge join or an IEJoin (both the pure-range block-band path and the equi+range hash-partitioned path). On any of those three strategies, a driver that matched the predicate but whose body skipped every candidate was previously routed to on_miss — firing null_fields, skip, or error. It now silently produces no row, matching the pure-equality strategies (in-memory hash and grace hash), which already behaved this way and are unchanged. A pipeline that relied on the old routing (for example, on_miss: error tripping on a body-skipped driver) no longer sees it.

`match: all`

Emit one output row for every matching build-side record. 1:N fan-out – if a driver record matches three build records, three rows are emitted.

  config:
    where: "employees.department == benefits.department"
    match: all
    cxl: |
      emit employee_id = employees.employee_id
      emit benefit = benefits.benefit_name

`match: collect`

Gather every matching build-side record into a single Array-typed field on the output row. The driver record appears once; the build matches are aggregated into an array. The cxl: body must be empty under collect – the combine node synthesizes the output as { driver fields..., <build_qualifier>: Array }.

  config:
    where: "orders.product_id == products.product_id"
    match: collect
    cxl: ""

A per-group entry limit of 10,000 prevents unbounded growth.

Use collect when you need the set of matches as a single structured value; use all when you need a flat row per match.

Unmatched records (`on_miss`)

on_miss controls what happens to driver records with zero predicate matches — drivers for which no build-side record satisfied where:. A driver that matched the predicate but whose cxl: body skipped the row (see match: first) is not a miss and never reaches on_miss; it simply produces no output row. On sort-merge and IEJoin strategies this is a recent change — see the behavior-change note under match: first.

Value	Semantics
`null_fields` (default)	Build-side fields resolve to null. Driver record is still emitted. Equivalent to left-join.
`skip`	Driver record is dropped. Equivalent to inner-join.
`error`	Pipeline fails on the first unmatched driver record.

  config:
    where: "orders.product_id == products.product_id"
    on_miss: skip

on_miss: error is useful for strict referential integrity where any miss should halt processing. on_miss: skip is the inner-join shape. on_miss: null_fields is the left-join shape and the default.

Composite keys

Chain multiple cross-input equalities with and:

  config:
    where: |
      sales.department == targets.department
      and sales.region == targets.region
    cxl: |
      emit department = sales.department
      emit region = sales.region
      emit actual = sales.amount
      emit goal = targets.goal

All conjuncts must hold for a record pair to match.

Multi-input combine (three or more)

Combine accepts any number of inputs. Each pair of inputs that should be related needs an explicit cross-input equality:

- type: combine
  name: fully_enriched
  input:
    orders: orders
    products: products
    categories: categories
  config:
    where: |
      orders.product_id == products.product_id
      and products.category_id == categories.category_id
    match: first
    on_miss: null_fields
    cxl: |
      emit order_id = orders.order_id
      emit product_name = products.product_name
      emit category_name = categories.name
      emit amount = orders.amount
    propagate_ck: driver

The planner builds a join tree by walking equalities pairwise and ordering the joins by selectivity.

Choosing the driving input

The driver is the input whose records flow through one at a time during execution; the other inputs are materialized as build-side hash tables (or IEJoin index structures). By default the first input in the input: map is the driver.

Use drive: to override:

  config:
    where: "orders.product_id == products.product_id"
    drive: products
    cxl: |
      emit product_id = products.product_id
      emit product_name = products.product_name
      emit sample_order_id = orders.order_id

With drive: products, the pipeline emits one row per product enriched with a matching order, instead of one row per order enriched with its product. Pick the driver based on which side you want to iterate over (typically the larger stream, or the one whose ordering you want to preserve).

Strategy hint

Value	Behavior
`auto` (default)	Planner picks a strategy from the predicate shape. Hash join for equi predicates; IEJoin for pure-range predicates.
`grace_hash`	Force grace hash join (disk-spilling partitioned hash). Applies only to pure-equi predicates; ignored on predicates with range conjuncts.

You rarely need to set this — auto already spills a large join to disk when it would otherwise exceed the memory budget. Use grace_hash as an explicit assertion when you know the build side is larger than memory but fits on disk after partitioning.

Correlation-key propagation

Combine declares which correlation-key columns its output rows carry via the required propagate_ck field.

- type: combine
  name: enriched
  input:
    orders: orders
    products: products
  config:
    where: "orders.product_id == products.product_id"
    cxl: |
      emit order_id = orders.order_id
      emit product_name = products.name
    propagate_ck: driver        # driver-only (today's behavior)

    propagate_ck: all           # union of every input's $ck.* columns

    propagate_ck:
      named: [order_id]         # explicit subset (intersected with upstream)

driver – output carries only the driver input’s correlation-key columns. Build-side records contribute body fields, but their group identity is consumed by the match.
all – output carries every input’s correlation-key columns. Use when the build side carries keys that downstream operators need to read.
named: [<field>, ...] – an explicit subset. Use to project a multi-field key down to a single field after a join.

Driver wins on a name collision: if both the driver and a build input declare the same key field, the output keeps the driver’s value. See the Correlation-key combine interaction reference for how each match mode fills the propagated key (especially match: collect).

propagate_ck is required on every combine; pipelines without an explicit value fail to compile. Existing pipelines migrate by adding propagate_ck: driver, which is bit-for-bit equivalent to today’s behavior.

Output-size cap (`max_output_rows`)

max_output_rows is an opt-in ceiling on how many rows a combine may emit. It defaults to unlimited; set it to guard against a permissive or mis-specified predicate that would explode a small pair of inputs into a huge result (for example, a range join over an unexpectedly hot key producing a near cross product):

  config:
    where: "orders.ts >= prices.effective_from"
    cxl: |
      emit order_id = orders.order_id
      emit price = prices.amount
    propagate_ck: driver
    match: all
    max_output_rows: 1000000

Semantics:

Fail-loud, never truncate. The moment the combine would emit more than the cap, the run stops with diagnostic E325 naming the node and the cap. It does not produce a capped or partial result — a silently truncated join would corrupt downstream data.
Independent of the memory budget. This is a result-size guard, not memory pressure. A runaway join can be perfectly bounded in memory (its output spills to disk) yet still produce far more rows than intended; max_output_rows caps the row count regardless of bytes.
Covers the whole output, on every strategy. The cap counts every emitted output row across all match modes and any on_miss rows, and is enforced identically whichever join strategy the planner picks (hash build-probe, grace-hash, sort-merge, or the IEJoin block-band).
collect counts driver rows. Under match: collect a combine emits one output row per driver row (each carrying an array of up to 10 000 collected matches), so max_output_rows bounds the driver-row count, not the number of collected array elements.
Dead-lettered rows are not counted. A driver whose cxl: body or residual raises a recoverable eval failure is routed to the DLQ, not the output, so it does not count toward the cap.
N-ary combines cap the final output. A combine whose where: spans three or more inputs is decomposed into a chain of binary steps; max_output_rows guards the final combined output, not the intermediate chain steps.

If the large result is expected, raise the cap (or omit the field). If it is not, tighten the where: predicate. Run clinker explain --code E325 for the full remediation guide.

Memory considerations

Each non-driving (build-side) input is held in memory while the join runs, so plan for roughly 1.5–2× its file size — a 50 MB lookup table needs about 75–100 MB. If a build side is larger than the memory budget, the join spills to disk automatically rather than failing. Set the budget with pipeline.memory.limit; see Memory Tuning.

Range and equi+range predicates (the IEJoin block-band strategy) are bounded on both input axes and the output: each side is drained into disk-backed, key-sorted blocks, and the emitted rows accumulate in a spillable sort buffer rather than a resident vector. So a range join whose inputs — or whose result — exceed the memory budget spills automatically and completes, rather than failing. Even a single hot key whose block-pair is a near cross product streams through a bounded nested loop instead of materializing the whole candidate set. Use max_output_rows if you want such a runaway result to stop rather than spill.

Document boundaries

A Combine passes document boundaries through to its output, so a per-document Aggregate after a join still rolls up per document. A driver source that carries several documents (a glob: over monthly files, say) produces one roll-up per driver document after the join, not one fold spanning all of them. A document carried on both the driver and the build side opens and closes exactly once downstream. See Document Context & Envelopes for the per-document aggregation model.

Complete example

pipeline:
  name: order_enrichment

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: csv
      path: "./data/orders.csv"
      schema:
        - { name: order_id, type: string }
        - { name: product_id, type: string }
        - { name: amount, type: float }

  - type: source
    name: products
    config:
      name: products
      type: csv
      path: "./data/products.csv"
      schema:
        - { name: product_id, type: string }
        - { name: product_name, type: string }
        - { name: category, type: string }

  - type: combine
    name: enrich
    input:
      orders: orders
      products: products
    config:
      where: "orders.product_id == products.product_id"
      match: first
      on_miss: null_fields
      cxl: |
        emit order_id = orders.order_id
        emit product_id = orders.product_id
        emit product_name = products.product_name
        emit category = products.category
        emit amount = orders.amount
      propagate_ck: driver

  - type: output
    name: result
    input: enrich
    config:
      name: result
      type: csv
      path: "./output/enriched_orders.csv"

Aggregate Nodes

Aggregate nodes group records by one or more fields and compute summary values using CXL aggregate functions. They consume all input records in a group before emitting a single summary record per group.

Basic structure

- type: aggregate
  name: dept_totals
  input: employees
  config:
    group_by: [department]
    cxl: |
      emit total_salary = sum(salary)
      emit headcount = count(*)
      emit avg_salary = avg(salary)

Group-by fields pass through automatically – you do not need to emit them. In this example, the output records contain department, total_salary, headcount, and avg_salary.

Group-by fields

The group_by: field is a list of field names from the input schema. Records sharing the same values for all group-by fields are placed in the same group.

    group_by: [region, department]
    cxl: |
      emit total_salary = sum(salary)
      emit max_salary = max(salary)

This produces one output record per unique (region, department) combination.

Global aggregation

An empty group_by list treats the entire input as a single group, producing exactly one output record:

- type: aggregate
  name: grand_totals
  input: orders
  config:
    group_by: []
    cxl: |
      emit grand_total = sum(amount)
      emit record_count = count(*)
      emit avg_order = avg(amount)

Aggregate functions

The following aggregate functions are available in CXL:

Function	Description
`sum(field)`	Sum of all values in the group
`count(*)`	Number of records in the group
`avg(field)`	Arithmetic mean
`min(field)`	Minimum value
`max(field)`	Maximum value
`collect(field)`	Collect all values into an array
`weighted_avg(value, weight)`	Weighted average

Strategy hint

The strategy: field controls how aggregation is executed:

- type: aggregate
  name: totals
  input: sorted_data
  config:
    group_by: [account_id]
    strategy: streaming
    cxl: |
      emit total = sum(amount)

Strategy	Behavior
`auto`	Default. The optimizer chooses based on whether the input is provably sorted for the group-by keys.
`hash`	Force hash aggregation. Works on any input ordering. Holds all groups in memory (with disk spill if memory budget is exceeded).
`streaming`	Require streaming aggregation. Processes one group at a time with O(1) memory per group. Compile-time error if the input is not provably sorted for the group-by keys.

When to use streaming

If your source declares a sort_order: that covers the group-by fields, the optimizer will automatically choose streaming aggregation. Use strategy: streaming as an explicit assertion – it turns a silent fallback to hash aggregation into a compile error, which is useful for catching sort-order regressions.

When to use hash

Hash aggregation works on unsorted input and is the safe default. It uses more memory but handles any data ordering. Memory-aware disk spill kicks in when RSS approaches the pipeline’s memory.limit.

Correlation-key interaction

When a pipeline’s sources declare correlation_key: fields, an aggregate behaves one of two ways depending on its group_by:

group_by covers the correlation key — if any record in a group fails, the whole group (including the aggregate output row) is sent to the DLQ.
group_by omits a correlation-key field — only the failing records are dropped and the affected groups are recomputed, so the surviving rows still produce a correct aggregate.

You do not configure this; the engine picks the behavior from your group_by. The one restriction is that the second case cannot be combined with strategy: streaming. See Correlation Keys for the full rules.

Time-windowed aggregates

When time_window: is set on the aggregate body, records are grouped not just by group_by but also by event-time window. Each record is placed into one or more windows by its event time (the $source.event_time value derived from the source’s watermark), and each window emits one row per group once it closes.

Every upstream-reachable source must declare a watermark: so the engine knows when a window is complete. Without one, no window ever closes and the planner rejects the pipeline with E156.

The engine emits user-declared columns only — window bounds do not appear in the output unless you compute and emit them yourself. The emit order is ascending window_start (deterministic), so output rows naturally group by window.

Tumbling windows

Non-overlapping fixed-size buckets. Each record lands in exactly one window [floor(t / size) * size, floor(t / size) * size + size).

time_window:
  tumbling: { size: 1h }

Input (tumbling_demo.csv):

user_id,event_ts,kind
u1,2026-05-14T10:05:00,click
u2,2026-05-14T10:30:00,click
u1,2026-05-14T10:42:00,click
u1,2026-05-14T11:03:00,click
u2,2026-05-14T11:15:00,click
u2,2026-05-14T11:50:00,click

Output with tumbling: { size: 1h }, group_by: [user_id], emit n = count(*):

user_id,n
u1,2
u2,1
u1,1
u2,2

Reading top-to-bottom: the first two rows are the [10:00, 11:00) bucket (u1’s 10:05 and 10:42, then u2’s 10:30); the next two are the [11:00, 12:00) bucket (u1’s 11:03, then u2’s 11:15 and 11:50). Each input record contributes to exactly one window.

Hopping windows

Overlapping fixed-size buckets advanced by slide. Each record lands in ceil(size / slide) windows: slide < size produces overlap, slide == size degenerates to tumbling, slide > size produces gaps where some records fall in zero windows.

time_window:
  hopping: { size: 1h, slide: 30m }

Input (hopping_demo.csv):

user_id,event_ts,amount
u1,2026-05-14T10:05:00,10
u1,2026-05-14T10:42:00,20
u1,2026-05-14T11:10:00,15

Output with group_by: [user_id], emit total = sum(amount), emit n = count(*):

user_id,total,n
u1,10,1
u1,30,2
u1,35,2
u1,15,1

Three input records, four output rows — each record fans into two overlapping size: 1h, slide: 30m windows:

[09:30, 10:30) — just 10:05 → total=10, n=1
[10:00, 11:00) — 10:05 + 10:42 → total=30, n=2
[10:30, 11:30) — 10:42 + 11:10 → total=35, n=2
[11:00, 12:00) — just 11:10 → total=15, n=1

Session windows

Per-key gap-bounded sessions. A new record extends its key’s current session if its event time is within gap of the session’s last event time; otherwise it starts a new session. The boundary is data-driven, not clock-aligned.

time_window:
  session: { gap: 10m }

Input (session_demo.csv):

user_id,event_ts,action
u1,2026-05-14T10:00:00,login
u1,2026-05-14T10:07:00,click
u1,2026-05-14T10:13:00,click
u1,2026-05-14T10:50:00,login
u1,2026-05-14T10:55:00,click

Output with group_by: [user_id], emit n = count(*):

user_id,n
u1,3
u1,2

u1’s first three rows form one session (10:00 → 10:07 → 10:13, consecutive gaps ≤ 10m). The 37-minute idle stretch exceeds gap, so 10:50 starts a fresh session that runs through 10:55. Two sessions, two output rows.

Allowed lateness

allowed_lateness extends how long a window stays open past its end before it emits, giving late-arriving records a grace period to still be counted. It is distinct from the source-side watermark.delay. Records that arrive after a window’s end + allowed_lateness route to the DLQ as LateRecord with stage label time_window:<aggregate-name>. See DLQ category: LateRecord for the DLQ row layout.

- type: aggregate
  name: hourly
  input: clicks
  config:
    group_by: [user_id]
    time_window:
      tumbling: { size: 1h }
    allowed_lateness: 30s
    cxl: |
      emit n = count(*)

Default (unset) means no grace beyond the watermark — windows close the instant min_across_sources crosses window_end. Set allowed_lateness when the source’s watermark.delay alone is too small to absorb the observed out-of-order tail.

Worked example: multi-source session window

This pipeline merges two independent login feeds and groups per-user events into gap-bounded sessions. When several sources feed one time-windowed aggregate, a window cannot close until every source has advanced past the window’s end — the slowest source paces the others, so no window emits before all of its records have arrived.

pipeline:
  name: multi_source_session

nodes:
  - type: source
    name: src_web
    description: Web login events.
    config:
      name: src_web
      type: csv
      path: ./data/session_logins.csv
      options:
        has_header: true
      watermark:
        column: event_ts
      schema:
        - { name: user_id, type: string }
        - { name: event_ts, type: date_time }
        - { name: source, type: string }

  - type: source
    name: src_mobile
    description: Mobile login events.
    config:
      name: src_mobile
      type: csv
      path: ./data/session_mobile.csv
      options:
        has_header: true
      watermark:
        column: event_ts
      schema:
        - { name: user_id, type: string }
        - { name: event_ts, type: date_time }
        - { name: source, type: string }

  - type: merge
    name: all_logins
    inputs: [src_web, src_mobile]

  - type: aggregate
    name: user_sessions
    input: all_logins
    config:
      group_by: [user_id]
      time_window:
        session: { gap: 5m }
      allowed_lateness: 30s
      cxl: |
        emit user_id = user_id
        emit logins = count(*)

  - type: output
    name: results
    input: user_sessions
    config:
      name: results
      type: csv
      path: ./output/multi_source_session.csv

Both sources declare their own watermark.column independently, and each source’s records carry an event time regardless of which feed delivered them — so the aggregate does not care which column name a given source used. A session cannot emit until both src_web and src_mobile have advanced past the session’s end plus its allowed_lateness. Drop the watermark: block on either source and the planner rejects the pipeline with E156.

Run it from the repo:

cargo run -p clinker -- run examples/pipelines/multi_source_session.yaml

Complete example

- type: source
  name: transactions
  config:
    name: transactions
    type: csv
    path: "./data/transactions.csv"
    schema:
      - { name: account_id, type: string }
      - { name: txn_date, type: date }
      - { name: amount, type: float }
      - { name: category, type: string }
    sort_order:
      - { field: "account_id", order: asc }

- type: aggregate
  name: account_summary
  input: transactions
  config:
    group_by: [account_id]
    strategy: streaming
    cxl: |
      emit total_amount = sum(amount)
      emit txn_count = count(*)
      emit avg_amount = avg(amount)
      emit max_amount = max(amount)
      emit categories = collect(category)

- type: output
  name: summary_output
  input: account_summary
  config:
    name: summary_output
    type: json
    path: "./output/account_summary.json"

The output is JSON because collect(category) emits an array. The CSV, XML, and fixed-width writers reject an array-valued field (a stray collection reaching a tabular sink is treated as a routing bug); route such a pipeline to JSON, which serializes arrays natively, or coerce the array to a scalar first with a downstream Transform (for example emit categories = categories.join(";")).

Reshape Nodes

Reshape nodes observe a whole correlation group and, per group, mutate the rows whose state caused a rule to fire while synthesizing new rows derived from those trigger rows. They are the node for “look at everything an entity did, then fix one record and insert the record that should have been there” — work no other node can do:

Aggregate reduces a group to one summary row.
Transform emits 0 or 1 record per input record.
Combine joins records across sources.

None of those produce new records derived from a group’s observed state while preserving the originals. Reshape does.

Reshape is a blocking grouping operator: it buffers every record of a group before any output row leaves, because a rule cannot decide what to synthesize until it has seen the whole group. It has a single output.

Basic structure

- type: reshape
  name: backfill_plans
  input: plans
  config:
    partition_by: [employee_id]
    order_by:
      - { field: plan_start, order: asc }
    rules:
      - name: fix_long_plan_years
        when: "plan_start - plan_end > 365"
        mutate:
          set:
            plan_end: "plan_start"
        synthesize:
          copy_from: trigger
          overrides:
            status: "'synthesized'"

For each employee_id group, every row where plan_start - plan_end > 365 (the trigger) has its plan_end rewritten and a new row synthesized from it with status overridden to synthesized.

`partition_by`

A list of field names. Records sharing the same values for all partition_by fields form one group, and every rule observes and acts within a single group. This is the correlation key the operator reasons over.

Whole-input grouping (`partition_by: []`)

An empty list is the degenerate case of that key: every record shares it, so the entire input forms one group and the rules apply across the whole dataset rather than per entity.

- type: reshape
  name: relabel
  input: rows
  config:
    partition_by: []          # one group: the whole input
    order_by:
      - { field: amount, order: asc }
    rules:
      - name: flag_large
        when: "amount > 100"
        mutate:
          set:
            label: "'large'"

order_by then sorts the whole input as a unit, and the no-cascade contract applies across every record at once. Two consequences follow from there being only one group:

The whole input must fit memory.limit at finalize, since a single group has to be resident when its rules fire (see Limits). Whole-input grouping is for datasets that fit the budget, not for bulk row-at-a-time work — a per-record mutation with no cross-record dependency belongs in a Transform, which streams.
A mutation conflict rolls back the whole group, so one conflict rolls back the entire run’s records rather than one entity’s.

Values Reshape cannot key

Reshape groups a record under a single null group whenever it cannot build a key from the partition value. Today that covers all of:

the column being absent from the record
an explicit null
an empty string ("")
a NaN float
an array- or map-valued cell (a multi-value column)

So account="" and account=null land in the same Reshape group. Note that Cull does not fold empty strings — there, account="" and account=null are two groups. The difference is unintentional and tracked in #1022; until it is resolved, do not assume one node’s grouping matches the other’s for blank values.

If a blank-heavy column is your partition key, expect one large null group. Normalize blanks upstream (a Transform that maps "" to a real sentinel) when you want them grouped separately.

`order_by`

Optional. A list of sort fields ({ field, order }, where order is asc or desc) applied within each group before rules run, so order-dependent synthesis is deterministic. Nulls sort last. Arrival order breaks ties.

Rules

Each entry in rules: is a declarative rule with a name, a trigger predicate, and optional mutation and synthesis actions. Rules are evaluated in declaration order — but every rule observes the same original group snapshot (see No cascade below).

`when` — the trigger predicate

when is a CXL boolean expression evaluated against each row in the group. A row for which when is true is a trigger row for that rule: its mutate rewrites it, and its synthesize derives new rows from it. CXL boolean operators are and / or / not (Clinker’s expression language is not SQL).

`mutate` — in-place trigger-row mutation

mutate:
  set:
    plan_end: "plan_start"
    note: "concat(note, ' (corrected)')"

Each set: entry is field: <CXL expression>. The expression evaluates against the original trigger row and overwrites that field’s value on the row.

Two restrictions are enforced at compile time:

A set: target must already exist in the upstream schema. Reshape mutates existing columns; it does not add new ones. Emit the column from an upstream Transform first if you need it.
A set: may not write a partition_by field — group identity must survive Reshape.

`synthesize` — deriving new rows

synthesize:
  copy_from: trigger
  overrides:
    plan_date: "'2024-01-01'"
    status: "'synthesized'"

For each trigger row, synthesize emits one new row:

copy_from: trigger — the new row starts as a copy of the trigger row’s values, then overrides are applied on top.
copy_from: none — the new row starts all-null; overrides must supply every column (enforced at compile time, so a synthesized row is never silently empty).

Each overrides: entry is field: <CXL expression>, evaluated against the trigger row.

No cascade

Every rule observes the original group state. A row mutated by rule A is not re-observed by rule B, and rule B’s when predicate sees the row’s original values, not rule A’s edits. This is a deliberate guarantee:

Determinism — cascade would make rule order silently change output.
Single observation — the group is observed once.

To sequence dependent transformations, chain two Reshape nodes in the DAG so the second observes the first’s output.

Mutation conflicts

If two rules write the same field on the same row, that is a mutation conflict. Some conflicts are caught at compile time when the rules’ selectors statically overlap; content-dependent collisions that cannot be proven at compile time are caught at runtime.

A runtime conflict routes a dead-letter-queue entry under the mutation_conflict category, and the whole correlation group rolls back — none of that group’s mutated or synthesized rows reach the output. The DLQ entry’s stage label is reshape:<node>:<rule_a>+<rule_b>, naming the colliding rule pair. See Error Handling & DLQ.

Audit stamps

Reshape stamps three engine-written columns on its output records so the provenance of a synthesized or mutated row is queryable downstream:

Column	Meaning
`$meta.synthetic`	`true` on a synthesized row, `false` on an original (including a mutated trigger row)
`$meta.synthesized_by`	the `<node>:<rule>` that synthesized the row (empty on originals)
`$meta.mutated_by`	comma-separated `<node>:<rule>` labels of every rule that mutated the row (empty if none)

Like the $ck.* correlation columns, these $meta.* columns stay out of the default writer output — they are available for downstream CXL and audit, not silently dumped into your output files.

Memory model

Reshape is a blocking, grouped operator: it groups every input record by partition_by before any rule fires, because each rule must observe its whole correlation group (the no-cascade contract forbids folding a group incrementally). It therefore cannot stream — the full group set materializes before the first output row leaves.

That per-group buffer is governed by the same central memory arbitrator every other blocking operator polls (see Memory & Spill). As records are grouped, Reshape tracks the live in-memory footprint and, whenever the run crosses the soft spill threshold (80% of memory.limit by default), it spills buffered groups to disk:

What spills: the raw input records, never the post-processed output rows. On reload at finalize, mutation and synthesis re-run in memory exactly as they would have without spilling, so the output is identical whether a group stayed resident or round-tripped through disk — including its within-group row order, which is restored to arrival order after a reload. (Two caveats apply; see Limits below.) Spilling input records (rather than output rows) is also what keeps a copy_from: none synthesized row — built against the wider output schema — from being reconstructed against the wrong schema; input records all share one uniform schema.
Spill priority: 15, between grace-hash Combine (10) and external sort (20). A grouped record buffer costs more to evict than grace partitions (reload pays the re-synthesis CPU) but less than an external-sort merge. Reshape cannot back-pressure — once its predecessor has drained there is no upstream channel to pause — so under memory pressure it always spills its own buffer in-thread rather than pausing a producer.
Largest-first, stop at the threshold: when the budget trips, Reshape evicts the largest resident groups first and stops as soon as the resident footprint drops back under the soft threshold — it does not drain every group. The cross-group resident peak is bounded near the soft limit.
Skew (one giant group): a single correlation group whose resident tail alone exceeds the budget is spilled incrementally — sliced by the upper bits of each record’s admission sequence so successive spill waves evict fractions of the one group — while smaller groups stay resident. The ingest-time resident peak therefore stays bounded even under one giant skewed group.

Reshape’s buffer is byte-budget-only: it does not honor the error_handling.max_group_buffer record-count cap. That cap bounds the correlation-commit group buffer used by the retraction machinery, a different buffer; Reshape’s footprint is bounded by memory.limit and the spill path, not by a per-group record count.

The on-disk spill volume Reshape produces is surfaced per stage in clinker run --explain (the Estimated spill volume and Spill compression sections) and, after a run, in the actual per-stage spill totals.

Limits

Two current limitations qualify the “identical whether spilled or resident” guarantee above:

A single correlation group must fit the memory budget at finalize. The no-cascade contract requires the whole group to be resident when its rules fire, so even though cross-group and ingest-time peaks spill to disk, the finalize reload of one group needs that group to fit. Skew slicing bounds the ingest peak, but a single correlation group larger than memory.limit has no in-budget representation. Rather than risk an out-of-memory crash, the run fails loud with E310 MemoryBudgetExceeded, naming the Reshape node, the offending partition_by group, and its footprint against the budget:
```
E310 backfill: arena exceeded budget (128000/8192) [one Reshape correlation
group [employee_id="employee-00000"] does not fit; the reported use is that
group's reload footprint alone. ...]
```
Raising memory.limit is the only fix that leaves your output unchanged. Raise it clear of the reported figure — finalize also holds the run’s remaining groups, so that figure is a floor, not a target.

The other two levers both change what you get, and are worth knowing only so you can weigh them deliberately:
- Dropping columns this node does not read, in an upstream Transform, shrinks each buffered record. But Reshape’s output row is the upstream columns plus its $meta.* audit columns — it never drops a column itself — so anything you strip upstream is also missing from the written output.
- Narrowing partition_by shrinks the group too, but that key defines the group the rules evaluate against, so a narrower key changes which records each rule sees and therefore changes your results. Treat the grouping key as a modelling decision, never as a memory knob.
(A future two-pass finalize could lift this limit.)

Run clinker explain --code E310 for remediation keyed to whichever memory surface overran.
Reshape rules cannot reference $doc document context. Because the spill round-trip does not yet preserve document envelope context, a $doc.* reference in a rule’s when, mutate.set, or synthesize expression would resolve to the real envelope for a resident group but to null for a spilled one — output that depends on the memory budget. A pipeline whose Reshape rules reference $doc is rejected at compile time. Move the $doc lookup into an upstream Transform that copies the value into a record column, then reference that column in the Reshape rule.

Cull Nodes

Cull nodes observe a whole correlation group and remove the entire group when a group-level predicate holds, routing the removed records to a side-output port instead of discarding them. They are the node for “look at everything an entity did, then set the whole entity aside for review or reprocessing” — work that operates on an aggregate property of a group, not on individual rows:

Route fans out per record on a row predicate.
Reshape mutates rows and synthesizes new ones within a group.
Aggregate reduces a group to one summary row.

None of those remove a whole correlation group based on an aggregate property of the group and emit the removed rows on a second stream. Cull does.

Cull is a blocking grouping operator: it buffers every record of a group before any output leaves, because a group-level predicate (e.g. “this group has more than 100 rows”) cannot be decided until the whole group is seen. It has two output ports: the main port (kept groups) and the removed_to side-output port (removed groups).

Basic structure

- type: cull
  name: flag_large_histories
  input: backfill
  config:
    partition_by: [employee_id]
    removed_to: review
    rules:
      - name: too_many_plans
        drop_group_when: "count(*) > 3"

For each employee_id group, if the group holds more than three rows the whole group is routed to the review side output; every other group flows to the main output.

`partition_by`

A list of field names. Records sharing the same values for all partition_by fields form one group, and the removal predicate observes one whole group at a time. This is the correlation key the operator reasons over. partition_by must cover every visible correlation-key field so group identity is preserved on both output ports.

Whole-input grouping (`partition_by: []`)

An empty list is the degenerate case of that key: every record shares it, so the entire input forms one group and drop_group_when decides the whole dataset at once — every record is kept, or every record is routed to removed_to.

- type: cull
  name: drop_bad
  input: events
  config:
    partition_by: []          # one group: the whole input
    removed_to: removed
    rules:
      - name: drop_any_error
        drop_group_when: "sum(if status == 'error' then 1 else 0) > 0"

That pipeline routes all records to removed if any single record has status: error. Keyed by [account] the same rule would remove only the offending account’s records — so reach for whole-input grouping when the decision genuinely concerns the batch (an all-or-nothing gate on a delivery), not when you meant a per-entity rule.

The whole input must then fit memory.limit at finalize, since a single group has to be resident when its predicate runs (see the limit below). The sibling group-count bound does not apply: one group is as few as the decision state can be.

Values Cull cannot key

Cull groups a record under a single null group in two cases: the column is absent from the record, or its value is an explicit null.

Everything else is either its own group or a hard error:

An empty string ("") is its own group, distinct from account=null.
A NaN float, or an array- or map-valued cell, aborts the run rather than grouping — a partition key must be a single scalar value. This abort currently presents as an internal error, but it is a data condition, not an engine defect: fix the offending column rather than treating the message as an engine invariant failure.

Reshape treats both of those differently: it folds empty strings, NaNs, and multi-value cells all into its null group instead. The blank-versus-null divergence is tracked in #1022. The unkeyable-value behavior is separate; until both rules are deliberately aligned or retained, do not assume one node’s grouping matches the other’s.

`order_by`

Optional. A list of sort fields ({ field, order }, where order is asc or desc) applied within each group before its predicate runs, so an order-sensitive predicate is deterministic. Nulls sort last. Arrival order breaks ties.

Rules

Each entry in rules: is a declarative removal rule with a name and a drop_group_when predicate. A group is removed when any rule’s predicate holds (the rules are OR-combined).

`drop_group_when` — the group-level removal predicate

drop_group_when is a CXL boolean expression evaluated in aggregate context over the whole group (group-by = partition_by). Because it is an aggregate expression, it uses CXL’s aggregate functions:

Aggregate	Meaning
`count(*)`	number of rows in the group
`sum(<expr>)`	sum of an expression over the group
`min(<expr>)` / `max(<expr>)`	minimum / maximum over the group
`avg(<expr>)`	mean over the group

rules:
  - name: too_many_plans
    drop_group_when: "count(*) > 3"
  - name: high_total
    drop_group_when: "sum(amount) > 10000"

CXL’s bare aggregate vocabulary is sum / count / min / max / avg / collect / weighted_avg — there is no bare any() aggregate. To express “remove the group if any row matches a condition”, sum an indicator and compare to zero:

rules:
  - name: drop_error_groups
    # Remove any account group containing at least one `error` row.
    drop_group_when: "sum(if status == 'error' then 1 else 0) > 0"

Ordered comparisons (>, <, >=, <=) work over every comparable aggregate type, not just numbers — the predicate uses the same comparison rules as a Transform. Numbers, strings, and dates all order:

rules:
  - name: late_alphabet
    drop_group_when: "max(name) > 'M'"          # string ordering
  - name: recent_hire
    drop_group_when: "max(hired) >= #2020-01-01#" # date ordering (`#YYYY-MM-DD#` literal)

A group whose aggregate operand is null (for example max(...) over an all-null column) compares as false — a null operand never removes the group and never errors.

Comments in a predicate

A drop_group_when predicate may carry a # line-comment to explain the rule inline. Each rule’s predicate is parsed on its own, so a trailing comment applies only to that rule:

rules:
  - name: drop_error_groups
    drop_group_when: "sum(if status == 'error' then 1 else 0) > 0  # any error row removes the group"
  - name: high_total
    drop_group_when: "sum(amount) > 10000                          # large accounts"

The comment is source text only — it never changes the compiled decision. (A #YYYY-MM-DD# date literal is unaffected: it lexes as a date, not a comment.)

Output ports: main and `removed_to`

Cull has two producer-side output ports, the same mechanism a Route uses for its branches — not the dead-letter queue. Removed records are valid rows the operator deliberately partitions onto a second stream, not errors.

Downstream nodes draw from the two ports by reference:

The main output (kept groups) is referenced by the Cull node’s bare name: input: flag_large_histories.
The side output (removed groups) is referenced as <cull>.<removed_to>: input: flag_large_histories.review.

  - type: output
    name: kept
    input: flag_large_histories            # main port — kept groups
    config: { name: kept, type: csv, path: kept.csv }

  - type: output
    name: review
    input: flag_large_histories.review     # side-output port — removed groups
    config: { name: review, type: csv, path: review.csv }

removed_to must be a non-empty name distinct from the Cull node’s own name (enforced at compile time, so the two ports are always distinguishable).

A single downstream node may draw from both ports — for example a Merge recombining the kept and removed streams (inputs: [flag_large_histories, flag_large_histories.review]) — and receives the union of both ports’ records.

`removed_to` is not the DLQ

The removed_to port carries the unchanged upstream schema — Cull does not widen, and both ports emit exactly the input columns. Removed records are not DlqEntrys and never appear in the dead-letter queue or its counters; they flow down a normal data edge to whatever node draws the removed_to port (an audit sink, a reprocessing branch, another transform). Use the DLQ for errors; use a Cull side output for valid records you want to handle separately. See Error Handling & DLQ for the error path.

Memory model

Cull is a blocking, grouped operator: it groups every input record by partition_by before any record leaves, because the group-level drop_group_when predicate is an aggregate property of the whole group and cannot be folded into a per-record keep/remove decision. It therefore cannot stream — the full group set materializes before the first output row leaves.

That per-group buffer is governed by the same central memory arbitrator every other blocking operator polls (see Memory & Spill). As records are grouped, Cull tracks the live in-memory footprint and, whenever the run crosses the soft spill threshold (80% of memory.limit by default), it spills buffered groups to disk:

What spills: the raw input records. On reload at finalize, each group is re-split onto its output port exactly as it would have been without spilling, so the output is identical whether a group stayed resident or round-tripped through disk — including within-group row order, which is restored to arrival order after a reload. (The per-group removal decision is computed from an in-memory aggregate over the same records; that aggregate state is O(distinct groups) and is never spilled — only the raw records spill. It cannot spill because Cull has no upstream channel to pause, so instead it is bounded by a hard check: if that state plus the run’s other live charged memory would exceed memory.limit, the run fails loud with a memory-budget error rather than growing it unbounded. See the limit below.)
Spill priority: 15, between grace-hash Combine (10) and external sort (20), matching Reshape. Cull cannot back-pressure — once its predecessor has drained there is no upstream channel to pause — so under memory pressure it always spills its own buffer in-thread rather than pausing a producer.
Largest-first, stop at the threshold: when the budget trips, Cull evicts the largest resident groups first and stops as soon as the resident footprint drops back under the soft threshold — it does not drain every group.
Skew (one giant group): a single correlation group whose resident tail alone exceeds the budget is spilled incrementally — sliced by the upper bits of each record’s admission sequence — while smaller groups stay resident, so the ingest-time resident peak stays bounded even under one giant skewed group.

The on-disk spill volume Cull produces is surfaced per stage in clinker run --explain (the Estimated spill volume and Spill compression sections) and, after a run, in the actual per-stage spill totals.

Limit: a single group must fit the finalize budget

Cull evaluates its group-level predicate against the whole group at once, so even though cross-group and ingest-time peaks spill to disk, the finalize reload of one group needs that group to fit the memory budget. Skew slicing bounds the ingest peak, but a single correlation group larger than memory.limit has no in-budget representation. Rather than risk an out-of-memory crash, the run fails loud with E310 MemoryBudgetExceeded, naming the Cull node, the offending partition_by group, and its footprint against the budget:

E310 drop_big: arena exceeded budget (512000/8192) [one Cull correlation
group [account="BIG"] does not fit; the reported use is that group's reload
footprint alone. ...]

Raising memory.limit is the only fix that leaves your output unchanged. Raise it clear of the reported figure — finalize also holds the run’s remaining groups and the per-group decision map, so that figure is a floor, not a target.

The other two levers both change what you get, and are worth knowing only so you can weigh them deliberately:

Dropping columns this node does not read, in an upstream Transform, shrinks each buffered record. But Cull filters rows, never columns — both its ports carry the unchanged upstream schema — so anything you strip upstream is also missing from the main and removed_to outputs.
Narrowing partition_by shrinks the group too, but that key defines the group drop_group_when evaluates over. With a rule like count(*) > 100, splitting one account across a finer key drops each resulting group below the threshold, so an account that should have been removed is emitted on the main port instead — the run “works” and quietly returns a different result set. Treat the grouping key as a modelling decision, never as a memory knob.

Run clinker explain --code E310 for remediation keyed to whichever memory surface overran, or see the memory guide.

There is a second, symmetric bound on the number of groups. The per-group removal decision is held in an in-memory aggregate that is O(distinct groups) and — unlike the raw records — cannot spill. If a partition key is so high-cardinality that the decision state plus the run’s other live charged memory would exceed memory.limit (many small groups rather than one giant group), the run likewise fails loud with E310, rather than growing that state unbounded. Here, coarsening partition_by is a legitimate fix only if the coarser key is the grouping you actually meant — the same caveat as above applies. Otherwise, raise memory.limit.

Several distinct memory surfaces can raise E310 against a Cull node, so read the [...] detail to see which one overran. The ones documented here are Cull correlation group [...] (the giant-group case above), Cull drop-decision aggregate state (the group-count bound), Cull cross-region tee admission (a downstream stage in a different deferred region forces this node’s output to be parked in memory), and node-buffer materialization overlap (a consumer must collect one of Cull’s sequential port scans into a resident vector). Cull’s main and removed_to handoff buffers themselves are spill-eligible, including when a port fans out. Other surfaces the shared runtime charges may name this node too — the detail string is the authority, not this list.

Envelope Nodes

Envelope nodes frame a body stream into per-document documents. An Envelope is a discrete, composable stage you can place after any operator — a Transform, a Merge, a Combine, or an Aggregate — to declare “from here on, treat the records as belonging to framed documents.” It mirrors the message/EDI/XML envelope-wrapper pattern (the Enterprise Integration Patterns Envelope Wrapper, XProc’s p:wrap-sequence): the body is the payload, and the envelope is the document boundary around it.

This page documents the preserve and concat framing strategies and the orthogonal header: / footer: synthesis that layers on top of either.

Basic structure

- type: envelope
  name: framed
  body: merged
  config:
    strategy: preserve

The node reads its body: input and emits the same records, framed per document. A downstream Output with reconstruct_envelope: true then writes one framed document per body grain.

Inputs: `body`, the wired `header`, and the not-yet-wired `trailer`

Input	Required	Status
`body`	yes	The records to frame into documents.
`header`	no	A 1-row-per-grain header stream. A wired value replaces each body document’s header with the matching header record, attached by document grain.
`trailer`	no	A stream whose records append to each framed document. Accepted in config but not yet wired — a wired value is rejected at plan validation this release.

When you omit header:, an Envelope frames each body record using the body’s own ambient envelope — the document context every record already carries from its source.

Wiring a `header:` port replaces the document’s header

A wired header: input is a second stream carrying one header record per document, each on the same document grain as the body it frames. The node attaches a header to a body strictly by grain, so the header record reaching it must carry the body’s grain — and the node then replaces that document’s ambient header with the wired header record (the framing grain is preserved; only the header changes). This is transform-in-place header replacement: rewrite a header’s values upstream — for example, override a batch id or stamp a run date — and frame the body with the rewritten header instead of the source’s original.

nodes:
  # `rewrite_header` rewrites the source header's values while keeping each
  # record's grain, so the rewritten header still grounds to its body document.
  - type: envelope
    name: framed
    body: payments
    header: rewrite_header
    config: { strategy: preserve }

The header record must carry a body document’s grain. A grain-preserving Transform of the source’s promoted header keeps it; a replacement from a different source establishes it via a business-key join against the body. A header record whose grain matches no in-flight body document (or carries the synthetic, ungrounded grain a Transform stamps onto a record it builds from scratch) cannot be placed, so the run fails with E351 (run clinker explain --code E351 for the full write-up):

envelope 'framed': a wired header record carries document grain <grain>, which
matches no in-flight body grain (or is a synthetic / ungrounded grain). The node
attaches a header to a body strictly by grain, so it cannot place a header that
grounds to no body document.

Exactly one header record may carry each body document’s grain. When the wired header stream carries two or more records on the same grain, the node has no rule to fold a second header onto an already-framed document, so it refuses to silently keep one and drop the rest. The run fails with E352 (run clinker explain --code E352 for the full write-up):

envelope 'framed': the wired header input carries two or more records for
document grain <grain> — exactly one header record per document grain is
required.

Deduplicate the header stream to one record per grain upstream — an aggregate or distinct on the grain’s business key, or a Transform that emits a single rewritten header per source document.

A wired trailer: input is still rejected this release with a clear “not yet supported” message:

envelope node 'framed': explicit `trailer` input wiring is not yet supported —
omit it to frame with the body's own envelope

`strategy: preserve`

preserve emits one framed document per body grain. It is a transparent framing stage: body records pass through with their document context and grain unchanged, and the document-boundary signals are forwarded verbatim. Inserting a preserve Envelope between a body stage and an Output is byte-identical to today’s per-document framing — its value is being the explicit, composable stage that later strategies extend, not a change in output.

preserve is the default, so config: { strategy: preserve } and an empty config: {} are equivalent.

Framing is keyed on the document grain, never the source file

The grain is the level at which one logical document is reconstructed — and it is not always one-per-file:

A nested X12 interchange frames once per interchange. The GS functional-group and ST transaction-set levels inherit the interchange grain, so an ISA … IEA interchange is one framed document regardless of how many groups or transaction sets it nests.
An HL7 multi-message file frames once per message. Each MSH message opens its own grain, so a single file containing several messages produces several framed documents.

Because framing keys on the grain rather than the source file, splitting or combining files never silently changes the document count.

`strategy: concat`

concat does the opposite of preserve: it collapses a multi-document body into one framed document. Every body record is re-stamped onto a single consolidated document context, so the body opens and closes exactly once regardless of how many documents fed in. This is the strategy to use when several source documents — say two files joined by a Merge — should write as one consolidated document with a single header and footer.

nodes:
  - type: merge
    name: both
    inputs: [file_a, file_b]
  - type: envelope
    name: framed
    body: both
    config: { strategy: concat }
  - type: output
    name: out
    input: framed
    config:
      name: out
      type: csv
      path: out.csv
      reconstruct_envelope: true

Re-stamping changes only the framing (the grain) and the ambient $doc.* view a record sees — it does not disturb per-record fields. In particular $source.file is a real column stamped when each record is read, so it still reports the record’s own originating file after a concat. Concat is lossless on per-record provenance; it changes only which document the record is framed inside.

The consolidated header, and the two-headers conflict

One consolidated document can carry only one envelope header. concat derives it from the headers of the documents that contribute body records, taking one header per document:

Every header agrees (or there is only one) → the consolidated document carries that common header.
No document carries a header → the consolidated document is headerless.
A headed document and a headerless document → the single header wins; the headerless document coexists with it (no conflict).

Only documents that contribute body records take part: a document that carries a header but no body records frames nothing once consolidated, so it never enters the comparison. Header identity is structural — two documents share a header when they declare the same sections, in the same order, with the same field values (including any raw content the reader preserves). Two files that differ only in an embedded control number therefore count as distinct headers.

When the body carries two or more distinct non-empty headers, concat refuses to silently keep one and drop the rest. The run fails with E350 (run clinker explain --code E350 for the full write-up):

envelope 'framed': concat collapses the body into one framed document, but the
body carried 2 distinct non-empty envelope headers — one document can frame only
one header, so concat will not silently drop the rest. Make the headers identical
upstream, or add a header-folding strategy that declares which header the
consolidated document keeps.

To resolve a conflict, either keep the documents separate with preserve, make the headers identical upstream (project them to the same sections and values), or use header: synthesis (below) to regenerate a fresh consolidated header that does not depend on the source headers agreeing.

Synthesizing a header and footer

header: and footer: are orthogonal to the framing strategy. The strategy decides how many output documents there are (preserve = one per body grain, concat = one consolidated); synthesis decides what header and footer each of those documents carries. The node computes a fresh header (declarative scalar expressions) and footer (streaming aggregates over the framed body) per output document, stamps them as named sections into the document’s envelope, and the same header_from_doc / footer_from_doc writer path renders them.

Both maps are keyed section -> field -> CXL expression. The section name is user-chosen — it is the envelope section a downstream Output renders via header_from_doc / footer_from_doc. The inner field map preserves declaration order, which is the rendered cell order. A synthesized section overrides an existing same-named section on the document (a regenerate); other sections ride through untouched.

- type: envelope
  name: framed
  body: merged
  config:
    strategy: concat            # or preserve — synthesis works the same on either
    header:                     # section -> field -> scalar CXL
      interchange:
        sender: $vars.sender_id
        created: $pipeline.run_date
    footer:                     # section -> field -> aggregate CXL
      interchange:
        record_count: count()
        total: sum(amount)

Header fields: evaluated once at document open

A header: field is a scalar expression evaluated once per output document, before the body streams. It may read only inputs known at document open — pipeline configuration ($vars), per-document provenance ($source), pipeline-scope state ($pipeline), and the ambient envelope ($doc.*). It may not read a body column, because there is no “current body record” when the header is emitted; a body-column reference is rejected at compile time with E353 (run clinker explain --code E353). Put body-derived values in a footer aggregate instead.

Footer fields: streaming aggregates over the body

A footer: field folds the document’s body records into a footer value at the document’s close. The fold is an O(1) accumulator per open document, so it supports exactly the streaming distributive/algebraic aggregates over a bare field or *:

Aggregate	Example
`count`	`count()`, `count(*)`
`sum`	`sum(amount)`
`avg`	`avg(amount)`
`min`	`min(amount)`
`max`	`max(amount)`

Anything else is rejected at compile time. A holistic or unbounded aggregate (collect, any, weighted_avg) or a composed/multi-argument aggregate (sum(amount * 1.1), weighted_avg(value, weight)) raises E354 (run clinker explain --code E354); a non-aggregate function (median, mode) fails earlier as an unknown function. Project a composed value in an upstream Transform, or compute a holistic value in an upstream Aggregate, then aggregate the bare column here.

Because the strategy sets the grain cardinality, the same footer: { interchange: { record_count: count() } } produces a different result on each strategy over the same body:

Under strategy: preserve, each body document frames its own output document, so each footer’s record_count is that document’s body count.
Under strategy: concat, the whole body collapses to one document, so the single footer’s record_count is the merged body count.

Placement

An Envelope is a normal single-input, single-output node — put it anywhere a record stream flows:

nodes:
  # … sources, a Combine that joins two streams into `merged` …
  - type: envelope
    name: framed
    body: merged
    config: { strategy: preserve }
  - type: output
    name: out
    input: framed
    config:
      name: out
      type: csv
      path: out.csv
      reconstruct_envelope: true

Placing the Envelope after a Combine or Aggregate is the intended use: it declares the document framing for the combined or reduced result, which is exactly where the later consolidation and synthesizing strategies do their work.

Memory model

Both strategies re-park the body into the node’s own buffer slot, which the engine’s memory arbitrator governs and spills to disk under pressure — so neither strategy is bounded by total input size held in RAM.

preserve is a transparent framing pass-through: it forwards records and their document boundaries unchanged. concat additionally re-stamps each record onto the one consolidated document context and replaces the per-document boundaries with a single open/close pair; the header consolidation it does first groups the body records by document — one document’s worth of body records shares one grain and one header — so it does one pass over the records to collect the distinct headers, comparing only one envelope per document (the work is bounded by the number of documents, not the number of body rows).

header: / footer: synthesis adds, on top of the materialized body, one O(1) accumulator per footer field per open document — every allowed footer aggregate (count / sum / avg / min / max) holds a fixed-size state regardless of how many body rows it folds. So a document’s footer state is independent of its body-row count, and synthesis stays within the node’s bounded-memory model.

Output Nodes

Output nodes write processed records to files. They are the terminal nodes of a pipeline – every pipeline path must end at an output (or records are silently dropped).

Basic structure

- type: output
  name: result
  input: transform_node
  config:
    name: output_stage
    type: csv
    path: "./output/result.csv"

The type: field selects the output format: csv, json, xml, fixed_width, edifact, x12, hl7, or swift. The edifact, x12, and swift writers reconstruct one interchange/message envelope around emitted records; the hl7 writer re-emits HL7 v2 segments and optionally wraps them in batch/file envelopes. See EDIFACT Format, X12 Format, HL7 v2 Format, and SWIFT MT Format.

Structured single-writer outputs (edifact, x12, hl7, and swift) accept one concrete document grain per output file. A multi-file source or multi-input merge feeding one of these outputs is rejected instead of being silently written as one merged envelope. To write multiple structured documents, consolidate them deliberately with an Envelope node first or route each document to a separate output path.

Direct broadcast to several outputs

Several Output nodes may name the same input. This is a broadcast: every Output receives every upstream record, regardless of node declaration order. The run report counts one write per sink, so five input records feeding a CSV and a JSON Output produce records_written: 10.

Use a Route node when outputs should receive different subsets. Writing a field such as _route does not select a destination; it is an ordinary output column unless a Route condition explicitly reads it.

Field control

Output nodes can either pass every upstream field through to the writer or restrict output to the fields the upstream transform explicitly emitted. Several options control which fields appear and how they are named.

Unmapped input field passthrough

    include_unmapped: false    # Default: true

When true (the default), every field on an input record that the upstream transform did not explicitly emit still passes through to the output unchanged. This includes fields the source’s on_unmapped: auto_widen policy absorbed into the per-record $widened sidecar map – their contents expand back to top-level columns at the sink.

When false, only fields named by an emit statement in the upstream transform appear in the output. The $widened sidecar slot is stripped and undeclared input fields are dropped.

When true, how a carried-along column reaches the writer depends on the output format. Self-describing formats (JSON / NDJSON / XML) write each record’s own keys. A CSV output widens its header to the union of every record’s columns when it can materialize the batch, and otherwise — on a bounded-memory streaming path (a Merge, a fused Transform, a single-branch Route, a streaming-strategy Aggregate, or the probe side of a hash-build-probe Combine feeding the output), or an envelope-reconstructing path — fails loudly with a SchemaDrift error rather than dropping a column it cannot fit under its already-committed header. A fixed-width output has no room for an undeclared column and likewise raises SchemaDrift. See Auto-Widen & Schema Drift → Schema drift across records.

Migration notice

The default flipped from false to true in a recent release (see issue #90). Pipelines that relied on the previous behavior – where output records contained only the fields explicitly emitted upstream – must now set include_unmapped: false explicitly to restore that shape.

The flag composes independently with include_correlation_keys: true – see below. See Auto-Widen & Schema Drift -> Output controls for the full specification and cross-format flow examples.

Worked example

Suppose the upstream source emits records with order_id, customer_id, amount, and region, and a transform that emits only one derived field:

- type: transform
  name: classify
  input: orders
  config:
    cxl: |
      emit amount_bucket = if amount >= 1000 then "high" else "low"

With include_unmapped: true (the default), each output record carries order_id, customer_id, amount, region, and amount_bucket. With include_unmapped: false, each output record carries only amount_bucket. The transform’s CXL is unchanged in both cases – the Output node decides the field set.

Include correlation-key shadow columns

    include_correlation_keys: true    # Default: false

When a source declares a correlation_key:, the engine tracks correlation-group identity on hidden columns that are stripped from output by default. Set include_correlation_keys: true to surface them in the writer output — typically for debugging correlation-group routing or auditing DLQ behavior. See Correlation Keys.

include_correlation_keys does not surface auto-widened columns – include_unmapped is the separate flag for that. The two are independent: each, both, or neither can be set.

Nested columns and non-JSON writers

The CSV, XML, fixed-width, EDIFACT, X12, and HL7 writers can only write flat scalar columns; a nested value reaching one of them fails with an UnserializableMapValue error. JSON writes nested values natively. The usual cause is an auto-widened column reaching a non-JSON writer with include_unmapped: false — see Auto-Widen & Schema Drift for the fix.

Field mapping

mapping: declares the columns the file carries – which columns, under what names, in what order – without changing upstream CXL. It is a sequence, one item per output column:

    mapping:
      - order_id                  # carried through under its own name
      - sold_to: customer_id      # written as `sold_to`, read from `customer_id`
      - contact_email: customer_email
      - channel
      - sku

Two item shapes:

A bare column name emits that column unchanged. This is the common case, and it costs one line naming the column once.
A single-key pair renames. The output name is on the left, the source column on the right – the same side the bare form names. Reading an item left to right always tells you what appears in the file first.

The renames are the only items carrying a colon, so in a wide output they are found by scanning for structure rather than by comparing two names per line.

Order and selection

Declaration order is the output column order. Listed columns are written first, in the order the block declares them, whatever order they arrive in.

include_unmapped governs everything the block does not list. With include_unmapped: true (the default) unlisted columns are appended after the declared ones, in their existing relative order. With include_unmapped: false they are dropped, so the block becomes the complete statement of the output:

    include_unmapped: false
    mapping:
      - department
      - surname: last_name
      - first_name

Given upstream columns first_name, last_name, department, that writes exactly department,surname,first_name.

Every record carries every declared column. When a record does not supply an item’s source column, that column is still written, empty. The file’s shape follows the block, not the data — so a stream whose records differ in shape (a multi-record-type source, a column arriving through auto_widen, a composition body’s open row) still produces one stable column set in declaration order, rather than one that depends on which record happened to arrive first.

One upstream column may feed two output columns – - sku and - item_code: sku – because names must be unique on the output side, not the source side. Declaring the same output name twice is rejected (E364): a file cannot carry two columns under one header.

For the same reason, an output name that include_unmapped: true would also carry through is rejected. If upstream already has a sold_to column, writing - sold_to: customer_id under include_unmapped: true would put two sold_to columns in the file and readers would resolve the wrong one. Rename the mapped column, exclude the upstream one, or set include_unmapped: false.

Where the compiler cannot enumerate the upstream columns, the same collision reaches the run. The mapped value wins – the block is your explicit statement of what the file carries – and the displaced upstream column is named in a W366 warning at the end of the run. Applying one of the three fixes above silences it.

Diagnostics

A mapping: item naming a column that does not exist at that point in the pipeline is rejected at compile time (E365), with the available column list and a did you mean when the name is a near miss. Nothing is renamed silently.

The compiler cannot always see the column set. Inside a composition body the rows are open by construction, and under on_unmapped: auto_widen a column can reach the sink through the sidecar without being declared anywhere. There an item naming an unknown column compiles even when its name resembles a declared column: spelling similarity cannot prove that a dynamic field is absent. W365 reports it after the run if no written record supplied it.

What catches the rest is the end of the run: if no record supplied an item’s source column, that item wrote an empty column in every row, and the run reports it as W365, naming the column to correct. An item some records supply and others do not is a sparse column, not a mistake, and is not reported.

Both W365 and W366 are advisory. They print to standard error when the run finishes and do not change the exit code – the file is written and readable either way, and by the time a stream ends the run’s other outputs have already been flushed.

A column absent from the source’s schema: reaches the sink only through the auto_widen sidecar, which is expanded to top-level columns only under include_unmapped: true. A mapping: item may name such a column when that flag is set; under include_unmapped: false it cannot resolve and is rejected at compile time.

An empty block – mapping: {} or mapping: [] – is rejected (E364): it declares an output with no columns. To write every upstream column, remove the mapping: key rather than emptying it.

Writing the block as a YAML map instead of a sequence is rejected (E364); the message prints your own block already rewritten. Run clinker explain --code E364 for the migration, and read the direction note there before pasting: releases before this one documented output_name: source_field but executed the reverse, so the rewrite swaps each pair’s two sides to preserve what the pipeline was actually writing.

Excluding fields

Remove specific fields from output:

    exclude: [internal_id, _debug_flag, temp_calc]

exclude: matches incoming column names, and runs before mapping:. Two consequences:

The columns that survive keep their relative order. Upstream a, b, c, d with exclude: [b] writes a, c, d.
Naming a column that a mapping: item also produces is not a conflict – the exclusion removes the upstream column of that name and leaves the mapped one standing. That is the fix for the two-columns-under-one-header collision above: - sold_to: customer_id with exclude: [sold_to] writes one sold_to column, carrying customer_id’s value.

Excluding a column a mapping: item reads is a different matter, and is rejected (E364): the exclusion removes the column before the item can read it, so the item could never resolve.

Header control (CSV)

    include_header: true      # Default: true

Set to false to omit the CSV header row.

Null handling

    preserve_nulls: false     # Default: false

When false, null values are written as empty strings. When true, nulls are preserved in the output format’s native null representation (e.g., null in JSON).

Rounding decimals to a declared scale

An Output node’s optional schema: may declare a column type: decimal with a scale. A decimal value landing in that column is rounded to the declared number of fractional places on write, using banker’s rounding — the same boundary contract a decimal source column applies on read.

    schema:
      - { name: dept,    type: string }
      - { name: total,   type: decimal, scale: 2 }
      - { name: average, type: decimal, scale: 2 }

Decimals compute at full precision inside the pipeline (division and avg keep every digit), so a declared output scale is how you pin a computed result to fixed places at the sink: avg(amount) over 1.00, 1.00, 2.00 writes 1.33 into a scale: 2 column, while sum(amount) — already at scale 2 — stays 4.00. This works for every format (CSV, JSON, fixed-width); an output column with no declared scale, or an output with no schema: block at all, keeps the full-precision value. Only decimal values in decimal-declared columns are affected — no other type is coerced. See Decimal — arithmetic rules for the full boundary-contract model.

The same rounding applies to an Output node declared inside a composition body. When its schema: names an external .schema.yaml file, the path resolves relative to the composition file’s own directory (not the invoking pipeline’s).

Output format options

CSV

- type: output
  name: csv_out
  input: processed
  config:
    name: csv_out
    type: csv
    path: "./output/result.csv"
    options:
      delimiter: "|"

delimiter is a single byte on the wire, so it must be exactly one ASCII character (for example ,, |, or \t). An empty, multi-character, or non-ASCII value is rejected at plan validation rather than silently truncated to its first byte.

JSON

- type: output
  name: json_out
  input: processed
  config:
    name: json_out
    type: json
    path: "./output/result.json"
    options:
      format: ndjson           # array | ndjson
      pretty: true             # Pretty-print JSON

array (default) – writes a single JSON array containing all records.
ndjson – writes one JSON object per line.

JSON numbers cannot represent non-finite floats; a record carrying NaN or an infinity fails the write with a JSON error instead of silently becoming null. See JSON Format.

XML

- type: output
  name: xml_out
  input: processed
  config:
    name: xml_out
    type: xml
    path: "./output/result.xml"
    options:
      root_element: "data"
      record_element: "row"
      attribute_prefix: "@"    # emit @-prefixed fields as XML attributes

Fields whose final path segment carries the attribute_prefix (default @, matching the XML source option) are emitted as XML attributes of their enclosing element, so attribute fields read from an XML source round-trip. See XML Format for details.

Fixed-width

- type: output
  name: fw_out
  input: processed
  config:
    name: fw_out
    type: fixed_width
    path: "./output/result.dat"
    schema: "./schemas/output.schema.yaml"
    options:
      line_separator: crlf

Fixed-width output requires a format schema defining field positions and widths. Fields land at their declared byte ranges with gaps space-filled — see Fixed-Width Format for the layout semantics.

EDIFACT

- type: output
  name: edi_out
  input: messages
  config:
    name: edi_out
    type: edifact
    path: "./out/result.edi"
    options:
      interchange: ["UNOA:1", "SENDER", "RECEIVER", "240101:1200", "REF1"]
      message_type: "ORDERS:D:96A:UN"
      write_una: false
      segment_newline: true

The EDIFACT writer reconstructs the interchange envelope around emitted records, recomputing the UNT/UNZ control counts and echoing the control references, and release-escapes any element data that carries a service character. The UNB header comes from interchange (literal elements) or interchange_from_doc (echoed from a $doc section). An interchange is a single envelope, so an edifact output cannot be combined with a split: block — the combination is rejected at config-validation time (E323). See EDIFACT Format for the full option reference, the record schema, and the round-trip semantics.

HL7 v2

- type: output
  name: hl7_out
  input: messages
  config:
    name: hl7_out
    type: hl7
    path: "./out/result.hl7"
    options:
      file_header: ["^~\\&", "LAB", "HOSP", "EHR", "HOSP", "20240102", "FILE7"]
      batch_header: ["^~\\&", "LAB", "HOSP", "EHR", "HOSP", "20240102", "BATCH3"]
      segment_newline: true

The HL7 writer re-emits the MSH and body segments from the record stream, escaping any field data that carries a delimiter character (| → \F\, ^ → \S\, and so on). When a file_header (or file_header_from_doc) or batch_header is configured the writer wraps the messages in an FHS..FTS file or BHS..BTS batch and recomputes the closing BTS/FTS counts. A batch/file envelope is a single structure, so an hl7 output cannot be combined with a split: block — the combination is rejected at config-validation time (E339). See HL7 v2 Format for the full option reference, the record schema, the MSH off-by-one, and the round-trip semantics.

Sort order

Sort records before writing:

    sort_order:
      - { field: "name", order: asc }
      - { field: "amount", order: desc, null_order: last }

Sort option	Values	Default
`order`	`asc`, `desc`	`asc`
`null_order`	`first`, `last`, `drop`	`last`

first – nulls sort before all non-null values.
last – nulls sort after all non-null values.
drop – records with null sort keys are excluded from output.

Shorthand: a bare string defaults to ascending with nulls last:

    sort_order:
      - "name"
      - { field: "amount", order: desc }

File splitting

Split output into multiple files based on record count, byte size, or group boundaries:

- type: output
  name: split_output
  input: processed
  config:
    name: split_output
    type: csv
    path: "./output/result.csv"
    split:
      max_records: 10000
      max_bytes: 10485760           # 10 MB
      group_key: "department"       # Never split mid-group
      naming: "{stem}_{seq:04}.{ext}"
      repeat_header: true           # Repeat CSV header in each file
      oversize_group: warn          # warn | error | allow

Split configuration fields

Field	Required	Default	Description
`max_records`	No	–	Soft record count limit per file
`max_bytes`	No	–	Soft byte size limit per file
`group_key`	No	–	Field name – never split within a group sharing this key value
`naming`	No	`"{stem}_{seq:04}.{ext}"`	File naming pattern. `{stem}` is the base name, `{seq:04}` is a zero-padded sequence number, `{ext}` is the file extension
`repeat_header`	No	`true`	Repeat CSV header row in each split file
`oversize_group`	No	`warn`	What to do when a single key group exceeds file limits

At least one of max_records or max_bytes should be specified for splitting to have any effect.

For formats whose output wraps the whole file in framing – a JSON array or an XML root element – each split file is a complete, independently valid document: the framing is closed at rotation and reopened for the next file.

Oversize group policies

warn (default) – log a warning and allow the oversized file.
error – stop the pipeline.
allow – silently allow the oversized file.

When group_key is set, the split point is the first group boundary after the threshold is reached (greedy). Without group_key, files are split at the exact limit.

Streaming writes after an interleave Merge

When a single Output sits directly after a Merge with mode: interleave whose inputs are all Sources, records are written to disk as they arrive rather than being buffered until the merge finishes. This keeps memory flat and lets a slow writer naturally pace the upstream readers.

- type: source
  name: src_a
  config: { type: csv, path: a.csv, schema: ... }
- type: source
  name: src_b
  config: { type: csv, path: b.csv, schema: ... }
- type: merge
  name: merged
  inputs: [src_a, src_b]
  config:
    mode: interleave        # required
- type: output
  name: out
  input: merged
  config:
    name: out
    type: csv
    path: out.csv

This is automatic — there is no setting to enable it. It applies only to this exact shape: one interleave Merge of Sources feeding one non-splitting Output, in a pipeline without correlation keys. Any other topology buffers as usual, with identical output either way.

Complete example

- type: output
  name: department_reports
  input: enriched_employees
  config:
    name: department_reports
    type: csv
    path: "./output/employees.csv"
    # `include_unmapped: false` makes the mapping the whole output: these four
    # columns, in this order, and nothing else. Without it every unlisted
    # upstream column would still be appended after them, and an `exclude:`
    # would be needed to keep any of them out.
    include_unmapped: false
    mapping:
      - "Employee ID": employee_id
      - "Full Name": display_name
      - department
      - "Annual Salary": salary
    include_header: true
    sort_order:
      - { field: "department", order: asc }
      - { field: "display_name", order: asc }
    split:
      max_records: 5000
      group_key: "department"
      naming: "employees_{seq:03}.csv"
      repeat_header: true

CSV Format

CSV is the default file format and the most common Clinker input. The reader decodes each line into a record whose fields are matched positionally (or by header name) against the source’s declared schema:; the writer reverses the process. CSV pairs with the file transport — see Source Nodes for the transport / format split and the schema rules every source shares.

- type: source
  name: orders
  config:
    name: orders
    type: csv
    path: "./data/orders.csv"
    schema:
      - { name: order_id, type: int }
      - { name: customer_id, type: int }
      - { name: amount, type: float }
      - { name: order_date, type: date }
    options:
      delimiter: ","         # default ","
      quote_char: "\""       # default "\""
      has_header: true        # default true
      encoding: "utf-8"      # default "utf-8"

Options

All CSV options are optional. With no options: block, Clinker uses standard RFC 4180 defaults.

Option	Default	Description
`delimiter`	`,`	Field separator, exactly one ASCII byte. Set to `\t` for TSV, `;` for semicolon-delimited exports.
`quote_char`	`"`	Quote character that escapes delimiters and newlines inside a field, exactly one ASCII byte.
`has_header`	`true`	When `true`, the first line names the columns and is consumed, not emitted. When `false`, fields bind to `schema:` positionally.
`encoding`	`utf-8`	Character set each field — including the header row — is decoded through. Supported values are `utf-8` (the default) and `iso-8859-1` (aliases `latin-1`, `latin1`). See Encoding.

delimiter and quote_char are each a single byte on the wire, so each must be exactly one ASCII character. An empty, multi-character, or non-ASCII value (for example "||" or "→") is rejected at plan validation — it is never silently truncated to its first byte.

Encoding

The reader decodes every field through the source’s declared encoding:

utf-8 (the default) is strict — a byte sequence that is not valid UTF-8 fails the run loudly rather than substituting replacement characters, so a mis-declared encoding is caught instead of silently corrupting data.
iso-8859-1 (Latin-1; also spelled latin-1 or latin1) maps each byte 0xNN to codepoint U+00NN, so high bytes such as 0xE9 (é) from legacy exports decode correctly.

An unsupported encoding is rejected at startup with a precise error naming the value and the supported set — it is never silently ignored.

Multi-record CSV sources (those whose schema: is a map with a records: list, described below) are decoded as UTF-8 only. Declaring a non-UTF-8 encoding on such a source is rejected at startup; split the file into a single-schema CSV source if it needs a non-UTF-8 charset.

Header handling

With has_header: true, the header row’s names bind input columns to the schema: entries — column order in the file may differ from the schema. With has_header: false, binding is strictly positional, so the schema order must match the file’s column order.

Input columns the schema does not name are governed by the source’s on_unmapped policy, the same as every other format.

Multi-value cells (`split_values`)

A CSV cell holds one string, but that string may pack several values behind a delimiter (1,a;b;c). Declare the column multiple: true and add a split_values entry naming the field and its delimiter, and the reader parses the cell into an array:

- type: source
  name: orders
  config:
    name: orders
    type: csv
    path: ./orders.csv
    split_values:
      - { field: tags, delimiter: ";" }
    schema:
      - { name: order_id, type: string }
      - { name: tags, type: string, multiple: true }

tags reads as ["a", "b", "c"]. An empty cell is an empty array; a cell with no delimiter is a one-element array; each element is coerced to the column’s declared type:. A quoted cell is unquoted first, so a delimiter inside the quotes is not a boundary. A multiple: true column with no covering split_values entry is rejected at compile (E361). The entry is read only on a single-schema source, not the multi-record reader below. See split_values in the Source reference for the full grammar.

Writing CSV

On output, the writer emits one row per record with cells in the output schema’s column order — the same order as the header row — regardless of how an upstream node ordered the record’s fields. An output-schema column the record does not carry emits an empty cell, the same as an explicit null; with include_unmapped: false a record field the output schema does not name is not written. See Output Nodes for header control, field mapping, and null handling.

Writing multi-value cells (`join_values`)

A multiple: field is joined into one delimited cell on write — the write-side inverse of split_values. The default needs no configuration: values join with ;, and a value that itself contains the delimiter is a hard error rather than a cell that would split back wrongly.

- type: output
  name: report
  input: orders
  config:
    name: report
    type: csv
    path: ./out/report.csv
    join_values:
      - tags                              # delimiter ";", on_conflict: error
      - { field: notes, delimiter: "|", on_conflict: escape, escape: "\\" }

A field with no join_values entry still joins, with the defaults. An entry overrides, per field:

delimiter — the separator written between values (default ;).
on_conflict — what to do when a value contains the delimiter:
- error (default) — dead-letter the record, naming the field and the offending value, rather than emit a cell that splits back wrongly. This is what makes a defaulted delimiter safe. Under error_handling.strategy: continue, the offending record goes to the dead-letter queue (category multi_value_join_collision) and the run continues; under fail_fast it aborts.
- escape — prefix each delimiter (and each escape character) inside a value with escape (default \), so a matching split_values escape: recovers the original. Lossless. delimiter and escape must each be a single character.
- encode_json — encode the whole field as an embedded JSON array, recovered by a matching split_values json: true. Preserves every value’s text exactly, including ones carrying the delimiter, quotes, or newlines — nothing is lost or mis-split. (A decimal/date/datetime element serializes as its JSON string form and reads back as a string, re-typed by the column’s declared type:, the same round trip every CSV cell takes.)

An empty field emits an empty cell — and, under the delimited policies (error, escape), a single empty-string value [""] emits an empty cell too, which reads back as zero values: the delimited encoding cannot tell an empty field from one empty value. Use encode_json when that distinction matters. A single non-empty value emits that value with no delimiter. The joined cell is quoted by the normal CSV rules when it contains the field delimiter, a quote, or a newline. Declaring join_values on a non-CSV output is rejected at compile (E362).

Round trip. on_conflict: escape and encode_json are recovered exactly by a matching source split_values entry:

# write side
join_values:
  - { field: tags, on_conflict: escape, escape: "\\" }
# read side (a later pipeline)
split_values:
  - { field: tags, escape: "\\" }

Header widening under auto-widen

When auto_widen is in effect and the Output leaves include_unmapped at its default of true, different records can carry different carried-along columns. The header must still be shared by every row, so Clinker widens it to the union of every record’s columns in first-seen order: a column that first appears on a later record still gets its own header slot, and the earlier rows write an empty cell for it. This pre-scan runs on the buffered output path, where the record batch is materialized.

An output that streams under a bounded-memory budget cannot pre-scan the whole batch: a CSV output fused directly after a Merge/Transform, a single-branch Route, a streaming-strategy Aggregate, or the probe side of a hash-build-probe Combine, or one reconstructing an envelope (which suppresses the shared header and streams a headerless body), commits its columns to the first record. A later record carrying a column that first record lacked then fails the run with a SchemaDrift error naming the column, rather than silently dropping it. Declare the column in the source or output schema: so every record carries it, or route to a self-describing format (JSON / NDJSON / XML). A reconstruct_envelope CSV output therefore requires a stable body shape — every record must carry the same columns.

Multi-record files (header / trailer / body)

Some CSV exports interleave multiple record types in one file — a header row, many body rows, and a trailer row — each distinguished by a discriminator column. Declare these with a map-form schema: carrying a discriminator: and a records: list, instead of the single column-list schema:. Each record type names its tag (the discriminator value that identifies it) and its own columns:; the discriminator field must sit at the same column in every type (usually the first). The reader derives the runtime superset schema (a lead record_type column plus the union of every record type’s columns) automatically.

- type: source
  name: payments
  config:
    name: payments
    type: csv
    path: "./data/payments.csv"
    schema:                                     # one multi-record schema (map form)
      discriminator: { field: rec_type }        # the physical column carrying the type tag
      records:
        - { id: header,  tag: H, columns: [ { name: rec_type, type: string }, { name: batch_id, type: string } ] }
        - { id: detail,  tag: D, columns: [ { name: rec_type, type: string }, { name: id, type: int }, { name: amount, type: int } ] }
        - { id: trailer, tag: T, columns: [ { name: rec_type, type: string }, { name: count, type: int } ] }
      structure:
        - { record: trailer, count: count }     # validate T's count against the body count
    envelope:
      sections:
        head:
          extract: { record_type: H }          # the H record type surfaces as $doc.head.*
          fields:
            batch_id: string

The reader streams one record per line on a single superset schema whose lead record_type column carries the matched type’s id. A downstream Route discriminates on that column. Rows of different record types may carry different column counts (ragged rows) — the reader validates the column count per record type, not file-wide. A textual column-header row is skipped when has_header is true (the default), so a leading record_type,name,amount line is not mistaken for a record of an unknown type. Each declared field honors its own type / trim / pad, the same as a single-record CSV field.

Header rows declared as an envelope: section via the record_type extract surface as $doc.<section>.* and are excluded from the body stream (see Envelopes & Document Context).
Trailer rows named by a structure: constraint are validated as they stream — the declared count field is checked against the actual body-record count at document close — and excluded from the body stream. A declared trailer that never appears is an incomplete-document error; a body row after the trailer is rejected as content past the document close.
Blank lines (empty or whitespace-only, common after concatenation) are skipped rather than parsed.
An unknown discriminator value (a tag no records: entry declares) is a structural-integrity failure, classified separately from a trailer count mismatch: it aborts the run by default, or under dlq_granularity: document condemns the whole file to the dead-letter sink and the run continues.

JSON Format

The JSON reader turns a JSON document into a record stream. It handles three physical shapes — a single array of objects, newline-delimited objects (NDJSON), or a wrapper object that nests the records under a path — and auto-detects the shape when you do not declare it. Each object is matched against the source’s declared schema:; see Source Nodes for the shared schema and transport rules.

- type: source
  name: events
  config:
    name: events
    type: json
    path: "./data/events.json"
    schema:
      - { name: event_id, type: string }
      - { name: timestamp, type: date_time }
      - { name: payload, type: string }
    options:
      format: object          # array | ndjson | object (auto-detect if omitted)
      record_path: "data"     # dot-separated keys to the records array
      max_index_bytes: 64MB   # cap on retained envelope sections (optional)

Physical shapes

`format`	Layout
`array`	The file is a single JSON array of objects.
`ndjson`	One JSON object per line (newline-delimited JSON).
`object`	A single top-level object; `record_path` locates the records array within it.

If format is omitted, Clinker auto-detects the shape from the file content. Declare it explicitly when the file is large enough that you want to skip detection, or when an object wrapper needs a record_path.

`record_path`

record_path is a dot-separated path of object keys, descended from the document root. data.rows selects the array at {"data": {"rows": [ … ]}}, and each of its elements becomes one record. This is the canonical statement of the grammar; other pages link here rather than restate it.

The rules, in full:

No $. root marker. It is not JSONPath. Write data.rows, not $.data.rows. Only the exact leading $. is rejected, so a key that merely starts with $ ($schema.rows) is still addressable.
No leading /. A leading slash is how a JSON Pointer is anchored; record_path is already anchored at the document root.
No empty segments — no doubled separator (data..rows) and no trailing one (data.).
Omitting record_path entirely lets the reader auto-detect the document shape. That is not the same as record_path: "", which is a path naming a key called “” and is rejected.

A value breaking any of these fails at compile time with E363, before any input is opened. The diagnostic names the corrected path where one can be derived.

record_path takes precedence over format:. When both are declared the reader navigates the path and streams the array it finds, whatever format: says — so pair record_path with format: object (or leave format: off). Declaring format: ndjson alongside a record_path does not read NDJSON.

Because a JSON key may contain any character, the two rejected prefixes give up a sliver of addressing: a top-level key literally named $ followed by a nested key, and a top-level key whose name starts with /, are not reachable through record_path.

Nested arrays

JSON records frequently embed arrays — line items on an invoice, tags on a product. Three source-level declarations decide what happens to them, all documented on the Source Nodes page:

split_to_rows fans the array out to one record per element. mode: extract (the default) hoists an object element’s keys onto the output record; mode: split keeps the record shape, flattening the element back under the field name (orders.id). An array of scalars keeps the value under the field’s own name under both modes.
A schema column declared multiple: true keeps the array as an array, and normalizes a lone scalar into a one-element array so the column’s shape never depends on what a particular document happened to carry.
split_values parses a delimited string cell into several values.

A record whose declared field holds an empty array, is explicitly null, or carries no such field at all, is preserved by default — keep_empty defaults to true, and setting it to false drops such a record. An explicit null is how many producers write “no value”, so it counts as no occurrence rather than one; for the same reason a multiple: true column holding an explicit null stays null rather than becoming [null].

A field that IS present but holds a single object or scalar rather than an array is one occurrence, projected exactly as a one-element array would be. Producers routinely unwrap a lone element, so a feed where some documents carry "line_items": [{…}, {…}] and others carry "line_items": {…} fans both out the same way and every output record ends up with the same columns. The XML reader, where a document cannot express the difference at all, already behaved this way.

Two declared fan-out fields apply in declaration order and multiply. A nested pair (orders then orders.items) produces the two-level expansion when the outer entry declares mode: split:

    split_to_rows:
      - { field: orders, mode: split }
      - { field: orders.items, mode: split }

Under mode: extract the outer entry lifts the occurrence’s keys to the top level, which removes the orders.items path the inner entry addresses — so that pairing is rejected at compile (E358) rather than silently fanning out only one level. A duplicated field is rejected too.

Flattened-name collisions

The reader dissolves nested objects into dotted keys, so {"a": {"b": 1}} becomes the field a.b. When two distinct keys flatten to the same name — for example a nested {"a": {"b": 1}} alongside a literal {"a.b": 2} in the same record — only one value could survive, and keeping one while dropping the other is silent data loss. The reader refuses the record instead, naming the colliding field. This mirrors the XML reader’s treatment of a repeated element: both formats now fail loud on an undeclared collision rather than one keeping the first value and the other the last. If the collision is intentional (both values belong together), declare the column multiple: true to collect them into an array in document order; otherwise rename one of the source keys so they no longer collide. As with XML, detection is per document at read time, so the run aborts under fail_fast and dead-letters the document under continue with dlq_granularity: document.

Detection covers two distinct source keys that flatten to the same dotted name — the nested {"a": {"b": 1}} plus literal {"a.b": 2} case above. It does not cover a key that is literally duplicated within one JSON object ({"tags": "x", "tags": "y"}): the JSON parser collapses such duplicates last-wins (keeping "y") before the record reaches collision detection, so that repeat is silently dropped rather than reported. A collision inside an array element that a split_to_rows: extract fan-out lifts to the top level (one element key clashing with a parent field or with another element key) is likewise not yet detected and still resolves last-wins — tracked by issue 920.

Bounding envelope retention: `max_index_bytes`

When a source declares an envelope: and a pipeline reads $doc.* paths from it, the JSON reader runs a streaming pre-scan that walks the document once and retains only the declared section subtrees — every other key, including a multi-megabyte body array, is parsed-and-skipped without being stored. The retained sections live in a bounded document index.

max_index_bytes caps that index. It is charged incrementally as each section is parsed, so even a single oversized declared section aborts mid-parse (naming the section and the cap) rather than risking an out-of-memory failure. It accepts a decimal size string (64MB, 500KB) or a bare byte count; optional, defaulting to 64MB. Only the declared sections a program actually reads are retained, so envelope metadata sits far below this ceiling in practice — the cap exists to convert an unbounded mistake into a clear error. See Document Envelope Context for the full model.

Non-finite floats

JSON numbers cannot represent NaN, +infinity, or -infinity. Writing a record (or an envelope section field) that holds a non-finite float to a JSON output fails with a JSON error naming the value, rather than silently substituting null — a substituted null would be indistinguishable from a genuine source null on read-back. Filter such records or replace the value in a transform before the JSON output.

Writing JSON

A JSON output writes one object per record, in schema-column order, either as a single array (format: array, the default) or one object per line (format: ndjson).

- type: output
  name: enriched
  input: processed
  config:
    name: enriched
    type: json
    path: "./output/enriched.json"
    options:
      format: ndjson     # array | ndjson
      pretty: false      # indent the emitted objects

Dotted column names become nested objects

A column name containing a . expands back into nesting, the same way the XML writer expands one into nested elements. Columns Address.City and Address.State write as one object:

{"Address":{"City":"Boston","State":"MA"},"name":"Ada"}

This is what makes a JSON-in / JSON-out pipeline reproduce its input shape: the reader flattened {"Address":{"City":…}} into the column Address.City, and the writer puts it back. It applies to every JSON output, with no option to turn it off — a flag would mean the same column name meant different things at different outputs.

Three points follow from the rule, all shared with the XML writer and specified in full on Field Paths:

Grouping. Columns sharing a prefix collect into one object, positioned where that prefix first appeared, even when the schema interleaves them.
Absent children. Under preserve_nulls: false a null column emits no key, and an object whose every descendant is absent emits no key at all rather than an empty {} — so it reads back as the absent column it stands for.
Values are untouched. A column holding a map or an array still serializes as that map or array. Expansion adds structure above the value, never inside it.

Keeping a literal `.` in a key

To emit a key that genuinely contains a ., escape the separator in the column name. The column a\.b writes the single key "a.b":

      schema:
        - { name: "a\\.b", type: string }   # emits {"a.b": …}
        - { name: "a.b",   type: string }   # emits {"a": {"b": …}}

A [ in a column name is currently literal but reserved; write \[ if you want it to stay literal indefinitely. See Field Paths for why.

Note that this is a write-side escape. A source key that literally contains a . still arrives from the reader as an unescaped column name (the Flattened-name collisions section above covers what the reader does), so {"a.b": 1} read and written back comes out as {"a": {"b": 1}}. Closing that is tracked by issue 920.

Column names that cannot both be written

Two columns can describe places that cannot both exist in one object — a column a holding a value alongside a column a.b that needs a to be an object. Rather than keep one and drop the other, the writer refuses the whole column set before emitting a byte, naming both columns and, where escaping would resolve it, the escaped spelling to use. Field Paths lists every clashing shape.

A column name carrying a malformed escape — a \ that is not part of \., \[, or \\, as in a column literally named C:\temp — is refused the same way, with the corrected spelling (C:\\temp) in the message.

XML Format

The XML reader selects record elements by a slash-separated path of element names and maps each one onto the source’s declared schema:. Child elements bind to fields by name; attributes bind under a configurable prefix. Namespaces are stripped by default so schema field names stay clean. See Source Nodes for the shared schema and transport rules.

- type: source
  name: catalog
  config:
    name: catalog
    type: xml
    path: "./data/catalog.xml"
    schema:
      - { name: product_id, type: int }
      - { name: name, type: string }
      - { name: price, type: float }
    options:
      record_path: "catalog/product"    # slash-separated element path
      attribute_prefix: "@"             # prefix for XML attribute fields
      namespace_handling: strip         # strip | qualify
      max_index_bytes: 64MB             # cap on retained envelope sections (optional)

Options

Option	Default	Description
`record_path`	—	Slash-separated path of element names selecting the elements that each become one record — see `record_path`. Omitted, every top-level element becomes one record.
`attribute_prefix`	`@`	Prefix that distinguishes an element’s attributes from its child elements when both map to schema fields.
`namespace_handling`	`strip`	`strip` removes namespace prefixes from element and attribute names; `qualify` preserves the namespace-qualified names.
`max_index_bytes`	`64MB`	Cap on the bytes the envelope pre-scan retains while extracting declared `$doc.*` sections.

`record_path`

record_path is a slash-separated path of XML element names, matched level by level starting at the document element. catalog/product selects every <product> that is a child of the document element <catalog>. This is the canonical statement of the grammar; other pages link here rather than restate it.

The rules, in full:

The path is already anchored at the document element, so it carries no leading /. Write Orders/Order, not /Orders/Order.
No //. It is not XPath: there is no descendant-or-any-depth step. Name every enclosing element.
No empty segments — no doubled separator (Orders//Order) and no trailing one (Orders/).
No XPath predicates, axes, or wildcards (product[@id='7'], child::product, *). Select the elements by path and filter the records in a transform.
Every segment must be a legal XML element name. Under namespace_handling: qualify element names keep their prefix, so a qualified segment (ns:Order) is allowed and is what matches; under the default strip the prefix is gone and the segment is the local name.
Omitting record_path entirely makes every top-level element one record. That is not the same as record_path: "", which is a path naming an element called “” and is rejected.

A value breaking any of these fails at compile time with E363, before any input is opened. The diagnostic names the corrected path where one can be derived.

`record_path` and `xml_path` root differently

The envelope option extract: { xml_path: … } is also a slash-path over XML, but it tolerates a leading / — /doc/Head is its documented form. record_path rejects one.

The two are separate grammars addressing separate things: xml_path locates a single envelope section anywhere in the document, record_path locates the record elements the body streams. They are deliberately not aligned — writing record_path: "/catalog/product" is an error, and writing xml_path: "/doc/Head" is correct.

Truncated input

A truncated XML document — one whose input ends before an open element’s closing tag — is rejected with a format error rather than yielding the partial fields read so far. This holds for a record cut off mid-element, a skipped-over sibling subtree cut off before it closes, and an envelope section cut off during the pre-scan (which then attaches no $doc metadata). This matches the general contract that a truncated stream always aborts rather than silently dropping data.

Writing XML

The XML writer expands dotted field names to nested elements, by the same rule the JSON writer expands them into nested objects — grouping, ordering, absent-child pruning, and the \. escape for a literal dot are all specified once on Field Paths. What is specific to XML is layered on top of that decoding, not instead of it.

The attribute_prefix convention applies in reverse: a field whose final path segment carries the prefix is emitted as an XML attribute of its enclosing element instead of a child element. A top-level @id attaches to the record element’s start tag; a nested Address.@type attaches to the <Address> element. Records read from an XML source therefore round-trip — <Record id="7"><name>A</name></Record> reads and writes back unchanged, and the writer never emits an @-named element.

Each decoded segment must also be a well-formed XML Name, so a segment that begins with a digit or contains a space is rejected. A literal dot survives — . is a legal XML name character, so a column declared a\.b emits the single element <a.b> rather than nesting.

Two column names that cannot both be expanded — a column a holding a value alongside a column a.b needing a to be a container — are refused before any byte of the record is written, naming both columns. Earlier versions emitted two sibling <a> elements for that column set, which this reader then refused on the way back in.

- type: output
  name: xml_out
  input: processed
  config:
    name: xml_out
    type: xml
    path: "./output/result.xml"
    options:
      root_element: "Root"              # default Root
      record_element: "Record"          # default Record
      attribute_prefix: "@"             # matches the source-side prefix

Option	Default	Description
`root_element`	`Root`	Name of the document root element wrapping all records.
`record_element`	`Record`	Name of the element emitted per record.
`attribute_prefix`	`@`	Prefix marking a field as an attribute of its enclosing element. Set it to the same value as the source-side prefix when round-tripping; an empty string disables attribute classification (every field emits as an element).

Attribute handling details:

A null attribute field is dropped even under preserve_nulls: true — a null element round-trips as a self-closing tag, but an attribute has no form that reads back as null.
A field with children nested under an attribute-prefixed segment (e.g. @a.b) is rejected with a format error: an XML attribute is a leaf and cannot contain elements.
The attribute name (the segment after the prefix) must be a well-formed XML name — a letter, _, or : followed by letters, digits, _, -, ., or : (plus the XML 1.0 Unicode name ranges). A name with a space, =, quote, /, >, or a leading digit (e.g. @foo bar, @1st) is rejected with a format error rather than emitting a malformed start tag. Non-ASCII letters are accepted, so an attribute name read from a source document round-trips unchanged.
An element with only attribute fields and no children self-closes: Address.@type alone emits <Address type="home"/>.

Writing multi-value fields (repeated elements)

A multiple: field is written as repeated child elements, one per value, in order — the XML counterpart to the CSV writer’s delimited join_values cell, and the write-side inverse of reading multiple: true. The default needs no configuration:

<Order><id>1</id><tags>a</tags><tags>b</tags></Order>

The value itself carries whether it repeats, so no per-field declaration is needed to decide whether to repeat — only to rename the elements. A field with one value emits exactly one element, byte-identical to a scalar field’s output; a field with an empty array emits nothing (no element, and no container even when one is configured); an empty-string value emits a self-closing item element (<tags/>).

A multiple: column that maps to an attribute field (a column whose name maps to an XML attribute, e.g. @tags, declared multiple: true) is rejected at compile with E359 — an XML attribute holds a single value and cannot repeat, and the writer emits repetition only as child elements. A runtime array reaching an attribute field is likewise rejected by the writer.

To rename the elements, add a join_values entry — the same block the CSV writer reads, sharing the field key. The XML writer reads two keys from it and ignores the CSV-only delimiter / on_conflict / escape:

- type: output
  name: xml_out
  input: processed
  config:
    name: xml_out
    type: xml
    path: "./output/result.xml"
    join_values:
      - field: tags
        repeat_as: Tag      # per-item element name; defaults to the field name
        wrap_in: Tags       # optional container; omit for bare repeats

repeat_as — the element name emitted per item. Defaults to the field’s own element name.
wrap_in — a container element bracketing the repeated items. Omit it for bare repeats with no container.

A scalar value on a field that carries a join_values entry is treated as a one-element sequence: it receives the same repeat_as / wrap_in naming an array of length one would, so the emitted shape does not depend on whether a lone value arrived wrapped ([a]) or bare (a) — mirroring how the reader normalizes a lone scalar into a one-element array. A field with no entry emits the plain <field>value</field> element.

The two combine into the four arrangements, with no other key:

`repeat_as`	`wrap_in`	Output for `tags = [a, b]`
—	—	`<tags>a</tags><tags>b</tags>`
`Tag`	—	`<Tag>a</Tag><Tag>b</Tag>`
—	`Tags`	`<Tags><tags>a</tags><tags>b</tags></Tags>`
`Tag`	`Tags`	`<Tags><Tag>a</Tag><Tag>b</Tag></Tags>`

repeat_as and wrap_in must each be a well-formed XML name, validated the same way as the root_element / record_element names. Declaring join_values on an output format that is neither csv nor xml is rejected at compile (E362).

Round trip. A document read into a multiple: true column with the default naming writes back to the identical repeated elements — reading <Order><id>1</id><tags>a</tags><tags>b</tags></Order> into a tags column and writing it to an XML output with record_element: Order reproduces the input byte-for-byte.

Repeated elements

When a record element contains repeated child elements, two source-level declarations decide what happens to them, and both take the flattened dotted field name — see Source Nodes → Multi-value fields for the shared grammar. The XML-specific matching rules are below.

A declared field is the repeated element’s dotted path relative to the record element — the same form the flattened field names use. For a record element <Order> containing repeated <Item> children, the field is Item; for <Order><Items><Item>…, it is Items.Item.

One record per occurrence: `split_to_rows`

- type: source
  name: orders
  config:
    name: orders
    type: xml
    path: "./data/orders.xml"
    options:
      record_path: "Orders/Order"
    schema:
      - { name: id, type: int }
      - { name: "Item.name", type: string }
      - { name: "Item.qty", type: int }
    split_to_rows:
      - field: "Item"
        mode: split            # one output record per <Item> occurrence

Each output carries one occurrence’s fields plus every field outside the group, duplicated onto each record.

Under mode: split the occurrence’s fields keep their full dotted names (Item.name, Item.@sku), including the element’s attributes. Under the default mode: extract the declared field’s prefix is lifted off, so the same document yields name and qty; a repeated scalar element (<Tag>a</Tag>) has no remainder to lift and takes the declared field’s last segment, so Tags.Tag yields Tag under extract and stays Tags.Tag under split.

Lifting a prefix off can land an occurrence’s field on a name a field outside the group already occupies — <Order><name> alongside <Item><name>. The occurrence wins: under extract it is the record, so its own field is not shadowed by the parent it was merged with. Use mode: split when you need both values, which keeps them at name and Item.name.

A declared position_column wins over any field of that name, inside the occurrence or outside it. position_column: line_no against an <Item> that carries its own <line_no> child yields the occurrence’s index, not the document’s value — you named the column, so the index is what it holds.

An occurrence with no content (<Item></Item>) still emits a record, one carrying only the fields outside the group. A record with no occurrence of the element is governed by keep_empty: XML cannot distinguish an empty repetition from an absent element, and the default keep_empty: true passes the record through unchanged.

Entries apply in declaration order, so two declared fields multiply. Fields must name disjoint element groups — a duplicated field, or one extending another (Item and Item.part) — which is rejected at compile (E358), before the source opens. The disjointness rule is this reader’s: it assigns each element to one occurrence group by document position, which is sound exactly when the declared groups do not nest. A JSON source has no such constraint.

All occurrences in one field: `multiple: true`

Declaring a schema column multiple: true collects every occurrence of that flattened field into one array, in document order, instead of keeping only the first:

    schema:
      - { name: id, type: int }
      - { name: "Tag", type: string, multiple: true }

<Tag>a</Tag><Tag>b</Tag> yields ["a", "b"], and a single <Tag> still yields a one-element array. Declaring the flattened children of a repeated container (Item.name, Item.qty) collects each of them independently.

An empty occurrence — an empty-body <Tag></Tag> or a self-closing <Tag/> — is a real array element, collected in position as a null: <Tag>a</Tag><Tag></Tag><Tag>b</Tag> yields ["a", null, "b"] rather than squeezing the empty element out, so the array round-trips its per-item shape. (An empty text value reads as null, the same rule the reader applies elsewhere; the self-closing and empty-body forms behave identically.)

A field cannot be both collected and fanned out: naming a multiple: true column in split_to_rows is rejected at compile (E358).

A repeated element named by neither a split_to_rows entry nor a multiple: column is a loud error, not a silent drop. Keeping the first occurrence and discarding the rest would lose data without warning, so the reader refuses the record and names the offending field, pointing at the two ways to handle a repeat on purpose: declare the column multiple: true to collect every occurrence into an array, or add a split_to_rows entry to fan each occurrence out to its own record. Detection is per document at read time — a plan cannot know in advance that a particular document repeats a field. Under the default fail_fast strategy the run aborts with the diagnostic; under continue with dlq_granularity: document the offending document is routed to the dead-letter queue and the run continues.

Delimited text in one element: `split_values`

split_values parses <Tag>a;b;c</Tag> into ["a", "b", "c"]. The field must also be declared multiple: true.

Bounding envelope retention: `max_index_bytes`

When a source declares an envelope: and a pipeline reads $doc.* paths from it, the XML reader runs an event-driven streaming pre-scan that walks the document once and retains only the declared section subtrees — every other element, including a multi-megabyte body, is event-walked and dropped without being flattened into memory. The retained sections live in a bounded document index.

max_index_bytes caps that index. It is charged incrementally as each section is built, so even a single oversized declared section aborts mid-parse (naming the section and the cap) rather than risking an out-of-memory failure. It accepts a decimal size string (64MB, 500KB) or a bare byte count; optional, defaulting to 64MB. Only the declared sections a program actually reads are retained, so envelope metadata sits far below this ceiling in practice — the cap exists to convert an unbounded mistake into a clear error.

The reader holds no whole-document buffer: the body walks the document element-at-a-time, and the envelope pre-scan opens the source a second time to walk it independently — a file source is read twice, never buffered. Peak memory is the bounded section index plus a single live record, not the input size. See Document Envelope Context for the full model.

Fixed-Width Format

Fixed-width files carry no delimiters — each field occupies a fixed column range on every line, the layout common to mainframe extracts and legacy COBOL exports. Because the byte layout is not self-describing, each column in a fixed-width source’s schema: carries its byte layout (start + width) alongside its CXL type — one unified declaration drives both the physical slice and compile-time type checking. See Source Nodes for the shared transport rules.

- type: source
  name: legacy_data
  config:
    name: legacy_data
    type: fixed_width
    path: "./data/mainframe.dat"
    schema:
      - { name: account_id,  type: string, start: 0,  width: 12 }
      - { name: balance,     type: float,  start: 12, width: 10 }
      - { name: status_code, type: string, start: 22, width: 2 }
    options:
      line_separator: crlf    # line-ending style

The column layout

Each column pins itself to a byte range with start (a 0-based offset) and width (a byte count); end (exclusive) may be given instead of width. Optional per-column formatting keys — justify, pad, trim, truncation — control padding and trimming on read and write. Because the same column declaration carries both the byte range and the CXL type, the physical layout and the types can never drift apart. A layout shared across pipelines can live in an external .schema.yaml file referenced by schema: layout.schema.yaml.

Writing fixed-width output

A fixed-width output node declares the same column layout in its schema:. The writer places every field at its declared byte range — start plus width (or end), resolved exactly as the reader slices — regardless of the order the columns are declared in, so a file written with a schema reads back under that same schema. Byte ranges the layout leaves undeclared (a gap between fields) are filled with spaces. A column that omits start continues at the previous column’s end, so a width-only schema lays its fields out sequentially. Two columns whose byte ranges overlap have no consistent layout; the writer rejects such a schema when the output opens, naming both columns and their ranges.

Widths are byte counts, matching how the reader slices. When a value is longer than its field, truncation cuts at a UTF-8 character boundary at or below the width, so a multi-byte character is never split: the emitted cell is always valid UTF-8 of exactly width bytes (it may hold fewer characters than the width when a trailing multi-byte character does not fit, with the freed bytes pad-filled). Because padding fills exact byte counts on write and is stripped one character at a time on read, pad must be a single-byte (ASCII) character — a multi-byte character (such as ·) or a multi-character string (such as "0 ") is rejected when the schema is resolved, on both the read and write sides. An absent or empty pad defaults to a space. Under truncation: error an over-long value is still a hard error before any slicing.

A type: decimal output column with a scale rounds its values to that many fractional places on write (banker’s rounding), the same contract a decimal source column applies on read. This matters here: a computed decimal such as avg(amount) is a full-precision quotient that would overflow a narrow numeric field — a hard error under the default truncation: error for numeric fields — so declaring the field’s scale shrinks it to fixed places that fit. For example, avg over 1.00, 1.00, 2.00 written into { name: average, type: decimal, scale: 2, width: 6 } emits 1.33; without the scale the 28-digit quotient overflows the 6-byte field and fails.

Options

Option	Default	Description
`line_separator`	platform	Line-ending style (`lf` / `crlf`) used to split the file into records.

Under lf or crlf, the reader buffers each physical line only up to the declared record width plus a line-terminator allowance. A physical line wider than the declared width — trailing filler beyond the last declared field, or a schema that maps only a prefix of a wider fixed-length record — reads its declared-width portion; the remaining bytes are discarded up to the next line terminator and the reader continues with the following record. Because the buffered portion is capped, a malformed file (a corrupt or missing newline) cannot grow a single record until end of input: memory stays bounded regardless of how long the physical line runs. A final line with no trailing newline reads normally as long as its declared fields fit within the width.

Multi-value cells (`split_values`)

A fixed-width field holds one value, but its text may pack several behind a delimiter within the field’s byte range. Declare the column multiple: true and add a split_values entry, and the reader splits the (padding-stripped) field text and coerces each part to the column’s declared type:

    split_values:
      - { field: tags, delimiter: ";" }
    schema:
      - { name: order_id, type: string, start: 0, width: 4 }
      - { name: tags, type: string, start: 4, width: 20, multiple: true }

Because the fixed-width reader is the sole coercion pass, each element is typed here — a multiple: true int field over 1;2;3 reads as [1, 2, 3]. A blank field is an empty array; a field with no delimiter is a one-element array. A multiple: true column with no covering split_values entry is rejected at compile (E361). The entry is read only on a single-schema source, not the multi-record reader below.

Schema drift

Fixed-width is inert with respect to auto-widen: because every byte is accounted for by the format schema, there are no “unmapped” trailing columns to absorb. The on_unmapped policy has no effect on a fixed-width source.

Multi-record files (header / trailer / body)

Mainframe and banking extracts often interleave multiple record types in one file — a header line, many body lines, and a trailer line — each identified by a discriminator at a fixed byte position (commonly the first character). Declare these with a map-form schema: carrying a discriminator: byte range and a records: list, instead of a flat column list. Each record type names its tag (the discriminator value) and its own byte-positioned columns:; the reader synthesizes the lead record_type column automatically.

- type: source
  name: payments
  config:
    name: payments
    type: fixed_width
    path: "./data/payments.dat"
    schema:                                    # one multi-record schema (map form)
      discriminator: { start: 0, width: 1 }    # the type tag occupies byte 0
      records:
        - { id: header,  tag: H, columns: [ { name: batch_id, type: string, start: 1, width: 9 } ] }
        - { id: detail,  tag: D, columns: [ { name: id, type: int, start: 1, width: 5 }, { name: amount, type: int, start: 6, width: 4 } ] }
        - { id: trailer, tag: T, columns: [ { name: count, type: int, start: 1, width: 5 } ] }
      structure:
        - { record: trailer, count: count }     # validate T's count against the body count
    envelope:
      sections:
        head:
          extract: { record_type: H }          # the H line surfaces as $doc.head.*
          fields:
            batch_id: string

Header lines declared as an envelope: section via the record_type extract surface as $doc.<section>.* and are excluded from the body stream (see Envelopes & Document Context).
Trailer lines named by a structure: constraint are validated as they stream — the declared count field is checked against the actual body-record count at document close — and excluded from the body stream. A declared trailer that never appears is an incomplete-document error; a body line after the trailer is rejected as content past the document close.
Blank lines (empty or whitespace-only, common after concatenation) are skipped rather than rejected; a line whose declared field range is cut off mid-value is a truncation error, not a silently-partial read. Field parsing — type coercion, padding strip, justification — is shared with the single-record fixed-width reader, so a declared type parses identically on both paths.
An unknown discriminator value (a tag no records: entry declares) is a structural-integrity failure, classified separately from a trailer count mismatch: it aborts the run by default, or under dlq_granularity: document condemns the whole file to the dead-letter sink and the run continues.

EDIFACT Format

Clinker reads and writes UN/EDIFACT interchanges alongside CSV, JSON, XML, and fixed-width. An interchange is a finite file: it opens with an optional UNA service-string advice and a mandatory UNB header, wraps one or more UNH..UNT messages, and closes with a UNZ trailer. The reader streams one segment at a time and the writer reconstructs the envelope around emitted records. The reader decodes release-escape sequences into clean data values and the writer re-escapes them on output, so a reader → writer → reader round-trip preserves the data values and the envelope control references.

Delimiters and the UNA service string

Each segment is terminated by the segment terminator; within a segment, data elements split on the element separator and components on the component separator. A release character escapes a delimiter that occurs as literal data.

When the file begins with a 9-byte UNA prefix, its six service characters override the defaults in this fixed order: component, element, decimal, release, repetition, terminator. When UNA is absent, the syntax Level-A defaults apply:

Role	Level-A default
Component separator	`:`
Element separator	`+`
Decimal notation	`.`
Release / escape	`?`
Repetition	space (inactive)
Segment terminator	`'`

UNA is optional — a parser that requires it would fail on the common no-UNA interchange, so Clinker assumes Level-A when it is absent.

Release character

The release character (default ?) marks the following byte as literal data rather than a delimiter: ?+ is a literal + inside an element, ?' is a literal apostrophe (not a terminator), and ?? is a literal ?. The reader decodes these sequences into clean data values, so a downstream CSV/JSON sink, a CXL string comparison, or a $doc field sees O'BRIEN, never the wire form O?'BRIEN. The writer re-escapes on output: any element value that carries the element separator, the segment terminator, or the release character is release-escaped automatically, so a value computed by a Transform or sourced from CSV — never EDIFACT-escaped to begin with — does not corrupt the interchange. A reader → writer → reader round-trip therefore preserves the data values exactly.

The component separator inside an element (e.g. the : in the composite UNOA:1) is kept as part of the element’s text and is not escaped — the positional element model works above component resolution, so a composite element round-trips unchanged. A literal colon in free-text data is the one ambiguity this introduces: because components are not split into separate fields, a : in a value re-reads as a component boundary. Repeating elements ride inside one element string intact and are likewise never truncated to their first repetition.

Newlines between segments

Some producers insert CR/LF after each segment terminator for readability. Those bytes are insignificant and are stripped between segments; CR/LF that appears inside an element is preserved.

Record shape

Each non-service segment becomes one record under a fixed positional schema:

Column	Meaning
`seg_id`	The segment tag (`BGM`, `NAD`, …)
`msg_ref`	The enclosing message reference (the `UNH` element 1)
`msg_type`	The message type (the `UNH` element 2, full composite)
`e01`, `e02`, …	The segment’s positional data elements (release sequences decoded)

Service segments (UNB, UNZ, UNH, UNT) are consumed by the reader to drive envelope state and validation — they are never emitted as body records. The UNH segment that opens a message is emitted as a body record (its seg_id is UNH), carrying the message reference in msg_ref and the message-type composite in msg_type, with its full positional element list also stamped onto e01, e02, … — so any UNH element past the message type (a common access reference, a message subset identification, and so on) is available as e03 onward and is reconstructed on write.

The number of eNN columns is controlled by the source max_elements option (default 32). A segment carrying more data elements than that is rejected with guidance rather than silently truncated. Absent trailing elements read as null.

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: edifact
      glob: ./inbox/*.edi
      options:
        max_elements: 48      # widen the positional schema for exotic segments
      schema:
        - { name: seg_id, type: string }
        - { name: msg_ref, type: string }
        - { name: e01, type: string }

Envelope sections over UNB

The interchange header UNB is extractable as a document envelope section, exposing its positional elements to CXL as $doc.<section>.<field>. Use the segment extract rule with the section field names matching the positional keys e01, e02, …:

envelope:
  sections:
    interchange:
      extract: { segment: "UNB" }
      fields:
        e05: string          # interchange control reference (UNB element 5)

A Transform can then read $doc.interchange.e05 on every body record.

Only the UNB header is extractable as an envelope section. Trailer segments (UNT, UNZ) arrive after the body and cannot become $doc fields without buffering the whole interchange — their control counts are instead validated inline by the reader (see below). A segment extract naming any tag other than UNB, or an xml_path / json_pointer extract against an EDIFACT source, is rejected at startup.

Control-count validation

The reader validates the structural integrity claims carried in the trailers as they arrive, failing the run on a mismatch (a truncation or corruption signal):

UNT segment count — must equal the actual number of segments in the message, counting the UNH and UNT themselves.
UNT message reference — must echo the opening UNH reference.
UNZ message count — must equal the actual number of UNH messages in the interchange.
UNZ control reference — must echo the UNB control reference.

Clinker locates the UNB control reference correctly even when the header carries an empty optional element or transmits the date/time of preparation as two separate parts rather than one combined element, so such interchanges still validate and round-trip with the trailer echoing the correct reference.

A missing UNZ at end of input is a truncation error; content after the UNZ trailer is rejected — including a lone stray release character with no following segment, which is treated as unterminated (truncated) content rather than silently dropped.

Routing a count mismatch to the DLQ

By default a UNT/UNZ count mismatch aborts the run. A source declaring dlq_granularity: document instead dead-letters the whole interchange / file to the DLQ — the file’s records become a structural_validation trigger plus document_rejected collaterals, and no record of the malformed file reaches the sink. The count is only known at the trailer, after the body has streamed, so the rejection lands at the sink boundary (no record is written out), not literally before the first record. The grain is the whole file. The control-reference echo mismatches and every other corruption (truncation, post-trailer content) always abort, even under the opt-in. See Malformed envelopes.

Writing EDIFACT

An EDIFACT Output node reconstructs the envelope around emitted records. Records map by the same positional columns (seg_id, msg_ref, msg_type, eNN); trailing null/empty elements are trimmed so no fabricated delimiters appear, and a column the writer does not recognize is an error (project the record to the EDIFACT columns first). Engine-internal $-namespaced columns are excluded automatically.

nodes:
  - type: output
    name: out
    input: messages
    config:
      name: out
      type: edifact
      path: ./out/result.edi
      options:
        interchange: ["UNOA:1", "SENDER", "RECEIVER", "240101:1200", "REF1"]
        message_type: "ORDERS:D:96A:UN"
        write_una: false
        segment_newline: true

Output options:

Option	Meaning
`interchange`	Literal `UNB` data elements (release-escaped as needed on write).
`interchange_from_doc`	Name of a `$doc` section to echo the `UNB` elements from (round-trip).
`message_type`	Fallback `UNH` message type when a record carries no `msg_type` value.
`write_una`	Emit a leading `UNA` segment (default `false`).
`segment_newline`	Write a newline after each segment terminator (default `true`).

Consecutive records are grouped into UNH..UNT messages on msg_ref transitions. The writer recomputes the UNT segment count and UNZ message count, and echoes the message and interchange control references, so the output passes its own count validation on re-read.

interchange_from_doc echoes the header from a record’s document context. That context is populated by a source’s UNB envelope section (declare a segment: "UNB" envelope section on the source) and travels with every body record through the pipeline — including to a sink that sits directly downstream of the source with no intervening Transform. The reader stashes the complete, ordered UNB element list (empty middle elements included), so the reconstructed header is faithful even when a middle element is empty and the user declares only the fields they care about. Supply interchange literal elements instead when the records have no source UNB section to echo.

Character set

EDIFACT names its body character repertoire in-band: the UNB header’s syntax identifier (data element S001, component 1 — the UNOA in UNOA:1) declares the repertoire for the whole interchange. There is therefore no encoding option on an EDIFACT source or sink; the reader discovers the repertoire from the UNB and the writer re-derives it from the UNB it emits, so a read → write round-trip is byte-faithful without any configuration.

`UNB` syntax level	Repertoire
`UNOA`, `UNOB`	ASCII; a byte `>= 0x80` is an error
`UNOC`	ISO-8859-1 (Latin-1), one byte per character
`UNOY`	UTF-8; invalid byte sequences are an error

Decoding stays streaming and per-segment — the interchange is never buffered whole. The syntax identifier is itself ASCII, so it is read straight from the raw UNB bytes before any text is decoded; the UNB and every body segment are then decoded through the negotiated repertoire. The UNB’s own sender and recipient identification elements may legitimately carry non-ASCII text under UNOC or UNOY, and they decode (and re-encode on output) under that repertoire too — so a UNOC interchange whose header or body carries Latin-1 high bytes (for example accented characters in a party name) parses without error, surfaces the correct text in $doc.UNB.*, and round-trips byte-for-byte.

The repertoire is enforced loudly. A UNOA/UNOB interchange whose body carries a high byte fails (“outside the ASCII repertoire”) rather than silently reinterpreting it; a UNOY interchange with invalid UTF-8 fails (“not valid UTF-8”); and a UNB declaring an unsupported syntax level (UNOD..UNOX) fails at startup with a precise error naming the level, never falling back to a guessed encoding or substituting replacement characters. On output the writer encodes element text through the same repertoire the UNB declares; a character the repertoire cannot represent (a non-ASCII character under UNOA/UNOB, or a codepoint above U+00FF under UNOC) is rejected rather than emitted truncated.

Limitations

Functional groups. A single UNB..UNZ interchange is supported; UNG/UNE functional-group segments are rejected with a precise error.
Output splitting. An interchange is a single UNB..UNZ envelope and cannot be divided across files. An edifact output combined with a split: block is rejected at config-validation time (diagnostic E323) rather than emitting a structurally corrupt interchange.
Rare degenerate headers. A few unusual header shapes — those that combine an empty date/time slot with a date-only date/time where the control reference normally sits — may not round-trip byte-for-byte on re-emit. Conformant headers and ordinary variations are unaffected.

X12 Format

Clinker reads and writes ANSI ASC X12 interchanges alongside CSV, JSON, XML, fixed-width, and EDIFACT. An X12 interchange is a finite file with a three-tier envelope: an ISA..IEA interchange wraps one or more GS..GE functional groups, and each functional group wraps one or more ST..SE transaction sets. The reader streams one segment at a time and the writer reconstructs the three envelope tiers around emitted records.

The three tiers surface as nested document-context levels: the ISA interchange becomes the file-level $doc document, and each GS group and ST set opens a nested level whose $doc sections layer over the enclosing tiers. A body record therefore sees every enclosing tier’s fields through one $doc.<section>.<field> lookup.

Delimiters and the ISA header

Unlike EDIFACT’s optional UNA service-string advice, X12 declares its delimiters in a fixed-length 106-byte ISA header. Three delimiter bytes live at structural positions within it:

Role	Source in the ISA
Element (data) separator	The byte immediately after the `ISA` tag
Sub-element (component) sep.	`ISA16`, the last single-byte ISA element
Segment terminator	The byte immediately after `ISA16`

The reader reads these three bytes from the header rather than assuming a fixed delimiter set, so an interchange that uses */:/~, |/^/newline, or any other producer-chosen delimiters parses correctly. The ISA13 interchange control number is located as the 13th element of the header split on the discovered element separator — structurally, not by an absolute byte offset — so producer padding quirks do not misalign it.

On output, an X12 sink that echoes the header via interchange_from_doc also adopts the source’s discovered delimiter set, so a reconstructed interchange keeps the exact element separator, sub-element separator, and segment terminator bytes it arrived with (see Writing X12). The literal interchange option keeps the writer’s */:/~ defaults.

No escape character

X12 has no release/escape character (EDIFACT’s ? has no X12 equivalent). A data value that contains a delimiter byte is therefore unrepresentable. On output the writer rejects any element value carrying the element separator or the segment terminator with a precise error rather than silently corrupting the interchange; re-encode the value or choose delimiters the data does not contain.

The sub-element (component) separator inside an element (e.g. the : in a composite A:B:C) is kept as part of the element’s text and is not split — the positional element model works above component resolution, so a composite element round-trips unchanged.

Newlines between segments

Some producers insert CR/LF after each segment terminator for readability. Those bytes are insignificant and are stripped between segments; CR/LF that appears inside an element is preserved.

Record shape

Each non-service segment becomes one record under a fixed positional schema:

Column	Meaning
`seg_id`	The segment tag (`BEG`, `PO1`, …)
`set_ref`	The enclosing transaction set control number (`ST02`)
`set_type`	The transaction set identifier code (`ST01`, e.g. `850`)
`e01`, `e02`, …	The segment’s positional data elements

The reader does not stamp the enclosing functional-group control number as a column; it surfaces through $doc.functional_group.e06 (see Envelope sections over the three tiers). An optional group_ref column drives multi-group output — project it from that $doc field when reconstructing several functional groups (see Writing X12).

Service segments (ISA, IEA, GS, GE, SE) are consumed by the reader to drive the envelope and validation — they are never emitted as body records. The ST segment that opens a transaction set is emitted as a body record (its seg_id is ST), carrying the set reference and type.

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: x12
      glob: ./inbox/*.x12
      options:
        max_elements: 48      # widen the positional schema for exotic segments
        encoding: iso-8859-1  # decode body element text as Latin-1
      schema:
        - { name: seg_id, type: string }
        - { name: set_ref, type: string }
        - { name: e01, type: string }

Character set

X12 carries no in-band element that names the body character repertoire (unlike EDIFACT’s UNB syntax identifier). Element text is therefore decoded through the charset the source declares in its encoding option, defaulting to UTF-8. The ISA header’s control fields are ASCII, so the declared charset affects only the body element text.

`encoding` value	Repertoire
`utf-8` (default)	UTF-8; invalid bytes are an error
`iso-8859-1` (aliases `latin-1`, `latin1`)	ISO-8859-1 (Latin-1), one byte per char

Decoding stays streaming and per-element — the interchange is never buffered whole. An interchange whose body carries Latin-1 high bytes (for example accented characters in free-text name or address fields) parses without error once the source declares encoding: iso-8859-1, and the decoded element text matches the source bytes under Latin-1.

A source that omits the option and meets non-UTF-8 bytes fails explicitly (“segment is not valid UTF-8”) rather than corrupting the data silently; a source that declares an unsupported encoding fails at startup with a precise error naming the value. On output, set the same encoding on the X12 sink so the round-trip is byte-faithful; a character the chosen charset cannot represent (for example a non-Latin-1 codepoint under iso-8859-1) is rejected rather than emitted truncated.

Envelope sections over the three tiers

The interchange header ISA is extractable as a file-level document envelope section, exposing its positional elements to CXL as $doc.<section>.<field>. Use the segment extract rule with the section field names matching the positional keys e01, e02, …:

envelope:
  sections:
    interchange:
      extract: { segment: "ISA" }
      fields:
        e13: string          # interchange control number (ISA13)

The GS functional group and the ST transaction set surface automatically as the nested $doc sections functional_group and transaction_set, each keyed by positional eNN elements — no envelope declaration is needed for them. A Transform on any body record can read all three tiers at once:

emit isa13 = $doc.interchange.e13       # interchange control number
emit gs06  = $doc.functional_group.e06  # group control number (GS06)
emit st02  = $doc.transaction_set.e02   # set control number (ST02)

Naming and typing the nested levels

The ISA header is declared through envelope: because a bounded pre-scan resolves it before any body streams. The GS group and ST set exist only mid-file, so they cannot be declared the same way — instead, name them and give them a typed field schema under the X12 source’s options. The reader applies the declaration each time it crosses a group or set boundary, so $doc.<your-name>.<field> exposes the level’s elements under the name you chose, coerced to the types you declared — the same way a declared ISA field is typed and coerced:

type: x12
options:
  group_section:
    name: functional_group   # your choice — the engine reserves no name
    fields:
      e01: string            # GS01 functional identifier code
      e06: int               # GS06 group control number
  set_section:
    name: transaction_set    # your choice
    fields:
      e01: int               # ST01 transaction-set identifier code
      e02: string            # ST02 set control number

emit functional_id = $doc.functional_group.e01   # typed string
emit group_control = $doc.functional_group.e06    # typed int
emit txn_type      = $doc.transaction_set.e01     # typed int

The two levels are declared independently — name one, both, or neither. A declared field schema is the contract: only the elements it lists surface in the typed section, and an element the wire carries but the schema omits is absent from $doc. An element that cannot coerce to its declared type (declaring the alphabetic GS01 code as an int, say) fails the run with a precise error. Omit a level’s declaration and it keeps its default name (functional_group / transaction_set) keyed by untyped positional eNN strings, unchanged from before this option existed.

Only the ISA header is extractable through the envelope: block. Trailer segments (SE, GE, IEA) arrive after the body they close and cannot become $doc fields without buffering the whole interchange — their control counts are instead validated inline by the reader (see below). A segment extract naming any tag other than ISA, or an xml_path / json_pointer extract against an X12 source, is rejected at startup.

Control-count validation

The reader validates the structural integrity claims carried in the trailers as they arrive, failing the run on a mismatch (a truncation or corruption signal):

SE segment count (SE01) — must equal the number of segments in the transaction set, counting the ST and SE themselves.
SE set control number (SE02) — must echo the opening ST02.
GE transaction-set count (GE01) — must equal the number of ST sets in the functional group.
GE group control number (GE02) — must echo the GS06.
IEA functional-group count (IEA01) — must equal the number of GS groups in the interchange.
IEA control number (IEA02) — must echo the ISA13.

A missing IEA at end of input is a truncation error; content after the IEA trailer is rejected.

Routing a count mismatch to the DLQ

By default a SE/GE/IEA count mismatch aborts the run. A source declaring dlq_granularity: document instead dead-letters the whole interchange / file to the DLQ — the file’s records become a structural_validation trigger plus document_rejected collaterals, and no record of the malformed file reaches the sink. The count is only known at the trailer, after the body has streamed, so the rejection lands at the sink boundary (no record is written out), not literally before the first record. The grain is the whole file: an SE-level mismatch rejects the entire interchange, not just that transaction set. The control-number echo mismatches (SE02/GE02/IEA02) and every other corruption (truncation, post-trailer content) always abort, even under the opt-in. See Malformed envelopes.

Writing X12

An X12 Output node reconstructs the three-tier envelope around emitted records. Records map by the same positional columns (seg_id, set_ref, set_type, eNN, and the optional group_ref); trailing null/empty elements are trimmed so no fabricated delimiters appear, and a column the writer does not recognize is an error (project the record to the X12 columns first). Engine-internal $-namespaced columns are excluded automatically.

nodes:
  - type: output
    name: out
    input: messages
    config:
      name: out
      type: x12
      path: ./out/result.x12
      options:
        interchange:
          ["00", "          ", "00", "          ", "ZZ", "SENDER         ",
           "ZZ", "RECEIVER       ", "240101", "1200", "U", "00401",
           "000000001", "0", "P", ":"]
        group_header: ["PO", "SENDER", "RECEIVER", "20240101", "1200", "1", "X", "004010"]
        set_type: "850"
        segment_newline: true

Output options:

Option	Meaning
`interchange`	Literal `ISA` data elements (the 16 fixed-width ISA fields).
`interchange_from_doc`	Name of a `$doc` section to echo the `ISA` elements from (round-trip).
`group_header`	Literal `GS01..GS08` elements (`GS06` control number recomputed per group).
`set_type`	Fallback `ST01` set type when a record carries no `set_type` value.
`segment_newline`	Write a newline after each segment terminator (default `true`).
`encoding`	Character set element text is encoded through (default `utf-8`).

Consecutive records are grouped into ST..SE transaction sets on set_ref transitions and into GS..GE functional groups on group_ref transitions — the group discriminator is the outer-tier analog of set_ref. The writer recomputes the SE segment count, the per-group GE transaction-set count, and the IEA functional-group count, and echoes the set, group, and interchange control numbers, so the output passes its own count validation on re-read.

Multiple functional groups

A real interchange can carry several functional groups — purchase orders (PO) and invoices (IN) in one ISA..IEA, each in its own GS..GE with a distinct functional identifier code. Project a group_ref column and the writer opens a fresh GS..GE group every time its value changes, echoing that value as the group control number (GS06/GE02) and recomputing each group’s own GE01 transaction-set count. With no group_ref column the whole stream collapses to one functional group, byte-identical to the single-group shape, so the column is opt-in. To split a source interchange into the same groups it arrived in, project group_ref from $doc.functional_group.e06:

nodes:
  - type: transform
    name: regroup
    input: orders
    config:
      cxl: |
        emit seg_id    = seg_id
        emit group_ref = $doc.functional_group.e06   # functional group control number (GS06)
        emit set_ref   = set_ref
        emit set_type  = set_type
        emit e01       = e01
        # … remaining eNN columns …
  - type: output
    name: out
    input: regroup
    config:
      name: out
      type: x12
      path: ./out/result.x12
      options:
        interchange_from_doc: interchange
        group_header: ["PO", "SENDER", "RECEIVER", "20240101", "1200", "1", "X", "004010"]

Records that share a group_ref value must arrive consecutively, exactly as records sharing a set_ref must: the writer streams and closes a group the moment the discriminator changes, so an interleaved stream would reopen a group it already closed. Sort upstream by group_ref (then set_ref) when the record order does not already guarantee it.

interchange_from_doc echoes the header from a record’s document context. That context is populated by a source’s ISA envelope section (declare a segment: "ISA" envelope section on the source) and travels with every body record through the pipeline — including to a sink that sits directly downstream of the source with no intervening Transform. The reader stashes the complete, ordered ISA element list together with the delimiter set it discovered from the header, so the reconstructed header is faithful and the whole output interchange — header, envelopes, and body — is emitted with the original element separator, sub-element separator, and segment terminator rather than the writer’s */:/~ defaults. Supply interchange literal elements instead when the records have no source ISA section to echo; that path keeps the default delimiters.

Limitations

Charset. Element text is decoded through the source’s encoding option (UTF-8 by default, or ISO-8859-1). An unsupported or unconfigured-but-non-UTF-8 repertoire is rejected explicitly rather than silently corrupted (see Character set above).
No escape character. X12 has no release mechanism, so a data value that contains a delimiter byte is rejected on output rather than corrupting the interchange.
Consecutive grouping. Both GS..GE functional groups (group_ref) and ST..SE transaction sets (set_ref) close the moment their discriminator changes, so records sharing a value must arrive together; sort upstream when the source order does not already guarantee it.
Output splitting. An interchange is a single ISA..IEA envelope and cannot be divided across files. An x12 output combined with a split: block is rejected at config-validation time (diagnostic E338) rather than emitting a structurally corrupt interchange.

HL7 v2 Format

Clinker reads and writes HL7 v2.x pipe-and-hat messages alongside CSV, JSON, XML, fixed-width, EDIFACT, and X12. An HL7 v2 file is a finite stream of carriage-return-terminated segments. The smallest unit is one message, which always begins with an MSH (message header) segment; messages may optionally be wrapped in a BHS..BTS batch and an FHS..FTS file envelope. The reader streams one segment at a time and the writer re-emits the segments, optionally reconstructing the batch/file envelopes.

The optional envelope tiers surface as nested document-context levels: an FHS file header becomes the file-level $doc document, and each BHS batch and MSH message opens a nested level whose $doc sections layer over the enclosing tiers. A body record therefore sees every enclosing tier’s fields through one $doc.<section>.<field> lookup. All tiers are optional — a bare stream of MSH messages with no batch or file wrapping is a valid HL7 v2 file.

Delimiters and the MSH header

HL7 declares its delimiters in the MSH header rather than assuming a fixed set. The byte immediately after the MSH tag is the field separator (MSH-1), and the four bytes that follow are the encoding characters (MSH-2), in this fixed order:

Role	Source in MSH-2	Conventional byte
Component separator	first encoding character	`^`
Repetition separator	second encoding char	`~`
Escape character	third encoding char	`\`
Sub-component sep.	fourth encoding char	`&`

The reader reads these bytes from the header, so a message that uses the conventional |^~\& or any other producer-chosen delimiters parses correctly. The segment terminator is always a carriage return (0x0D) — unlike the field and encoding delimiters, it is never producer-chosen.

When a file opens with an FHS or BHS header instead of MSH, the delimiters are read from that header; HL7 requires the file’s FHS/BHS encoding characters to match its messages’ MSH.

The discovered delimiter set travels with each message through the pipeline, and an HL7 Output re-emits every message with the set its header declared — a custom-delimiter file round-trips byte-faithfully, never silently rewritten to the conventional |^~\& (see Writing HL7).

The MSH off-by-one

MSH-1 is the field separator, so it is implicit — it never appears as a data field. When the MSH segment is split on the field separator, the encoding-characters field (MSH-2) is the first data field. As a result, for any MSH field number N ≥ 2, the positional column is f<N-1>: MSH-2 is f01, the sending application MSH-3 is f02, the message type MSH-9 is f08, and the message control id MSH-10 is f09. The same off-by-one applies to FHS and BHS.

Escape sequences

Field data escapes a literal delimiter character with an escape sequence \X\, where X names the delimiter: \F\ field separator, \S\ component separator, \T\ sub-component separator, \R\ repetition separator, \E\ the escape character itself. The reader decodes these into their literal data byte, so downstream consumers — CSV/JSON output, CXL string predicates, $doc fields — see clean data, never the wire escapes. The writer re-escapes any delimiter byte in field data on output, so the reader → writer → reader round-trip is byte-faithful.

An application escape the positional reader does not decode (e.g. the formatting escape \.br\) is kept verbatim rather than dropped, so no data is lost.

The component, repetition, and sub-component separators inside a field (e.g. the ^ in a composite PATID^^^HOSP^MR) are kept as part of the field’s text and are not split — the positional field model works above component resolution, so a composite field round-trips unchanged.

Newlines between segments

Some producers (and CRLF-normalizing transports) add a line feed after each carriage-return terminator. Those bytes are insignificant and are stripped between segments. A producer that omits the trailing carriage return on the final segment is accepted — that shape is common in practice.

Record shape

Each segment becomes one record under a fixed positional schema:

Column	Meaning
`seg_id`	The segment tag (`MSH`, `PID`, `OBX`, …)
`set_ref`	The enclosing message’s control id (`MSH-10`)
`set_type`	The enclosing message’s type (`MSH-9`, e.g. `ADT^A01`)
`f01`, `f02`, …	The segment’s positional data fields

The MSH header segment is emitted as a body record (its seg_id is MSH), carrying the message’s fields positionally. Batch/file envelope segments (FHS, FTS, BHS, BTS) are consumed by the reader to drive the document levels and validate counts — they are never emitted as body records.

The number of fNN columns is controlled by the source max_fields option (default 64). A segment carrying more data fields than that is rejected with guidance rather than silently truncated. Absent trailing fields read as null.

nodes:
  - type: source
    name: messages
    config:
      name: messages
      type: hl7
      glob: ./inbox/*.hl7
      options:
        max_fields: 128       # widen the positional schema for large OBX segments
      schema:
        - { name: seg_id, type: string }
        - { name: set_ref, type: string }
        - { name: f01, type: string }

Component splitting (optional)

By default a composite field rides inside one fNN column with its component (^), repetition (~), and sub-component (&) separators intact — the positional model deliberately works above component resolution. When you want component-level access (the message code MSH-9.1 vs the trigger event MSH-9.2) without writing CXL string-splitting downstream, opt one or more fields into splitting with split_fields. The reader explodes the named field into structured columns, and an HL7 Output re-assembles the exact wire field from them, so an HL7→HL7 round-trip stays byte-identical.

options:
  split_fields:
    - { field: f08, components: 2 }                       # MSH-9 → message code + trigger
    - { field: f03, components: 5 }                       # PID-3 (CX) → its components
    - { field: f04, components: 2, subcomponents: 3 }     # also expose sub-components
    - { field: f13, components: 1, repetitions: 4 }       # repeating field → per-repetition

Each split fixes the column width on three structural axes: components (required, the ^ axis), subcomponents (default 1, the & axis), and repetitions (default 1, the ~ axis). The schema stays static — it never varies with per-record data. A field whose data carries more structure on any axis than the declaration reserves is rejected with guidance, the same posture as a max_fields overflow; raise the axis count or leave the field unsplit.

The exploded columns name the path from the field down to a leaf with the axis letters r, c, s, all 1-based, eliding the default index (1) on the repetition and sub-component axes so the common component-only case stays clean:

Declaration	Columns for `f08`
`components: 2`	`f08_c1`, `f08_c2`
`components: 1, subcomponents: 2`	`f08_c1_s1`, `f08_c1_s2`
`components: 1, repetitions: 2`	`f08_r1_c1`, `f08_r2_c1`

The verbatim fNN column is replaced by the structured columns. Declare the exploded column names in the source schema: block (or rely on on_unmapped) the same way you would any other column:

options:
  split_fields:
    - { field: f08, components: 2 }
schema:
  - { name: seg_id, type: string }
  - { name: f08_c1, type: string }   # MSH-9.1 message code
  - { name: f08_c2, type: string }   # MSH-9.2 trigger event

emit code    = f08_c1
emit trigger = f08_c2

Splitting respects the escape rules: an escaped separator (e.g. \S\, a literal ^ in data) is not treated as a component boundary — the split runs on the raw bytes before the escape decodes, so the literal stays inside one component. On output the writer re-joins the leaves on the separators verbatim (never escaping ^/~/&) and still escapes any field-separator, escape, or carriage-return byte inside a leaf, so the round-trip is byte-faithful.

Envelope sections over the tiers

The file header FHS is extractable as a file-level document envelope section, exposing its positional fields to CXL as $doc.<section>.<field>. Use the segment extract rule with the field names matching the positional keys f01, f02, … :

envelope:
  sections:
    file:
      extract: { segment: "FHS" }
      fields:
        f07: string          # file name / id (FHS-8 under the off-by-one)

The BHS batch and the MSH message surface automatically as the nested $doc sections batch and transaction_set, each keyed by positional fNN fields — no envelope declaration is needed for them. A Transform on any body record can read all available tiers at once:

emit file_id  = $doc.file.f07            # FHS file id (declared section)
emit batch_id = $doc.batch.f07           # BHS batch id (auto section)
emit mtype    = $doc.transaction_set.f08 # MSH-9 message type (auto section)
emit ctrl     = $doc.transaction_set.f09 # MSH-10 control id (auto section)

Only the FHS header is extractable as a declared envelope section, and only when the file actually opens with one; a bare MSH-led file has no file-level envelope, so declaring an FHS section against it is rejected at startup. Trailer segments (BTS, FTS) arrive after the body they close and cannot become $doc fields without buffering the whole file — their counts are instead validated inline by the reader (see below). A segment extract naming any tag other than FHS, or an xml_path / json_pointer extract against an HL7 source, is rejected at startup.

Control-count validation

The reader validates the structural integrity claims carried in the batch/file trailers as they arrive, failing the run on a mismatch (a truncation or corruption signal):

BTS batch message count (BTS-1) — must equal the number of MSH messages in the batch. An empty BTS-1 disables the check (the count is optional in practice).
FTS file batch count (FTS-1) — must equal the number of BHS batches in the file. An empty FTS-1 disables the check.

Content after the FTS file trailer is rejected. A bare MSH-led file needs no trailers and validates with no count checks.

Routing a count mismatch to the DLQ

By default a BTS/FTS count mismatch aborts the run. A source declaring dlq_granularity: document instead dead-letters the whole file to the DLQ — the file’s records become a structural_validation trigger plus document_rejected collaterals, and no record of the malformed file reaches the sink. The count is only known at the trailer, after the body has streamed, so the rejection lands at the sink boundary (no record is written out), not literally before the first record. The grain is the whole file. Other corruption (truncation, post-trailer content) always aborts, even under the opt-in. An empty BTS-1/FTS-1 still disables the check entirely. See Malformed envelopes.

Writing HL7

An HL7 Output node re-emits the MSH and body segments from the record stream, escaping any field data that carries a delimiter byte. Records map by the same positional columns (seg_id, fNN); trailing null/empty fields are trimmed so no fabricated delimiters appear, and a column the writer does not recognize is an error (project the record to the HL7 columns first). Split-leaf columns (f08_c1, f03_r2_c1_s3) are recognized too — the writer groups them by field and re-assembles the wire value from the column names alone, so no output option is needed to round-trip a split source. The reader-stamped set_ref/set_type echoes and engine-internal $-namespaced columns are excluded automatically.

The writer re-emits each message with the delimiter set its source header declared: the reader stamps the discovered field separator and encoding characters into the message’s document context, and the writer adopts that set at every MSH — field joining, escaping, and split re-assembly all use it, so a custom-delimiter message round-trips byte-faithfully. A record that did not come from an HL7 source (or whose document context was dropped upstream) is written with the conventional |^~\& set.

nodes:
  - type: output
    name: out
    input: messages
    config:
      name: out
      type: hl7
      path: ./out/result.hl7
      options:
        file_header: ["^~\\&", "LAB", "HOSP", "EHR", "HOSP", "20240102", "FILE7"]
        batch_header: ["^~\\&", "LAB", "HOSP", "EHR", "HOSP", "20240102", "BATCH3"]
        segment_newline: true

Output options:

Option	Meaning
`file_header`	Literal `FHS` fields; opens an `FHS..FTS` file envelope.
`file_header_from_doc`	Name of a `$doc` section to echo the `FHS` fields from (round-trip).
`batch_header`	Literal `BHS` fields; wraps the messages in a `BHS..BTS` batch.
`segment_newline`	Write a newline after each segment terminator (default `true`).

When a file or batch header is configured, the writer recomputes the BTS batch message count and the FTS file batch count and emits the trailers at end of stream, so the output passes its own count validation on re-read. The MSH-2 encoding-characters field of each header is written verbatim so the delimiter declaration round-trips. With no header options set, the writer emits a bare stream of messages with no batch/file envelope.

file_header_from_doc echoes the FHS header from a record’s document context. That context is populated by a source’s FHS envelope section (declare a segment: "FHS" envelope section on the source) and travels with every body record through the pipeline. The reader stashes the complete, ordered FHS field list, so the reconstructed header is faithful. Supply file_header literal fields instead when the records have no source FHS section to echo.

Limitations

Charset. Field text is decoded as UTF-8. Non-UTF-8 messages are rejected explicitly rather than silently corrupted.
Positional fields by default; opt-in component columns. Components, repetitions, and sub-components ride inside one positional fNN field verbatim unless that field is named in split_fields (see Component splitting above), in which case the reader explodes it into structured columns and the writer re-assembles them. The reader always decodes the \X\ delimiter escapes regardless.
HL7 v3 and FHIR are out of scope. HL7 v3 is XML (use the xml format) and FHIR is JSON/REST (use the json format); this format handles HL7 v2.x pipe-and-hat encoding only.
Output splitting. A batch/file envelope is a single FHS..FTS structure and cannot be divided across files. An hl7 output combined with a split: block is rejected at config-validation time (diagnostic E339) rather than emitting a structurally corrupt file.

SWIFT MT Format

Clinker reads and writes SWIFT MT (FIN) messages alongside CSV, JSON, XML, fixed-width, EDIFACT, X12, and HL7 v2. A SWIFT MT message is a finite file built from brace-balanced blocks. The reader streams the message body one field at a time and surfaces the service blocks as document-envelope sections; the writer inverts the reader exactly, re-framing the block structure around emitted records so a read → write → read round-trip is byte-faithful.

Block structure

A SWIFT MT message is a sequence of top-level blocks, each {n:...} where n is a numeric block id:

Block	Role	Contents
`1`	Basic header	A fixed header string (application id, BIC, …)
`2`	Application header	Input/output direction, message type, recipient
`3`	User header	Optional; may carry nested `{tag:value}` sub-blocks
`4`	Message text	The message body — a run of `:tag:value` fields
`5`	Trailer	Optional; may carry nested `{tag:value}` sub-blocks

Unlike the flat delimiter-structured EDI formats (HL7, X12, EDIFACT), SWIFT framing is brace-balanced rather than terminator-delimited. Because of that, the nested sub-blocks of blocks 3 and 5 (for example {3:{108:MSGREF}}) are kept intact inside their parent block rather than mistaken for top-level blocks.

The `-}` text-block trailer

Block 4 is special. Its body is opaque line-structured free text — a field value (a :77E: envelope, a :79: narrative, an :86: information line) legitimately contains {, }, and even -} as data. So braces inside block 4 are treated as data, not framing: the block closes only on a line-anchored -} trailer — a -} that begins a line. An interior {, }, or a -} in the middle of a value is data, not a frame boundary. Both the framing braces and the closing -} trailer are stripped from the stored values, so a record carries clean tag/value data.

Whitespace between blocks

Producers insert CR/LF (and the \r\n that separates block-4 fields) for readability. Inter-block whitespace is insignificant and is skipped; the line breaks inside block 4 delimit the :tag:value fields.

Record shape

Each :tag:value line of block 4 becomes one record under a fixed positional schema — the same one-line-one-record model the X12 and HL7 readers use:

Column	Meaning
`block`	The block id the line came from (always `4` for body fields)
`tag`	The SWIFT field tag without its surrounding colons (`20`, `32A`, `61`)
`value`	The field value, with continuation lines folded in verbatim

A multi-line field (a :50K: ordering-customer block, a :77E: / :86: narrative) keeps its continuation lines: any line of block 4 that does not begin a new :tag: is folded into the current field’s value with its line break preserved — including a blank line inside the value, so a narrative with an internal blank line round-trips faithfully. A repeated tag (the :61: / :86: statement lines of an MT940, for instance) streams as one record per occurrence, in order.

The service blocks (1, 2, 3, 5) are consumed by the reader to serve envelope sections and drive the message-level document context — they are never emitted as body records.

nodes:
  - type: source
    name: payments
    config:
      name: payments
      type: swift
      glob: ./inbox/*.swift
      options:
        max_fields: 20000   # raise the block-4 field ceiling for large messages
      schema:
        - { name: block, type: string }
        - { name: tag, type: string }
        - { name: value, type: string }

The max_fields option caps the number of block-4 field lines a single message may carry (default 10000). A message exceeding it is rejected with guidance rather than streamed unbounded — a corruption guard, since a real MT message is well under the cap.

Envelope sections over the service blocks

A SWIFT MT message is a single envelope: the four service blocks surface as file-level $doc sections, exposing each block’s text to CXL as $doc.<section>.body. Every body record can read the enclosing message’s headers through a $doc.<section>.body lookup.

Declare the sections on the source with the segment extract rule naming the block id. The whole block body surfaces under the field name body, because a SWIFT service block carries free-form text (a header string, nested {sub:tag} blocks) rather than positional elements:

envelope:
  sections:
    basic:
      extract: { segment: "1" }   # block 1, the basic header
    app:
      extract: { segment: "2" }   # block 2, the application header
    user:
      extract: { segment: "3" }   # block 3, the user header (nested sub-blocks kept verbatim)
    trailer:
      extract: { segment: "5" }   # block 5, the trailer

A Transform on any body record can read every enclosing block at once:

emit tag       = tag
emit value     = value
emit basic_hdr = $doc.basic.body     # block 1 header string
emit app_hdr   = $doc.app.body       # block 2 header string
emit user_hdr  = $doc.user.body      # block 3 body, nested {108:...} kept verbatim

The section names are entirely your choice — the engine reserves none. A segment extract may name a block either by its numeric id ("1", "3") or by the stable default label ("basic_header", "app_header", "user_header", "trailer"); both resolve the same block.

Block 4 is the message-text body streamed as records, not an envelope section — a segment: "4" extract is rejected at startup. An xml_path or json_pointer extract against a SWIFT source is likewise rejected, because those rules belong to the tree formats.

Malformed-message handling

A structurally broken message fails the run with a precise SWIFT error rather than producing garbled records:

Unbalanced brace — a block that never closes (or whose brace depth never returns to zero) is a truncation error naming the offending block.
Missing -} trailer — a block 4 that runs to end of input without its -} trailer is a truncation error.
Missing or non-numeric block id — a block without a numeric id after the { (or with no : separating the id from the body) is rejected.
Malformed :tag:value line — a block-4 line with no second colon closing the tag, or an empty tag, is rejected.
Repeated service block — a second {1:...} (or any repeated service block) in one message is rejected.

A header-only message (no block 4, or an empty block 4) is valid: it produces no body records and drains cleanly.

Writing SWIFT MT

A SWIFT Output node inverts the reader exactly. It re-emits each block-4 record as a :tag:value line and re-frames the single message envelope around them: the service blocks 1/2/3 first, then block 4 ({4: … -}), then the optional block-5 trailer. Block-4 free text is opaque, so values are written verbatim with no escaping — an interior {, }, a mid-line -}, a folded continuation break, and an interior blank line all reproduce as data. The reader strips exactly the structural separators (braces, the leading :, the -} trailer, the line breaks) and keeps every other byte; the writer re-adds exactly those separators and nothing else, so a read → write → read round-trip returns byte-identical field values.

Records map by the tag and value columns. The block column is the constant 4 discriminator (an empty block is treated as block 4, so a Transform that projects only tag/value writes fine); a record carrying a block other than 4 is rejected, because service blocks are never emitted as records — they ride the document context.

nodes:
  - type: output
    name: out
    input: messages
    config:
      name: out
      type: swift
      path: ./out/message.swift
      options:
        basic_header_from_doc: basic
        app_header_from_doc: app
        user_header_from_doc: user
        trailer_from_doc: trailer

Each service block is written from a literal body or echoed from a user-declared $doc section:

Option	Meaning
`basic_header`	Literal block-1 body, written verbatim as `{1:<body>}`.
`basic_header_from_doc`	Name of a `$doc` section to echo the block-1 body from.
`app_header`	Literal block-2 body.
`app_header_from_doc`	Name of a `$doc` section to echo the block-2 body from.
`user_header`	Literal block-3 body (nested `{sub:tag}` content kept verbatim).
`user_header_from_doc`	Name of a `$doc` section to echo the block-3 body from.
`trailer`	Literal block-5 body, written after block 4 closes.
`trailer_from_doc`	Name of a `$doc` section to echo the block-5 body from.

The *_from_doc options name the section the user declared on the source — the engine reserves no section name. A literal *_header wins over its *_from_doc companion when both are set; a service block with neither is omitted. The _from_doc echo reads the block body verbatim from the section’s body field — the same single-field shape the reader writes — so a SWIFT source’s service blocks (declared as segment envelope sections) round-trip unchanged when their section names are passed back to the writer here.

The document context that carries the $doc sections rides on each body record, so the *_from_doc echoes require at least one block-4 record to read from. A document with zero block-4 records emits a valid empty message — {4: immediately closed by -} — wrapped only by the literal-configured service blocks (basic_header, app_header, user_header, trailer); the *_from_doc echoes are skipped because no record carries the document context. Use the literal options when a zero-record output must still emit its service blocks.

Block-4 free text has no escape mechanism, so values are written verbatim. Almost any value round-trips faithfully, but two shapes are unrepresentable when the value is built from arbitrary records (CSV/JSON → Transform → SWIFT): a value whose continuation line — a line after a folded line break — begins with the block terminator -} (which would re-read as an early block close) or with a : tag marker (which would re-read as a spurious field). The writer rejects such a value with a clear error rather than emitting silently-corrupt output. Values read from a SWIFT source can never take these shapes, so a read → write → read round-trip is always safe.

A SWIFT MT message is a single indivisible envelope, so a swift output cannot be combined with a byte-limit split: block — the pairing is rejected at config-validation time (diagnostic E342).

Limitations

UTF-8 only. SWIFT MT messages are decoded as UTF-8; a non-UTF-8 block body is rejected explicitly rather than corrupted silently.
Field-content parsing. The reader exposes each :tag:value line as a tag/value pair verbatim. Parsing a field’s internal structure (the sub-fields of a :32A: value-date/currency/amount, say) is a CXL concern downstream of the source, not a reader responsibility.

Network Sources (REST)

A Source reads from the filesystem by default. To pull records from a network endpoint instead, declare a transport: block on the Source. The transport selects where records come from; it sits above the on-disk type: (the format), which for a REST source still selects how the response bodies decode.

A network transport is a finite-pull source: it runs on its own thread, drives a synchronous client to cursor exhaustion, then exits. There is no daemon, no event loop, and no async runtime — the same single- process, run-to-drain model as a file pipeline. Finiteness is a hard property of the reader: a REST source caps its pull with an explicit page/record limit, so an unbounded endpoint cannot keep it running forever.

A network source still requires a schema: block. That authored schema is the row-to-record target: the reader maps each decoded object onto it, coercing values leniently. A per-row value that cannot coerce is left unchanged at the reader and routed to the dead-letter queue at the Transform stage — identical to file-source semantics. A network source declares no file matcher (path / glob / regex / paths); declaring one is a configuration error (E219).

Because a network source has no file path, its $source.file provenance column and the {source_file} output template both resolve to a stable synthetic identifier, <source:NAME>, where NAME is the Source node’s name.

REST sources

A rest source issues paginated HTTP GETs against a base URL, decoding each response body through the declared json or xml format. (Other formats are rejected with E220 — a REST body is a multi-record document, not a flat CSV/fixed-width stream.)

nodes:
  - type: source
    name: orders_api
    config:
      name: orders_api
      type: json
      options:
        format: array        # each page body is a JSON array of objects
      transport:
        kind: rest
        url: https://api.example.com/v1/orders
        max_pages: 50         # HARD page cap — required
        pagination:
          strategy: link_header
        auth:
          scheme: bearer
          token: "${ORDERS_TOKEN}"
      schema:
        - { name: order_id, type: int }
        - { name: total,    type: float }
        - { name: placed_at, type: date_time }

Pagination strategies

The pagination.strategy selects how the reader advances pages and detects the last one. Whatever the strategy, the pull always stops at the max_pages / max_records cap, even when the server keeps offering more.

none (default) — a single GET; the body is the whole result.
offset — ?offset=N&limit=L, advancing the offset by the page size each request. The last page is the one that returns fewer rows than limit.
```
pagination:
  strategy: offset
  limit: 200
  offset_param: offset     # optional, defaults shown
  limit_param: limit
```
cursor_token — the reader reads a continuation token from a JSON pointer in each response and sends it back on the next request. Paging stops when the token field is absent or null.
```
pagination:
  strategy: cursor_token
  cursor_param: page_token
  next_token_pointer: /meta/next_page   # RFC 6901 JSON pointer
```
link_header — the reader follows the URL in the response’s RFC 5988 Link: <…>; rel="next" header until no such link is present.
```
pagination:
  strategy: link_header
```

Authentication

auth.scheme selects the credential sent on every request:

none (default) — no auth header.
bearer — sends Authorization: Bearer <token>.

header — sends an arbitrary static header, e.g. an API key.

auth:
  scheme: header
  name: X-API-Key
  value: "${API_KEY}"

Reliability and finiteness knobs

Key	Default	Meaning
`max_pages`	—	Required. Hard ceiling on pages fetched, regardless of the server.
`max_records`	none	Optional hard ceiling on records emitted.
`retries`	`3`	Bounded retries on a transient failure (5xx, connect/timeout error). A 4xx is fatal — retrying cannot help.
`timeout_secs`	`30`	Per-request timeout. Bounds in-flight time so an interrupt lands within the shutdown window.

A partial-page decode failure routes that page’s offending rows to the DLQ per-row, exactly like a file source; it does not abort the pull.

Shutdown

On SIGINT/SIGTERM the reader polls its cancellation handle at each page boundary and stops cleanly with a normal end-of-input — the same graceful drain a file source performs. The timeout_secs per-request bound caps how long a single in-flight request can delay that stop.

Auto-Widen & Schema Drift

When an input file carries columns the source’s declared schema: block does not name, Clinker decides what to do with them via the per-source on_unmapped policy. The default is auto_widen, which preserves the extra columns end-to-end so schema drift never silently breaks a pipeline. This page covers the three modes, how undeclared columns flow downstream, the output controls, and the related diagnostics.

The three modes

- type: source
  name: orders
  config:
    name: orders
    type: csv
    path: "./data/orders.csv"
    on_unmapped:
      mode: auto_widen     # default; other values: drop, reject
    schema:
      - { name: order_id, type: string }
      - { name: amount, type: float }

auto_widen (default) — undeclared input fields are carried along with each record and re-expanded to top-level columns at the output (when the Output node’s include_unmapped is left at its default of true). Nothing is silently lost, and you don’t have to declare every column up front. How a widened column reaches the output depends on the output format — self-describing formats (JSON / NDJSON / XML) carry it per record, tabular CSV widens its header to the union of every record’s columns, and fixed-width (a positional layout with no room for undeclared columns) fails loudly rather than dropping it. See Output controls below.
drop — undeclared input fields are silently stripped at read time. The source carries only its declared schema:.
reject — any record carrying a field not in the declared schema fails the source with a diagnostic naming the offending field. The strict choice when unexpected columns should be treated as errors.

CXL expressions can only read fields you declared in schema: — carried-along undeclared fields are not visible to CXL, only to the output. To use an undeclared field in an expression, add it to the source schema:.

How undeclared columns flow downstream

Carried-along columns follow these rules through each node type:

Node type	Behavior
Transform	Passed through unchanged (transforms are row-preserving).
Aggregate	Dropped — per-row extra columns have no meaning on a grouped row. To keep one, add it to `group_by` or emit it explicitly.
Combine	The driver’s carried columns ride through; build-side ones are dropped. To keep a build-side field, emit it explicitly in the combine body via `<build_qualifier>.<field>`.
Route / Merge	Passed through. `Merge` requires every input to share the same `on_unmapped` policy — mixing fails with E315 (see below).
Composition	The body inherits the parent’s carried columns and whatever the body’s last node carries flows back out.
Output	Expanded to top-level columns when `include_unmapped: true` (the default); stripped when `false`. How a widened column reaches a CSV / XML / fixed-width writer depends on the format — see Schema drift across records.

Output controls

- type: output
  name: out
  input: src
  config:
    name: out
    type: json
    path: out.json
    include_unmapped: true    # default: true

When true (the default), undeclared fields the source carried along are expanded back to top-level columns at the sink — useful for pass-through pipelines where every original column should reach the output. Set include_unmapped: false to write only the columns explicitly emitted upstream.

include_unmapped is independent of include_correlation_keys: each can be set on its own, and include_correlation_keys never surfaces auto-widened columns.

Cross-format flow

Expansion happens before the writer runs, so a CSV source with auto_widen feeding a JSON output with include_unmapped: true produces JSON objects whose keys include both the declared columns and the absorbed ones:

input.csv:    id,extra,city
              1,foo,Paris

output.json:  {"id": "1", "extra": "foo", "city": "Paris"}

Schema drift across records (tabular formats)

Different records can carry different auto-widened columns — for example a Merge of two sources where one carries region and the other category, so region appears only on the first source’s rows and category only on the second’s. Each output format handles that heterogeneity differently:

JSON / NDJSON / XML are self-describing: each record writes its own keys/elements, so a column present on only some records is simply absent from the others. Nothing is lost.
CSV needs one header shared by every row. When the output can be materialized (the common buffered path), Clinker pre-scans the batch and writes a header that is the union of every record’s columns, in first-seen order; rows that lack a later-appearing column write an empty cell for it. Nothing is lost.
On a bounded-memory CSV path — a streaming output fused directly after a Merge/Transform, a single-branch Route, a streaming-strategy Aggregate, or the probe side of a hash-build-probe Combine, or an envelope-reconstructing output — the writer commits its header to the first record before it has seen the rest, so a union is impossible. A later record carrying a column the header lacks then fails the run loudly with a SchemaDrift error naming the format and column, rather than silently writing a narrower row. Declare the column in the source (or output) schema: so every record carries it, or route to a self-describing format.
Fixed-width is positional — every column occupies a declared byte range, and there is no room for an undeclared one — so any carried-along column reaching a fixed-width output is a SchemaDrift error. Fixed-width sources never auto-widen (see below), so this only arises when a fixed-width output sits downstream of a source that does.

Writer errors on unexpanded columns

The CSV, XML, and fixed-width writers can only write flat scalar columns. If a $widened sidecar map reaches one of these writers without being expanded — which happens when you set include_unmapped: false but a nested value is still present — the write fails with an UnserializableMapValue error naming the format and column. (JSON has no such limit; it writes nested values natively.)

The fix is to either leave include_unmapped at its default of true, so the columns are expanded to top-level before writing, or to convert the value to a scalar in CXL before emitting it. The error message lists both routes.

E315 — Merge inputs must agree on policy

Merge concatenates its inputs positionally, so every input must agree on column shape — same column names, same on_unmapped policy, same correlation_key set. If two upstream sources disagree on whether they carry auto-widened columns (one uses auto_widen, another uses drop / reject), compilation fails:

E315: merge "merged": input schemas disagree on the `$widened` auto_widen sidecar column.

The fix is to set every merge upstream source to the same on_unmapped policy.

Fixed-width sources

Fixed-width sources are positional — the reader only sees the byte ranges your schema defines, so there are never any “extra” columns to absorb. auto_widen has no effect on a fixed-width source; use on_unmapped: drop (or reject) to make that explicit and silence the informational log the engine emits otherwise.

CXL Overview

CXL (Clinker Expression Language) is a per-record expression language designed for ETL transformations. Every CXL program operates on one record at a time, producing output fields, filtering records, or computing derived values.

CXL is not SQL. There are no SELECT, FROM, or WHERE keywords. CXL programs are sequences of statements – emit, let, filter, distinct – that execute top to bottom against the current record.

Key differences from SQL

SQL	CXL
`SELECT col AS alias`	`emit alias = col`
`WHERE condition`	`filter condition`
`AND` / `OR` / `NOT`	`and` / `or` / `not` (keywords)
`&&` / `\|\|` / `!`	Not supported – use keywords
`COALESCE(a, b)`	`a ?? b`
`CASE WHEN ... THEN ... END`	`if ... then ... else ...` or `match { }`

Boolean operators are keywords

CXL uses English keywords for boolean logic, not symbols:

$ cxl eval -e 'emit result = true and false' --field dummy=1

{
  "result": false
}

The operators &&, ||, and ! are syntax errors in CXL. Always use and, or, and not.

System namespaces use `$` prefix

CXL provides built-in namespaces for accessing pipeline state, metadata, and window functions. All system namespaces are prefixed with $:

$pipeline.* – pipeline execution context (name, counters, provenance) and pipeline-scope declared state
$source.* – per-source context and source-scope declared state
$record.* – per-record scoped state (travels with the record, never an output column)
$window.* – window function calls
$vars.* – static, channel-overridable configuration
$config.* – a composition’s config parameters, read inside its body (constant-folded per instantiation)

$ cxl eval -e 'emit name = $pipeline.name'

{
  "name": "cxl-eval"
}

Compile-time type checking

CXL is statically type-checked, so type errors are caught before any data is processed. Run cxl check to validate a transform before a run. Errors come with source locations and fix suggestions.

$ cxl check transform.cxl
ok: transform.cxl is valid

If there are type errors, the checker reports them with spans:

error[typecheck]: cannot apply '+' to String and Int (at transform.cxl:12)
  help: convert one operand — use .to_int() or .to_string()

A minimal CXL program

emit greeting = "hello"
emit doubled = amount * 2
filter amount > 0

This program:

Emits a constant string field greeting
Emits doubled as twice the input amount
Filters out records where amount is not positive

Try it:

$ cxl eval -e 'emit greeting = "hello"' -e 'emit doubled = amount * 2' \
    --field amount=5

{
  "greeting": "hello",
  "doubled": 10
}

Statement order matters

CXL statements execute sequentially. Later statements can reference fields produced by earlier emit or let statements:

$ cxl eval -e 'let tax_rate = 0.21' -e 'emit tax = price * tax_rate' \
    --field price=100

{
  "tax": 21.0
}

A filter statement short-circuits execution – if the condition is false, remaining statements do not run and the record is excluded from output.

Types & Literals

CXL has 9 value types. Every field value, literal, and expression result is one of these types.

Value types

Type	Rust backing	Description
Null	`Value::Null`	Missing or absent value
Bool	`bool`	`true` or `false`
Integer	`i64`	64-bit signed integer
Float	`f64`	64-bit double-precision float
Decimal	`Decimal`	Exact base-10 fixed-point number for money/financials
String	`Box<str>`	UTF-8 text
Date	`NaiveDate`	Calendar date without timezone
DateTime	`NaiveDateTime`	Date and time without timezone
Array	`Vec<Value>`	Ordered collection of values
Map	`IndexMap<Box<str>, Value>`	Key-value pairs

Literal syntax

Integers

Standard decimal notation. Negative values use the unary minus operator.

$ cxl eval -e 'emit a = 42' -e 'emit b = -5' -e 'emit c = 0'

{
  "a": 42,
  "b": -5,
  "c": 0
}

Floats

Decimal notation with a dot. Must have digits on both sides of the decimal point.

$ cxl eval -e 'emit a = 3.14' -e 'emit b = -0.5'

{
  "a": 3.14,
  "b": -0.5
}

Strings

Double-quoted or single-quoted. Supports escape sequences: \\, \", \', \n, \t, \r.

$ cxl eval -e 'emit greeting = "hello world"'

{
  "greeting": "hello world"
}

Booleans

The keywords true and false.

$ cxl eval -e 'emit flag = true' -e 'emit neg = not flag'

{
  "flag": true,
  "neg": false
}

Dates

Hash-delimited ISO 8601 format: #YYYY-MM-DD#.

$ cxl eval -e 'emit d = #2024-01-15#'

{
  "d": "2024-01-15"
}

Null

The keyword null.

$ cxl eval -e 'emit nothing = null'

{
  "nothing": null
}

Schema types

When declaring column types in YAML pipeline schemas, use these type names:

Schema type	CXL type	Description
`string`	String	Text values
`int`	Integer	64-bit integers
`float`	Float	64-bit floats
`decimal`	Decimal	Exact base-10 fixed-point (money) — see below
`bool`	Bool	Boolean values
`date`	Date	Calendar dates
`date_time`	DateTime	Date and time
`array`	Array	Ordered collections
`numeric`	Int or Float	Union type – accepts either
`any`	Any	Unknown type – no type constraints
`nullable(T)`	Nullable(T)	Wrapper – value may be null

Example YAML schema declaration:

schema:
  employee_id: int
  name: string
  salary: nullable(float)
  start_date: date

Type promotion

CXL automatically promotes types in mixed expressions:

Int + Float promotes to Float:

$ cxl eval -e 'emit result = 2 + 3.5'

{
  "result": 5.5
}

Null + T produces Nullable(T): Any operation involving null produces a nullable result.

$ cxl eval -e 'emit result = null + 5'

{
  "result": null
}

Nullable(A) + B unifies to Nullable(unified): When a nullable value meets a non-nullable value, the result type wraps the unified inner type in Nullable.

The `decimal` type

float is an IEEE-754 binary float: it cannot represent most base-10 fractions exactly, so 0.1 + 0.2 is 0.30000000000000004, not 0.3. That rounding is unacceptable for money. The decimal type is an exact base-10 fixed-point number — 0.10 + 0.20 is exactly 0.30 — and is the correct type for monetary amounts, prices, tax, and any figure that must round like decimal arithmetic on paper.

Declare a decimal column with type: decimal and a scale (the number of fractional digits). precision (total significant digits) is optional validation metadata:

schema:
  - { name: amount, type: decimal, scale: 2 }
  - { name: tax_rate, type: decimal, scale: 4 }

A decimal column parses its raw text into an exact value and rounds off any excess precision to the column scale (round-half-to-even, the unbiased “banker’s rounding” used in accounting), so a scale: 2 column stores 2.567 as 2.57. This is one edge of a boundary contract: a declared scale pins a value to that many places at the boundary it is declared on — a source column’s scale on read, an output column’s scale on write (see Aggregating decimals) — while decimals keep full precision inside the pipeline.

Arithmetic rules

decimal ⊗ decimal → decimal — exact.
decimal ⊗ int → decimal — the integer widens exactly, so amount + 1 and price * quantity stay exact decimals.
decimal ⊗ float is a type error. Mixing an exact decimal with a binary float would silently lose precision, so CXL rejects it and asks for an explicit cast. Choose the trade-off deliberately:
- amount.to_float() * rate — opt into binary float precision.
- rate.to_decimal() * amount — bring the float into exact decimal math (the float→decimal step is the one acknowledged lossy conversion).
Division and avg compute at full precision — the exact quotient, not a binary-float approximation. Inside the pipeline a computed decimal keeps every digit; it is pinned to a fixed number of places only at a boundary that declares a scale. Declaring the output column type: decimal with a scale rounds the value to that many places on write (banker’s rounding), exactly as a decimal source column rounds on read — so avg(amount) emitted into a scale: 2 output column writes 1.33, not the full quotient. With no declared output scale the full precision is preserved; use an explicit round in CXL when you need fixed places mid-pipeline.

Comparisons follow the same rule: decimal < int is fine, decimal < float requires a cast.

Casting

x.to_decimal() converts an int, string, or float into a decimal (try_decimal is the lenient form that yields null on failure). d.to_int(), d.to_float(), and d.to_string() convert a decimal back out.

Worked example — an exact invoice total

$ cxl eval -e 'emit total = ("19.99".to_decimal() * 3) + "4.80".to_decimal()'

{
  "total": "64.77"
}

19.99 * 3 = 59.97, + 4.80 = 64.77 — exact, with no binary-float drift. (In a pipeline, declare the source columns type: decimal instead of casting; JSON output renders a decimal as a scale-preserving string.)

Aggregating decimals

sum, avg, min, max, count, and distinct all work over a decimal column and stay exact — no binary float ever touches a running total:

sum(amount) and avg(amount) return a decimal. The sum is the exact total to the cent; the average is the exact full-precision quotient.
min / max return the exact extremum, and count returns an integer.
Group-by and distinct keys are scale-normalized: two decimals that are numerically equal group together regardless of scale, so 2.50 and 2.5 fall in one group. This holds even when the aggregation spills to disk.

To pin an aggregate result to fixed places on the way out, declare the output column type: decimal with a scale: the value is rounded to that scale on write (banker’s rounding), so avg(amount) over 1.00, 1.00, 2.00 writes 1.33 into a scale: 2 output column while sum(amount) stays 4.00. An output column with no declared scale keeps the full-precision quotient. This applies to every format — CSV, JSON, and fixed-width — because the rounding happens as the record is projected onto the output. For fixed-width output it is often required: a full-precision quotient overflows a narrow numeric field, which is a hard error, whereas the rounded value fits.

weighted_avg also stays exact over decimals: a decimal value or weight (or both) gives an exact sum(value * weight) / sum(weight) at full division precision, and a zero total weight returns null. A decimal in one position mixed with a binary float in the other is a type error, matching the decimal ⊗ float arithmetic rule — cast with .to_decimal() or .to_float() so the value and weight share one numeric domain.

Type unification rules

When two types meet in an expression, CXL coerces them automatically:

Numbers combine: mixing an integer and a float gives a float (2 + 3.5 is 5.5).
Anything combined with null is null.
Mismatched types are an error: String + Int fails. Convert first with .to_int() or .to_string() so both sides are the same type.

Operators & Expressions

CXL provides arithmetic, comparison, boolean, null coalescing, and string operators. Boolean logic uses keywords (and, or, not), not symbols.

Arithmetic operators

Operator	Description	Example
`+`	Addition (or string concatenation)	`2 + 3`
`-`	Subtraction	`10 - 4`
`*`	Multiplication	`3 * 5`
`/`	Division	`10 / 3`
`%`	Modulo (remainder)	`10 % 3`

$ cxl eval -e 'emit result = 2 + 3 * 4'

{
  "result": 14
}

Multiplication binds tighter than addition, so 2 + 3 * 4 is 2 + (3 * 4) = 14, not (2 + 3) * 4 = 20.

$ cxl eval -e 'emit result = 10 % 3'

{
  "result": 1
}

Comparison operators

Operator	Description	Example
`==`	Equal	`x == 0`
`!=`	Not equal	`x != 0`
`>`	Greater than	`x > 10`
`<`	Less than	`x < 10`
`>=`	Greater than or equal	`x >= 10`
`<=`	Less than or equal	`x <= 10`

$ cxl eval -e 'emit result = 5 > 3' --field dummy=1

{
  "result": true
}

Boolean operators

CXL uses keywords for boolean logic. The symbols &&, ||, and ! are not valid CXL syntax.

Operator	Description	Example
`and`	Logical AND	`a and b`
`or`	Logical OR	`a or b`
`not`	Logical NOT (unary)	`not a`

$ cxl eval -e 'emit result = true and not false'

{
  "result": true
}

$ cxl eval -e 'emit result = 5 > 3 or 10 < 2'

{
  "result": true
}

Null coalesce operator

The ?? operator returns its left operand if non-null, otherwise its right operand.

$ cxl eval -e 'emit result = null ?? "default"'

{
  "result": "default"
}

$ cxl eval -e 'emit result = "present" ?? "default"'

{
  "result": "present"
}

String concatenation

The + operator concatenates strings when both operands are strings.

$ cxl eval -e 'emit result = "hello" + " " + "world"'

{
  "result": "hello world"
}

Unary operators

Operator	Description	Example
`-`	Numeric negation	`-x`
`not`	Boolean negation	`not done`

$ cxl eval -e 'emit result = -42'

{
  "result": -42
}

Method calls

Methods are called on a receiver using dot notation:

$ cxl eval -e 'emit result = "hello".upper()'

{
  "result": "HELLO"
}

Methods can be chained:

$ cxl eval -e 'emit result = "  hello  ".trim().upper()'

{
  "result": "HELLO"
}

Field references

Bare identifiers reference fields from the input record:

$ cxl eval -e 'emit result = price * qty' \
    --field price=10 \
    --field qty=3

{
  "result": 30
}

Qualified field references use dot notation for multi-source pipelines: source.field.

Operator precedence

From highest (binds tightest) to lowest:

Precedence	Operators	Associativity
1 (highest)	`.` (method calls, field access)	Left
2	`-` (unary), `not`	Prefix
3	`*` `/` `%`	Left
4	`+` `-`	Left
5	`==` `!=` `>` `<` `>=` `<=`	Left
6	`and`	Left
7	`or`	Left
8 (lowest)	`??`	Right

Use parentheses to override precedence:

$ cxl eval -e 'emit result = (2 + 3) * 4'

{
  "result": 20
}

Comments

Line comments start with # (when not followed by a digit – digit-prefixed # starts a date literal):

# This is a comment
emit total = price * qty  # inline comment
emit deadline = #2024-12-31#  # this is a date literal, not a comment

Statements

CXL programs are sequences of statements that execute top-to-bottom against each input record. Statement order matters – later statements can reference values produced by earlier ones.

emit

The emit statement produces an output field. Each emit becomes a column in the output record.

emit name = expression

$ cxl eval -e 'emit greeting = "hello"' -e 'emit doubled = 21 * 2'

{
  "greeting": "hello",
  "doubled": 42
}

Multiple emit statements build up the output record field by field:

$ cxl eval -e 'emit first = "Alice"' -e 'emit last = "Smith"' \
    -e 'emit full = first + " " + last'

{
  "first": "Alice",
  "last": "Smith",
  "full": "Alice Smith"
}

let

The let statement creates a local variable binding. The variable is available to subsequent statements but is NOT included in the output record.

let name = expression

$ cxl eval -e 'let tax_rate = 0.21' -e 'emit tax = 100 * tax_rate'

{
  "tax": 21.0
}

Note that tax_rate does not appear in the output – only emit statements produce output fields.

filter

The filter statement excludes records where the condition evaluates to false. When a filter excludes a record, remaining statements do not execute (short-circuit).

filter condition

$ cxl eval -e 'filter amount > 0' -e 'emit result = amount * 2' \
    --field amount=5

{
  "result": 10
}

When the filter condition is false, the entire record is dropped and no output is produced.

Filters can appear anywhere in the statement sequence. Place them early to skip unnecessary computation:

filter status == "active"
let discount = if tier == "gold" then 0.2 else 0.1
emit final_price = price * (1 - discount)

distinct

The distinct statement deduplicates records. The bare form deduplicates on all emitted fields. The by form deduplicates on a specific field.

distinct
distinct by field_name

In a pipeline, distinct tracks values seen so far and drops records that have already been emitted with the same key.

emit to a scoped namespace

An emit whose target is a $pipeline.*, $source.*, or $record.* name writes a producer-declared scoped variable instead of an output column. The variable must be listed in the writing Transform’s config.declares: block. $record.* is the per-record store that travels with the record but never serializes as an output column.

emit $record.quality_flag = if amount < 0 then "suspect" else "ok"

Read it downstream via the same namespace:

filter $record.quality_flag == "ok"

See Scoped Variables for the declaration model and the three scopes’ lifetimes.

trace

The trace statement emits debug logging. It has no effect on the output record. Trace messages are only visible when tracing is enabled at the appropriate level.

trace "processing record"
trace warn "unusual value detected"
trace info if amount > 10000 then "high value transaction"

Trace levels: trace (default), debug, info, warn, error. An optional guard condition (via if) limits when the trace fires.

Statement ordering

Statements execute sequentially. A statement can reference any field or variable defined by a preceding emit or let:

$ cxl eval -e 'let base = 100' -e 'let rate = 0.15' \
    -e 'emit subtotal = base * rate' \
    -e 'emit total = base + subtotal'

{
  "subtotal": 15.0,
  "total": 115.0
}

Referencing a name before it is defined is a resolve-time error:

emit total = base + tax    # error: 'base' is not defined yet
let base = 100
let tax = base * 0.21

use

The use statement imports a CXL module for reuse. See Modules & use for details.

use shared.dates as d
emit fy = d::fiscal_year(invoice_date)

Conditionals

CXL provides two conditional expression forms: if/then/else and match. Both are expressions – they return values and can be used anywhere an expression is expected.

If / then / else

The basic conditional expression:

if condition then value else alternative

$ cxl eval -e 'emit label = if amount > 100 then "high" else "low"' \
    --field amount=250

{
  "label": "high"
}

The else branch is optional. When omitted, records where the condition is false produce null:

$ cxl eval -e 'emit bonus = if score > 90 then score * 0.1' \
    --field score=80

{
  "bonus": null
}

Chained conditionals

Chain multiple conditions with else if:

$ cxl eval -e 'emit tier = if amount > 1000 then "platinum"
    else if amount > 500 then "gold"
    else if amount > 100 then "silver"
    else "bronze"' \
    --field amount=750

{
  "tier": "gold"
}

Nested usage

Since if/then/else is an expression, it can be used inside other expressions:

$ cxl eval -e 'emit price = base * (if member then 0.8 else 1.0)' \
    --field base=100 \
    --field member=true

{
  "price": 80.0
}

Match

The match expression provides pattern matching. It comes in two forms: value matching (with a subject) and condition matching (without a subject).

Value form (with subject)

Match a subject expression against literal patterns:

match subject {
  pattern1 => result1,
  pattern2 => result2,
  _ => default
}

$ cxl eval -e 'emit label = match status {
    "A" => "Active",
    "I" => "Inactive",
    "P" => "Pending",
    _ => "Unknown"
  }' \
    --field status=A

{
  "label": "Active"
}

The wildcard _ is the catch-all arm. It matches any value not covered by preceding arms.

Condition form (without subject)

When no subject is provided, each arm’s pattern is evaluated as a boolean condition. This is CXL’s equivalent of SQL’s CASE WHEN:

match {
  condition1 => result1,
  condition2 => result2,
  _ => default
}

$ cxl eval -e 'emit tier = match {
    amount > 1000 => "high",
    amount > 100 => "medium",
    _ => "low"
  }' \
    --field amount=500

{
  "tier": "medium"
}

Practical examples

Tiered pricing:

emit discount = match {
  qty >= 1000 => 0.25,
  qty >= 100  => 0.15,
  qty >= 10   => 0.05,
  _           => 0.0
}

Status code mapping:

emit status_text = match http_code {
  200 => "OK",
  201 => "Created",
  400 => "Bad Request",
  404 => "Not Found",
  500 => "Internal Server Error",
  _   => "HTTP " + http_code.to_string()
}

Region classification:

emit region = match country {
  "US" => "North America",
  "CA" => "North America",
  "MX" => "North America",
  "GB" => "Europe",
  "DE" => "Europe",
  "FR" => "Europe",
  _    => "Other"
}

Match arms are evaluated in order

The first matching arm wins. Place more specific conditions before general ones:

# Correct: specific before general
emit category = match {
  amount > 10000 => "enterprise",
  amount > 1000  => "business",
  _              => "personal"
}

# Wrong: first arm always matches
emit category = match {
  amount > 0     => "personal",    # catches everything positive
  amount > 1000  => "business",    # never reached
  amount > 10000 => "enterprise",  # never reached
  _              => "unknown"
}

Built-in Methods

CXL provides built-in scalar methods organized into categories. Methods are called on a receiver value using dot notation: receiver.method(args).

Null propagation

Most methods return null when the receiver is null. This means null values flow through method chains without causing errors. The exceptions are documented in Introspection & Debug.

Method categories

String Methods (24 methods)

Text manipulation: case conversion, trimming, padding, searching, splitting, regex matching.

Method	Description
`upper`, `lower`	Case conversion
`trim`, `trim_start`, `trim_end`	Whitespace removal
`starts_with`, `ends_with`, `contains`	Substring testing
`replace`	Find and replace
`substring`, `left`, `right`	Extraction
`pad_left`, `pad_right`	Padding
`repeat`, `reverse`	Repetition and reversal
`length`	Character count
`split`, `join`	Splitting and joining
`matches`, `find`, `capture`	Regex operations
`format`, `concat`	Formatting and concatenation

Numeric Methods (8 methods)

Rounding, clamping, and comparison for integers and floats.

Method	Description
`abs`	Absolute value
`ceil`, `floor`	Ceiling and floor
`round`, `round_to`	Rounding to decimal places
`clamp`	Constrain to range
`min`, `max`	Pairwise minimum/maximum

Date & Time Methods (13 methods)

Date component extraction, arithmetic, and formatting.

Method	Description
`year`, `month`, `day`	Date component extraction
`hour`, `minute`, `second`	Time component extraction (DateTime only)
`add_days`, `add_months`, `add_years`	Date arithmetic
`diff_days`, `diff_months`, `diff_years`	Date difference
`format_date`	Custom date formatting

Conversion Methods (11 methods)

Type conversion in strict (error on failure) and lenient (null on failure) variants.

Method	Description
`to_int`, `to_float`, `to_string`, `to_bool`	Strict conversion
`to_date`, `to_datetime`	Strict date parsing
`try_int`, `try_float`, `try_bool`	Lenient conversion
`try_date`, `try_datetime`	Lenient date parsing

Introspection & Debug (5 methods)

Type inspection, null checking, and debugging. These are the only methods that accept null receivers without propagating null.

Method	Description
`type_of`	Returns the type name as a string
`is_null`	Tests for null
`is_empty`	Tests for empty string, empty array, or null
`catch`	Null fallback (equivalent to `??`)
`debug`	Passthrough with tracing side effect

Path Methods (5 methods)

File path component extraction.

Method	Description
`file_name`	Full filename with extension
`file_stem`	Filename without extension
`extension`	File extension
`parent`	Parent directory path
`parent_name`	Parent directory name

Array Methods

Traversal and transformation over nested arrays. Closure-bearing methods take an arrow-syntax closure and evaluate it per element.

Method	Description
`filter`, `map`, `find`, `any`, `flat_map`	Closure-bearing traversal
`remove`	Drop the element at a given index
`length`, `join`	Cross-listed on arrays (also defined on strings)

Map Methods

Builders and accessors for Value::Map payloads. All map methods return new maps – they never mutate the receiver.

Method	Description
`keys`, `values`	List map keys / values as arrays
`merge`	Union of two maps (right wins on conflict)
`set`	Insert / replace an entry, by single key or by a nested `a.b[0].c` path
`remove_field`	Drop a single entry by top-level key
`unset`	Delete an entry by single key or by a nested `a.b[0].c` path (array index removes-and-shifts; missing path is a no-op)

String Methods

CXL provides 24 built-in methods for string manipulation. All string methods return null when the receiver is null (null propagation).

Case conversion

upper()

Converts all characters to uppercase.

$ cxl eval -e 'emit result = "hello world".upper()'

{
  "result": "HELLO WORLD"
}

lower()

Converts all characters to lowercase.

$ cxl eval -e 'emit result = "Hello World".lower()'

{
  "result": "hello world"
}

Whitespace trimming

trim()

Removes leading and trailing whitespace.

$ cxl eval -e 'emit result = "  hello  ".trim()'

{
  "result": "hello"
}

trim_start()

Removes leading whitespace only.

$ cxl eval -e 'emit result = "  hello  ".trim_start()'

{
  "result": "hello  "
}

trim_end()

Removes trailing whitespace only.

$ cxl eval -e 'emit result = "  hello  ".trim_end()'

{
  "result": "  hello"
}

Substring testing

starts_with(prefix: String) -> Bool

Tests whether the string starts with the given prefix.

$ cxl eval -e 'emit result = "hello world".starts_with("hello")'

{
  "result": true
}

ends_with(suffix: String) -> Bool

Tests whether the string ends with the given suffix.

$ cxl eval -e 'emit result = "report.csv".ends_with(".csv")'

{
  "result": true
}

contains(substring: String) -> Bool

Tests whether the string contains the given substring.

$ cxl eval -e 'emit result = "hello world".contains("lo wo")'

{
  "result": true
}

Find and replace

replace(find: String, replacement: String) -> String

Replaces all occurrences of find with replacement.

$ cxl eval -e 'emit result = "foo-bar-baz".replace("-", "_")'

{
  "result": "foo_bar_baz"
}

Extraction

substring(start: Int [, length: Int]) -> String

Extracts a substring starting at start (0-based character index). If length is provided, takes at most that many characters. If omitted, takes all remaining characters.

$ cxl eval -e 'emit result = "hello world".substring(6)'

{
  "result": "world"
}

$ cxl eval -e 'emit result = "hello world".substring(0, 5)'

{
  "result": "hello"
}

left(n: Int) -> String

Returns the first n characters.

$ cxl eval -e 'emit result = "hello world".left(5)'

{
  "result": "hello"
}

right(n: Int) -> String

Returns the last n characters.

$ cxl eval -e 'emit result = "hello world".right(5)'

{
  "result": "world"
}

Padding

pad_left(width: Int [, char: String]) -> String

Left-pads the string to the given width. Default pad character is a space.

$ cxl eval -e 'emit result = "42".pad_left(5, "0")'

{
  "result": "00042"
}

$ cxl eval -e 'emit result = "hi".pad_left(6)'

{
  "result": "    hi"
}

pad_right(width: Int [, char: String]) -> String

Right-pads the string to the given width. Default pad character is a space.

$ cxl eval -e 'emit result = "hi".pad_right(6, ".")'

{
  "result": "hi...."
}

Repetition and reversal

repeat(n: Int) -> String

Repeats the string n times.

$ cxl eval -e 'emit result = "ab".repeat(3)'

{
  "result": "ababab"
}

reverse() -> String

Reverses the characters in the string.

$ cxl eval -e 'emit result = "hello".reverse()'

{
  "result": "olleh"
}

Length

length() -> Int

Returns the number of characters in the string. Also works on arrays, returning the number of elements.

$ cxl eval -e 'emit result = "hello".length()'

{
  "result": 5
}

Splitting and joining

split(delimiter: String) -> Array

Splits the string by the delimiter, returning an array of strings.

$ cxl eval -e 'emit result = "a,b,c".split(",")'

{
  "result": ["a", "b", "c"]
}

join(delimiter: String) -> String

Joins an array of values into a string with the given delimiter. The receiver must be an array.

$ cxl eval -e 'emit result = "a,b,c".split(",").join(" - ")'

{
  "result": "a - b - c"
}

Regex operations

matches(pattern: String) -> Bool

Tests whether the string fully matches the given regex pattern.

$ cxl eval -e 'emit result = "abc123".matches("^[a-z]+[0-9]+$")'

{
  "result": true
}

find(pattern: String) -> Bool

Tests whether the string contains a substring matching the given regex pattern (partial match).

$ cxl eval -e 'emit result = "hello world 42".find("[0-9]+")'

{
  "result": true
}

capture(pattern: String [, group: Int]) -> String

Extracts a capture group from the first regex match. Default group is 0 (the full match).

$ cxl eval -e 'emit result = "order-12345".capture("order-([0-9]+)", 1)'

{
  "result": "12345"
}

Formatting and concatenation

format(fmt: String) -> String

Formats the receiver value as a string.

$ cxl eval -e 'emit result = 42.format("")'

{
  "result": "42"
}

concat(args: String…) -> String

Concatenates the receiver with one or more string arguments. Null arguments are treated as empty strings.

$ cxl eval -e 'emit result = "hello".concat(" ", "world")'

{
  "result": "hello world"
}

This is variadic – it accepts any number of string arguments:

$ cxl eval -e 'emit result = "a".concat("b", "c", "d")'

{
  "result": "abcd"
}

Numeric Methods

CXL provides 8 built-in methods for numeric operations. These methods work on both Integer and Float values (the Numeric receiver type). All return null when the receiver is null.

abs, ceil, floor, round, and round_to also accept an exact decimal receiver and return an exact decimal result — d.round(2) rounds d to two fractional digits using banker’s rounding, staying in the decimal domain rather than converting to a float. The type checker infers decimal for these calls as well, so the result composes with other decimals without a cast: amount.round_to(2) + fee typechecks when both columns are decimal.

abs() -> Numeric

Returns the absolute value. Preserves the original type (Int stays Int, Float stays Float).

$ cxl eval -e 'emit result = (-42).abs()'

{
  "result": 42
}

$ cxl eval -e 'emit result = (-3.14).abs()'

{
  "result": 3.14
}

ceil() -> Int

Rounds up to the nearest integer. Returns the value unchanged for integers.

$ cxl eval -e 'emit result = 3.2.ceil()'

{
  "result": 4
}

$ cxl eval -e 'emit result = (-3.2).ceil()'

{
  "result": -3
}

floor() -> Int

Rounds down to the nearest integer. Returns the value unchanged for integers.

$ cxl eval -e 'emit result = 3.8.floor()'

{
  "result": 3
}

$ cxl eval -e 'emit result = (-3.2).floor()'

{
  "result": -4
}

round([decimals: Int]) -> Float

Rounds to the specified number of decimal places. Default is 0 decimal places. On a decimal receiver the result is a decimal (exact banker’s rounding), not a float.

$ cxl eval -e 'emit result = 3.456.round()'

{
  "result": 3.0
}

$ cxl eval -e 'emit result = 3.456.round(2)'

{
  "result": 3.46
}

round_to(decimals: Int) -> Float

Rounds to the specified number of decimal places. Unlike round(), the decimals argument is required. On a decimal receiver the result is a decimal (exact banker’s rounding), not a float.

$ cxl eval -e 'emit result = 3.14159.round_to(3)'

{
  "result": 3.142
}

Use round_to when you want to be explicit about precision in financial or scientific calculations:

$ cxl eval -e 'emit price = 19.995.round_to(2)'

{
  "price": 20.0
}

clamp(min: Numeric, max: Numeric) -> Numeric

Constrains the value to the given range. Returns min if the value is below it, max if above, or the value itself if within range.

$ cxl eval -e 'emit result = 150.clamp(0, 100)'

{
  "result": 100
}

$ cxl eval -e 'emit result = (-5).clamp(0, 100)'

{
  "result": 0
}

$ cxl eval -e 'emit result = 50.clamp(0, 100)'

{
  "result": 50
}

min(other: Numeric) -> Numeric

Returns the smaller of the receiver and the argument.

$ cxl eval -e 'emit result = 10.min(20)'

{
  "result": 10
}

$ cxl eval -e 'emit result = 10.min(5)'

{
  "result": 5
}

max(other: Numeric) -> Numeric

Returns the larger of the receiver and the argument.

$ cxl eval -e 'emit result = 10.max(20)'

{
  "result": 20
}

$ cxl eval -e 'emit result = 10.max(5)'

{
  "result": 10
}

Practical examples

Clamp a percentage:

emit pct = (completed / total * 100).clamp(0, 100).round_to(1)

Absolute difference:

emit diff = (actual - expected).abs()

Floor division for batch numbering:

emit batch = (row_number / 1000).floor()

Date & Time Methods

CXL provides 13 built-in methods for date and time manipulation. These methods work on Date and DateTime values. All return null when the receiver is null.

Component extraction

year() -> Int

Returns the year component.

$ cxl eval -e 'emit result = #2024-03-15#.year()'

{
  "result": 2024
}

month() -> Int

Returns the month component (1-12).

$ cxl eval -e 'emit result = #2024-03-15#.month()'

{
  "result": 3
}

day() -> Int

Returns the day-of-month component (1-31).

$ cxl eval -e 'emit result = #2024-03-15#.day()'

{
  "result": 15
}

hour() -> Int

Returns the hour component (0-23). DateTime only – returns null for Date values.

$ cxl eval -e 'emit result = "2024-03-15T14:30:00".to_datetime().hour()'

{
  "result": 14
}

minute() -> Int

Returns the minute component (0-59). DateTime only – returns null for Date values.

$ cxl eval -e 'emit result = "2024-03-15T14:30:00".to_datetime().minute()'

{
  "result": 30
}

second() -> Int

Returns the second component (0-59). DateTime only – returns null for Date values.

$ cxl eval -e 'emit result = "2024-03-15T14:30:45".to_datetime().second()'

{
  "result": 45
}

Date arithmetic

add_days(n: Int) -> Date

Adds n days to the date. Use negative values to subtract. Works on both Date and DateTime.

$ cxl eval -e 'emit result = #2024-01-15#.add_days(10)'

{
  "result": "2024-01-25"
}

$ cxl eval -e 'emit result = #2024-01-15#.add_days(-5)'

{
  "result": "2024-01-10"
}

add_months(n: Int) -> Date

Adds n months to the date. Day is clamped to the last day of the target month if necessary.

$ cxl eval -e 'emit result = #2024-01-31#.add_months(1)'

{
  "result": "2024-02-29"
}

$ cxl eval -e 'emit result = #2024-03-15#.add_months(-2)'

{
  "result": "2024-01-15"
}

add_years(n: Int) -> Date

Adds n years to the date. Leap day (Feb 29) is clamped to Feb 28 in non-leap years.

$ cxl eval -e 'emit result = #2024-02-29#.add_years(1)'

{
  "result": "2025-02-28"
}

Date difference

diff_days(other: Date) -> Int

Returns the number of days between the receiver and the argument (receiver - other). Positive when the receiver is later.

$ cxl eval -e 'emit result = #2024-03-15#.diff_days(#2024-03-01#)'

{
  "result": 14
}

$ cxl eval -e 'emit result = #2024-01-01#.diff_days(#2024-03-15#)'

{
  "result": -74
}

diff_months(other: Date) -> Int

Returns the difference in months between two dates.

Note: This method currently returns null (unimplemented). Use diff_days and divide by 30 as an approximation.

diff_years(other: Date) -> Int

Returns the difference in years between two dates.

Note: This method currently returns null (unimplemented). Use diff_days and divide by 365 as an approximation.

Formatting

format_date(format: String) -> String

Formats the date/datetime using a chrono format string. See chrono format syntax.

Common format specifiers:

Specifier	Description	Example
`%Y`	4-digit year	`2024`
`%m`	2-digit month	`03`
`%d`	2-digit day	`15`
`%H`	Hour (24h)	`14`
`%M`	Minute	`30`
`%S`	Second	`00`
`%B`	Full month name	`March`
`%b`	Abbreviated month	`Mar`
`%A`	Full weekday	`Friday`

$ cxl eval -e 'emit result = #2024-03-15#.format_date("%B %d, %Y")'

{
  "result": "March 15, 2024"
}

$ cxl eval -e 'emit result = #2024-03-15#.format_date("%Y/%m/%d")'

{
  "result": "2024/03/15"
}

Practical examples

Fiscal year calculation (April start):

let d = invoice_date
emit fiscal_year = if d.month() < 4 then d.year() - 1 else d.year()

Age in days:

emit days_since = now.diff_days(created_date)

Quarter:

emit quarter = match {
  invoice_date.month() <= 3  => "Q1",
  invoice_date.month() <= 6  => "Q2",
  invoice_date.month() <= 9  => "Q3",
  _                          => "Q4"
}

ISO week format:

emit formatted = order_date.format_date("%Y-W%V")

Conversion Methods

CXL provides two families of conversion methods: strict (7 methods) and lenient (6 methods). Strict conversions raise an error on failure, halting pipeline execution. Lenient conversions return null on failure, allowing graceful handling of dirty data.

All conversion methods accept any receiver type (Any).

Strict conversions

Use strict conversions for required fields where invalid data should halt processing.

to_int() -> Int

Converts the receiver to an integer. Errors on failure.

Float: truncates toward zero
String: parses as integer
Bool: true becomes 1, false becomes 0

$ cxl eval -e 'emit result = "42".to_int()'

{
  "result": 42
}

$ cxl eval -e 'emit result = 3.9.to_int()'

{
  "result": 3
}

to_float() -> Float

Converts the receiver to a float. Errors on failure.

Integer: promotes to float
String: parses as float

$ cxl eval -e 'emit result = "3.14".to_float()'

{
  "result": 3.14
}

$ cxl eval -e 'emit result = 42.to_float()'

{
  "result": 42.0
}

to_decimal() -> Decimal

Converts the receiver to an exact decimal. Errors on failure.

Integer: converts exactly
String: parses base-10 exactly ("19.99" → 19.99, never via a binary float)
Float: converts via the binary value — this is the one lossy direction, made explicit precisely because decimal * float is otherwise a type error

Use to_decimal() to bring a value into exact decimal arithmetic. JSON output renders a decimal as a scale-preserving string.

$ cxl eval -e 'emit result = "0.10".to_decimal() + "0.20".to_decimal()'

{
  "result": "0.30"
}

to_string() -> String

Converts any value to its string representation. Never fails.

$ cxl eval -e 'emit result = 42.to_string()'

{
  "result": "42"
}

$ cxl eval -e 'emit result = true.to_string()'

{
  "result": "true"
}

to_bool() -> Bool

Converts the receiver to a boolean. Errors on failure.

String: "true", "1", "yes" become true; "false", "0", "no" become false (case-insensitive)
Integer: 0 is false, everything else is true

$ cxl eval -e 'emit result = "yes".to_bool()'

{
  "result": true
}

$ cxl eval -e 'emit result = 0.to_bool()'

{
  "result": false
}

to_date([format: String]) -> Date

Parses a string to a Date. Without a format argument, expects ISO 8601 (YYYY-MM-DD). With a format, uses chrono strftime syntax.

$ cxl eval -e 'emit result = "2024-03-15".to_date()'

{
  "result": "2024-03-15"
}

$ cxl eval -e 'emit result = "15/03/2024".to_date("%d/%m/%Y")'

{
  "result": "2024-03-15"
}

to_datetime([format: String]) -> DateTime

Parses a string to a DateTime. Without a format argument, expects ISO 8601 (YYYY-MM-DDTHH:MM:SS). With a format, uses chrono strftime syntax.

$ cxl eval -e 'emit result = "2024-03-15T14:30:00".to_datetime()'

{
  "result": "2024-03-15T14:30:00"
}

Lenient conversions

Use lenient conversions for optional or dirty data fields. They return null instead of raising errors, making them safe to combine with ?? for fallback values.

try_int() -> Int

Attempts to convert to integer. Returns null on failure.

$ cxl eval -e 'emit a = "42".try_int()' -e 'emit b = "abc".try_int()'

{
  "a": 42,
  "b": null
}

try_float() -> Float

Attempts to convert to float. Returns null on failure.

$ cxl eval -e 'emit a = "3.14".try_float()' -e 'emit b = "N/A".try_float()'

{
  "a": 3.14,
  "b": null
}

try_decimal() -> Decimal

Attempts to convert to an exact decimal. Returns null on failure.

$ cxl eval -e 'emit a = "19.99".try_decimal()' -e 'emit b = "N/A".try_decimal()'

{
  "a": "19.99",
  "b": null
}

try_bool() -> Bool

Attempts to convert to boolean. Returns null on failure.

$ cxl eval -e 'emit a = "yes".try_bool()' -e 'emit b = "maybe".try_bool()'

{
  "a": true,
  "b": null
}

try_date([format: String]) -> Date

Attempts to parse a string as a Date. Returns null on failure.

$ cxl eval -e 'emit a = "2024-03-15".try_date()' \
    -e 'emit b = "not a date".try_date()'

{
  "a": "2024-03-15",
  "b": null
}

try_datetime([format: String]) -> DateTime

Attempts to parse a string as a DateTime. Returns null on failure.

$ cxl eval -e 'emit a = "2024-03-15T14:30:00".try_datetime()' \
    -e 'emit b = "invalid".try_datetime()'

{
  "a": "2024-03-15T14:30:00",
  "b": null
}

When to use each

Strict conversions (to_*) for:

Required fields that must be valid
Schema-enforced data where bad input should halt the pipeline
Fields already validated upstream

Lenient conversions (try_*) for:

Optional fields that may be missing or malformed
Dirty data with mixed formats
Fields where a fallback value is acceptable

Practical patterns

Safe numeric parsing with fallback:

emit amount = raw_amount.try_float() ?? 0.0

Parse dates from multiple formats:

emit parsed = raw_date.try_date("%Y-%m-%d")
    ?? raw_date.try_date("%m/%d/%Y")
    ?? raw_date.try_date("%d-%b-%Y")

Strict conversion for required fields:

emit employee_id = raw_id.to_int()    # halts on bad data -- correct behavior
emit salary = raw_salary.to_float()   # must be numeric

Lenient conversion for optional fields:

emit bonus = raw_bonus.try_float()    # null if missing or non-numeric
emit total = salary + (bonus ?? 0.0)  # safe arithmetic

Introspection & Debug

CXL provides 4 introspection methods and 1 debug method. These are the only methods that accept null receivers without propagating null – they are designed specifically for inspecting and handling null values.

type_of() -> String

Returns the type name of the receiver as a string. Works on any value, including null.

Type name strings: "String", "Int", "Float", "Bool", "Date", "DateTime", "Null", "Array", "Map".

$ cxl eval -e 'emit a = 42.type_of()' -e 'emit b = "hello".type_of()' \
    -e 'emit c = null.type_of()'

{
  "a": "Int",
  "b": "String",
  "c": "Null"
}

Useful for branching on dynamic types:

emit formatted = match value.type_of() {
  "Int"   => value.to_string() + " (integer)",
  "Float" => value.round_to(2).to_string() + " (decimal)",
  _       => value.to_string()
}

is_null() -> Bool

Returns true if the receiver is null, false otherwise. This is the primary way to test for null values – it is NOT subject to null propagation.

$ cxl eval -e 'emit a = null.is_null()' -e 'emit b = 42.is_null()'

{
  "a": true,
  "b": false
}

Use in filter statements:

filter not field.is_null()

is_empty() -> Bool

Returns true for empty strings, empty arrays, or null values. Returns false for all other values.

$ cxl eval -e 'emit a = "".is_empty()' -e 'emit b = "hello".is_empty()' \
    -e 'emit c = null.is_empty()'

{
  "a": true,
  "b": false,
  "c": true
}

Useful for filtering out blank or missing records:

filter not name.is_empty()

catch(fallback: Any) -> Any

Returns the receiver if it is non-null, otherwise returns the fallback value. This is the method equivalent of the ?? operator.

$ cxl eval -e 'emit a = null.catch("default")' \
    -e 'emit b = "present".catch("default")'

{
  "a": "default",
  "b": "present"
}

catch and ?? are interchangeable:

# These two are equivalent:
emit name = raw_name.catch("Unknown")
emit name = raw_name ?? "Unknown"

debug(label: String) -> Any

Passes the receiver through unchanged while emitting a trace log with the given label. Zero overhead when tracing is disabled. The return value is always the receiver, making it safe to insert into any expression chain.

$ cxl eval -e 'emit result = 42.debug("check value")'

{
  "result": 42
}

Insert debug anywhere in a method chain for inspection without affecting the output:

emit total = price.debug("price")
    * qty.debug("qty")

When tracing is enabled, this produces log lines like:

TRACE source_row=1 source_file=input.csv: price: Integer(100)
TRACE source_row=1 source_file=input.csv: qty: Integer(5)

Null-safe summary

Method	Null receiver behavior
`type_of()`	Returns `"Null"`
`is_null()`	Returns `true`
`is_empty()`	Returns `true`
`catch(x)`	Returns `x`
`debug(l)`	Passes through `null`, logs it
All other methods	Return `null` (propagation)

Path Methods

CXL provides 5 built-in methods for extracting components from file path strings. All path methods take a string receiver and return a string. They return null when the receiver is null or when the requested component does not exist.

file_name() -> String

Returns the full filename (with extension) from the path.

$ cxl eval -e 'emit result = "/data/reports/sales.csv".file_name()'

{
  "result": "sales.csv"
}

file_stem() -> String

Returns the filename without the extension.

$ cxl eval -e 'emit result = "/data/reports/sales.csv".file_stem()'

{
  "result": "sales"
}

extension() -> String

Returns the file extension (without the leading dot).

$ cxl eval -e 'emit result = "/data/reports/sales.csv".extension()'

{
  "result": "csv"
}

Returns null when no extension is present:

$ cxl eval -e 'emit result = "/data/reports/README".extension()'

{
  "result": null
}

parent() -> String

Returns the parent directory path.

$ cxl eval -e 'emit result = "/data/reports/sales.csv".parent()'

{
  "result": "/data/reports"
}

parent_name() -> String

Returns just the name of the parent directory (not the full path).

$ cxl eval -e 'emit result = "/data/reports/sales.csv".parent_name()'

{
  "result": "reports"
}

Practical examples

Organize output by source directory:

emit source_dir = $pipeline.source_file.parent_name()
emit source_type = $pipeline.source_file.extension()

Extract file identifiers:

emit file_id = $pipeline.source_file.file_stem()
emit is_csv = $pipeline.source_file.extension() == "csv"

Route by file type:

let ext = input_path.extension()
emit format = match ext {
  "csv"  => "delimited",
  "json" => "structured",
  "xml"  => "markup",
  _      => "unknown"
}

Array Methods

CXL provides closure-bearing and non-closure array builtins for traversing and transforming nested arrays carried on a single record. The closure-bearing methods take an arrow-syntax closure and evaluate it once per element.

Null propagation

Every array method returns null when the receiver is null. The closure body is not invoked on a null receiver.

Closure-bearing methods

filter(it => Bool) -> Array

Returns a new array containing the elements for which the closure body evaluates to true.

- type: transform
  name: filter_items
  input: orders
  config:
    cxl: |
      emit kept = items.filter(it => it["price"] > 5)

For an input record where items is [{"sku":"a","price":10},{"sku":"b","price":20},{"sku":"c","price":5}], kept is [{"sku":"a","price":10},{"sku":"b","price":20}].

map(it => T) -> Array

Returns a new array whose elements are the closure body’s value for each input element. The element type need not match the input element type.

    cxl: |
      emit skus = items.map(it => it["sku"])
      emit doubled_prices = items.map(it => it["price"] * 2)

skus is ["a", "b", "c"]; doubled_prices is [20, 40, 10].

find(it => Bool) -> Element | Null

Returns the first element for which the closure body evaluates to true. Returns null if no element matches.

    cxl: |
      emit first_premium = items.find(it => it["price"] > 15)

first_premium is {"sku":"b","price":20} for the running example.

any(it => Bool) -> Bool

Returns true if the closure body evaluates to true for at least one element. Returns false if no element matches (including on an empty array).

    cxl: |
      emit has_cheap = items.any(it => it["price"] < 10)

has_cheap is true.

flat_map(it => Array) -> Array

Like map, but the closure body returns an array per input element; the results are concatenated into a single flat array. A null body result contributes no elements; a non-array body result contributes a single element.

    cxl: |
      emit all_tags = items.flat_map(it => it["tags"])

For input items carrying tags arrays (e.g. [{"sku":"a","tags":["new"]},{"sku":"b","tags":["sale","new"]}]), all_tags is ["new","sale","new"].

Non-closure methods

remove(index: Int) -> Array

Returns a new array with the element at the given 0-based index removed. The original array is unchanged.

    cxl: |
      emit shifted = items.remove(1)

shifted is [{"sku":"a","price":10},{"sku":"c","price":5}] – index 0 is preserved, index 2 shifts down to index 1.

If the index is negative or out of range, remove returns the receiver array unchanged.

length() -> Int

Returns the number of elements in the array. length is also defined on strings (see String Methods).

    cxl: |
      emit item_count = items.length()

item_count is 3.

join(separator: String) -> String

Joins an array of values into a single string with the given separator between elements. Defined as a string method (see String Methods) but accepts array receivers.

    cxl: |
      emit sku_list = items.map(it => it["sku"]).join(", ")

sku_list is "a, b, c".

Bracket indexing vs `.remove`

Bracket indexing (items[0]) reads an element by position and returns null when out of range. .remove(idx) returns a new array with the element dropped; out-of-range indices leave the array unchanged. See Nested Paths for the index-access surface.

Map Methods

CXL provides six built-in methods for working with map values (key-value pairs). Maps arise naturally from JSON object inputs, from the set builder below, and from upstream emits that produce nested structures.

All map methods return new values – they never mutate the receiver. This is copy-on-write semantics: chaining .set then .remove_field produces a fresh map at each step, leaving the upstream binding untouched.

Null propagation

Every map method returns null when the receiver is null or is not a Value::Map.

Method reference

keys() -> Array

Returns the map’s keys as an array of strings, preserving insertion order.

- type: transform
  name: list_keys
  input: rows
  config:
    cxl: |
      emit field_names = profile.keys()

For an input record where profile is {"name":"Alice","tier":"gold","since":"2021-04"}, field_names is ["name","tier","since"].

values() -> Array

Returns the map’s values as an array, preserving insertion order. Value types are heterogeneous – the array carries each value as-is.

    cxl: |
      emit field_values = profile.values()

field_values is ["Alice","gold","2021-04"].

merge(other: Map) -> Map

Returns a new map containing every key from the receiver and from other. On conflicting keys, other’s value wins.

    cxl: |
      emit enriched = profile.merge(overrides)

For profile = {"name":"Alice","tier":"gold"} and overrides = {"tier":"platinum","since":"2021-04"}, enriched is {"name":"Alice","tier":"platinum","since":"2021-04"}.

set(key: String, value: Any) -> Map

Returns a new map with key set to value. If the key was already present, its value is replaced; insertion order is preserved.

    cxl: |
      emit stamped = profile.set("region", "us-east")

stamped is {"name":"Alice","tier":"gold","since":"2021-04","region":"us-east"}.

Nested paths

key may be a dotted/indexed path that descends into nested maps and arrays, so a single set writes into a deep document. Dots separate map keys; a [n] suffix indexes an array.

    cxl: |
      emit moved = profile.set("address.city", "NYC")
      emit relabel = order.set("items[0].sku", "A-100")

Auto-create. Missing intermediate map segments are created as empty maps, so a path can build structure that does not yet exist. {}.set("a.b.c", 7) returns {"a":{"b":{"c":7}}}. This is what lets set assemble a nested document from scratch (matching jq setpath and Bloblang assignment).
Type conflict -> null. If an intermediate segment already exists but is the wrong kind for the next step – descending into a key whose value is a scalar, indexing a map with [n], or naming a field on an array – the whole operation returns null. Nothing is partially written.
Array index past the end -> null. Indexing past the last element returns null for the whole operation; arrays are never silently grown. The path can only overwrite an array slot that already exists.
A bare key is a single key, not a path. "region" writes the top-level region. Only . and [n] introduce nesting; a key with neither behaves exactly as before.

For profile = {"name":"Alice","address":{"city":"LA"}}, profile.set("address.city", "NYC") is {"name":"Alice","address":{"city":"NYC"}} – the sibling name and any other address keys are preserved.

Known limitation. Because . and [ are path syntax, set cannot yet target a key whose name literally contains a . or [ (for example a JSON field literally named "a.b"). Column names solve this with a backslash escape — a\.b is one segment named a.b, per Field Paths — and the decided direction is for set / unset keys to read that same grammar, since both are flat strings addressing a path. Until they do: to write such a key, build it with merge and a map literal; to remove it, use remove_field, which matches the exact key string.

remove_field(key: String) -> Map

Returns a new map without key. If the key was absent, the receiver is returned unchanged.

    cxl: |
      emit slim = profile.remove_field("since")

slim is {"name":"Alice","tier":"gold"}.

unset(key: String) -> Map

Returns a new map with the entry addressed by key removed. unset is the deletion counterpart to set and reuses the same dotted/indexed path grammar: dots separate map keys, a [n] suffix indexes an array. A bare key (no . or [n]) drops a top-level entry, exactly like remove_field.

    cxl: |
      emit pruned = profile.unset("address.city")
      emit dropped = order.unset("items[0]")

Array element removes and shifts. unset("items[0]") deletes element 0 and shifts the remaining elements down, so the array shrinks by one (matching jq del). This is deliberately distinct from set("items[0]", null), which leaves a null hole — unset means delete.
Missing or conflicting path is a no-op. A path that does not resolve — a missing intermediate, a missing final key, an array index past the end, or a type conflict (a field segment against an array, an index segment against a map) — returns the receiver unchanged. This mirrors remove_field on an absent key, and is the opposite of set, which returns null on a conflicting path.
Copy-on-write. Like every map method, unset never mutates the receiver; the upstream binding is untouched.

For profile = {"name":"Alice","address":{"city":"LA","zip":"90001"}}, profile.unset("address.city") is {"name":"Alice","address":{"zip":"90001"}} – the sibling zip and the top-level name are preserved.

Worked example: chained set + remove_field

Map methods compose naturally because each returns a new map.

- type: transform
  name: rewrite_profile
  input: rows
  config:
    cxl: |
      emit profile =
        profile.set("region", "us-east").remove_field("internal_id")

For profile = {"name":"Alice","internal_id":"ix-77","tier":"gold"}, the emitted profile is {"name":"Alice","tier":"gold","region":"us-east"}. The internal_id slot is removed and the region slot is appended; both happen on a fresh map so the upstream record’s profile is unaffected for any other downstream branch.

Parentheses are required

All map methods are method calls and must be written with parentheses, even the zero-argument ones:

profile.keys()         -- ok
profile.keys           -- parses as a field lookup, not a method call

profile.keys parses as a dotted path – a lookup for a field literally named keys inside profile. That path almost certainly returns null. Always include the parentheses when invoking a map method.

Using map methods inside array closures

Map methods compose with closure-bearing array builtins when the array elements are themselves maps.

    cxl: |
      emit enriched_items = items.map(it => it.set("region", "us-east"))
      emit item_keys = items.map(it => it.keys())

Each it is a map; the closure body invokes a map method on it. enriched_items is an array where every element gained a region field. item_keys is an array of key-name arrays, one per element.

Window Functions

Window functions allow CXL expressions to access aggregated values across a set of records within an analytic window. Unlike aggregate functions (which collapse groups into single rows), window functions attach computed values to each individual record.

Window functions are accessed via the $window.* namespace and require an analytic_window: configuration on the transform node.

Configuring an analytic window

Window functions are only available in transform nodes that declare an analytic_window: section in YAML:

nodes:
  - name: ranked_sales
    type: transform
    input: raw_sales
    analytic_window:
      group_by: [region]
      sort_by:
        - field: amount
          order: desc
    cxl: |
      emit region = region
      emit amount = amount
      emit running_total = $window.sum(amount)
      emit rank_position = $window.count()

Window configuration fields

Field	Description
`group_by`	List of fields to partition the window by (the SQL `PARTITION BY` axis).
`sort_by`	List of `{ field, order }` ordering specifications (`order` is `asc` or `desc`).
`source`	Optional explicit source-name reference for cross-source windows.
`on`	Optional cross-source partition-lookup field.

Frame specification (frame: { rows: ... } / frame: { range: ... }) is not yet plumbed through the YAML parser; today every window evaluates with a rows: unbounded_preceding..current_row semantic, which matches the SQL default for the listed window functions. See the deferred-work tracker for status of explicit frame syntax.

Aggregate window functions

These compute aggregate values over the window frame.

$window.sum(field)

Sum of the field values in the window frame.

emit running_total = $window.sum(amount)

$window.avg(field)

Average of the field values in the window frame. Returns Float.

emit moving_avg = $window.avg(amount)

$window.min(field)

Minimum value in the window frame.

emit window_min = $window.min(amount)

$window.max(field)

Maximum value in the window frame.

emit window_max = $window.max(amount)

$window.count()

Count of records in the window frame. Takes no arguments.

emit window_size = $window.count()

$window.first_value(field)

Returns the value of field at the first record of the window frame (ordered by sort_by). Equivalent to SQL FIRST_VALUE(field).

emit opening_amount = $window.first_value(amount)

$window.last_value(field)

Returns the value of field at the last record of the window frame (ordered by sort_by). Equivalent to SQL LAST_VALUE(field).

emit closing_amount = $window.last_value(amount)

Ranking window functions

Zero-argument integer functions that return the current row’s rank within its partition.

$window.row_number()

1-indexed position of the current record within its partition.

emit row_idx = $window.row_number()

$window.rank()

SQL RANK(): rows that share the same sort_by tuple receive the same rank, and the next distinct row jumps by the size of the tie group.

emit sales_rank = $window.rank()

$window.dense_rank()

SQL DENSE_RANK(): ties share a rank with no gaps between distinct ranks.

emit sales_dense_rank = $window.dense_rank()

Positional window functions

These access specific records by position within the window frame.

$window.first()

Returns the value of the current field from the first record in the window frame.

emit first_amount = $window.first()

$window.last()

Returns the value of the current field from the last record in the window frame.

emit last_amount = $window.last()

$window.lag(n)

Returns the value from n records before the current record. Returns null if there is no record at that offset.

emit prev_amount = $window.lag(1)
emit two_back = $window.lag(2)

$window.lead(n)

Returns the value from n records after the current record. Returns null if there is no record at that offset.

emit next_amount = $window.lead(1)

Iterable window functions

These evaluate predicates or collect values across the window.

$window.any(predicate)

Returns true if the predicate is true for any record in the window.

emit has_high = $window.any(amount > 1000)

$window.every(predicate)

Returns true if the predicate is true for every record in the window.

emit all_positive = $window.every(amount > 0)

$window.exists(predicate)

Returns true if the predicate is true for at least one record in the window — a SQL-fluency alias of $window.any.

emit any_high = $window.exists(amount > 1000)

$window.not_exists(predicate)

Returns true if no record in the window satisfies the predicate. Equivalent to not $window.exists(predicate) and to $window.every(not predicate).

emit none_negative = $window.not_exists(amount < 0)

$window.collect(field)

Collects all values of the field in the window into an array.

emit all_amounts = $window.collect(amount)

$window.distinct(field)

Collects distinct values of the field in the window into an array.

emit unique_regions = $window.distinct(region)

$window.collect and $window.distinct emit arrays. The CSV, XML, and fixed-width writers reject an array-valued field, so route such a result to a JSON output, or coerce the emitted array to a scalar in a downstream Transform (for example emit regions = unique_regions.join(";")) before a tabular sink.

Complete example

nodes:
  - name: sales_analysis
    type: transform
    input: daily_sales
    analytic_window:
      group_by: [store_id]
      sort_by:
        - field: sale_date
          order: asc
    cxl: |
      emit store_id = store_id
      emit sale_date = sale_date
      emit daily_revenue = revenue
      emit week_avg = $window.avg(revenue)
      emit week_total = $window.sum(revenue)
      emit prev_day_revenue = $window.lag(1)
      emit day_over_day = revenue - ($window.lag(1) ?? revenue)

This computes per-store running averages and totals over the partition’s history-up-to-and-including the current row.

Correlation-key error handling

Window functions work correctly when a pipeline uses correlation keys for group-atomic error handling: if records are retracted from an upstream group, the window recomputes the affected partitions so its output stays consistent. There is nothing to configure.

Aggregate Functions

Aggregate functions operate across grouped record sets in aggregate nodes, collapsing multiple input records into summary rows. They are distinct from window functions, which attach computed values to each individual record.

Aggregate functions

CXL provides 7 aggregate functions. These are called as free-standing function calls (not method calls) within the CXL block of an aggregate node.

Function	Signature	Returns	Description
`sum(expr)`	Numeric	Numeric	Sum of values
`count(*)`	–	Int	Count of records in the group
`avg(expr)`	Numeric	Float	Arithmetic mean
`min(expr)`	Any	Any	Minimum value
`max(expr)`	Any	Any	Maximum value
`collect(expr)`	Any	Array	All values collected into an array
`weighted_avg(value, weight)`	Numeric, Numeric	Float / Decimal	Weighted arithmetic mean

YAML aggregate node

Aggregate functions are used inside the cxl: block of a node with type: aggregate. The node must declare group_by: fields.

nodes:
  - name: dept_summary
    type: aggregate
    input: employees
    group_by: [department]
    cxl: |
      emit total_salary = sum(salary)
      emit headcount = count(*)
      emit avg_salary = avg(salary)
      emit max_salary = max(salary)
      emit min_salary = min(salary)

Group-by fields pass through automatically

Fields listed in group_by: are automatically included in the output. You do NOT need to emit them – they are carried through as group keys.

In the example above, department is automatically present in every output record without an explicit emit department = department statement.

Function details

sum(expr) -> Numeric

Computes the sum of the expression across all records in the group. Null values are skipped.

cxl: |
  emit total_revenue = sum(price * quantity)

count(*) -> Int

Counts the number of records in the group. The argument is the wildcard *.

cxl: |
  emit num_orders = count(*)

avg(expr) -> Float

Computes the arithmetic mean. Null values are skipped. Returns Float.

cxl: |
  emit avg_order_value = avg(order_total)

min(expr) -> Any

Returns the minimum value in the group. Works on numeric, string, and date types.

cxl: |
  emit earliest_order = min(order_date)
  emit lowest_price = min(unit_price)

max(expr) -> Any

Returns the maximum value in the group. Works on numeric, string, and date types.

cxl: |
  emit latest_order = max(order_date)
  emit highest_price = max(unit_price)

collect(expr) -> Array

Collects all values of the expression into an array. Useful for building lists of values per group.

cxl: |
  emit all_order_ids = collect(order_id)

Because collect emits an array, route the result to a JSON output (which serializes arrays natively) or coerce it to a scalar in a downstream Transform (for example emit ids = all_order_ids.join(";")). The CSV, XML, and fixed-width writers reject an array-valued field.

weighted_avg(value, weight) -> Float or Decimal

Computes a weighted average: sum(value * weight) / sum(weight). Takes two arguments.

cxl: |
  emit weighted_price = weighted_avg(unit_price, quantity)

Returns Float for int/float inputs. When either the value or the weight is a decimal, the result is an exact decimal computed entirely in the decimal domain (no binary float touches the running totals), at full division precision. A zero total weight returns null. Mixing a decimal with a binary float across the two arguments is a type error – cast with .to_decimal() or .to_float() so both share one numeric domain.

Aggregates vs. windows

Feature	Aggregate node	Window function
Record output	One row per group	One row per input record
Syntax	`sum(field)` (free-standing)	`$window.sum(field)` (namespace)
Configuration	`type: aggregate` + `group_by:`	`type: transform` + `analytic_window:`
Use case	Summarize groups	Enrich records with group context

Combining aggregates with expressions

Aggregate function calls can be mixed with regular CXL expressions in emit statements:

nodes:
  - name: category_stats
    type: aggregate
    input: products
    group_by: [category]
    cxl: |
      emit total_revenue = sum(price * quantity)
      emit avg_price = avg(price)
      emit margin_pct = (sum(revenue) - sum(cost)) / sum(revenue) * 100
      emit product_count = count(*)
      emit has_premium = max(price) > 100

Restrictions

let bindings in aggregate transforms are restricted to row-pure expressions (no aggregate function calls in let).
filter in aggregate transforms runs pre-aggregation – it filters input records before grouping.
distinct is not permitted inside aggregate transforms. Place a separate distinct transform upstream.

Complete example

pipeline:
  name: sales_summary
  nodes:
    - name: raw_sales
      type: source
      format: csv
      path: sales.csv

    - name: monthly_summary
      type: aggregate
      input: raw_sales
      group_by: [region, month]
      cxl: |
        emit total_sales = sum(amount)
        emit order_count = count(*)
        emit avg_order = avg(amount)
        emit top_sale = max(amount)
        emit all_reps = collect(sales_rep)

    - name: output
      type: output
      input: monthly_summary
      format: json
      path: summary.json

This pipeline outputs JSON because all_reps = collect(sales_rep) emits an array, which the tabular writers (CSV/XML/fixed-width) reject; drop the collect binding or coerce it with a downstream Transform to keep a CSV sink.

Closures

CXL supports arrow-syntax closures as arguments to closure-bearing array builtins like filter, map, find, any, and flat_map. They give CXL a way to express element-by-element predicates and projections over nested arrays carried inside a single record – without writing a separate transform node per element.

Syntax

it => expression

A closure has one parameter, named it, and a single expression body. The arrow => separates them.

- type: transform
  name: filter_items
  input: orders
  config:
    cxl: |
      emit kept = items.filter(it => it["price"] > 5)

The body is an expression, not a block of statements. Use if/then/else or match if you need branching inside a closure.

    cxl: |
      emit price_buckets = items.map(it =>
        if it["price"] >= 100 then "premium"
        else if it["price"] >= 10 then "standard"
        else "value")

Parameter name

The parameter is always it. Other identifiers are not accepted as the closure binding:

items.filter(item => item["price"] > 5)   -- parse error
items.filter(it => it["price"] > 5)       -- ok

it is recognized in expression position only inside a closure body. Outside of one, it has no special meaning.

Lexical capture

Inside the closure body, the outer record’s fields and let bindings remain visible. For each iteration the closure parameter it is bound to the current element, the body evaluates, then it is removed before the next iteration.

    cxl: |
      let threshold = 10
      emit kept = items.filter(it => it["price"] > threshold)

Here the closure body reads both it (the current array element) and threshold (an outer let binding). The record’s fields are also reachable by name – a closure over items can still read customer_id, region, or any other field on the same record.

Where closures appear

Closures are valid only as method-call arguments to closure-bearing builtins. They cannot be assigned to variables, stored in fields, or passed to non-closure builtins:

let f = it => it * 2          -- rejected at resolve time
emit doubler = it => it * 2   -- rejected at resolve time

If you need to share a closure across multiple call sites, repeat the literal closure expression. CXL has no first-class function values.

Null propagation

Closure-bearing builtins applied to a null receiver return null without evaluating the body. The body is also never called on records where the array is null:

    cxl: |
      emit kept = items.filter(it => it["price"] > 5)
      -- when `items` is null, `kept` is null; the body never runs

This matches the null-propagation policy on every other builtin – see Null Handling for the wider rules.

Worked example: filter and map over a nested array

Suppose each input record carries an items array of objects, each with sku and price:

{"order_id":"O-1","items":[{"sku":"a","price":10},{"sku":"b","price":20},{"sku":"c","price":5}]}

A transform that drops cheap items and projects the remaining SKUs:

- type: transform
  name: filter_items
  input: orders
  config:
    cxl: |
      emit order_id = order_id
      emit kept = items.filter(it => it["price"] > 5)
      emit kept_skus = items.filter(it => it["price"] > 5).map(it => it["sku"])

For the input above, the transform produces:

{
  "order_id": "O-1",
  "kept": [{"sku": "a", "price": 10}, {"sku": "b", "price": 20}],
  "kept_skus": ["a", "b"]
}

Bracket-index access (it["price"]) reaches into each map element. See Nested Paths for the full traversal surface.

Nested Paths

CXL records can carry nested arrays and maps as field values (for example, a JSON input where each record has an items array of objects). Reaching into that structure uses two complementary forms: dotted paths and bracket indices.

Dotted paths

A dotted identifier path reads a static field name from a map.

doc.metadata.tenant

Each segment must be a valid identifier. Dotted paths are resolved at compile time – the typechecker walks the structure declared in the source schema and reports a missing-field error if any segment doesn’t exist.

- type: transform
  name: project_tenant
  input: events
  config:
    cxl: |
      emit tenant = doc.metadata.tenant
      emit user_id = doc.user.id

Use dotted paths for structures whose shape is fixed and known at authoring time.

Bracket indices

A bracket index reads a runtime-computed key. The receiver may be an array (integer index) or a map (string index).

items[0]
profile["name"]
items.map(it => it["sku"])

Bracket indices are dynamic – the index expression evaluates per record. The typechecker treats the result as Any and does not assert that the key is present.

Integer index on an array

- type: transform
  name: first_item
  input: orders
  config:
    cxl: |
      emit head = items[0]
      emit second = items[1]

For items = [{"sku":"a"},{"sku":"b"},{"sku":"c"}], head is {"sku":"a"} and second is {"sku":"b"}.

Out-of-range indices return null. Negative indices also return null (CXL does not support negative indexing).

String index on a map

    cxl: |
      emit name = profile["name"]
      emit tier = profile["tier"]

Missing keys return null – the lookup never raises an error. This is the same null-propagation policy closure builtins use on their receivers.

Mixing forms

The two forms compose in either order:

    cxl: |
      emit first_sku = items[0]["sku"]
      emit profile_email = users.profile["email"]

items[0]["sku"] is two bracket indices chained – an integer index against the array, then a string index against the resulting map. users.profile["email"] walks a dotted path to reach profile (a map field on users), then bracket-indexes into it for a runtime key.

Null propagation

Every nested-access form propagates null end-to-end. If the receiver is null, the result is null without evaluating the index expression:

    cxl: |
      emit sku = items[0]["sku"]
      -- when `items` is null, `sku` is null
      -- when `items[0]` is null, `sku` is also null

This matches the null behavior on dotted paths and on method-call receivers. Records with missing intermediate structure produce nulls in their derived fields rather than aborting the transform.

Method calls on indexed values

A bracket-indexed expression is a regular value, so it composes with any method or further index:

    cxl: |
      emit head_sku_upper = items[0]["sku"].upper()
      emit cheap_skus = items.filter(it => it["price"] < 10).map(it => it["sku"])

The first chain reads a string out of nested structure and uppercases it. The second filters an array of maps by a numeric field and projects the SKU strings out.

Field Paths

A column name is not opaque. An unescaped . inside it separates path segments, so the column Address.City addresses the path Address → City. Readers produce such names when they flatten nested input, writers expand them back into nested output, and the rule for reading them is the same everywhere in Clinker — one grammar, not one per format.

This page is the reference for that grammar. The places it applies:

Column names declared in a source or output schema:.
Column names a self-describing reader infers (the JSON reader flattens {"a":{"b":1}} to the column a.b; the XML reader flattens <a><b>1</b></a> the same way).
Column names a writer expands back into nesting — see Writing JSON and Writing XML.

It does not cover CXL expressions. Reaching into a value — a map or array held in one column — uses the expression forms on Nested Paths (profile.city, profile["a.b"], items[0]). The two are different surfaces over the same idea: a path is an ordered list of segments either way, but a column name spells it with . and a backslash escape, while an expression spells it with dots and brackets.

The rule

Reading a name left to right:

Input	Meaning
`.`	Ends the current segment, starts the next.
`\.`	A literal `.` inside the current segment.
`\\`	A literal `\`.
`\[`	A literal `[`.
`\` before anything else	An error. `\]`, `\t`, and a name ending in `\` are all rejected.
Any other character	A literal character of the current segment — including a bare `[`, and `]`, `@`, `$`, `/`, and whitespace.

So Address.City is two segments, and a\.b is one segment named a.b.

Three consequences worth stating outright:

An unrecognized escape is an error, never a literal. A \ must be followed by ., [, or \ — nothing else, and not the end of the name. C:\temp is rejected, because silently treating \t as the two characters \ and t would make the encoding ambiguous, and silently dropping the \ would rename the column without saying so. Write C:\\temp. The same applies to \], which is covered below.
Empty segments are real. a..b is three segments — a, the empty name, and b. An empty key is a value a document can genuinely carry, so the grammar does not reject it.
A name may nest at most 64 levels deep. This matches the depth at which the JSON reader stops flattening, so any name that reader produces is a name a writer can expand. The XML reader flattens with no depth bound, so an extraordinarily deep XML document can produce a name past the cap; writing it back out fails with a clear error rather than recursing without limit.

Writing a literal bracket

[ is currently a literal character, so a column named a[0] works. It is nonetheless reserved: bracket indexing may later be given meaning inside a flat name, matching the [n] form CXL expressions already use. Writing \[ means “a literal [” today and will keep meaning exactly that. A bare [ is not guaranteed to.

So the column a[0] future-proofs as a\[0] — escaping only the opening bracket:

Spelling	Result
`a\[0]`	One segment named `a[0]`. Correct, and stable across the reserved-`[` change.
`a\[0\]`	Rejected — `\]` is not an escape.
`a[0]`	One segment named `a[0]` today, but a bare `[` is the form that is not guaranteed to keep that meaning.

Only the opening bracket is ever escaped. ] is never escaped and never needs to be: it would carry meaning only as the close of an unescaped [, so a literal ] standing on its own is already unambiguous. Escaping it would add noise without removing any ambiguity.

Writing \] is an error rather than a silently-accepted no-op, for the same reason C:\temp is: an escape that quietly meant nothing would let two different names decode to the same path, and one of them would be dropped. The error names the offending escape and the rule.

If a column name of yours contains [, escaping it now costs nothing and makes it future-proof.

Expansion on write

A writer that can express nesting rebuilds it from the decoded paths.

Grouping. Columns sharing a prefix collect into one container, positioned where that prefix first appeared — even when the schema interleaves them. Columns Address.City, name, Address.State write as:

{"Address":{"City":"Boston","State":"MA"},"name":"Ada"}

Absent children. A column omitted for a record (a null under preserve_nulls: false) contributes nothing, and a container whose every descendant is omitted emits no key at all rather than an empty object. That keeps the round trip honest: reading {"a":{}} produces no column, so writing no column must produce no "a".

Values are untouched. Expansion adds structure above a column’s value; a Value::Map or array held in that column still serializes as itself. A column Items.Item holding [1,2] writes as {"Items":{"Item":[1,2]}}.

No carve-outs. The rule reads the column-name string and nothing else, so it applies identically to engine-stamped columns. Under include_correlation_keys: true the column $ck.customer_id writes as {"$ck":{"customer_id":…}}.

When two names clash

Two columns can describe places that cannot both exist. The writer refuses the whole column set before writing a single byte, naming both columns — it never silently keeps one.

Columns	Why
`a` and `a.b`	`a` holds a value and is also the container `a.b` sits inside.
`a.b` and `a.b.c`	The same clash, one level down.
`a[b` and `a\[b`	Two spellings of the identical path.

a.b and a\.b do not clash: they address a → b and the single segment a.b. That is what the escape is for.

When a clash comes from a . you meant literally, the diagnostic offers the escaped spelling:

XML writer cannot expand this output's column names into nested output: field
names `a.b` and `a.b.c` cannot both be written: `a.b` holds a value and is also
the container `a.b.c` nests inside. Rename one of them, or — if the `.` in `a.b`
is part of the name rather than a nesting separator — declare it as `a\.b`.

Where the grammar does not reach yet

Two surfaces read column names without this grammar. Both are tracked, and both are stated here so the current behavior is a known position rather than a surprise:

Flat writers emit the raw name. CSV and fixed-width have no nesting to expand into, so they write the column name verbatim — a column declared a\.b appears as the literal CSV header a\.b, backslash included. Whether flat writers should emit the decoded name instead is a separate decision.
Readers join without escaping. The JSON and XML readers join flattened path segments with a plain ., so a source key that literally contains a . ({"a.b": 1}) arrives as the column a.b — indistinguishable from a nested {"a":{"b":1}}, and it writes back nested. Closing that means escaping each key as the reader joins it, tracked by issue 920.

set and unset in CXL also use a path grammar of their own that does not yet support escaping — see the known limitation on Map Methods.

Emit Each

The emit each statement fans one input record into multiple output records – one per element of an array on the input. The body emits the fields each output record carries. A trailing outer modifier preserves the trigger row when the array is empty or null.

Syntax

emit each <binding> in <source> {
  <statements>
}

<binding> is the identifier the body uses to refer to the current array element. The conventional name is it (same as the closure parameter), but any identifier is accepted.
<source> is any expression producing an array. Typically a field reference on the input record.
The body is a block of let and emit statements that produce one output record per iteration.

Worked example

Suppose each input record carries an items array of objects, each with sku and price:

{"order_id":"O-1","items":[{"sku":"a","price":10},{"sku":"b","price":20},{"sku":"c","price":5}]}

A transform that fans each input into one record per item:

- type: transform
  name: explode
  input: orders
  config:
    cxl: |
      emit each it in items {
        emit order_id = order_id
        emit sku = it["sku"]
        emit price = it["price"]
      }

For the input above, the transform produces three output records:

{"order_id":"O-1","sku":"a","price":10}
{"order_id":"O-1","sku":"b","price":20}
{"order_id":"O-1","sku":"c","price":5}

The body reads both it (the current element) and order_id (an outer record field). Outer-record fields remain visible inside the body for every iteration.

Cardinality

If the source array has N elements, emit each produces exactly N output records. Empty array sources produce zero records. A null source also produces zero records – no DLQ entry, no error – mirroring the explode-on-null convention used elsewhere in CXL.

When fan-out nests, the cardinalities multiply: an outer array of M elements whose inner arrays have N elements each produces up to M×N records. The cumulative max_expansion cap bounds that product.

A non-array, non-null source raises a runtime type-mismatch error and routes the originating record to the DLQ.

Preserving the trigger row: `outer`

A trailing outer modifier switches emit each to its outer-join variant. The grammar is identical except for the keyword after the source:

emit each <binding> in <source> outer {
  <statements>
}

The only behavioral difference is what happens when the source is null or an empty array. Plain emit each drops the trigger row entirely (zero output records). The outer variant instead emits the trigger row once, with <binding> bound to null:

Source	`emit each ...`	`emit each ... outer`
3-element	3 records	3 records (identical)
empty array	0 records	1 record, binding = `null`
`null`	0 records	1 record, binding = `null`

This is the shape SQL engines spell LATERAL VIEW OUTER EXPLODE (Spark, Hive) or an outer UNNEST (DuckDB): “for each tag on this article emit a tagged row, but keep articles that have no tags.”

Using the worked example above with an order that carries no items:

{"order_id":"O-2","items":[]}

- type: transform
  name: explode_outer
  input: orders
  config:
    cxl: |
      emit each it in items outer {
        emit order_id = order_id
        emit sku = it["sku"]
        emit price = it["price"]
      }

produces a single record that keeps order_id while the per-item fields read through the null binding:

{"order_id":"O-2","sku":null,"price":null}

Outer-record fields (like order_id) and any emit statements preceding the block still apply to the preserved trigger row, so an outer row is never bare.

The source type rule is slightly wider than plain emit each: a statically-null source is accepted (it is the case the variant exists to handle), alongside arrays and Any. Everything else in this page — the cumulative max_expansion cap, the nesting rules, the body-statement restrictions — applies unchanged to the outer variant. The two variants compose freely: an outer block may nest inside a plain emit each block and vice versa.

Output schema

The body’s emit statements define the output record’s field set, the same way emit does in a regular transform body. Fields the body does not emit fall under the Output node’s include_unmapped policy (see Output Nodes).

Fields written by the body shadow same-named fields on the originating input record.

Nested fan-out: fan-out within fan-out

An emit each body may itself contain emit each blocks — fan-out within fan-out for one trigger row. This is the canonical “for each article, for each section, for each tag, emit a row” shape:

emit each section in article["sections"] {
  emit each tag in section["tags"] {
    emit article_id = article_id
    emit section = section["name"]
    emit tag = tag
  }
}

For one input article, this produces one output record per (section, tag) pair. The inner binding (tag) reads the current inner element; the outer binding (section) and any outer-record field (article_id) stay visible inside the inner body. A field name reused as both an outer and inner binding shadows lexically — the inner binding wins inside the inner body, and the outer value is restored when the inner block finishes.

Emits are positional: an emit placed in the outer body before a nested block applies to every leaf record that block produces, but an emit placed after a nested block does not retroactively reach the records that block already emitted. Put the fields shared across leaves above the nested block.

Plain and outer blocks compose in any order. An inner plain emit each over an empty or null array contributes no records for that branch, while an inner emit each ... outer preserves one trigger row (inner binding bound to null) — exactly the per-level semantics from the single-level table, applied at each level.

Nesting is bounded to 32 levels so that adversarially deep input cannot exhaust the parser stack; legitimate document fan-out is only a few levels deep. Beyond that bound, parsing fails with a “nesting too deep” diagnostic.

The flat-array workaround (precompute a flattened array with .flat_map and use a single emit each) is still available and may be clearer for a simple two-level cartesian product, but is no longer required.

Body-statement restrictions

Within the body, let, emit, trace, and nested emit each / emit each ... outer are accepted. filter and distinct are rejected at evaluation time – a body filter would split work between branches the engine can’t represent. Move filter/distinct logic into a downstream transform, or pre-filter the source array with .filter before the emit each block.

Safety cap: `max_expansion`

By default, emit each can fan one input record into at most 10,000 output records. This limit is counted cumulatively across all nesting levels, so nested fan-out cannot multiply past it. A record that exceeds the limit routes to the DLQ with category expansion_limit_exceeded instead of producing an unbounded result. Override the limit with the max_expansion field in the transform config.

See Transform Nodes -> Expansion Cap for the YAML field and tuning guidance.

System Variables

CXL provides several system variable namespaces prefixed with $. These give CXL expressions access to pipeline execution context, user-defined variables, per-record metadata, and the current time.

$pipeline.* – Pipeline context

Pipeline variables are accessed via $pipeline.member_name. Some are frozen at pipeline start; others update per record.

Stable (frozen at pipeline start)

Variable	Type	Description
`$pipeline.name`	String	Pipeline name from YAML config
`$pipeline.execution_id`	String	UUID v7, unique per pipeline run
`$pipeline.batch_id`	String	From `--batch-id` CLI flag, or auto-generated UUID v7
`$pipeline.start_time`	DateTime	Frozen at pipeline start, deterministic within a run

$ cxl eval -e 'emit name = $pipeline.name' \
    -e 'emit exec = $pipeline.execution_id'

{
  "name": "cxl-eval",
  "exec": "00000000-0000-0000-0000-000000000000"
}

Counters

Variable	Type	Description
`$pipeline.total_count`	Int	Total records processed so far
`$pipeline.ok_count`	Int	Records that passed successfully
`$pipeline.dlq_count`	Int	Records sent to dead-letter queue
`$pipeline.filtered_count`	Int	Records excluded by `filter` statements
`$pipeline.distinct_count`	Int	Records excluded by `distinct` statements

trace info if $pipeline.total_count % 10000 == 0 then "processed " + $pipeline.total_count.to_string() + " records"

$source.* – Per-record source lineage

$source.* exposes engine-stamped columns that travel with every record from its origin Source node downstream through merges, combines, and transforms. They identify where the record came from and when in event-time it happened. All three columns are filtered out of default Output projections — reference them explicitly with emit if you need them in your output schema.

Variable	Type	Description
`$source.file`	String	Path of the input file the current record was read from.
`$source.name`	String	Name of the Source node that produced the current record. Survives through `merge` / `combine` so downstream nodes can branch on origin.
`$source.event_time`	DateTime	Engine-stamped event time, delay-corrected by the source’s `watermark.delay`. `Null` when the source has no `watermark:` block, or when the per-record value did not parse.

filter $source.name == "src_web"
emit origin = $source.name
emit ingest_file = $source.file
emit ts = $source.event_time

$source.event_time is the column a time-windowed aggregate reads to assign records to windows. It is only populated for records from a source that declares watermark: — otherwise it holds Null.

$vars.* – User-defined variables

User-defined variables are declared in the YAML pipeline config under pipeline.vars: and accessed via $vars.name in CXL expressions.

YAML declaration

pipeline:
  name: invoice_processing
  vars:
    high_value_threshold: 10000
    tax_rate: 0.21
    output_currency: "USD"
    fiscal_year_start_month: 4

CXL usage

filter amount > $vars.high_value_threshold
emit tax = amount * $vars.tax_rate
emit currency = $vars.output_currency

Variables provide a clean way to externalize configuration from CXL logic. Combined with channels, different variable sets can parameterize the same pipeline for different environments or clients.

$config.* – Composition config parameters

$config.<param> reads a composition’s declared config parameter from inside that composition’s body. It is only available in a composition body — a top-level pipeline declares no config schema, so $config.* there is a compile error.

Each parameter is declared in the composition’s _compose.config_schema: block, then read from the body’s CXL:

# in fraud_check.comp.yaml
_compose:
  name: fraud_check
  config_schema:
    threshold: { type: float, default: 0.8 }
nodes:
  - type: transform
    name: flag
    input: inp
    config:
      cxl: |
        emit order_id = order_id
        emit flagged = score >= $config.threshold

Unlike $vars.* (which flows to the executor as a runtime value), $config.<param> is constant-folded at compile time: each reference is replaced by the value resolved for that instantiation, so two call sites of the same composition with different config: compile to different bodies. The resolution precedence, highest first, is a channel/group config: clobber, then the call site’s config:, then the signature default.

Because the value is resolved per instantiation, overriding a config knob via a channel or group config: value clobber changes what the composition body computes — the override is applied to execution, and the winning layer is still recorded in the provenance side-table for channels resolve / explain --field.

$record.* – Per-record scoped state

$record.* is a per-record key-value store that travels with the record through the pipeline but never serializes as an output column. It is the mechanism for tagging records with quality flags, routing hints, or audit information that should not appear in the final output unless explicitly re-emitted as a regular column.

Each $record variable is declared in the writing Transform’s config.declares: block (scope: record) and written from that Transform’s CXL:

Writing record state

- type: transform
  name: classify
  input: orders
  config:
    declares:
      - { name: quality, scope: record, type: string }
    cxl: |
      emit order_id = order_id
      emit $record.quality = if amount < 0 then "suspect" else "ok"

Reading record state

Any downstream node reads it via $record.<key>:

filter $record.quality == "ok"
emit audit_quality = $record.quality

See Scoped Variables for the full declaration model and the pipeline / source / record lifetimes.

now – Current time

The now keyword returns the current wall-clock time as a DateTime value. It is evaluated fresh per record, so each record gets the actual time of its processing.

$ cxl eval -e 'emit timestamp = now'

{
  "timestamp": "2026-04-11T15:30:00"
}

now is useful for timestamping records:

emit processed_at = now
emit days_old = now.diff_days(created_date)

Note: now is a keyword, not a function call. Write now, not now().

Complete example

pipeline:
  name: order_enrichment
  vars:
    discount_threshold: 500
    tax_rate: 0.08

  nodes:
    - name: orders
      type: source
      format: csv
      path: orders.csv

    - name: enrich
      type: transform
      input: orders
      cxl: |
        emit order_id = order_id
        emit amount = amount
        emit discount = if amount > $vars.discount_threshold then 0.1 else 0.0
        emit tax = amount * $vars.tax_rate
        emit total = amount * (1 - discount) + tax
        emit processed_at = now
        emit source_file = $source.file
        emit pipeline_run = $pipeline.execution_id

    - name: output
      type: output
      input: enrich
      format: csv
      path: enriched_orders.csv

Null Handling

Null values in CXL represent missing or absent data. CXL uses null propagation – most operations on null produce null – with specific tools for detecting and handling nulls.

Null propagation

When a method receives a null receiver, it returns null without executing. This is called null propagation and applies to all methods except the introspection methods.

$ cxl eval -e 'emit result = null.upper()'

{
  "result": null
}

Propagation flows through method chains:

$ cxl eval -e 'emit result = null.trim().upper().length()'

{
  "result": null
}

Null propagation exceptions

Five methods are exempt from null propagation and actively handle null receivers:

Method	Null behavior
`is_null()`	Returns `true`
`type_of()`	Returns `"Null"`
`is_empty()`	Returns `true`
`catch(x)`	Returns `x`
`debug(l)`	Passes through null, logs it

$ cxl eval -e 'emit a = null.is_null()' -e 'emit b = null.type_of()' \
    -e 'emit c = null.catch("fallback")'

{
  "a": true,
  "b": "Null",
  "c": "fallback"
}

Null coalesce operator (??)

The ?? operator returns its left operand if non-null, otherwise its right operand. It is the primary tool for providing default values.

$ cxl eval -e 'emit a = null ?? "default"' \
    -e 'emit b = "present" ?? "default"'

{
  "a": "default",
  "b": "present"
}

Chain multiple ?? operators for fallback chains:

$ cxl eval -e 'emit result = null ?? null ?? "last resort"'

{
  "result": "last resort"
}

Three-valued logic

Boolean operations with null follow three-valued logic (like SQL):

and

Left	Right	Result
`true`	`null`	`null`
`false`	`null`	`false`
`null`	`true`	`null`
`null`	`false`	`false`
`null`	`null`	`null`

The key insight: false and null is false because the result is false regardless of the unknown value.

or

Left	Right	Result
`true`	`null`	`true`
`false`	`null`	`null`
`null`	`true`	`true`
`null`	`false`	`null`
`null`	`null`	`null`

The key insight: true or null is true because the result is true regardless of the unknown value.

not

Operand	Result
`true`	`false`
`false`	`true`
`null`	`null`

Arithmetic with null

Any arithmetic operation involving null produces null:

$ cxl eval -e 'emit result = 5 + null'

{
  "result": null
}

Comparison with null

Comparisons involving null produce null (not false):

$ cxl eval -e 'emit result = null == null'

{
  "result": null
}

To test for null, use is_null():

$ cxl eval -e 'emit result = null.is_null()'

{
  "result": true
}

Practical patterns

Fallback values with ??

emit name = raw_name ?? "Unknown"
emit amount = raw_amount ?? 0
emit active = is_active ?? false

Safe conversion with try_* and ??

emit price = raw_price.try_float() ?? 0.0
emit qty = raw_qty.try_int() ?? 1

Explicit null testing

filter not amount.is_null()
emit has_email = not email.is_null()

Catch method (equivalent to ??)

emit name = raw_name.catch("Unknown")

Conditional null handling

emit status = if amount.is_null() then "missing"
    else if amount < 0 then "invalid"
    else "ok"

Filter blank or null

# Filter out records where name is null or empty string
filter not name.is_empty()

Null-safe chaining

When working with fields that may be null, place the null check early or use ??:

# Safe: coalesce first, then transform
emit normalized = (raw_name ?? "").trim().upper()

# Safe: test before use
emit name = if raw_name.is_null() then "N/A" else raw_name.trim()

Modules & use

CXL supports a module system for organizing reusable expressions. Modules contain function declarations and constant bindings that can be imported into CXL programs.

Module files

A module is a .cxl file containing fn declarations and let constants. Module files live in the rules path (default: ./rules/).

Function declarations

Functions are pure, single-expression bodies with named parameters:

fn fiscal_year(d) = if d.month() < 4 then d.year() - 1 else d.year()

fn full_name(first, last) = first.trim() + " " + last.trim()

fn clamp_pct(value) = value.clamp(0, 100).round_to(1)

Functions are pure – they have no side effects and always return a value.

Module constants

Constants are let bindings at the module level:

let tax_rate = 0.21
let max_retries = 3
let default_currency = "USD"

Example module file

File: rules/shared/dates.cxl

fn fiscal_year(d) = if d.month() < 4 then d.year() - 1 else d.year()

fn quarter(d) = match {
  d.month() <= 3  => 1,
  d.month() <= 6  => 2,
  d.month() <= 9  => 3,
  _               => 4
}

fn fiscal_quarter(d) = quarter(d.add_months(-3))

let fiscal_start_month = 4

Importing modules

Use the use statement to import a module. Module paths use dot notation (not ::):

use shared.dates as d

This imports the module at rules/shared/dates.cxl and binds it to the alias d.

Import syntax

use module.path
use module.path as alias

The as alias clause is optional. When omitted, the last segment of the path becomes the default name.

use shared.dates          # access as dates::fiscal_year(...)
use shared.dates as d     # access as d::fiscal_year(...)

Path resolution

Module paths are resolved relative to the rules path:

Import	File path
`use shared.dates`	`rules/shared/dates.cxl`
`use transforms.normalize`	`rules/transforms/normalize.cxl`
`use utils`	`rules/utils.cxl`

The rules path defaults to ./rules/ and can be overridden with --rules-path.

Using imported functions and constants

After importing, reference module members with :: (double colon) syntax:

use shared.dates as d
use shared.finance as f

emit fiscal_year = d::fiscal_year(invoice_date)
emit quarter = d::quarter(invoice_date)
emit tax = amount * f::tax_rate
emit net = amount - tax

Functions

Call functions with alias::function_name(args):

use shared.dates as d
emit fy = d::fiscal_year(order_date)

Constants

Access constants with alias::constant_name:

use shared.finance as f
emit tax = amount * f::tax_rate

Restrictions

No wildcard imports. use shared.* is not supported. Import modules explicitly.
Dot separator only. Module paths use ., not ::. The :: syntax is reserved for member access after import.
Single expression bodies. Functions must be a single expression – no multi-statement bodies.
Pure functions. Functions cannot use emit, filter, distinct, or other statement forms. They are pure computations.
No recursion. Functions cannot call themselves (directly or indirectly).

Complete example

File: rules/etl/clean.cxl

fn normalize_name(name) = name.trim().upper()

fn safe_amount(raw) = raw.try_float() ?? 0.0

fn flag_suspicious(amount, threshold) =
  if amount > threshold then "review" else "ok"

let max_amount = 999999.99

Pipeline CXL block:

use etl.clean as c

emit customer = c::normalize_name(raw_customer)
emit amount = c::safe_amount(raw_amount)
filter amount <= c::max_amount
emit review_flag = c::flag_suspicious(amount, 10000)

The cxl CLI Tool

The cxl command-line tool validates, evaluates, and formats CXL source files. It is the standalone companion to the Clinker pipeline engine, useful for testing expressions, validating transforms, and debugging CXL logic.

Commands

cxl check

Parse, resolve, and type-check a .cxl file. Reports errors with source locations and fix suggestions.

$ cxl check transform.cxl
ok: transform.cxl is valid

On errors:

error[parse]: expected expression, found '}' (at transform.cxl:12)
  help: check for missing operand or extra closing brace

error[resolve]: unknown field 'amoutn' (at transform.cxl:5)
  help: did you mean 'amount'?

error[typecheck]: cannot apply '+' to String and Int (at transform.cxl:8)
  help: convert one operand — use .to_int() or .to_string()

cxl eval

Evaluate CXL expressions against provided data and print the result as JSON.

Inline expression:

$ cxl eval -e 'emit result = 1 + 2'

{
  "result": 3
}

From a file with field values:

$ cxl eval transform.cxl \
    --field Price=10.5 \
    --field Qty=3

From a file with JSON input:

$ cxl eval transform.cxl --record '{"price": 10.5, "qty": 3}'

Multiple inline statements:

$ cxl eval -e 'let tax = 0.21
emit net = price * (1 - tax)' --field price=100

{
  "net": 79.0
}

cxl fmt

Parse and pretty-print a .cxl file in canonical format with normalized whitespace and consistent styling.

$ cxl fmt transform.cxl

Output is printed to stdout. Redirect to overwrite:

$ cxl fmt transform.cxl > transform.cxl.tmp && mv transform.cxl.tmp transform.cxl

Input data

–field name=value

Provide individual field values as key-value pairs. Values are automatically type-inferred:

Input	Inferred type	Example
Integer pattern	Int	`--field count=42`
Decimal pattern	Float	`--field price=10.5`
`true` / `false`	Bool	`--field active=true`
`null`	Null	`--field value=null`
Anything else	String	`--field name=Alice`

$ cxl eval -e 'emit t = amount.type_of()' --field amount=42

{
  "t": "Int"
}

$ cxl eval -e 'emit t = name.type_of()' --field name=Alice

{
  "t": "String"
}

–record JSON

Provide a full JSON object as input. Mutually exclusive with --field.

$ cxl eval -e 'emit total = price * qty' \
    --record '{"price": 10.5, "qty": 3}'

{
  "total": 31.5
}

JSON types map directly:

JSON type	CXL type
`null`	Null
`true` / `false`	Bool
integer number	Int
decimal number	Float
`"string"`	String
`[array]`	Array
`{object}`	Map

Output format

Output is always JSON. Each emit statement produces a key-value pair:

$ cxl eval -e 'emit a = 1
emit b = "two"
emit c = true'

{
  "a": 1,
  "b": "two",
  "c": true
}

Date and DateTime values are serialized as ISO 8601 strings:

$ cxl eval -e 'emit d = #2024-03-15#'

{
  "d": "2024-03-15"
}

Exit codes

Code	Meaning
0	Success (or warnings only)
1	Parse, resolve, type-check, or evaluation errors
2	I/O error (file not found, invalid JSON, etc.)

Pipeline context in eval mode

When running cxl eval, a minimal pipeline context is provided:

Variable	Value
`$pipeline.name`	`"cxl-eval"`
`$pipeline.execution_id`	Zeroed UUID
`$pipeline.batch_id`	Zeroed UUID
`$pipeline.start_time`	Current wall-clock time
`$pipeline.source_file`	Filename or `"<inline>"`
`$pipeline.source_row`	`1`
`now`	Current wall-clock time (live)

Practical usage

Quick expression testing:

$ cxl eval -e 'emit result = "hello world".upper().split(" ").length()'

{
  "result": 2
}

Validate a transform file:

$ cxl check transforms/enrich_orders.cxl && echo "Valid"

Test conditional logic:

$ cxl eval -e 'emit tier = match {
    amount > 1000 => "high",
    amount > 100 => "med",
    _ => "low"
  }' \
    --field amount=500

{
  "tier": "med"
}

Test date operations:

$ cxl eval -e 'emit year = d.year()
emit month = d.month()
emit next_week = d.add_days(7)' \
    --record '{"d": "2024-03-15"}'

Test null handling:

$ cxl eval -e 'emit safe = raw.try_int() ?? 0' --field raw=abc

{
  "safe": 0
}

CLI Reference

Clinker ships two command-line tools: clinker (the pipeline runner) and cxl (the expression checker/evaluator/formatter, covered in the CXL CLI chapter). This page is the complete reference for clinker.

clinker run

Execute a pipeline.

clinker run [OPTIONS] <CONFIG>

Positional arguments

Argument	Description
`<CONFIG>`	Path to the pipeline YAML configuration file (required)

Options

Flag	Default	Description
`--memory-limit <SIZE>`	YAML `memory.limit`, else `512M`	Memory budget for the execution. Uses the same grammar as the YAML `memory.limit`: a byte count with an optional binary (1024-based) `K`/`M`/`G` suffix (`K` = 1024 bytes, `M` = 1024², `G` = 1024³), where a bare integer is bytes. Other forms — a decimal `GB`, an explicit `GiB`, or a fractional value such as `1.5G` — are rejected. When the limit is approached, aggregation operators spill to disk rather than crashing. When passed, this value overrides any `memory.limit` set in the pipeline YAML; when omitted, the YAML value applies (or the `512M` default when the YAML is also silent). An empty or whitespace-only value — as an ops wrapper produces when it forwards an unset variable, e.g. `--memory-limit "$CLINKER_MEM"` with `CLINKER_MEM` unset — is treated the same as omitting the flag. A non-empty malformed value (for example the decimal `4GB` rather than the binary `4G`) is rejected at the CLI boundary with an error naming `--memory-limit` and echoing the value, so a typo fails loudly instead of silently falling back to the default and shrinking a larger YAML budget. Because the flag simply populates `pipeline.memory.limit`, a startup budget error (`E312`) for a value you passed via `--memory-limit` refers to that same limit.
`--threads <N>`	number of CPUs	Size of the thread pool used for parallel node execution.
`--error-threshold <N>`	`0` (unlimited)	Maximum number of records routed to the dead-letter queue before the pipeline aborts. `0` means no limit – the pipeline will run to completion regardless of DLQ volume.
`--batch-id <ID>`	UUID v7	Custom execution identifier. Appears in metrics output and log lines. Use a meaningful value (e.g. `daily-2026-04-11`) for correlation across retries.
`--explain [FORMAT]`	`text`	Print the execution plan and exit without processing data. Accepted formats: `text`, `json`, `dot`. See Explain Plans.
`--lineage <PATH>`	–	Build column lineage and write it as OpenLineage NDJSON, then exit without processing data. Give a file path, or `-` for stdout. See Column Lineage.
`--lineage-events <PATH>`	–	Run the pipeline and emit live OpenLineage run events (a `START` at run begin, then a terminal `COMPLETE` / `FAIL` / `ABORT` with real timing and row counts) as NDJSON to a file path, or `-` for stdout. Cannot be combined with `--lineage`, `--explain`, `--dry-run`, or `-n`. See Live run events.
`--dry-run`	–	Validate the configuration (YAML structure, CXL syntax, type checking, DAG wiring) without reading any data.
`-n, --dry-run-n <N>`	–	Process only the first `N` records through the full pipeline. Implies `--dry-run`.
`--dry-run-output <FILE>`	stdout	Redirect dry-run output to a file instead of stdout. Only meaningful with `-n`.
`--rules-path <DIR>`	`./rules/`	Search path for CXL module files referenced by `use` statements.
`--base-dir <DIR>`	–	Base directory for resolving relative paths in the YAML config. Defaults to the directory containing the config file.
`--allow-absolute-paths`	–	Permit absolute file paths in the pipeline YAML. By default, absolute paths are rejected to encourage portable configs.
`--env <NAME>`	–	Set the active environment. Equivalent to setting `CLINKER_ENV`. Used by `when:` conditions in channel overrides.
`--quiet`	–	Suppress progress output. Errors are still printed to stderr.
`--force`	–	Allow output files to be overwritten if they already exist. Without this flag, the pipeline aborts rather than clobbering existing output.
`--log-level <LEVEL>`	`info`	Logging verbosity. One of: `error`, `warn`, `info`, `debug`, `trace`.
`--metrics-spool-dir <DIR>`	–	Directory for per-execution metrics files. See Metrics & Monitoring.
`--group <NAME>`	–	Force-include a group overlay by name (repeatable). Applies the group’s `overrides` op stream and `config`/`vars` clobber before the pipeline compiles, regardless of the group’s selector. Use `clinker channels resolve` to preview the effective plan.
`--no-auto-groups`	–	Suppress selector-derived group membership; only groups named with `--group` apply.

Examples

# Basic execution
clinker run pipeline.yaml

# Production run with memory budget and forced overwrite
clinker run pipeline.yaml --memory-limit 512M --force --log-level warn

# Validate without processing
clinker run pipeline.yaml --dry-run

# Preview first 10 records
clinker run pipeline.yaml --dry-run -n 10

# Show execution plan as Graphviz
clinker run pipeline.yaml --explain dot | dot -Tpng -o plan.png

# Run with a custom batch ID for tracing
clinker run pipeline.yaml --batch-id "daily-2026-04-11" --metrics-spool-dir ./metrics/

clinker metrics collect

Sweep per-execution metrics files from a spool directory into a single NDJSON archive.

clinker metrics collect [OPTIONS]

Options

Flag	Description
`--spool-dir <DIR>`	Spool directory to sweep (required).
`--output-file <FILE>`	NDJSON archive destination (required). If the file exists, new entries are appended.
`--delete-after-collect`	Remove spool files after they have been successfully written to the archive.
`--dry-run`	Preview which files would be collected without writing anything.

Examples

# Collect and archive, then clean up spool
clinker metrics collect \
  --spool-dir /var/spool/clinker/ \
  --output-file /var/log/clinker/metrics.ndjson \
  --delete-after-collect

# Preview what would be collected
clinker metrics collect \
  --spool-dir ./metrics/ \
  --output-file ./archive.ndjson \
  --dry-run

clinker channels

Inspect and validate the channel/group multi-tenant overlay system.

clinker channels resolve <TARGET> [OPTIONS]
clinker channels lint [OPTIONS]
clinker channels group members <GROUP> [OPTIONS]
clinker channels label set <KEY>=<VALUE> <CHANNEL_ID>... [OPTIONS]

clinker channels resolve

Renders the effective post-overlay plan for one target — the DAG plus per-value provenance (which layer supplied each value, and which group injected which node). This answers “what does tenant X actually run?”.

Flag	Default	Description
`<TARGET>`	–	Path to the base pipeline (or composition) YAML to resolve (required).
`--channel <ID>`	–	Channel id to resolve for (a folder under the channel root). Matching groups are derived from the channel’s labels.
`--group <NAME>`	–	Force-include a group overlay by name (repeatable), with or without a channel.
`--no-auto-groups`	–	Suppress selector-derived group membership.
`--base-dir <DIR>`	`.`	Workspace root holding `clinker.toml` and the channel/group roots.

Exits non-zero when the overlay raises an error (e.g. a config key matching no parameter), so resolve doubles as a targeted check for one tenant.

clinker channels lint

Compiles every (target × overlay) combination across the workspace and reports failures — the CI safety net for base-change blast radius. This is where the full-tree scan lives; the run path resolves a single channel by computed lookup.

Flag	Default	Description
`--base-dir <DIR>`	`.`	Workspace root to lint.

Exits non-zero if any combination fails to compile or apply. Dangling splice anchors (an op referencing a missing node) and config keys matching no parameter are reported per combination.

clinker channels group members

Lists the channels whose labels currently satisfy a group’s selector — “who is in this group right now?”. Because membership is derived from labels, this evaluates the group’s match: selector against each channel’s manifest labels through the same derivation the overlay resolver uses.

Flag	Default	Description
`<GROUP>`	–	Group name (the `group.name` of a `*.group.yaml`).
`--base-dir <DIR>`	`.`	Workspace root holding `clinker.toml` and the channel/group roots.

A group with no match: selector is explicit-only and reports no derived members. A channel whose labels make the selector ill-typed or reference an undeclared label is reported as a selector error (never a silent non-match), and the command exits non-zero when any such error occurs.

clinker channels label set

Stamps (or overwrites) one label across the named channels by editing each channel’s channel.cfg.yaml manifest in place. Idempotent: re-running with the same value writes nothing. Only the manifest’s labels: block is rewritten; other keys and comments are preserved. A channel with no manifest yet gets one created (with its folder name as the channel name).

Flag	Default	Description
`<KEY>=<VALUE>`	–	Label assignment. `KEY` must be an identifier (letters, digits, `_`) so a selector can reference it. `VALUE` is typed by YAML scalar inference (`true`/`false` → bool, integers → int, decimals → float, otherwise string).
`<CHANNEL_ID>...`	–	One or more channel ids (tenant folder names) to stamp.
`--base-dir <DIR>`	`.`	Workspace root holding `clinker.toml` and the channel root.

Because group membership is attribute-derived, label set is the maintenance operation for group membership: set a label once and every group whose selector matches gains the channel — no membership list to hand-edit.

Examples

# What does tenant `globex` actually run for this pipeline?
clinker channels resolve pipeline/order_fulfillment.yaml --channel globex

# Preview a group overlay standalone (no channel)
clinker channels resolve pipeline/order_fulfillment.yaml --group enterprise

# Compile every channel/group overlay in the workspace and report failures
clinker channels lint

# Which channels are currently in the `enterprise` group?
clinker channels group members enterprise

# Onboard two tenants into the enterprise tier in one shot
clinker channels label set tier=enterprise globex acme-corp

clinker refactor

Structural refactors that span a base pipeline and every channel/group overlay that references it.

clinker refactor rename-node <TARGET> <OLD> <NEW> [OPTIONS]

clinker refactor rename-node

Renames a base node and propagates the rename to every overlay reference. The overlay op model addresses base nodes by name, so renaming a node otherwise breaks every overlay that referenced it. This command rewrites, in one operation:

the base node’s name and every consumer’s input: / inputs: / body:/header:/trailer: reference;
a Combine’s named-input map (qualifier key and/or upstream value) and — when the Combine draws from the renamed node under a same-named qualifier — its where: / cxl: bodies, rewritten via the CXL parser so only true source qualifiers are touched (a method receiver like region.contains(...) is left alone);
across every group / channel-manifest / per-target overlay file: op target, after, before, injected alias, explicit input, rewire keys and values, an inline node, a set config.cxl value’s CXL, and top-level config dotted-path prefixes (old.param → new.param).

Flag	Default	Description
`<TARGET>`	–	Path to the base pipeline (or composition) YAML that declares the node.
`<OLD>`	–	Current node name (must exist in the target).
`<NEW>`	–	New node name — identifier only (letters, digits, `_`); must not already exist in the target.
`--dry-run`	–	Print the diff of every file that would change without writing anything.
`--base-dir <DIR>`	`.`	Workspace root holding `clinker.toml` and the channel/group roots.

Ambiguity is guarded: renaming to a name that already exists in the target, or renaming a node that does not exist, is a hard error. A Combine where:/cxl: body that must be rewritten but does not parse aborts the whole operation before anything is written. After a real (non-dry-run) run the command re-runs channels lint so an incomplete rename fails loudly.

Scope: a per-target overlay (<target>.channel.yaml / .comp.yaml) is rewritten only when it overlays the renamed pipeline, so a different pipeline that happens to share the node name is left alone. Target-agnostic layers — channel-wide manifest overrides and group files — are rewritten wherever they reference the old name; if two pipelines share a node name and the same group or channel-wide override targets it, review the --dry-run diff and rely on channels lint to catch any pipeline the rename should not have touched.

Files are rewritten by re-serializing their YAML: key order is preserved, but comments and incidental scalar styling are normalized. Use --dry-run to review the exact on-disk diff first.

Examples

# Preview a rename across the base pipeline and every overlay that references it
clinker refactor rename-node pipeline/order_fulfillment.yaml orders purchases --dry-run

# Apply it, then re-lint the workspace
clinker refactor rename-node pipeline/order_fulfillment.yaml orders purchases

clinker config

Inspect a pipeline configuration file.

clinker config --resolved <CONFIG>

clinker config –resolved

Prints the config with the multi-value shorthand expanded to canonical form. The bare-field forms of split_to_rows:, split_values:, and join_values: are rewritten to full mappings with every default spelled out — so you can see exactly what the engine runs:

a bare - line_items under split_to_rows: becomes - { field: line_items, keep_empty: true, mode: extract };
a bare - tags under split_values: becomes - { field: tags, delimiter: ";" };
a bare - tags under join_values: becomes - { field: tags, delimiter: ";", on_conflict: error, escape: "\\" }.

The rewrite is surgical: only those shorthand blocks change. Comments, key order, indentation, and every other surface are preserved byte-for-byte, so the output parses to a plan semantically identical to the input, and running config --resolved on the result is a no-op. Schema columns are already canonical (multiple: true is always written explicitly), so the schema block is left untouched.

This is config canonicalization for the pipeline file itself. It is distinct from clinker channels resolve, which renders the effective post-overlay plan for a specific tenant.

A few surfaces are deliberately left as written rather than expanded, since regenerating them would lose information: a shorthand block that carries an interior comment or blank line between its items is passed through unchanged (so the comment is never dropped), and a value written as a YAML alias (*anchor) is left in place — the anchor it points to is expanded at its definition, so the alias still resolves to the expanded value. The output uses the input file’s line endings (LF or CRLF).

Flag	Default	Description
`<CONFIG>`	–	Path to the pipeline YAML config file (required). The file is validated before it is rewritten, so a malformed config fails with a config error rather than emitting a half-expanded document.
`--resolved`	–	Print the fully-expanded canonical form to stdout. Currently the only mode; required.

Examples

# Show the fully-expanded canonical form
clinker config --resolved pipeline.yaml

# Materialize the shorthand into a new file
clinker config --resolved pipeline.yaml > pipeline.canonical.yaml

See Source Nodes → Multi-value fields for the shorthand these forms expand from.

Environment Variables

Variable	Description
`CLINKER_ENV`	Active environment name. Equivalent to `--env`. Used by `when:` conditions in channel overrides to select environment-specific configuration.
`CLINKER_METRICS_SPOOL_DIR`	Default metrics spool directory. Overridden by `--metrics-spool-dir`.

Precedence (highest to lowest): CLI flag, environment variable, YAML config value.

Validation & Dry Run

Clinker provides two levels of pre-flight validation so you can catch problems before committing to a full run.

Compile validation

clinker run pipeline.yaml --dry-run

This validates everything that can be checked without reading data:

YAML structure and required fields
CXL syntax and compile-time type checking
Schema compatibility between connected nodes
DAG wiring (no cycles, no dangling inputs, no missing nodes)
Plan-time source and output configuration gates

No records are read and no output files are created. Bare --dry-run does not perform runtime source discovery or require an input file to be readable. Planning may still inspect available file metadata or evaluate matchers for cost estimates. The command exits with code 0 on success or code 1 with a diagnostic message on failure.

Use this after every YAML edit. It runs in milliseconds and catches the majority of configuration mistakes.

Record preview

clinker run pipeline.yaml --dry-run -n 10

This reads the first 10 records from each source and processes them through the full pipeline – transforms, aggregations, routing, and output formatting. Results are printed to stdout.

The record preview exercises the runtime evaluation path, catching issues that compile validation cannot:

CXL expressions that are syntactically valid but fail at runtime (e.g., calling a string method on an integer)
Data format mismatches between the declared schema and actual file contents
Unexpected null values in required fields

Save preview to file

clinker run pipeline.yaml --dry-run -n 100 --dry-run-output preview.csv

The output format matches what the pipeline’s output node would produce, so preview.csv shows you exactly what the full run will write.

Recommended workflow

Use both validation levels in sequence before every production run:

--dry-run – catch configuration and type errors instantly.
--dry-run -n 10 – verify output shape and values against real data.
Full run – execute with confidence.

This three-step pattern is especially valuable when:

Editing CXL expressions in transform or aggregate nodes
Changing source schemas or swapping input files
Adding or removing nodes from the pipeline DAG
Modifying route conditions

Combining with explain

You can also inspect the execution plan before running:

clinker run pipeline.yaml --explain

This shows the DAG structure, parallelism strategy, and node ordering without reading any data. See Explain Plans for details.

The typical full pre-flight sequence is:

clinker run pipeline.yaml --explain          # inspect the DAG
clinker run pipeline.yaml --dry-run          # validate the compiled plan
clinker run pipeline.yaml --dry-run -n 10    # preview with data
clinker run pipeline.yaml --force            # run for real

Explain Plans

The --explain flag prints the execution plan – the DAG of nodes, their connections, and the parallelism strategy the optimizer has chosen – without reading any data.

Text format

clinker run pipeline.yaml --explain
# or explicitly:
clinker run pipeline.yaml --explain text

The text format shows a human-readable summary of the execution plan:

Execution Plan: customer_etl
============================

Node 0: customers (Source, parallel: file-chunked)
  -> transform_1

Node 1: transform_1 (Transform, parallel: record)
  -> route_1

Node 2: route_1 (Route, parallel: record)
  -> [high] output_high
  -> [default] output_standard

Node 3: output_high (Output, parallel: serial)

Node 4: output_standard (Output, parallel: serial)

Key information shown:

Node index and name – the topological position in the DAG
Node type – Source, Transform, Aggregate, Route, Merge, Output, Composition
Parallelism strategy – how the optimizer plans to execute the node
Connections – downstream nodes, with port labels for route branches
Buffer class (Physical Properties section) – buffer: streaming for a node that hands its output straight to a single downstream consumer, or buffer: materialized for one that holds a whole stage’s output in an inter-stage buffer. See Streaming vs. Blocking Stages for the distinction.

The buffer class is a pre-runtime signal for memory pressure: a materialized node holds its rows against pipeline.memory.limit and may spill to disk once the budget is tight, while a streaming node holds only a small in-flight slice. Use the annotation alongside --memory-limit / pipeline.memory.limit to predict which stages will dominate memory before running the pipeline.

JSON format

clinker run pipeline.yaml --explain json

Produces a machine-readable JSON object for programmatic consumption. Useful for:

CI pipelines that need to assert plan properties
Custom dashboards that visualize execution plans
Diffing plans between config versions

# Compare plans before and after a config change
clinker run old.yaml --explain json > plan_old.json
clinker run new.yaml --explain json > plan_new.json
diff plan_old.json plan_new.json

Graphviz DOT format

clinker run pipeline.yaml --explain dot

Produces a Graphviz DOT graph. Pipe it to dot to render an image:

# PNG
clinker run pipeline.yaml --explain dot | dot -Tpng -o pipeline.png

# SVG (scalable, good for documentation)
clinker run pipeline.yaml --explain dot | dot -Tsvg -o pipeline.svg

# PDF
clinker run pipeline.yaml --explain dot | dot -Tpdf -o pipeline.pdf

This requires the graphviz package to be installed on the system.

The resulting diagram shows:

Nodes as labeled boxes with type and parallelism annotations
Edges as arrows with port labels where applicable
Branch/merge fan-out and fan-in structure

When to use explain

During development – verify the DAG shape matches your mental model before writing test data.
After adding route or merge nodes – confirm branch wiring is correct.
When tuning parallelism – check which strategy the optimizer selected for each node.
In code review – generate a DOT diagram and include it in the PR for visual confirmation.

Explain parses the YAML and builds the plan without opening runtime readers or processing records. Planning may inspect source metadata or matchers for cost estimates, but it does not create pipeline outputs.

clinker run pipeline.yaml --explain       # parse, compile, print the plan
clinker run pipeline.yaml --dry-run       # parse and compile without printing the plan

Both commands perform the same compile-time checks: schema binding, CXL type checking, DAG wiring, and plan-time source and output gates. --explain also renders the compiled plan; bare --dry-run is the quieter validation form. Neither command opens runtime readers, processes records, or creates pipeline outputs.

Retraction section

If at least one Aggregate has a group_by that omits a correlation-key field, the output includes a === Retraction === block. It lists which aggregates and windows use group-atomic retraction (see Correlation Keys) and a rough per-row memory estimate for each, so you can gauge the memory cost before a production run. The block is absent on pipelines that don’t use this mode.

Exact group sizes are unknown until the pipeline runs, so treat the estimates as a planning aid and confirm the live shape with clinker metrics collect after the first run.

Statistics

When the plan carries column statistics, the output ends with a === Statistics === section. Each figure is tagged with where it came from:

Row counts — an estimate per source. A [file metadata] figure is estimated from the input file’s size before any record is read; a [exec sketch] figure is an exact count measured during an actual run. These row counts are what the optimizer uses to pick a Combine’s join strategy.
Column sketches — distinct-value counts and frequent-value hints that a Combine gathers over its join keys while records flow, used to speed up matching.

A statistic that was never gathered renders as null rather than a fabricated zero — for example, a multi-file glob source or a network source whose size cannot be read adds no Statistics section at all.

Field provenance

clinker explain <pipeline> --field <path> traces where a single resolved value comes from across every configuration layer, printing the winning layer plus each shadowed layer and its source span. The path arity selects what is traced:

<node>.<param> (two parts) — a composition config parameter, resolved across composition defaults and channel/group overlays.
<source>.<column>.<attribute> (three parts) — a source-schema attribute (type, scale, precision, format, width, required, …), resolved across the schema-provenance layers Base < Pipeline < Group < Channel. Base is the source’s own declared schema:; the higher layers are the patch_schema overlay ops each channel/group applies.

# Where does the `scale` on the orders source's `amount` column come from?
clinker explain pipeline.yaml --field orders.amount.scale

# Resolve the same attribute with a channel overlay applied first.
clinker explain pipeline.yaml --field orders.amount.scale --channel acme_prod

Field: orders.amount.scale

  Resolved value: 2

  Provenance chain (outermost to innermost):
  [WON] Channel               →  2  (line 12)
        Pipeline              →  0  (shadowed)  (line 5)
        Base                  →  0  (shadowed)

The [WON] marker names the layer whose value survives; shadowed layers show what they proposed. An unknown source, column, or attribute is rejected with a hint listing the valid names at that level.

Reading a plan-time failure

A pipeline that fails a plan-time check never reads any input. The failure is printed before the run starts, and it carries four things:

E363

  × source 'src': `record_path` "$.rows" starts with the JSONPath root marker
  │ `$.`, which is not part of the grammar; `record_path` is a dot-separated
  │ path of object keys, descended from the document root (for example
  │ `data.rows`). Write "rows" instead
   ╭─[pipeline.yaml:4:1]
 3 │ nodes:
 4 │   - type: source
   · ────────┬───────
   ·         ╰── declared here
 5 │     name: src
   ╰────
  help: `record_path` on a `json` source is a dot-separated path of object
        keys descended from the document root: no `$.` JSONPath root marker,
        no leading `/`, and no empty segments. It takes precedence over
        `format:`, so pair it with `format: object` or leave `format:` off.
        Omit `record_path` entirely and the reader auto-detects the document
        shape. Run `clinker explain --code E363` for the full grammar.

The code (E363) heads the report. Where a page exists for it, hand it to clinker explain --code for the worked example.
The message names the offending input and the rule it broke.
The source line is quoted from your YAML, with the offending node underlined.
The help: paragraph names the fix. When the gate does not already say so, a See: clinker explain --code <CODE> line is appended.

Warnings are reported the same way but marked ⚠ rather than ×, so an advisory is distinguishable from the diagnostic that stopped the run.

The same report is printed under --explain, which compiles the plan before printing it.

Two notes on where the snippet comes from:

A pipeline that pulls in a composition body is reported without the quoted source line. A plan-time diagnostic carries a line number but not which file it belongs to, so rather than risk underlining an unrelated line, the report gives the code, message and help alone.
A channel/group overlay suppresses the snippet only when it rewrites the compiled config through structural ops, source patches, or composition config: values. A selection that contributes only runtime vars leaves the pipeline document unchanged, so its snippet remains safe and is retained.
Bare --dry-run compiles the plan and prints the same report without reading source data.

Looking up diagnostic codes

clinker explain --code <CODE> prints the documentation page for a code that has one, including retraction-specific codes:

clinker explain --code E15Y   # retraction-mode aggregate incompatible with strategy: streaming

Not every code that can head a report has a page yet — pages are written per condition, and the code set is larger. The report itself tells you which: the See: clinker explain --code <CODE> line is appended only when that code has a page, so a report carrying it is a code this command can answer for. Passing a code with no page reports it as unknown and lists every code that does have one.

Column Lineage

The --lineage flag builds the pipeline’s column-level lineage – which source columns each output column is derived from, and which source columns influence the output as a whole – and writes it as OpenLineage events. Like --explain, it compiles the plan and exits without reading any data, so the lineage is derived statically from the pipeline definition.

# Write to a file
clinker run pipeline.yaml --lineage lineage.ndjson

# Write to stdout (pipe into other tooling)
clinker run pipeline.yaml --lineage -

There are two emission modes:

--lineage – a static, plan-derived export. It compiles the plan and exits without reading data, so it runs instantly and describes the pipeline’s lineage rather than a specific execution.
--lineage-events – live run-lifecycle emission. It runs the pipeline and emits a START when the run begins and a terminal COMPLETE / FAIL / ABORT when it ends, carrying real timing and row counts. See Live run events below.

Both modes share the same column-lineage facet and the same on-the-wire OpenLineage shape; the live mode wraps it in real run-lifecycle events. Pushing those events to a live OpenLineage HTTP endpoint (a catalog such as Marquez) is a separate, planned transport and is not yet available.

Output format

The output is NDJSON (one JSON object per line) conforming to the OpenLineage 2-0-2 core spec. A run is described by a START event followed by a COMPLETE event that share one runId:

{"eventType":"START","run":{"runId":"019f030d-0b3e-7ee1-86ec-1bb5b4a2776b"},"job":{"namespace":"clinker","name":"audit_join","facets":{"clinker_pipeline":{"sourceHash":"7fd096a9..."}}}, ...}
{"eventType":"COMPLETE","run":{"runId":"019f030d-0b3e-7ee1-86ec-1bb5b4a2776b"}, "inputs":[...], "outputs":[{"namespace":"file","name":".../audit_report.csv","facets":{"columnLineage":{ ... }}}]}

runId is a UUID v7 minted for this export and shared by both events. Because --lineage is a static, plan-derived export, the START/COMPLETE pair describes the pipeline’s lineage, not an executed data run — no rows are processed and the two events share one timestamp. A separate clinker run mints its own runId. (For real timing and row counts tied to an actual execution, use --lineage-events.)
job.namespace is clinker; job.name is the pipeline name. The pipeline’s content hash rides in the clinker_pipeline job facet (sourceHash), not the job name – so the name stays stable across edits while runs of the same definition remain correlatable.
inputs are the source datasets; outputs are the sink datasets. Filesystem datasets use the file namespace with the resolved path as the name; a network source falls back to the clinker namespace plus the node name.
The columnLineage facet is attached to each output dataset on the COMPLETE event.

The facet has two parts, mirroring the OpenLineage ColumnLineageDatasetFacet:

"columnLineage": {
  "fields": {
    "amount": { "inputFields": [
      { "namespace":"file", "name":".../audit_orders.csv", "field":"amount",
        "transformations":[{"type":"DIRECT","subtype":"IDENTITY"}] }
    ]}
  },
  "dataset": [
    { "namespace":"file", "name":".../audit_orders.csv", "field":"order_id",
      "transformations":[{"type":"INDIRECT","subtype":"JOIN"}] }
  ]
}

fields – DIRECT (value-derivation) lineage, keyed per output column: the source columns each output column’s value is computed from. A rename (emit full = name), a multi-hop chain, or a path through a composition body (including nested compositions) collapses to the originating source column. A column whose value derives from an envelope read ($doc.<section>.<field>, bare / indexed / inside a larger expression) gets a DIRECT input field on the originating source dataset whose field is the rendered $doc.… path – so envelope-derived columns trace back to the document section they came from.
dataset – INDIRECT (influence) lineage for the dataset as a whole: source columns that shaped which rows exist, via filtering, joining, grouping, or sorting – collected once rather than duplicated across every column.

Each transformation carries a type (DIRECT / INDIRECT) and a subtype (IDENTITY, TRANSFORMATION, AGGREGATION, JOIN, GROUP_BY, FILTER, SORT, CONDITIONAL).

Multi-record sources

A multi-record flat file carries several record shapes in one physical file, discriminated by a lead record_type column. Lineage attributes each record type to its own logical dataset, rather than collapsing every shape onto one flat superset dataset:

Each record type is a dataset named <file>#<id> – the physical file path with the record type’s id as a # fragment (e.g. .../payments.txt#detail). Its columns are exactly that record type’s declared columns, so an output column that derives from a detail-record field traces to <file>#detail, and one from a header field traces to <file>#header.
A column declared by several record types (unified into one superset column) lists each owning <file>#<id> dataset as an input field, so a derived output column traces to every record type it could have come from.
The engine-stamped record_type discriminator lead column belongs to no record type, so it stays on the base file dataset (<file>, no fragment) – a Route that branches on record_type still references {file, <path>, record_type}.
The run’s inputs list the base file dataset followed by each <file>#<id> record-type dataset, in record-type declaration order.

A record type’s parent / join_key – the intra-file hierarchy linking a child record type to its parent – is carried implicitly in the per-record-type dataset identities; it is not emitted as a synthetic lineage edge, since no plan operation performs that join.

Live run events

--lineage-events <PATH> runs the pipeline and emits OpenLineage run events tied to that actual execution, as NDJSON to a file path (or - for stdout):

clinker run pipeline.yaml --lineage-events events.ndjson

Unlike --lineage (which exits before reading data), this processes data, so it cannot be combined with --lineage, --explain, --dry-run, or -n.

Prefer a file path for a clean stream. With - (stdout), the run’s own stdout output — for example the per-stage spill-volume summary — interleaves with the event lines, so stdout is not pure NDJSON. Writing to a file keeps the events unmixed.

A run emits a START when it begins, then exactly one terminal event when it ends:

START – written and flushed before the run body executes, so a crash mid-run still leaves an observable open run. It carries the input and output datasets by identity (no facets yet).
COMPLETE – the run finished. It carries the input datasets and the output datasets with their columnLineage facets, exactly like the static export.
FAIL – the run errored. It carries the standard OpenLineage errorMessage run facet with the failure message.
ABORT – the run was interrupted (e.g. a SIGINT/SIGTERM shutdown) and drained what it could before unwinding.

{"eventType":"START","eventTime":"2026-07-03T17:00:00Z","run":{"runId":"019f..."}, "inputs":[...], "outputs":[...]}
{"eventType":"COMPLETE","eventTime":"2026-07-03T17:00:04Z","run":{"runId":"019f...","facets":{"clinker_runStats":{"recordsRead":1000,"recordsWritten":970,"recordsDlq":30,"durationMs":4210}}}, "outputs":[{"...":"...","facets":{"columnLineage":{ ... }}}]}

Key differences from the static export:

runId is the run’s execution_id (a UUID v7) — the same identity used across clinker’s provenance sidecars and metrics spool, so an orchestrator can correlate the lineage events with the run’s other artifacts.
The START and terminal events carry distinct eventTimes (run begin and run end), not one shared timestamp.
The terminal event carries a clinker_runStats run facet — a clinker-defined facet with recordsRead, recordsWritten, recordsDlq, and durationMs. Counts are pipeline-wide run totals, not per-output.
On FAIL, the run also carries the standard errorMessage run facet (ErrorMessageRunFacet 1-0-0) with the failure message.

A terminal event is always emitted for a started run: if the process fails to reach a clean terminal (for example, an output-commit error after the executor finished), the run is still closed out as a FAIL so no START is left dangling. Emission is best-effort after the run’s data outputs are committed — a lineage-sink write error is logged as a warning and does not fail a run whose outputs already landed.

When to use

Impact analysis – before changing a source schema, see which outputs and columns depend on it.
Auditing & governance – feed the OpenLineage events into a catalog (e.g. Marquez) to track data provenance.
Review – attach the lineage of a new pipeline to a PR to confirm the intended derivations.

Because --lineage reads no data, it runs instantly and works on a pipeline whose inputs do not yet exist.

Limitations

Lineage is derived from the compiled plan, so a few constructs are approximated:

A column-grain $doc read is traced as DIRECT lineage (see fields above) in a transform projection, a combine body, a composition body, and an aggregate emit, attributed only to a source whose envelope declares the section. A $doc read in an influence predicate – a route condition, a cull drop_group_when, or a combine where – is surfaced as INDIRECT influence (FILTER for route and cull, JOIN for combine). Two $doc cases remain uncovered: a whole-section envelope echo (an output header/footer regenerated from a source document section, with no output column or expression); and any $doc reference in a Reshape rule, which the compiler rejects outright (Reshape re-runs its rules after a per-group spill that drops envelope context), so there is no Reshape envelope lineage to produce.
A match: collect combine declared without a projection body produces coarse column lineage: each collected column derives (as TRANSFORMATION) from every build-side column, because there is no body expression to pin the exact source column.
INDIRECT influence covers route/cull predicates, join keys, aggregate grouping, and correlation sort over record columns (and $doc envelope terms in route/cull/combine predicates, as above). An aggregate’s pre-aggregation row filter, a transform-inline filter, and Reshape order_by / partition_by are not (yet) attributed as influence.
Constant and count(*) columns (which have no source input) are omitted from fields; engine-stamped columns ($ck.*, $meta.*, $source.*) are skipped, mirroring the default writer.

Memory Tuning

Clinker is designed to be a good neighbor on shared servers. Rather than consuming all available memory, it works within a configurable budget and reaches for back-pressure or disk spill before it runs out.

The `memory:` block

All pipeline-level memory tuning lives under a single optional block:

pipeline:
  name: my_pipeline
  memory:
    limit: "1G"          # optional — defaults to 512M
    backpressure: pause  # optional — defaults to pause

The entire block is optional. A pipeline with no opinions about memory writes nothing:

pipeline:
  name: my_pipeline

…and gets the runtime defaults (512 MB hard limit, backpressure: pause).

Individual fields are also optional. Setting just one is fine:

pipeline:
  name: my_pipeline
  memory:
    limit: "2G"

Setting the memory limit

CLI flag (highest priority):

clinker run pipeline.yaml --memory-limit 512M

YAML config:

pipeline:
  memory:
    limit: "512M"

When --memory-limit is passed it overrides pipeline.memory.limit for that run; omit the flag and the YAML value applies unchanged, falling back to the 512 MB default only when neither is set. An empty or whitespace-only flag value — as an ops wrapper produces when it forwards an unset variable, so --memory-limit "$CLINKER_MEM" expands to --memory-limit "" — is treated exactly like omitting the flag: the YAML value (or default) applies, rather than the run aborting. Suffixes are binary (1024-based): K = 1024 bytes, M = 1024², G = 1024³; a bare integer is bytes. (This differs from the decimal KB/MB/GB used by min_size/max_size, which are 1000-based.)

Default: 512 MB.

Invalid values: the two entry points treat a malformed limit differently. An empty or unparseable memory.limit in the YAML (for example a stray non-numeric value) falls back to the 512 MB default. A non-empty malformed --memory-limit flag, by contrast, is rejected up front with a config error that names --memory-limit and echoes the value you passed — so a typo such as the decimal 4GB (the binary suffix is 4G) fails loudly instead of silently collapsing to the default and shrinking a larger budget set in your YAML. (An empty or whitespace-only flag value is not malformed: it is treated as if the flag were omitted, as noted above.) Either way, a value whose size is well-formed but too large to represent — its scaled byte count exceeds the maximum a 64-bit counter can hold — is rejected rather than wrapping to a small budget (the YAML overflow error names memory.limit; the flag overflow error names --memory-limit). Pick a limit that fits your host’s real memory.

A well-formed but undersized value is a different case: it is not a malformed flag, so it clears the boundary check, and whether it aborts the run depends on the backpressure policy. Under a producer-pausing policy (pause, the default, or both) a value below the process’s baseline resident memory is rejected at startup by the budget gate as E312. Under the non-pausing spill policy that startup gate does not fire: the run proceeds and relies on spilling to stay within the budget rather than aborting. Because --memory-limit simply populates pipeline.memory.limit, that E312 — which names the limit and echoes the offending byte value — refers to the same limit you passed via the flag.

Choosing a backpressure policy

When memory use approaches the limit (the soft threshold is 80 % of limit), something has to give up memory. The backpressure knob chooses what:

Value	Behavior
`pause` (default)	Where possible, pause an upstream reader so it stops producing until pressure eases; when a paused reader is about to be needed, first spill downstream state and then proceed, so a pause never stalls the run.
`spill`	Never pause a producer — always free memory by spilling a stage to disk.
`both`	Pause where possible, otherwise spill whichever stage is holding the most memory.

pause is the right default for most pipelines: pausing a fast Source feeding a slow downstream stage is cheaper than writing its buffered records to disk. Reach for spill or both only when you have a specific reason to prefer a different posture — for example, both when one large stage dominates the budget and you want it spilled first.

How pause and resume work

Under pause (and both), a producer paused because memory crossed the soft threshold is resumed automatically once memory recedes — it is never left parked. Pause and resume use two watermarks to avoid flapping (a hysteresis band):

Pause when live memory rises above the soft threshold, 0.80 × limit.
Resume when live memory falls back below the lower resume watermark, resume_threshold × limit (default 0.70 × limit).

Between the two watermarks nothing changes, so a normal batch-to-batch swing in memory cannot make a producer flap between paused and resumed on every poll.

A paused reader also never blocks the run. When the engine reaches a stage that needs a paused reader’s records, it first sheds reclaimable downstream state to disk and then resumes the reader and proceeds — so pause throttles producers under pressure but degrades to spill-and-continue at the point it would otherwise wait, rather than stalling.

`resume_threshold`

resume_threshold tunes the low watermark of that band, as a fraction of the hard limit:

pipeline:
  memory:
    limit: "1G"
    resume_threshold: 0.65   # optional — defaults to 0.70

Default: 0.70 (omit the field to take it).
Valid range: strictly greater than 0 and strictly less than the 0.80 soft threshold, so the resume point always sits below the pause point. A value outside (0, 0.80) — including 0.0, a negative, or anything ≥ 0.80 — is rejected at plan time with E324. A misspelled key is rejected as an unknown field.
Lower widens the band: a paused producer stays paused longer (smoother, but slower to re-open).
Higher narrows the band: faster to re-open, but more prone to flapping.

Only the pausing policies (pause, both) use this watermark; under spill no producer is ever paused, so it has no effect.

Streaming batch size (`batch_size`)

pipeline.batch_size sets how many events (records plus document-boundary punctuations) a streaming-eligible stage hands off to its downstream consumer at a time over a back-pressured channel. For a fused stage (Source → Transform → Output, Merge.interleave of Sources) it bounds the in-flight working set to one batch rather than the whole stage, because the stage pulls records off a live upstream channel without ever building a full result. The other streaming stages build their full result first and stream it in batches; there the knob sizes only the inter-stage slice, not the producer’s footprint. The knob is optional; omit it to use the built-in default of 2048 events. See Streaming vs. Blocking Stages for the distinction.

pipeline:
  name: orders_rollup
  batch_size: 1024          # optional; default 2048

A per-transform override is available on a Transform’s config.batch_size (see Transform Nodes); it takes precedence over the pipeline value for that one stage. A batch_size of 0 is rejected at config load. The knob affects only the memory profile of streaming stages, never their output — blocking stages (sort, hash Aggregate, Combine build side) ignore it and continue to fully materialize. See Streaming vs. Blocking Stages for the full model.

Behavior under memory pressure

You don’t manage memory by hand — the engine does it within the budget you set. What this means in practice:

Spillable stages always complete if disk space is available, regardless of input size. When a blocking stage (sort, hash Aggregate, large Combine) outgrows the budget, it spills to disk instead of failing.
- Range and equality+range Combine also spill. A Combine whose where: joins the two inputs on an inequality (<, <=, >, >=) — a pure band join such as orders.amount >= bands.lo and orders.amount < bands.hi, or an equality-plus-range join such as orders.region == bands.region and orders.amount >= bands.lo — runs the block-band strategy, which is spill-bounded on both axes: it external-sorts each input side to disk and accumulates its matched output in a spillable sort as well, so it completes on inputs and result sizes larger than the budget. When an equality key is present, records are grouped by that key (via its hash) before the range walk and only same-key pairs are joined; a single very common key is spread across many disk-backed blocks and joined a bounded pair at a time, so even a heavily skewed key stays within the budget. The join still aborts with E310 MemoryBudgetExceeded only as a last resort — when a single indivisible unit of work (one pair of input blocks plus the join’s scratch arrays) cannot fit the hard limit even on its own. If you hit that, give the pipeline more headroom or narrow the predicate.
Performance degrades gracefully. Under pressure you’ll see slower execution and possibly disk I/O — not a crash.
The limit is a soft ceiling, not a hard wall. Momentary spikes may briefly exceed it before the engine reacts. Only if memory blows past the limit outright does the run abort with E310 MemoryBudgetExceeded, which names the stage that overran.
Shared handoff buffers spill and re-scan sequentially. A stage that fans out to several consumers, or feeds a composition input port, records the exact reader count for each output port and may spill that shared slot like any other materialized buffer. Readers run one at a time over immutable backing; spilled and mixed buffers open one file at a time for each scan, and the final reader takes the authoritative slot. Clinker never pre-forks one copy per destination. A consumer that must collect the scan into a full resident vector reserves that materialization before allocating it. If the projected overlap exceeds the hard limit, it aborts with E310 MemoryBudgetExceeded, names that consumer, and reports NodeBuffer as the budget category. More spill space can keep the shared backing off-heap, but it cannot eliminate an individual operator’s required resident working set.
One oversized correlation group trips E310 in Reshape and Cull. Both apply their rules to a whole group at once, so a group must fit the budget at finalize even though cross-group and ingest-time peaks spill. A group larger than memory.limit has no in-budget representation, so the run aborts with E310 MemoryBudgetExceeded naming the node and the offending partition_by group. Raising memory.limit clear of the reported figure is the only fix that leaves your output unchanged (finalize also holds the run’s other groups, so that figure is a floor, not a target). Dropping unread columns in an upstream Transform also shrinks the group, but both nodes write every input column through, so those columns leave the output too. Do not narrow partition_by to clear it — that key defines the group the rules evaluate over, so a narrower key makes the run succeed by changing your results. Both nodes also hold per-group bookkeeping that is O(distinct groups) and cannot spill — the spill path evicts buffered records, never group entries — but they handle an extreme partition cardinality differently, and the difference is the diagnostic you get:
- Cull fails loud. Its in-memory drop-decision state is combined with the run’s other live charged memory and checked against the budget on every admission, so a cardinality that would breach memory.limit aborts with E310 MemoryBudgetExceeded naming the decision state instead of a group.
- Reshape has no such gate. Its group map and group-order list grow with the distinct-group count unchecked, so a high-cardinality Reshape can be OOM-killed rather than reporting E310; the resulting status is platform-dependent. Keep the group count well within the budget on a Reshape; there is no engine check to catch it for you. This gap is tracked in #1027.
clinker explain --code E310 covers every surface. The rendered diagnostic names which one overran in its [...] detail; the explain page keys its remediation to that.
A Reshape partition value it cannot key is folded into the null group, silently. Where Cull aborts on a NaN, array, or map partition_by value, Reshape groups that record under null — alongside every record whose partition column is missing, explicitly null, or an empty string — and the run exits 0. Nothing in the output marks the merge, so if any of those shapes can occur in your partition column, normalize or filter it in an upstream Transform. See Values Reshape cannot key.
These are limits, not bugs. Every case above is an ordinary operational condition with a documented code, so it reports through E310 rather than as an internal error. Four conditions do still present as an internal error today despite being about your data or your host, not an engine defect:
- a NaN, array, or map value in a Cull partition_by column (Reshape folds these into the null group instead, per the bullet above);
- a runtime data error inside a Cull drop_group_when expression;
- a runtime data error inside a Reshape rule’s when, mutate.set, or synthesize expression — the direct analogue of the Cull case above. Like it, this aborts the run rather than routing the record to the dead-letter queue, even under strategy: continue;
- a spill read/write failure in a Reshape or Cull group buffer, including the spill volume filling up; #1021 tracks preserving its spill diagnostic and exit classification.
Any other internal error is a Clinker defect worth reporting.

Some stages stream (they hold only a small in-flight slice of records) and some materialize (they hold a whole stage’s worth before emitting). clinker run --explain annotates each node with buffer: streaming or buffer: materialized so you can see which stages will dominate the budget before you run. See Streaming vs. Blocking Stages for which is which.

Sizing guidelines

Workload	Recommended limit	Notes
Small files (<10 MB)	128M	Minimal memory pressure
Medium files (10–50 MB)	256M	Covers most ETL jobs
Large files or complex aggregations	512M (default) – 1G	Multiple group-by keys, large cardinality
Multiple large group-by keys	1G+	High-cardinality distinct values

Target workload: Clinker is optimized for 1–5 input files of up to 100 MB each, processing 10K–2M records per run.

Aggregation strategy interaction

Memory consumption depends heavily on the aggregation strategy the optimizer selects:

Hash aggregation accumulates state in a hash map. Memory usage is proportional to the number of distinct group-by values. With high-cardinality keys, this can consume significant memory before spill triggers.
Streaming aggregation processes groups in order and emits results as each group completes. Memory usage is minimal (proportional to a single group’s state) but requires the input to be sorted by the group-by keys.
strategy: auto (the default) lets the optimizer choose based on the declared sort order of the input. If the data arrives sorted by the group-by keys, streaming aggregation is selected automatically.

To influence strategy selection:

  - type: aggregate
    name: rollup
    input: sorted_data
    config:
      group_by: [department]
      strategy: streaming    # force streaming (input MUST be sorted)
      cxl: |
        emit total = sum(amount)

Only force streaming when you are certain the input is sorted by the group-by keys. If the data is not sorted, results will be incorrect. Use auto when in doubt.

Oversized single rows

An aggregate that keeps min, max, avg, or another value-buffering binding holds each contributing row’s raw values until the group finalizes. If a single input row’s buffered footprint is larger than the entire memory.limit, no amount of spilling can hold it — spill would only re-read the same oversized row. The engine surfaces this per-row overflow rather than absorbing it:

With error_handling.strategy: fail_fast (the default), the run aborts with E310 MemoryBudgetExceeded, naming the aggregate stage and reporting the offending row’s byte footprint against the budget.
With strategy: continue, the offending record is routed to the dead-letter queue under the aggregate_finalize category, and the run proceeds.

This is almost always a sign the budget is set far too low for the record shape — raise memory.limit so a typical row fits comfortably.

Compositions

A composition (a reusable sub-pipeline included via use:) does not get its own memory budget — its operators share the parent pipeline’s budget and spill to the same temporary directory. A spilled composition input is scanned sequentially, and its materialization stays charged continuously as ownership moves into the body; it is not briefly dropped from the accounting or charged twice. During body-input schema conversion, the old input and new records are both charged for the short interval when both allocations exist. If that materialization would exceed the hard limit, E310 names the composition call-site directly. If a later budget overrun happens inside the composition, the error names that same call-site (e.g. enrich_call) so you can locate it, prefixing the message with in composition 'enrich_call': ... when the overrun is internal to the body.

Monitoring memory usage

Use the metrics system to track peak_rss_bytes across runs:

clinker run pipeline.yaml --metrics-spool-dir ./metrics/

The metrics file includes peak_rss_bytes, which shows the maximum resident memory during execution. If this consistently approaches your memory limit, consider increasing the budget or restructuring the pipeline to reduce intermediate state.

Shared server considerations

On servers running JVM applications, memory is often at a premium. Recommendations:

Set --memory-limit or memory.limit explicitly rather than relying on the default. Know your budget.
Use --threads to limit CPU contention alongside memory limits.
Monitor peak_rss_bytes in production metrics to right-size the limit over time.
Schedule large pipelines during off-peak hours when JVM heap pressure is lower.

Storage & Spill Location

Blocking operators — Aggregate, sort, and grace-hash Combine — accumulate state in memory up to the configured budget, then spill to disk when a soft or hard memory threshold trips, rather than running the process out of memory. By default those spill files land in the operating system’s temporary directory. The [storage] block in clinker.toml lets you redirect them.

The `[storage]` block

Storage settings are a property of the workspace, not of an individual pipeline, so they live in clinker.toml at the workspace root rather than in the per-pipeline YAML:

[storage.spill]
dir = "/var/clinker/spill"   # optional; default = OS temp dir
disk_cap_bytes = "10GB"      # optional; default = unlimited
compress = "auto"            # optional; auto | off | on   (default = auto)

[storage.staging]
enabled  = false             # opt-in; default off
dir      = "/var/clinker/staging"   # required when enabled
patterns = ["/mnt/nfs/data/**"]     # which sources to stage

The whole block is optional. With no clinker.toml, or a clinker.toml that omits [storage], Clinker spills to the OS temp directory exactly as it always has.

`storage.spill.dir` — where spill files go

When dir is set, the per-run spill directory (clinker-spill-<random>/) is created under that path, and every blocking operator writes its spill files there. When dir is omitted, the per-run directory is created under the OS temp directory (std::env::temp_dir, typically $TMPDIR or /tmp).

The directory is validated once at startup, before any input is read. If the path does not exist, is a file, or is not writable, the run fails immediately with a diagnostic naming the setting:

storage.spill.dir /var/clinker/spill does not exist; create it or point at an existing volume

Validating up front — rather than at the first spill — means a misconfigured spill volume fails fast, while the run is cheap to abandon, instead of after minutes of work. (This is the trap DuckDB fell into when its temp-directory setting was honored only lazily, duckdb/duckdb#9401.)

Why redirect spill off `/tmp`

On many Linux hosts — especially systemd-managed ones — /tmp is mounted as tmpfs, which is backed by RAM (and swap), not disk. Spilling there does not actually free physical memory: the spill bytes stay resident, defeating the whole point of the memory budget. If df -T /tmp reports a tmpfs filesystem, point storage.spill.dir at a path on a real block device so spilling moves pressure off RAM and onto disk.

Inspecting the resolved spill root

clinker run --explain prints the resolved spill root and where it came from, so you can confirm the setting took effect before committing to a run:

Spill root: /var/clinker/spill [storage.spill.dir]

…or, with no configuration:

Spill root: /tmp [OS temp dir (default)]

The same --explain output reports the resolved disk cap on the next line:

Spill disk cap: 10737418240 bytes [storage.spill.disk_cap_bytes]

…or, with no cap configured:

Spill disk cap: unlimited (default)

Finally, --explain reports the resolved compression decision per spill-writing operator, so you can see which spills will be LZ4-framed (lz4) and which will be written raw (off) before the run starts. Under auto the choice varies by operator width:

Spill compression: Auto [storage.spill.compress]
  Aggregate 'totals' → lz4
  Sort 'by_amount' → off

Only operators that actually write spill files appear here: the external sort, the hash Aggregate, the grace-hash / sort-merge Combine, and the pure-range (block-band) IEJoin Combine, which external-sorts each side and writes its min/max-tagged blocks to disk and spills its matched-output sort runs the same way. The remaining in-memory join strategies — the inline hash build/probe and the equi+range IEJoin (hash-partitioned range join) — run their kernel entirely in RAM and never open a spill file, so spill compression does not apply to them and they are omitted from this list, even though they carry a spill priority for memory arbitration.

`storage.spill.disk_cap_bytes` — cap concurrent spill

By default a run will spill as much as it needs, limited only by the physical space on the spill volume. disk_cap_bytes sets a budget on the spill the run holds at once: the on-disk size of the spill files live at any moment. When that footprint would cross the cap, the run aborts with a dedicated diagnostic instead of continuing to fill the volume. Because the cap tracks what is concurrently on disk, an operator that deletes intermediate spill files as it consumes them (such as the merge that folds a heavily fragmented external sort back together) does not count those transient files twice — only the disk a run actually occupies at once is charged against the cap.

[storage.spill]
dir = "/mnt/fast-ssd/clinker-spill"
disk_cap_bytes = "50GB"

The value accepts the same human-readable byte-size grammar as the source size filters — a bare integer is bytes, and KB/MB/GB suffixes use decimal units (1GB = 1,000,000,000 bytes), matching du, df, and the AWS CLI. Omitting the key leaves spill unlimited, exactly as before.

The cap is a policy ceiling, deliberately independent of both the memory budget and the physical volume size. A run can sit well inside its memory.limit and still exhaust local disk through an unbounded stream of spill files; the cap lets an operator bound that on a shared volume. It is the guard DataFusion shipped without (apache/datafusion#15358) until production runs filled volumes.

`storage.spill.compress` — LZ4 compression policy

Spill files are postcard-encoded record streams. By default each stream is wrapped in an LZ4 frame, which shrinks large spilled runs. But LZ4 carries a per-frame fixed cost — clearing the compressor’s internal state on every frame reset — and on small spills that cost can outweigh the byte savings. The LZ4 v1.8.2 release notes call this out directly, and Pentaho Kettle ships explicit guidance to turn spill compression off for small rows.

compress controls the policy:

[storage.spill]
compress = "auto"   # auto | off | on   (default = auto)

Mode	Behavior
`auto` (default)	Compress only when a spilled batch is projected large enough to amortize LZ4’s per-frame cost — both ≥ 4 KiB and ≥ 1024 rows. Below either threshold the batch is written raw. The projection comes from the operator’s schema width and the run’s `batch_size`, so the decision is made per blocking operator.
`off`	Never compress. Postcard records are written straight to disk with no LZ4 frame. Cheapest for small spills; largest on-disk size.
`on`	Always compress with an LZ4 frame. The pre-knob behavior, best for spills of large, compressible rows.

Each spill file records its compression choice in a one-byte header tag, so the read path always dispatches to the right decoder regardless of the mode the file was written with — changing the knob between runs never breaks re-reading an earlier run’s files.

The 4 KiB / 1024-row thresholds mark the empirical crossover: below them the LZ4 frame’s fixed cost dominates the small amount of compressible payload, and writing raw is faster end-to-end (the spill_compression benchmark sweeps batch sizes from 256 B to 64 KiB and confirms auto tracks the faster of on / off across the range). Most pipelines should leave compress at auto; set on when spilling wide, highly compressible rows to a space-constrained volume, and off when spills are dominated by many small batches.

Observability — what the planner will do before you run

clinker run --explain is plan-only (it reads no input and spills nothing), so it is the safe place to see what a run would do to the spill volume and to the staging dir before committing to it. On top of the resolved spill root, disk cap, and compression decision documented above, --explain surfaces three storage-observability sections, and a real clinker run reports the matching actuals at end-of-run so you can calibrate the estimate.

A note on byte units. Three different unit conventions appear across the storage surface, and it helps to know which is which before comparing figures:

Config values you write (disk_cap_bytes = "10GB") use decimal units — 1GB = 1,000,000,000 bytes — matching du, df, and the AWS CLI (see the disk-cap grammar).
The === Estimated Spill Volume === section humanizes with binary suffixes — K/M/G = KiB/MiB/GiB — so it lines up with the predicted_peak figure on each stage’s Physical Properties line, which uses the same humanizer.
The cap-headroom line and the post-run actuals print raw bytes with no suffix, so the cap-minus-estimate subtraction and the estimate-vs-actual comparison are exact rather than rounded.

When you calibrate the estimate against the post-run actual, convert the binary estimate suffix to bytes first (1K = 1024 bytes, 1M = 1,048,576 bytes) so you are comparing the same unit the actuals report.

Estimated spill volume per stage

The === Estimated Spill Volume === section lists one line per spill-writing stage (hash Aggregate, external sort, grace-hash / sort-merge Combine, and the pure-range block-band IEJoin Combine) with its plan-time spill-volume estimate, followed by a total. The remaining in-memory join strategies (inline hash build/probe, equi+range IEJoin) never write spill files, so they do not appear here and do not inflate the total:

=== Estimated Spill Volume ===

Estimated spill volume (per blocking stage):
  [aggregation:hash] dept_totals → 1K
  [sort] by_amount → 4K
  Total: 5K

Each figure is the operator’s coarse predicted peak live state — the same predicted_peak the Physical Properties arbitration line shows — and bytes render in binary units (K/M/G = KiB/MiB/GiB). Summing rather than maxing is the conservative choice for a preflight: two blocking operators can be live and spilled at the same time, so their footprints add.

A streaming-only pipeline (no blocking operator) has nothing that spills, so the section is omitted entirely.

Unknown stages. The estimate is seeded from input file sizes resolved at plan time. A stage whose volume cannot be known before the run renders unknown instead of a misleading 0B, and the total notes that unknown stages are excluded:

  [aggregation:hash] dept_totals → unknown
  Total (known stages): 0B (excludes stages whose volume is unknown at plan time
  — a network source, a missing or unreadable input, or a glob/regex matcher
  whose discovery fails)

The seed is known for every file-backed matcher whose files can be sized at plan time: a single-file path: source, an explicit paths: list, and a glob: or regex: matcher. A glob/regex seed runs the same discovery resolver the run uses — applying its exclude, min_size/max_size, modified_after/before, take, and sort filters — and sums the matched files’ sizes, so the estimate names exactly the bytes the run will read with no second implementation to drift. A glob/regex that matches nothing seeds zero (rendered as unknown, since there is no spill volume to preview). The seed is genuinely unknown for a network source, for a missing or unreadable input file, and for a glob/regex matcher whose discovery itself fails (an invalid pattern, or no match under on_no_match: error) — the run surfaces the same error at startup. Check the post-run actuals below to calibrate any estimate.

Staging plan per source

When storage.staging is enabled, the === Staging Plan === section reports, for each source (and each discovered file under a multi-file matcher): whether it would be staged, the resolved content-addressed staged path, and — under on_existing = reuse — the reuse-if-fresh cache decision (hit if a committed prior copy still matches the live source, miss if it would be re-staged):

=== Staging Plan ===

Source 'orders':
  /data/in/orders-2024.csv → staged: yes, path: /mnt/local/staging/3f2a…b1.staged, reuse: hit
  /data/in/orders-2025.csv → staged: yes, path: /mnt/local/staging/9c4e…07.staged, reuse: miss

The reuse prediction runs the exact freshness check (mtime + size against the committed manifest) the real run makes, read-only — --explain copies nothing. A source that matches no staging pattern reports staged: no (no pattern match, reads in place); a network source reports not stagable (network source reads in place). When staging is disabled the section states that every source reads in place.

Cap headroom

When a spill cap is configured, --explain reports the headroom (cap minus estimate) with the same per-invocation disclaimer the startup cap-headroom preflight carries, and the same 80% warning:

Cap headroom: 5000000000 bytes free (5000000000 estimated of 10000000000 cap, 50%)
  [per invocation — does NOT account for sibling invocations sharing the spill
  volume under partition-and-run]

Machine-readable form — `--explain json`

clinker run --explain json emits the whole plan as JSON for tooling (the canvas, dashboards, CI gates). The same storage observability the text form prints lives under a structured storage_summary object, so a consumer reads per-stage spill estimates and the cap / staging summary without re-parsing prose:

{
  "schema_version": "1",
  "nodes": [ ... ],
  "node_properties": { ... },
  "storage_summary": {
    "spill_root": { "path": "/mnt/fast-ssd/clinker-spill", "source": "storage.spill.dir" },
    "spill_disk_cap_bytes": 1000000000,
    "estimated_spill": {
      "per_stage": [
        { "node_name": "dept_totals", "display_name": "[aggregation:hash] dept_totals", "estimate_bytes": 1024 },
        { "node_name": "by_amount", "display_name": "[sort] by_amount", "estimate_bytes": 4096 }
      ],
      "total_known_bytes": 5120,
      "any_unknown": false
    },
    "spill_compression": {
      "mode": "auto",
      "per_operator": [
        { "node_name": "dept_totals", "display_name": "[aggregation:hash] dept_totals", "compression": "lz4" },
        { "node_name": "by_amount", "display_name": "[sort] by_amount", "compression": "off" }
      ]
    },
    "cap_headroom": {
      "headroom_bytes": 999994880,
      "estimated_bytes": 5120,
      "cap_bytes": 1000000000,
      "pct_of_cap": 0.000512,
      "over_threshold": false
    },
    "staging": { "enabled": false, "sources": [] }
  }
}

The fields mirror the text sections one-for-one: estimated_spill is the === Estimated Spill Volume === section (a stage whose volume is unknown at plan time carries estimate_bytes: null and sets any_unknown: true), spill_compression is the Spill compression: projection, cap_headroom is the cap-headroom line (omitted when no cap is configured or the estimate is zero), and staging is the === Staging Plan === section. The JSON and DOT formats emit only their machine payload — the human-readable === Resolved Outputs === preamble the text form prints is suppressed so the output parses cleanly.

Post-run actuals — calibrating the estimate

A real clinker run that spills prints a per-stage actual spill-volume section at end-of-run, so you can compare it against the --explain estimate for the same stage — the calibration loop that turns a coarse pre-run estimate into a trustworthy one over repeated runs:

=== Spill Volume (actual, per stage) ===
  dept_totals → 1048576 bytes
  by_amount → 4194304 bytes
  Total: 5242880 bytes (compare against the --explain estimate)

The per-stage breakdown sums to the pipeline-wide cumulative spill total. A run that stayed within memory spilled nothing and prints no section. A large estimate-vs-actual delta is the single highest-leverage signal when a pipeline starts spilling unexpectedly (the failure mode behind Polars’ documented 13.5× spill amplification, where an optimizer interaction turned 30 GB of input into 400 GB of spill with no per-stage visibility).

Note on the --explain compression projection. The per-operator spill-compression decision shown under Spill compression: is projected from the same column count the operator’s runtime spill writer sees, so the projected auto verdict matches the file the run actually writes. A hash Aggregate and a grace-hash / sort-merge Combine project against their output schema (engine-stamped identity columns included), exactly the width their dispatch arms resolve compression against; an enforcer sort projects against the width of the records flowing into it — its upstream’s emitted schema — which is the width its sort buffer reads at runtime. The read path also dispatches on each spill file’s own one-byte header tag, so re-reading is robust regardless.

Distinguishing the runtime storage-abort conditions

A run that fails while spilling or staging emits one of several distinct diagnostics so a single glance at the error tells you exactly what to fix — instead of every disk and memory problem rendering as one ambiguous “out of memory” message (the trap DuckDB hit in duckdb/duckdb#14142, where a temp-dir cap was reported as “Out of Memory Error … 187.3 GiB/187.3 GiB used” and users inspected df only to find free space). The aborts split along two axes: the spill side (in-memory operator state landing on disk) and the staging side (matched source files copied to local disk before they are read).

Spill aborts

Condition	Code	What happened	What to do
Out of memory	E310	An operator’s in-RAM state crossed the hard `memory.limit` (a true RSS overrun).	Raise `memory.limit`, reduce input, or let the operator spill.
Spill cap exceeded	E320	Cumulative spill bytes crossed `storage.spill.disk_cap_bytes`. The volume may still have free space — you hit the configured budget.	Raise `disk_cap_bytes`, point `storage.spill.dir` at a larger volume, or reduce the spill footprint.
Spill volume full	E321	The OS reported the spill volume out of space (`ENOSPC`). The physical disk filled.	Free space on the volume, or move `storage.spill.dir` to a larger mount.
Spill directory unavailable	(Spill)	The spill directory went bad mid-run — unmounted, remounted read-only, deleted by a cleaner, or permissions revoked.	Remount/restore the volume; stop the over-eager cleaner.

The key separations:

E310 vs E320 — an OOM is an in-RAM overrun; a cap-exceeded is a disk-budget stop. A run can hit E320 while comfortably inside its memory envelope, so conflating the two would point you at the wrong knob.
E320 vs E321 — E320 is the budget you set; E321 is the disk itself running dry. If you removed disk_cap_bytes, an over-large run would no longer trip E320 and would instead spill until the volume filled (E321).

(A future per-operator memory-reservation surface will add a fifth, reservation-exhausted condition; it is not part of the engine yet.)

Staging-copy aborts

When storage.staging is enabled, copying a matched source to local disk can fail in three distinct ways. Like the spill split, each has its own code so a content-corruption problem never renders as a budget problem and vice versa. Staging runs before any record flows, so these surface as startup-style validation failures.

Condition	Code	What happened	What to do
Staged copy corrupt	E335	The local copy’s BLAKE3 digest did not match the source — the transport (e.g. a soft-mount NFS share) delivered different bytes than the source holds.	Re-run over a healthy transport, harden the mount, or stage from a stable snapshot. Do not set `verify = "none"` to silence it — that hides corruption, not fixes it.
Staging cap exceeded	E336	The cumulative bytes staged this run would cross `storage.staging.disk_cap_bytes`. The volume may still have free space — you hit the configured budget, not a full disk.	Raise `disk_cap_bytes`, point `storage.staging.dir` at a larger volume, narrow `storage.staging.patterns`, or remove the cap.
Staged copy already exists	E337	A staged copy of this source already exists and `on_existing = error` refuses to touch it.	Remove the existing copy, or switch `on_existing` to `overwrite` (re-stage) or `reuse` (reuse a fresh copy).

The same cap-vs-full-disk separation applies here as on the spill side: E336 is the budget you set (mirroring E320), so it must not render as an out-of-space message — a physically full staging volume instead surfaces as a staging I/O error (mirroring E321). E335 is distinct from a generic staging I/O error: an I/O error means the OS reported a fault, whereas E335 means the copy completed cleanly yet still does not match the source.

Startup storage validation

Before a run spawns its first source-ingest thread — after the plan compiles but before any input is read or any byte is spilled or staged — Clinker runs a single comprehensive validation pass over the resolved [storage] configuration. It rejects configurations that are physically wrong for the job, each with a stable diagnostic code, the offending clinker.toml field, and a clinker explain --code <CODE> pointer. Validating up front fails a misconfigured volume while the run is still cheap to abandon, rather than after minutes of work when the first spill or staged copy hits the bad volume.

Code	Rejected configuration	Why
E330	`storage.spill.dir` on an in-memory filesystem (Linux tmpfs / ramfs, Windows RAM disk).	Spilling there keeps the bytes in RAM, so it frees no physical memory and defeats the memory budget.
E331	`storage.spill.dir` on a network filesystem (NFS / SMB / CIFS / FUSE).	A spill target on a soft-mounted share risks silent truncation and mmap data loss — the failure modes spill exists to avoid.
E332	`storage.staging.dir` on a network filesystem.	Staging copies inputs off a flaky share; a staging dir that is itself on a share reintroduces the fragility staging exists to escape.
E333	`storage.staging.dir` on the same physical device as a matched (staged) source.	The copy moves no I/O off the source volume, so it buys nothing while still spending time and space. Applies only to matched sources.
E334	`storage.spill.dir` equal to `storage.staging.dir`.	Spill files and staged source copies are sized and cleaned up differently; sharing one directory makes accounting and cleanup ambiguous.

The filesystem-class checks (E330–E332) read the volume type through one cross-platform detection layer, so they behave identically on Linux, macOS, and Windows: Linux matches the statfs f_type magic, macOS matches the f_fstypename string, and Windows maps GetDriveTypeW. (macOS has no native tmpfs, so E330 only ever fires on Linux and Windows.) The same-device check (E333) compares the device id on Linux/macOS and the volume serial number on Windows — the very same probe the staging same-volume rule uses, so there is one consistent notion of “same device” across the whole run.

Free-space preflight

Separately from the runtime disk cap (E320) and the full-volume surface (E321), the startup pass runs a free-space preflight: it queries the bytes available on the spill volume and compares them to the run’s estimated spill footprint (the sum of every blocking operator’s predicted peak state, the same estimate --explain surfaces). When the spill volume looks too small, the run prints a warning and continues:

W330: spill volume /var/clinker/spill has 2000000000 bytes free but the run is
estimated to spill up to 8000000000 bytes; the run may abort with a full-volume
error (E321) at the final spill — point storage.spill.dir at a larger volume or
reduce the spill footprint (raise memory.limit, partition the input)

This is advisory, not fatal: the estimate is a coarse upper bound (it ignores spill compression and the streaming drain), so the run may well finish within the available space. The warning exists so a long pipeline that would die at its final spill surfaces that risk before it runs for an hour, rather than after. The free-space query uses a cross-platform probe (statvfs on Unix, GetDiskFreeSpaceExW on Windows) that returns a 64-bit byte count, so the historical 32-bit f_bavail truncation never affects the comparison.

Cap-headroom preflight

When storage.spill.disk_cap_bytes is configured, the same startup pass also runs a cap-headroom preflight: it compares the run’s estimated spill volume to the configured cap and warns when the estimate reaches 80% of the cap. Unlike the free-space preflight (which probes the physical volume), this checks the run against the policy ceiling you set, so it fires even on a volume with plenty of free space:

W331: this run is estimated to spill up to 9000000000 bytes, which is 90% of the
configured spill cap storage.spill.disk_cap_bytes (10000000000 bytes); the run
may abort with a spill-cap error (E320) before it finishes — raise disk_cap_bytes
or reduce the spill footprint (raise memory.limit, partition the input). This
headroom is per invocation: if you partition the input and run several clinker
invocations against the same spill volume and cap, they share the cap, so the
real headroom is smaller than this figure

Like W330, this is advisory, not fatal — the estimate is a coarse upper bound, so a run that compresses well or never trips its memory budget may finish comfortably under the cap. It fires on a normal clinker run (before ingestion, at startup), not only under --explain, so an operator sees the signal on the real run even when they did not explicitly inspect the plan first.

Per-invocation accounting. The cap and the headroom figure are scoped to a single clinker invocation. Under the partition-and-run model — where you split a large input by file or key and launch several clinker processes that share one spill volume and one disk_cap_bytes — the physical spill volume is shared by every sibling, so the real headroom is smaller than any one invocation’s figure. The warning text states this explicitly rather than silently presenting a per-invocation number as a whole-volume guarantee. Clinker is single-process by design (one invocation = one OS process), so the engine cannot see its siblings; the disclaimer is the honest stance.

Mid-run spill failures

The startup check guarantees the spill directory is writable when the run begins, but it can still go bad mid-run — an NFS share remounts read-only, a volume unmounts, an over-eager temp-file cleaner deletes the directory, or permissions are revoked. When a spill write fails because the directory has vanished or become read-only, the run aborts cleanly with a distinct diagnostic rather than a generic I/O error or a panic:

spill directory /var/clinker/spill became unavailable mid-run: No such file or directory
(the directory may have been unmounted, remounted read-only, deleted by an
external cleaner, or had its permissions revoked)

This surfaces the directory-level cause directly, so the fix (remount the volume, stop the cleaner, restore permissions) is obvious from the message.

Crash purge of orphaned spill directories

A run’s spill directory (clinker-spill-<random>/) is normally removed when the run ends — a clean exit, a run that aborts with a fatal error, or even a panic all delete it. But a SIGKILL, the Linux OOM-killer, or a power loss kills the process before that cleanup runs, leaking the directory and every spill file inside it. Over many crashed runs that fills the spill volume.

To prevent that, a run cleans up orphaned spill directories at startup — but only when a spill directory is explicitly configured (storage.spill.dir), before it creates its own. It removes only directories left by dead runs and never touches one a concurrent run is still using.

When storage.spill.dir is not set, the spill root defaults to the OS temp directory (std::env::temp_dir, typically $TMPDIR or /tmp), and no startup purge runs there. In the default case a run still cleans up its own spill directory on every exit short of a hard kill; a directory leaked into the OS temp directory by a hard kill is the operating system’s temp-reaper’s responsibility, not Clinker’s. The purge is confined to a configured spill root because Clinker owns that volume but does not own the shared OS temp directory.

`storage.staging` — opt-in source staging

Reading source files directly from a network share (NFS, SMB) couples every run to the share’s availability and quirks: a soft-mount can silently truncate a read, and latency multiplies across many small files. Source staging copies matched source files to a local volume before the pipeline reads them, so the run works from stable local copies. It is off by default and activated per workspace by pattern match — pipelines that don’t opt in behave exactly as before.

[storage.staging]
enabled        = true
dir            = "/var/clinker/staging"   # required when enabled
patterns       = [
    "/mnt/nfs/data/**",
    "//fileserver/share/**",
]
disk_cap_bytes = "50GB"   # optional; cap on bytes copied per run (default unlimited)
verify         = "blake3" # optional; blake3 | none   (default blake3)
on_existing    = "overwrite" # optional; overwrite | reuse | error (default overwrite)
cleanup        = "on_success" # optional; on_success | always | never (default on_success)

Key	Default	Meaning
`enabled`	`false`	Master switch. When `false`, `patterns` is ignored and every source reads in place.
`dir`	—	Local directory the copies are written under. Required when `enabled`.
`patterns`	`[]`	Glob patterns selecting which source paths to stage. A source is staged only when `enabled` and its path matches at least one pattern. Empty ⇒ nothing is staged.
`disk_cap_bytes`	unlimited	Cumulative cap on bytes copied per run. Same byte-size grammar as the spill cap (`"50GB"`, bare integers are bytes).
`verify`	`blake3`	Post-copy integrity check. `blake3` hashes source and copy and requires a match — the only check that catches a soft-mount’s silent truncation. `none` skips the check.
`on_existing`	`overwrite`	What to do when a staged copy of this source already exists from a prior run: `overwrite` re-copies unconditionally; `reuse` reuses the existing copy only when it is still fresh (the source’s modification time and size match what was recorded when it was staged), otherwise re-copies; `error` fails the run rather than touch the existing copy. See The staging cache below.
`cleanup`	`on_success`	When staged copies are deleted relative to the run’s outcome: `on_success` removes them after a clean exit but keeps them after a failure so the operator can inspect the exact inputs the failed run saw; `always` removes them regardless; `never` keeps them as a persistent reuse cache for a later `reuse` run. See Cleanup.

Pattern matching

patterns uses the same glob grammar as a source’s exclude: list. Each pattern is tested against both the full path and the basename, so /mnt/nfs/** matches a deep path by its full path while *.csv matches any CSV by basename. ** crosses directory boundaries; * does not.

Startup validation

When enabled, staging is validated once at startup, before any input is opened, so a misconfiguration fails the run immediately rather than at the first copy. The run is refused when:

dir is unset.
dir does not exist, is a file, or is not writable (probed with a real create-and-delete, so a read-only mount or restrictive ACL is caught).
a patterns entry is not a valid glob.
dir sits on the same volume as a matched source. Staging within one volume copies bytes without moving I/O off the slow share — a well-documented anti-pattern — so it is refused up front rather than left to surface as a confusingly slow pipeline. The check compares the source’s and the staging dir’s storage volume (the device id on Linux/macOS, the volume mount root on Windows); point dir at a local disk on a different volume.

The same-volume rule applies only to matched sources: a source the patterns don’t select reads in place, so its volume is irrelevant.

How a file is staged

Staging copies the matched source to your local staging directory once, then verifies the copy against the source (with verify = blake3, the default, a content mismatch fails the run with E335). From then on the pipeline reads from the local copy. The same source always resolves to the same staged file, so a later run can find and reuse a prior copy.

The staging cache (`on_existing`)

Because staged copies live at stable paths, a copy from a prior run is still on disk when the next run starts (unless cleanup removed it). on_existing decides what happens when that prior copy is found:

Mode	Behavior
`overwrite` (default)	Always re-stage. The prior copy is removed and the source is copied fresh. The safe default: a copy from a crashed run must not be trusted.
`reuse`	Reuse the prior copy only when it is still fresh — the source’s current modification time and size both match what was recorded when it was staged. A fresh match skips the copy entirely (no bytes read off the share, nothing charged against the disk cap). A changed mtime or size means the source was rewritten, so the copy is stale and is re-staged.
`error`	Fail the run with a clear diagnostic if a staged copy already exists, rather than overwrite or reuse it. For workflows that want an explicit “the cache is already populated” stop.

reuse is the mode that turns staging into a cache: re-running the same pipeline over an unchanged network share copies nothing on the second run. The freshness check is mtime + size, not a re-hash, so it is cheap.

Staging is safe to run from several clinker invocations at once over a shared staging volume: a source is copied exactly once no matter how many runs race for it, a run always reads a complete copy, and no run fails because a sibling was reading, cleaning up, or re-staging the same source.

Cleanup (`on_success` | `always` | `never`)

cleanup decides when a run’s staged copies are removed, keyed on the run’s outcome:

Mode	Behavior
`on_success` (default)	Remove the copies after a clean exit; keep them after a failure (or an interrupted / DLQ-producing run) so the operator can inspect the exact inputs the run saw and re-run without re-fetching.
`always`	Remove the copies when the run ends, success or failure.
`never`	Keep the copies indefinitely as a persistent reuse cache. Combine with `on_existing = reuse` to make repeated runs over a stable source copy-free. The operator reclaims the staging dir manually (or lets the next run’s crash purge eventually reap stale entries).

Each staged file’s manifest is removed alongside it, so cleanup never leaves a manifest pointing at a staged file that is gone.

Crash purge of orphaned artifacts

A SIGKILL, the Linux OOM-killer, or a power loss can kill a run before its cleanup runs, leaving half-finished staging artifacts behind. To stop those from accumulating, every run cleans up leftover artifacts from dead runs at startup, before it stages anything. A complete staged copy is the reuse cache and is always kept; only incomplete leftovers are reclaimed.

File permissions

Staged copies hold verbatim source records — potentially PII, credentials, or financial data — so on Unix they are created with owner-only permissions. On Windows staged files inherit the staging directory’s permissions, so restrict the directory if the volume is shared with other users.

Crash durability and the parent-directory fsync

Staged copies survive a crash: a later run finds a complete file or nothing at all, never a half-written one.

Streaming vs. Blocking Stages

Every node in a pipeline is one of two kinds at runtime, and the difference is what keeps Clinker’s memory bounded:

Streaming stages pass records through without holding the whole input. Their memory footprint stays small no matter how large the input is.
Blocking stages must see their entire input before they can produce any output, so they accumulate state. They stay within the memory budget and spill to disk when it gets tight, rather than holding everything in RAM.

A pipeline’s peak memory is therefore set by its largest blocking stage, not by the total size of all stages combined.

Which stages stream

Source → Transform → Output chains — records flow straight from the reader through the transform to the writer.
Output — a sink always streams its records to the configured writer.
Route — predicate fan-out passes records through.
Merge — concatenation or interleaving passes records through.
Aggregate with strategy: streaming — when the input is pre-sorted on the group key, each group is emitted as soon as the key advances, so the whole input is never held. (See Aggregate Nodes.)
The probe (driver) side of a hash Combine — the driver streams against the already-built lookup table.

Document boundaries (the signals behind $doc.*) flow inline with records through streaming stages, so a document’s close always trails its last record.

Which stages block

A stage blocks when its result depends on records it has not seen yet:

sort — the full input must be present before the first sorted record is known.
Hash Aggregate — a group’s final value depends on every member, so the group table holds the whole input. (A streaming-strategy Aggregate over pre-sorted input is the exception above.)
A Combine’s build side — the lookup table is built in full before any driver record is matched. The probe side streams; the build side materializes.
Time-windowed and correlation-key Aggregates — these hold their group state for windowing or for the correlation commit, so they materialize.

A blocking stage keeps its accumulated state inside pipeline.memory.limit and spills to disk when the budget gets tight.

Seeing the classification

clinker run <pipeline>.yaml --explain annotates every node with its class in the Physical Properties section:

output.report:
  buffer: streaming

aggregation.dept_totals:
  buffer: materialized

buffer: streaming marks a stage that holds only a small in-flight slice; buffer: materialized marks one that holds a whole stage’s output and may spill it. The annotation comes from the same classifier the executor uses at runtime, so what --explain reports is exactly what happens. See Explain Plans and Memory Tuning.

Tuning the batch size

The number of records a streaming stage hands downstream at a time is set by pipeline.batch_size (default 2048), with an optional per-transform override. Smaller batches lower in-flight memory at the cost of more per-batch overhead; larger batches do the reverse. The batch size changes only the memory profile of streaming handoffs — never their output, and never the behavior of blocking stages.

Optimizing Pipelines

Clinker keeps memory bounded and spills to disk automatically, so most pipelines run fine with no tuning at all. When you do need a pipeline to run faster or in less memory, a handful of authoring choices do nearly all the work. This page is the practical checklist; the engine mechanics behind each tip live in the separate Engine Internals book.

Let stages stream instead of buffer

The cheapest pipeline is one where records flow straight through without being held in memory. A Source → Transform → Output chain streams end to end — no intermediate stage is materialized. You get this automatically; the things that break it are fan-out (a Route with several branches, an output that forks) and blocking operators (sort, hash aggregation, the build side of a Combine).

Practical implication: keep the hot path simple. A filter-and-reshape job that’s just Source → Transform → Output already runs at minimal memory. See Streaming vs. Blocking Stages for which operators stream and which block.

Make aggregation stream with `sort_order`

A hash Aggregate holds one entry per distinct group key in memory — fine for low-cardinality keys, expensive for high-cardinality ones. If your input is already sorted on the group-by keys, declare it:

- type: source
  name: txns
  config:
    type: csv
    path: ./data/transactions_sorted.csv
    sort_order:
      - { field: account_id, order: asc }
    schema:
      - { name: account_id, type: string }
      - { name: amount, type: float }

- type: aggregate
  name: per_account
  input: txns
  config:
    group_by: [account_id]
    cxl: |
      emit total = sum(amount)

With a matching sort_order, the optimizer switches the aggregate to streaming — it emits each group as the key advances and holds only one group at a time, regardless of cardinality. To make the requirement explicit (and turn a silent fallback to hash aggregation into a compile error), set strategy: streaming. See Aggregate Nodes → Strategy hint.

sort_order is trusted, not verified — if the data isn’t actually sorted, streaming aggregation produces wrong results. Only declare it when you’re sure.

Choose the Combine driver side deliberately

A Combine holds each non-driving (build-side) input in memory as a lookup table, then streams the driver against it. So:

Put the smaller relation on the build side and drive with the larger stream — you iterate the big input once and keep only the small one resident. Plan for roughly 1.5–2× the build file’s size in memory.
The driver also sets output order and which side’s correlation identity propagates, so pick it for those reasons too. See Combine Nodes and Correlation Keys → Combine interaction.

A large build side isn’t a failure — the join spills to disk automatically — but spilling is slower than staying in memory, so sizing the driver right is the main lever.

Size the memory budget

The default budget is 512 MB. Raise it when a pipeline does high-cardinality aggregation or large joins and you have the RAM; lower it to be a good neighbor on a shared box. The budget is a target, not a hard wall — stages spill rather than fail when they exceed it.

pipeline:
  name: my_pipeline
  memory:
    limit: "1G"

Full sizing guidance and the backpressure knob are in Memory Tuning.

Reduce intermediate state

Less data in flight means less to buffer and spill:

Filter early. Drop records you don’t need in the first Transform, before they reach a blocking stage.
Project narrowly. Emit only the fields downstream stages actually use; carrying wide records through a sort or aggregate costs memory per row.
Aggregate before joining when you can — feeding a small rolled-up relation into a Combine is cheaper than joining raw rows and aggregating after.

Confirm with `--explain` and metrics

Before running, clinker run pipeline.yaml --explain annotates each node with buffer: streaming or buffer: materialized, so you can see which stages will dominate memory. After a run, the metrics spool reports peak_rss_bytes — if it consistently approaches your limit, raise the budget or cut intermediate state. See Explain Plans.

Metrics & Monitoring

Clinker writes per-execution metrics as JSON files to a spool directory. These files can be collected into an NDJSON archive for ingestion into monitoring systems.

Enabling metrics

There are three ways to enable metrics collection, listed from highest to lowest priority:

CLI flag:

clinker run pipeline.yaml --metrics-spool-dir ./metrics/

Environment variable:

export CLINKER_METRICS_SPOOL_DIR=./metrics/
clinker run pipeline.yaml

YAML config:

pipeline:
  metrics:
    spool_dir: "./metrics/"

When metrics are enabled, each execution writes one JSON file to the spool directory, named <execution_id>.json.

Metrics schema

Each metrics file follows schema version 3. The collector rejects spool files written under an older schema version, so upgrading clinker across a schema bump means draining the spool first.

{
  "execution_id": "01912345-6789-7abc-def0-123456789abc",
  "schema_version": 3,
  "pipeline_name": "customer_etl",
  "config_path": "/opt/clinker/pipelines/daily_etl.yaml",
  "hostname": "prod-etl-01",
  "started_at": "2026-04-11T10:00:00Z",
  "finished_at": "2026-04-11T10:00:05Z",
  "duration_ms": 5000,
  "exit_code": 0,
  "records_total": 50000,
  "records_ok": 49950,
  "records_written": 49950,
  "records_dlq": 50,
  "execution_mode": "Streaming",
  "peak_rss_bytes": 134217728,
  "thread_count": 4,
  "input_files": ["./data/customers.csv"],
  "output_files": ["./output/enriched.csv"],
  "dlq_path": "./output/errors.csv",
  "error": null,
  "retraction": {
    "groups_recomputed": 0,
    "partitions_dispatched": 0,
    "iterations": 0,
    "degrade_fallback_count": 0,
    "synthetic_ck_columns_emitted_total": 0,
    "synthetic_ck_fanout_lookups_total": 0,
    "synthetic_ck_fanout_rows_expanded_total": 0
  },
  "per_source_record_counts": { "customers": 50000 },
  "per_source_dlq_counts": { "customers": 50 }
}

Field reference

Field	Type	Description
`execution_id`	string	UUID v7 or custom `--batch-id` value
`schema_version`	integer	Schema version of this payload; currently `3`
`pipeline_name`	string	The `name` from the pipeline YAML
`config_path`	string	Absolute path to the config file
`hostname`	string	Machine hostname
`started_at`	string	ISO 8601 UTC timestamp
`finished_at`	string	ISO 8601 UTC timestamp
`duration_ms`	integer	Wall-clock duration in milliseconds
`exit_code`	integer	Process exit code (see Exit Codes)
`records_total`	integer	Total records read from the primary source
`records_ok`	integer	Distinct source records that reached at least one output. Under inclusive Route fan-out one input matching N branches counts once
`records_written`	integer	Total writes across all sinks. Equals `records_ok` for single-output exclusive pipelines; exceeds it under inclusive Route fan-out or multiple Output sinks
`records_dlq`	integer	Records routed to the dead-letter queue
`execution_mode`	string	DAG-derived execution summary: `Streaming` (no full-stage materialization required) or `TwoPass` (a blocking stage forces an accumulation pass)
`peak_rss_bytes`	integer/null	Peak resident set size in bytes, sampled across chunk boundaries on Linux, macOS, and Windows. `null` on platforms where RSS sampling is unavailable
`thread_count`	integer	Thread pool size used
`input_files`	array	Paths to all source files
`output_files`	array	Paths to all output files written
`dlq_path`	string/null	Path to the DLQ file, or null if none
`error`	string/null	Error message on exit 1/3/4, or null on success (exit 0) and partial success (exit 2)
`retraction`	object	Correlation-key retraction counters (see below). All-zero on strict pipelines, which never enter the relaxed loop
`per_source_record_counts`	object	Ingest record count per Source node, keyed by node name. A source that read zero records is present with a count of `0`
`per_source_dlq_counts`	object	DLQ entry count per Source node; sources with zero DLQ entries are absent. The values sum to at most `records_dlq` — see the note below

The sum of per_source_dlq_counts values is at most records_dlq, and can be less: a failure in a Combine emit or a post-aggregate row is not traceable to a single declared source, so it is counted in records_dlq but not in this per-source breakdown. For pipelines whose dead-letters all originate at a declared source, the two match exactly.

The retraction object carries the relaxed correlation-key retraction orchestrator’s counters: groups_recomputed, partitions_dispatched, iterations, degrade_fallback_count, synthetic_ck_columns_emitted_total, synthetic_ck_fanout_lookups_total, and synthetic_ck_fanout_rows_expanded_total. Every field is 0 on strict pipelines and on relaxed pipelines that never trigger a retraction. See Correlation Keys for the underlying mechanism.

Collecting metrics

The spool directory accumulates one file per execution. Use clinker metrics collect to sweep them into an NDJSON archive:

clinker metrics collect \
  --spool-dir ./metrics/ \
  --output-file ./metrics/archive.ndjson \
  --delete-after-collect

This appends all spool files to the archive (one JSON object per line) and removes the originals. The NDJSON format is compatible with most log aggregation and monitoring tools.

Preview without writing:

clinker metrics collect \
  --spool-dir ./metrics/ \
  --output-file ./metrics/archive.ndjson \
  --dry-run

Integration with monitoring systems

Grafana / Prometheus

Parse the NDJSON archive with a log shipper (Promtail, Filebeat, Vector) and create dashboards tracking:

duration_ms – execution time trends
records_dlq – data quality over time
peak_rss_bytes – memory utilization

Datadog

Ship NDJSON to Datadog Logs, then create metrics from log attributes:

# Example: tail the archive and ship to Datadog
tail -f ./metrics/archive.ndjson | datadog-agent log-stream

ELK Stack

Filebeat can ingest NDJSON directly:

# filebeat.yml
filebeat.inputs:
  - type: log
    paths:
      - /var/log/clinker/metrics.ndjson
    json.keys_under_root: true

Simple alerting with jq

For environments without a full monitoring stack, use jq to query the archive directly:

# Find all runs with DLQ entries in the last 24 hours
jq 'select(.records_dlq > 0)' metrics/archive.ndjson

# Find runs that exceeded 400MB RSS
jq 'select(.peak_rss_bytes > 419430400)' metrics/archive.ndjson

# Average duration by pipeline
jq -s 'group_by(.pipeline_name) | map({
  pipeline: .[0].pipeline_name,
  avg_ms: (map(.duration_ms) | add / length)
})' metrics/archive.ndjson

Operational recommendations

Always enable metrics in production. The overhead is negligible (one small JSON write at the end of each run).
Run metrics collect --delete-after-collect on a schedule (e.g., hourly) to prevent spool directory growth.
Use --batch-id with meaningful identifiers to correlate metrics across retries and environments.
Alert on records_dlq > 0 to catch data quality regressions early.
Track peak_rss_bytes trends to anticipate when memory limits need adjustment.

Exit Codes & Error Diagnosis

Clinker uses structured exit codes to communicate the outcome of a pipeline run. These codes are designed for integration with schedulers, cron, CI systems, and monitoring tools.

Exit code reference

Code	Meaning	Description
0	Success	Pipeline completed. All records processed successfully.
1	Configuration error	Invalid YAML, CXL syntax error, type mismatch, or DAG wiring problem. Fix the pipeline configuration.
2	Partial success	Pipeline ran to completion, but some records were routed to the dead-letter queue. Check the DLQ file.
3	Evaluation error	CXL runtime error during record processing (e.g., division by zero, type coercion failure).
4	I/O error	File not found, permission denied, disk full, or input format mismatch.

Understanding exit code 2

Exit code 2 is not a crash. It means:

The pipeline started and ran to completion.
All viable records were processed and written to output files.
Some records could not be processed and were diverted to the dead-letter queue.

Your scheduler should treat exit code 2 as a warning, not a failure. The DLQ file contains the problematic records along with the error that caused each one to be rejected.

To control when exit code 2 escalates to a hard failure, use --error-threshold:

# Abort if more than 100 records hit the DLQ
clinker run pipeline.yaml --error-threshold 100

With a threshold set, the pipeline aborts (exit code 3) when the DLQ count exceeds the threshold, rather than continuing to completion.

Diagnosing failures

Exit code 1: Configuration error

The error message includes a span-annotated diagnostic pointing to the exact location of the problem:

Error: CXL type error in node 'transform_1'
  --> pipeline.yaml:25:15
   |
25 |   emit total = amount + name
   |                ^^^^^^^^^^^^^ cannot add Int and String

Action: Fix the YAML or CXL expression indicated in the diagnostic, then re-run with --dry-run to confirm the fix.

Exit code 2: Partial success (DLQ entries)

Check the DLQ file for details:

# The DLQ path is shown in the run output and in metrics
cat output/errors.csv

Common causes:

Null values in fields that a CXL expression does not handle
Data that does not match the declared schema (e.g., non-numeric value in an integer column)
Coercion failures between types

Action: Review the DLQ records, fix the data or add null handling to CXL expressions, and re-run.

Exit code 3: Evaluation error

A CXL expression failed at runtime. The error message includes the failing expression and the record that triggered it:

Error: division by zero in node 'compute_ratio'
  expression: emit ratio = total / count
  record: {total: 500, count: 0}

Action: Add guard conditions to the CXL expression:

emit ratio = if count == 0 then 0 else total / count

Exit code 4: I/O error

File system or format errors:

Error: file not found: ./data/customers.csv
  --> pipeline.yaml:8:12

Common causes:

Input file does not exist or path is wrong
Permission denied on input or output directories
Output file already exists (use --force to overwrite)
Disk full during output writing
Input file format does not match the declared type (e.g., invalid CSV)

Action: Fix file paths, permissions, or disk space, then re-run.

Plan-time diagnostic codes

The process exit codes above tell a scheduler whether the run succeeded. The E### codes below appear inside the structured Error: messages a configuration error (exit code 1) prints, and identify the specific compile-time check that rejected the pipeline. The codes below cover the event-time watermark and time-windowed aggregate surface (issue #61); related code sets live in Pipeline Variables, Channels, and Correlation Keys.

Code	Trigger	Remediation
E154	A source declares `watermark.column: <col>` but `<col>` is not present in that source’s `schema:` block.	Add the column to `schema:`, or remove the `watermark:` block.
E155	A source declares `watermark.column: <col>` and the column exists, but its declared CXL type is not `date_time` or `date`.	Change the column’s `type:` to `date_time` or `date`, or point `watermark.column` at a column that already has one of those types.
E156	An aggregate declares `time_window:` but at least one upstream-reachable source does not declare `watermark.column`.	Add `watermark: { column: <event-time-column> }` to each listed source, or remove `time_window:` from the aggregate. Without a watermark on every upstream source, `min_across_sources` never advances past `None` and the window can never close.
E157	A source declares an external `schema:` file (`schema: path.schema.yaml`) that could not be read or parsed as a `SourceSchema`.	Fix the file path or its contents. A schema file is a bare column list or a multi-record `discriminator:`/`records:` map — it may not itself point at another schema file.
E158	A source column’s declared type is (or wraps) the inference-only `numeric` union.	Declare a concrete `int` or `float`. `numeric` is `int \| float` resolved during type unification and never carries into a compiled source schema.
E159	A source pairs a `generated` schema with a non-EDI format.	`generated` (engine-synthesized positional columns) is valid only for the EDI-family formats (`edifact`, `x12`, `hl7`, `swift`). Declare an explicit column list for any other format.

See Source Nodes → Watermarks and Aggregate Nodes → Time-windowed aggregates for the field semantics each code is enforcing.

DLQ category: LateRecord

When a time-windowed aggregate sees a record whose event time falls inside an already-closed window (window_end + allowed_lateness < min_across_sources), the engine routes the record to the DLQ instead of attempting to fold it into a finalized accumulator. Mirrors Flink’s sideOutputLateData and Spark Structured Streaming’s late-data drop.

The DLQ row carries:

_cxl_dlq_error_category = late_record
_cxl_dlq_stage = time_window:<aggregate-name>
_cxl_dlq_error_detail — the closed window’s [start, end) bounds as i64 nanoseconds since the Unix epoch

Tune watermark.delay (source-side, applies before any aggregate) or allowed_lateness (operator-side, applies per aggregate) to absorb expected out-of-order tails before they reach this path.

Scheduler integration

For running Clinker under a workflow orchestrator (Temporal, Airflow, Dagster) — mapping these exit codes onto a retry policy, plus the cancellation and output-atomicity guarantees — see Running Under a Workflow Orchestrator.

Cron script

#!/bin/bash
set -euo pipefail

PIPELINE=/opt/clinker/pipelines/daily_etl.yaml
METRICS_DIR=/var/spool/clinker/

clinker run "$PIPELINE" \
  --memory-limit 512M \
  --log-level warn \
  --metrics-spool-dir "$METRICS_DIR" \
  --force

EXIT=$?

case $EXIT in
  0)
    echo "$(date): Success" >> /var/log/clinker/daily_etl.log
    ;;
  2)
    echo "$(date): Warning - DLQ entries produced" >> /var/log/clinker/daily_etl.log
    mail -s "Clinker ETL Warning: DLQ entries" ops@company.com < /dev/null
    ;;
  *)
    echo "$(date): FAILURE (exit code $EXIT)" >> /var/log/clinker/daily_etl.log
    mail -s "Clinker ETL FAILURE (exit $EXIT)" ops@company.com < /dev/null
    ;;
esac

exit $EXIT

CI pipeline (GitHub Actions)

- name: Run ETL pipeline
  run: clinker run pipeline.yaml --dry-run
  # Exit code 1 fails the build on config errors

- name: Smoke test with real data
  run: clinker run pipeline.yaml --dry-run -n 100
  # Catches runtime evaluation errors

Systemd

Systemd Type=oneshot services interpret non-zero exit codes as failures. To allow exit code 2 (partial success) without triggering service failure:

[Service]
Type=oneshot
SuccessExitStatus=2
ExecStart=/opt/clinker/bin/clinker run /opt/clinker/pipelines/daily_etl.yaml --force

Production Deployment

Clinker is a single statically-linked binary with no runtime dependencies. Deployment is straightforward: copy the binary to the server.

Installation

# Copy the binary
scp target/release/clinker user@server:/opt/clinker/bin/

# Verify it runs
ssh user@server /opt/clinker/bin/clinker --version

No JVM, no Python, no container runtime required.

Recommended directory structure

/opt/clinker/
  bin/
    clinker                    # The binary
  pipelines/
    daily_etl.yaml             # Pipeline configs
    weekly_report.yaml
  data/                        # Input data (or symlinks to data locations)
  output/                      # Output files
  rules/                       # CXL module files (for use statements)
  metrics/                     # Metrics spool directory

Create a dedicated user:

sudo useradd --system --home-dir /opt/clinker --shell /usr/sbin/nologin clinker
sudo chown -R clinker:clinker /opt/clinker

Systemd service

For scheduled one-shot execution:

[Unit]
Description=Clinker ETL - Daily Customer Processing
After=network.target

[Service]
Type=oneshot
ExecStart=/opt/clinker/bin/clinker run /opt/clinker/pipelines/daily_etl.yaml \
  --memory-limit 512M \
  --log-level warn \
  --metrics-spool-dir /var/spool/clinker/ \
  --force
WorkingDirectory=/opt/clinker
User=clinker
Group=clinker
SuccessExitStatus=2

# Resource limits
MemoryMax=1G
CPUQuota=200%

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=clinker-daily

[Install]
WantedBy=multi-user.target

Pair with a systemd timer for scheduling:

[Unit]
Description=Run Clinker daily ETL at 2 AM

[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true

[Install]
WantedBy=timers.target

sudo systemctl enable --now clinker-daily.timer

Note: SuccessExitStatus=2 tells systemd that exit code 2 (partial success with DLQ entries) is not a service failure. See Exit Codes for the full reference.

Cron scheduling

# Run daily at 2 AM, log to syslog
0 2 * * * /opt/clinker/bin/clinker run \
  /opt/clinker/pipelines/daily_etl.yaml \
  --log-level warn --force \
  2>&1 | logger -t clinker

# Collect metrics hourly
0 * * * * /opt/clinker/bin/clinker metrics collect \
  --spool-dir /var/spool/clinker/ \
  --output-file /var/log/clinker/metrics.ndjson \
  --delete-after-collect

Environment-based configuration

Use the CLINKER_ENV variable or --env flag to activate environment-specific overrides:

# Production
CLINKER_ENV=production clinker run pipeline.yaml

# Staging
CLINKER_ENV=staging clinker run pipeline.yaml

Combined with channel overrides in the pipeline YAML, this allows a single pipeline definition to target different file paths, connection strings, or thresholds per environment.

Logging

Log levels for production

Level	Use case
`warn`	Recommended for production cron jobs. Prints warnings and errors only.
`info`	Default. Includes progress messages. Useful during initial deployment.
`error`	Minimal output. Only prints when something fails.
`debug`	Troubleshooting. Generates significant output.
`trace`	Development only. Extremely verbose.

Directing logs

To syslog via logger:

clinker run pipeline.yaml --log-level warn 2>&1 | logger -t clinker

To a log file:

clinker run pipeline.yaml --log-level warn 2>> /var/log/clinker/etl.log

Systemd journal captures stdout and stderr automatically when running as a service.

DLQ monitoring

When a pipeline exits with code 2, records that could not be processed are written to the dead-letter queue file. Set up a daily check:

#!/bin/bash
# Check for DLQ files produced today
DLQ_DIR=/opt/clinker/output/
DLQ_FILES=$(find "$DLQ_DIR" -name "*_errors.csv" -mtime 0 -size +0c)

if [ -n "$DLQ_FILES" ]; then
    echo "DLQ entries found:" | mail -s "Clinker DLQ Alert" ops@company.com <<EOF
The following DLQ files were produced today:

$DLQ_FILES

Review the files and address data quality issues.
EOF
fi

Batch ID for tracing

Use --batch-id with a meaningful, consistent naming scheme:

# Date-based
clinker run pipeline.yaml --batch-id "daily-$(date +%Y-%m-%d)"

# Include environment
clinker run pipeline.yaml --batch-id "prod-daily-$(date +%Y-%m-%d)"

The batch ID appears in metrics output and log lines, making it easy to correlate a specific run across logs, metrics, and DLQ files. On retries, use a different batch ID (e.g., append -retry-1) to distinguish attempts.

Upgrades

To upgrade Clinker:

Validate the new version against your pipelines:

/opt/clinker/bin/clinker-new run pipeline.yaml --dry-run

Replace the binary:

cp clinker-new /opt/clinker/bin/clinker

Verify:
```
/opt/clinker/bin/clinker --version
```

There is no configuration migration. Pipeline YAML files are forward-compatible within the same major version.

Running Under a Workflow Orchestrator

Clinker is a finite batch job wrapped in a single CLI binary. That makes it a natural activity body (or task, or step) for an external workflow orchestrator — Temporal, Airflow, Dagster, Prefect, or a plain systemd timer. The orchestrator owns scheduling, retries, timeouts, and cross-job state; Clinker owns one bounded-memory pass over finite input.

This page is the behavioral contract that boundary depends on: how a clinker run reports its outcome, how it responds to cancellation, and what it guarantees about its output files. It is written so an orchestration author can wire a retry policy and a cancellation timeout against stable, intended behavior rather than reverse-engineering it.

Temporal is used as the worked example throughout because its RetryPolicy, heartbeat_timeout, and cancellation model map cleanly onto Clinker’s exit codes and signal handling. The same contract applies to any orchestrator that runs Clinker as a child process.

The shell-out model

Clinker embeds no Temporal client, no orchestrator SDK, and no worker runtime. It does not connect to a Temporal service, poll a task queue, or report activity results over any wire protocol. Running Clinker under an orchestrator is deliberately a shell-out: the orchestrator’s own worker spawns clinker run … as a child process and reads back three things —

the process exit code (the primary signal; see below),
log output on stderr (--log-level), and
optionally the metrics spool file written on completion (--metrics-spool-dir; see Metrics & Monitoring).

Keeping the orchestrator client out of Clinker is a decided non-goal: the coupling lives in a thin worker you write, not in the engine.

Exit codes → retry policy

Clinker’s exit codes are stable and structured for exactly this purpose. The full diagnostic reference is Exit Codes & Error Diagnosis; the table below adds the retry decision an orchestrator should make for each.

Exit	Meaning	What happened	Orchestrator handling (Temporal)
`0`	Success	All records processed; every output committed.	Activity success.
`2`	Partial success	Ran to completion, but some records were routed to the dead-letter queue.	Activity success — the run finished. Inspect the DLQ counts; escalate only under your own data-quality policy.
`1`	Configuration / compile error	Rejected deterministically before any data was written: invalid YAML, CXL type or syntax error, schema-binding failure, DAG-wiring problem, or an unsatisfiable memory budget.	Non-retryable `ApplicationError`. The same config and input will fail identically — retrying wastes attempts.
`3`	Fatal data error	`fail_fast` tripped, the `--error-threshold` or DLQ-rate ceiling was exceeded, a NaN appeared in a `group_by` key, or a CXL runtime / accumulator error halted the run.	Non-retryable by default — deterministic on the same input. Fix the data or the pipeline, then re-run.
`4`	I/O / system error	File not found, permission denied, disk full, spill cap exceeded, thread-pool failure, or a malformed input file.	Retryable with bounded backoff — many causes are transient. Cap `maximum_attempts`, because some (a genuinely malformed input, a spill cap smaller than the working set) are deterministic and will exhaust the budget.
`130`	Interrupted	A SIGINT or SIGTERM drained the run and it exited.	Treat as cancellation, not failure (see below).

Reading the table into a `RetryPolicy`

Fail fast (do not retry): 1 and 3. These are deterministic on the same input. Surface them as non-retryable so the workflow fails immediately instead of burning the retry budget. In Temporal, raise a non-retryable ApplicationError (or add the code to non_retryable_error_types).
Retry with backoff: 4. Transient infrastructure faults (a not-yet-arrived input file, a full or contended volume, a locked resource) usually clear on their own. A bounded exponential backoff is safe here because output is atomic (next section) — a failed attempt leaves nothing half-written for the retry to trip over. Keep maximum_attempts finite so a permanently-bad input eventually fails the workflow rather than retrying forever.
Success, with a caveat: 2. Exit 2 means the pipeline finished — it is success-with-warnings, not a crash. The rejected records are in the configured DLQ file; the counts (records_dlq, dlq_path, per_source_dlq_counts) are in the metrics spool (Metrics & Monitoring). Do not treat DLQ presence as a hard failure unless you choose to. To make DLQ volume a hard failure inside Clinker, set --error-threshold N, which converts an overflow into exit 3. To make it a workflow decision, let the activity succeed and branch on the DLQ count in your workflow code.

Cancellation and SIGTERM contract

Clinker installs a process-wide handler for both SIGINT and SIGTERM (the ctrlc crate’s “termination” feature). Either signal requests a graceful drain: in-flight work finishes, worker threads join, already completed outputs are committed, and the process exits 130 with the run marked interrupted.

Cancellation is bounded-latency, not instantaneous. The shutdown request is a flag that the executor polls at well-defined points:

at operator chunk boundaries during streaming dispatch, and
every 4096 records while a blocking operator (sort, aggregate) builds its in-memory arena.

Between two poll points the run keeps working, so the worst-case delay from signal to exit is roughly one chunk (or one 4096-record build slice) of processing time. Size your timeouts accordingly:

Set Temporal’s heartbeat_timeout and the cancellation grace period comfortably above the worst-case time to process one chunk on your largest input. A grace period shorter than that will escalate a healthy, draining run to a hard kill.
The worker must translate a Temporal cancellation into a child SIGTERM (which Clinker drains cleanly), and use SIGKILL only as a last resort after the grace period expires. SIGKILL cannot be caught, so it skips the graceful drain — but the atomic-output guarantee below still protects the final files.

What a cancelled run leaves behind

If the interrupt lands during streaming dispatch, the run drains what is in flight and any completed single-file output is atomically renamed into place — durable, never truncated. That output reflects a partial pass (fewer input records than a full run), so a retry should overwrite it, not append to it.
If the interrupt lands during a blocking-operator build, no final output is produced and the writing temp file is preserved for inspection.

Either way the exit code is 130 and there is never a truncated final file. A partial DLQ plus an interrupt still reports 130 — the interrupt takes precedence over the DLQ-partial code 2.

Output atomicity and retry-safety

Single-file outputs are written atomically: each output streams into a sibling temp file in the same directory, and only after the pipeline completes successfully is that temp file renamed into the final path (followed by an fsync of the parent directory for crash durability). This is exercised by the CLI’s atomic_output_test suite.

The consequences an orchestrator can rely on:

A killed or failed attempt leaves no half-written final file. On failure the temp file is kept and its path logged at WARN for operator inspection, but the final path is untouched — a retry sees a clean slate.
A retry can overwrite cleanly. By default a pre-existing output aborts the run (exit 4); pass --force to allow the new attempt to replace a previous attempt’s output. Alternatively, an output’s if_exists: unique_suffix policy hands each attempt a distinct, race-safe path (the reservation uses create_new, so concurrent attempts never clobber one another).

Caveat — multi-file outputs are not yet atomic. Fan-out outputs (one file per source file) and split: outputs write directly to their final paths rather than through the temp-then-rename path. A killed attempt can leave a partial fan-out or split file behind. The atomic guarantee above applies to single-file outputs; for multi-file outputs, a retry should treat the output directory as untrusted and clear it first.

Idempotency: what re-running a batch means today

Clinker treats every run as a fresh, finite pass over its input. There is no checkpoint, no resume cursor, and no incremental state carried between attempts: a retry reprocesses all input from the start. “Cancel + retry” therefore means discard the partial output and recompute the whole batch — which, combined with atomic single-file output and --force, is safe to wire into a RetryPolicy.

Two conditions make retries idempotent:

Stable input. Clinker reads whatever is on disk at run time. Retries are idempotent only if the input files do not change between attempts — stage inputs immutably (or point each logical batch at a fixed snapshot) so attempt n+1 sees exactly what attempt n saw.
A stable correlation key. Pass --batch-id <ID> to carry a meaningful identifier — e.g. the Temporal workflow or run id — across every attempt of one logical batch. It appears in the metrics spool and log lines, so all attempts correlate under one key. (Without it, each invocation generates its own UUID v7.)

Heartbeating a long run

For an activity that outlives its start_to_close_timeout, Temporal relies on heartbeats to distinguish a slow-but-healthy run from a wedged one. Two rules apply to Clinker:

Size heartbeat_timeout above the cancellation latency described above. Because cancellation is bounded-latency (chunk / 4096-record granularity), a heartbeat timeout tighter than one processing slice can reap a run that is making progress.
Heartbeat details are advisory progress, not a resume cursor. Clinker cannot resume from a heartbeated offset — every attempt is a full pass. Use heartbeat details for liveness and observability only; never feed them back as a “start from record N” input.

What is observable for liveness today is log output (--log-level) and the metrics spool written at completion (--metrics-spool-dir). A structured, machine-readable run report and a live progress stream suitable for richer heartbeat details are tracked upstream in issue #622 and are not available yet; do not build a contract on them until they land.

CSV-to-CSV Transform

This recipe reads employee data from a CSV file, computes salary tiers using CXL expressions, and writes the enriched result to a new CSV file.

Input data

employees.csv:

id,name,department,salary
1,Alice Chen,Engineering,95000
2,Bob Martinez,Marketing,62000
3,Carol Johnson,Engineering,88000
4,Dave Williams,Sales,71000
5,Eva Brown,Marketing,58000
6,Frank Lee,Engineering,102000

Pipeline

salary_tiers.yaml:

pipeline:
  name: salary_tiers

nodes:
  - type: source
    name: employees
    config:
      name: employees
      type: csv
      path: "./employees.csv"
      schema:
        - { name: id, type: int }
        - { name: name, type: string }
        - { name: department, type: string }
        - { name: salary, type: int }

  - type: transform
    name: classify
    input: employees
    config:
      cxl: |
        emit id = id
        emit name = name
        emit department = department
        emit salary = salary
        emit level = if salary >= 90000 then "senior" else "junior"
        emit salary_band = match {
          salary >= 100000 => "100k+",
          salary >= 90000 => "90-100k",
          salary >= 70000 => "70-90k",
          _ => "under 70k"
        }

  - type: output
    name: report
    input: classify
    config:
      name: salary_report
      type: csv
      path: "./output/salary_report.csv"

error_handling:
  strategy: fail_fast

Run it

# Validate first
clinker run salary_tiers.yaml --dry-run

# Preview output
clinker run salary_tiers.yaml --dry-run -n 3

# Full run
clinker run salary_tiers.yaml

Expected output

output/salary_report.csv:

id,name,department,salary,level,salary_band
1,Alice Chen,Engineering,95000,senior,90-100k
2,Bob Martinez,Marketing,62000,junior,under 70k
3,Carol Johnson,Engineering,88000,junior,70-90k
4,Dave Williams,Sales,71000,junior,70-90k
5,Eva Brown,Marketing,58000,junior,under 70k
6,Frank Lee,Engineering,102000,senior,100k+

Key points

Schema declaration. The source node declares the schema explicitly with typed columns. This enables compile-time type checking of CXL expressions – if you write salary + name, the type checker catches the error before any data is read.

Emit statements. Each emit in the transform produces one output column. The output schema is defined entirely by the emit statements – input columns that are not emitted are dropped. This is intentional: explicit output schemas prevent accidental data leakage.

Match expressions. The match block evaluates conditions top to bottom and returns the value of the first matching arm. The _ wildcard is the default case and must appear last.

Error handling. The fail_fast strategy aborts the pipeline on the first record error. For production pipelines processing dirty data, consider continue instead, which routes the failing record to the dead-letter queue and keeps going – see Error Handling & DLQ.

Variations

Filtering records

Add a filter statement to exclude records:

  - type: transform
    name: classify
    input: employees
    config:
      cxl: |
        filter salary >= 60000
        emit id = id
        emit name = name
        emit salary = salary

Records where salary < 60000 are dropped silently – they do not appear in the output or the DLQ.

Computed columns with type conversion

      cxl: |
        emit id = id
        emit name = name
        emit monthly_salary = (salary.to_float() / 12.0).round(2)
        emit salary_display = "$" + salary.to_string()

The .to_float() conversion is required because salary is declared as int and division by a float literal requires matching types.

Multi-Input Combine

This recipe enriches order records with product metadata from a separate catalog stream using a combine node. Combine is a first-class N-ary operator: every input is declared up front, and the where expression uses qualified field references (orders.product_id, products.product_id) to express the join.

Input data

orders.csv:

order_id,product_id,quantity,unit_price
ORD-001,PROD-A,5,29.99
ORD-002,PROD-B,2,149.99
ORD-003,PROD-A,1,29.99
ORD-004,PROD-C,10,9.99
ORD-005,PROD-B,3,149.99

products.csv:

product_id,product_name,category
PROD-A,Widget Pro,Hardware
PROD-B,DataSync License,Software
PROD-C,Cable Kit,Hardware

Pipeline

order_enrichment.yaml:

pipeline:
  name: order_enrichment

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: csv
      path: "./orders.csv"
      schema:
        - { name: order_id, type: string }
        - { name: product_id, type: string }
        - { name: quantity, type: int }
        - { name: unit_price, type: float }

  - type: source
    name: products
    config:
      name: products
      type: csv
      path: "./products.csv"
      schema:
        - { name: product_id, type: string }
        - { name: product_name, type: string }
        - { name: category, type: string }

  - type: combine
    name: enrich
    input:
      orders: orders
      products: products
    config:
      where: "orders.product_id == products.product_id"
      match: first
      on_miss: null_fields
      cxl: |
        emit order_id = orders.order_id
        emit product_id = orders.product_id
        emit product_name = products.product_name
        emit category = products.category
        emit quantity = orders.quantity
        emit unit_price = orders.unit_price
        emit line_total = orders.quantity.to_float() * orders.unit_price
      propagate_ck: driver

  - type: output
    name: result
    input: enrich
    config:
      name: enriched_orders
      type: csv
      path: "./output/enriched_orders.csv"

Run it

clinker run order_enrichment.yaml --dry-run
clinker run order_enrichment.yaml --dry-run -n 3
clinker run order_enrichment.yaml

Expected output

output/enriched_orders.csv:

order_id,product_id,product_name,category,quantity,unit_price,line_total
ORD-001,PROD-A,Widget Pro,Hardware,5,29.99,149.95
ORD-002,PROD-B,DataSync License,Software,2,149.99,299.98
ORD-003,PROD-A,Widget Pro,Hardware,1,29.99,29.99
ORD-004,PROD-C,Cable Kit,Hardware,10,9.99,99.90
ORD-005,PROD-B,DataSync License,Software,3,149.99,449.97

How combine works

A combine node declares every input in its input: map, binding each upstream stream to a qualifier used inside expressions:

- type: combine
  name: enrich
  input:
    orders: orders        # qualifier: upstream_node
    products: products
  config:
    where: "orders.product_id == products.product_id"
    propagate_ck: driver

The config: block carries four fields that shape behavior:

where – a CXL boolean expression. Every field reference must be qualified with its input name. The expression must contain at least one cross-input equality (e.g. orders.product_id == products.product_id); additional range or arbitrary conjuncts can be combined with and.
match – first (default), all, or collect. See below.
on_miss – null_fields (default), skip, or error. Applies only to records on the driving input that find no match.
cxl – emit statements that shape the output row. Under match: collect, this field must be empty; the combine node auto-derives the output schema.

Match modes

`match: first`

Emit one output row per driver record, using the first matching build-side record. This is the standard 1:1 enrichment. When no match exists, the behavior is governed by on_miss.

config:
  where: "orders.product_id == products.product_id"
  match: first

`match: all`

Emit one output row for every matching build-side record. This is 1:N fan-out – if a driver record matches three build records, three rows are emitted.

- type: combine
  name: expand_benefits
  input:
    employees: employees
    benefits: benefits
  config:
    where: "employees.department == benefits.department"
    match: all
    cxl: |
      emit employee_id = employees.employee_id
      emit benefit = benefits.benefit_name
    propagate_ck: driver

An employee in a department with three benefits produces three output records.

`match: collect`

Gather every matching build-side record into a single Array-typed field on the output row. The driver record appears once; the build matches are aggregated into a list. The cxl: body must be empty under match: collect – the combine node synthesizes the output as { driver fields..., <build_qualifier>: Array }.

- type: combine
  name: gather
  input:
    orders: orders
    products: products
  config:
    where: "orders.product_id == products.product_id"
    match: collect
    cxl: ""
    propagate_ck: driver

Use collect when you need the set of matches as a single structured value (e.g. every price history row for an order). Use all when you need one flat row per match.

Unmatched records (`on_miss`)

on_miss controls what happens to driver records with zero matches:

config:
  where: "orders.product_id == products.product_id"
  on_miss: null_fields   # default: emit with build fields set to null

config:
  where: "orders.product_id == products.product_id"
  on_miss: skip          # inner-join semantics: drop unmatched drivers

config:
  where: "orders.product_id == products.product_id"
  on_miss: error         # fail the pipeline on first unmatched driver

Use skip for inner-join semantics, null_fields for left-join semantics, and error for strict referential integrity where any miss should halt processing.

Composite keys

Chain multiple equalities with and to combine on more than one field. Each conjunct is a separate cross-input equality:

- type: combine
  name: match_by_region
  input:
    sales: sales
    targets: targets
  config:
    where: |
      sales.department == targets.department
      and sales.region == targets.region
    cxl: |
      emit department = sales.department
      emit region = sales.region
      emit actual = sales.amount
      emit goal = targets.goal
    propagate_ck: driver

Both equalities must hold for a record pair to match.

Equi plus residual filter

The where clause can mix equi predicates with additional filter conjuncts. Non-equality conjuncts are applied as a residual filter after the equi match:

- type: combine
  name: high_value_enrichment
  input:
    orders: orders
    products: products
  config:
    where: |
      orders.product_id == products.product_id
      and orders.amount >= 100
    match: first
    on_miss: skip
    cxl: |
      emit order_id = orders.order_id
      emit product_name = products.product_name
      emit amount = orders.amount
    propagate_ck: driver

The equi conjunct drives the hash lookup; the amount >= 100 conjunct is evaluated as a post-filter. At least one cross-input equality is required in every combine.

Multi-input combine (three or more)

Combine accepts any number of inputs. Each pair of inputs that should be related needs an explicit equality in the where clause:

- type: combine
  name: fully_enriched
  input:
    orders: orders
    products: products
    categories: categories
  config:
    where: |
      orders.product_id == products.product_id
      and products.category_id == categories.category_id
    match: first
    on_miss: null_fields
    cxl: |
      emit order_id = orders.order_id
      emit product_name = products.product_name
      emit category_name = categories.name
      emit amount = orders.amount
    propagate_ck: driver

Input order in the input: map is preserved, and downstream reasoning treats the first input as the default driving side unless a drive: hint overrides it.

Choosing the driving input

By default the planner picks a driving (probe) input and builds hash tables for the rest. Use drive: to force a specific input to be the driver – typically the larger stream, or the one whose ordering you want to preserve:

- type: combine
  name: product_driven
  input:
    orders: orders
    products: products
  config:
    where: "orders.product_id == products.product_id"
    match: first
    drive: products
    cxl: |
      emit product_id = products.product_id
      emit product_name = products.product_name
      emit sample_order_id = orders.order_id
    propagate_ck: driver

With drive: products, the pipeline emits one row per product enriched with a matching order, instead of one row per order enriched with its product.

Memory considerations

Build-side inputs are materialized in memory as hash tables keyed by the equi columns. For each non-driving input, plan for roughly 1.5-2x the raw CSV size in heap. A 50 MB product catalog typically uses 75-100 MB of hash-table memory. Tune with --memory-limit; see Memory Tuning for spill thresholds and strategy overrides.

Document boundaries

When the driver carries document boundaries – a glob: source where each file is its own document – the Combine forwards those boundaries to its output, so a per-document Aggregate after the join rolls up per driver document. See Combine – Document boundaries.

Routing to Multiple Outputs

This recipe splits a stream of order records into separate output files based on business rules. High-value orders go to one file, standard orders to another.

Input data

orders.csv:

order_id,customer,amount,region
ORD-001,Acme Corp,15000,US
ORD-002,Globex,450,EU
ORD-003,Initech,8500,US
ORD-004,Umbrella,22000,APAC
ORD-005,Stark Ind,950,US
ORD-006,Wayne Ent,3200,EU

Pipeline

order_routing.yaml:

pipeline:
  name: order_routing
  vars:
    high_value_threshold: 5000

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: csv
      path: "./orders.csv"
      schema:
        - { name: order_id, type: string }
        - { name: customer, type: string }
        - { name: amount, type: float }
        - { name: region, type: string }

  - type: route
    name: split_by_value
    input: orders
    config:
      mode: exclusive
      conditions:
        high: "amount >= $vars.high_value_threshold"
      default: standard

  - type: output
    name: high_value_output
    input: split_by_value.high
    config:
      name: high_value_orders
      type: csv
      path: "./output/high_value.csv"

  - type: output
    name: standard_output
    input: split_by_value.standard
    config:
      name: standard_orders
      type: csv
      path: "./output/standard.csv"

Run it

clinker run order_routing.yaml --dry-run
clinker run order_routing.yaml

Expected output

output/high_value.csv:

order_id,customer,amount,region
ORD-001,Acme Corp,15000,US
ORD-003,Initech,8500,US
ORD-004,Umbrella,22000,APAC

output/standard.csv:

order_id,customer,amount,region
ORD-002,Globex,450,EU
ORD-005,Stark Ind,950,US
ORD-006,Wayne Ent,3200,EU

How routing works

Port syntax

Route nodes produce named output ports. Downstream nodes reference these ports using dot syntax: split_by_value.high and split_by_value.standard.

The port names come from two places:

Condition names in the conditions map (here, high)
The default field (here, standard)

Exclusive mode

With mode: exclusive, each record goes to exactly one branch. Conditions are evaluated top to bottom – the first matching condition wins, and the record is sent to that port. Records that match no condition go to the default port.

Pipeline variables

The threshold is defined in pipeline.vars and referenced in the CXL expression as $vars.high_value_threshold. This makes it easy to adjust the threshold without editing the route condition, and channel overrides can change it per environment.

Variations

Multiple branches

Route nodes can have any number of named branches:

  - type: route
    name: split_by_region
    input: orders
    config:
      mode: exclusive
      conditions:
        us: "region == \"US\""
        eu: "region == \"EU\""
        apac: "region == \"APAC\""
      default: other

  - type: output
    name: us_output
    input: split_by_region.us
    config:
      name: us_orders
      type: csv
      path: "./output/us_orders.csv"

  - type: output
    name: eu_output
    input: split_by_region.eu
    config:
      name: eu_orders
      type: csv
      path: "./output/eu_orders.csv"

  # ... additional outputs for apac, other

Transform before output

Insert a transform between the route and output to shape the data differently per branch:

  - type: transform
    name: enrich_high_value
    input: split_by_value.high
    config:
      cxl: |
        emit order_id = order_id
        emit customer = customer
        emit amount = amount
        emit priority = "URGENT"
        emit review_required = true

  - type: output
    name: high_value_output
    input: enrich_high_value
    config:
      name: high_value_orders
      type: csv
      path: "./output/high_value.csv"

Combining routing with aggregation

Route first, then aggregate each branch independently:

  - type: aggregate
    name: high_value_summary
    input: split_by_value.high
    config:
      group_by: [region]
      cxl: |
        emit total = sum(amount)
        emit count = count(*)

This produces a per-region summary of high-value orders only.

Aggregation & Rollups

This recipe demonstrates grouping records and computing summary statistics. The pipeline filters active sales records, then rolls them up by department.

Input data

sales.csv:

id,department,amount,status,rep
1,Engineering,5000,active,Alice
2,Marketing,3000,active,Bob
3,Engineering,7000,active,Carol
4,Sales,4000,inactive,Dave
5,Marketing,2000,active,Eva
6,Engineering,9500,active,Frank
7,Sales,6000,active,Grace
8,Marketing,1500,inactive,Hank

Pipeline

dept_rollup.yaml:

pipeline:
  name: dept_rollup

nodes:
  - type: source
    name: sales
    config:
      name: sales
      type: csv
      path: "./sales.csv"
      schema:
        - { name: id, type: int }
        - { name: department, type: string }
        - { name: amount, type: float }
        - { name: status, type: string }
        - { name: rep, type: string }

  - type: transform
    name: active_only
    input: sales
    config:
      cxl: |
        filter status == "active"

  - type: aggregate
    name: rollup
    input: active_only
    config:
      group_by: [department]
      cxl: |
        emit total = sum(amount)
        emit count = count(*)
        emit average = avg(amount)
        emit maximum = max(amount)
        emit minimum = min(amount)

  - type: output
    name: report
    input: rollup
    config:
      name: dept_totals
      type: csv
      path: "./output/dept_totals.csv"

Run it

clinker run dept_rollup.yaml --dry-run
clinker run dept_rollup.yaml

Expected output

output/dept_totals.csv:

department,total,count,average,maximum,minimum
Engineering,21500,3,7166.67,9500,5000
Marketing,5000,2,2500,3000,2000
Sales,6000,1,6000,6000,6000

One row per department. The inactive records (Dave’s $4000, Hank’s $1500) are excluded by the filter.

How aggregation works

Group-by keys

The group_by field lists the columns that define each group. Records with the same values for all group-by columns are aggregated together. The group-by columns appear automatically in the output – you do not need to emit them.

Aggregate functions

Available aggregate functions in CXL:

Function	Description
`sum(expr)`	Sum of values
`count(*)`	Number of records
`avg(expr)`	Arithmetic mean
`min(expr)`	Minimum value
`max(expr)`	Maximum value
`first(expr)`	First value encountered
`last(expr)`	Last value encountered

Per-document aggregation

Per-document aggregation works after any upstream that forwards document boundaries — a document-aware source (a glob: / paths: source that treats each file as its own document, or an enveloped format like XML or EDI), a Merge, or a Combine. The Aggregate produces one set of grouped rows per document rather than a single aggregate spanning every file. Each document’s groups finalize and emit at that document’s close boundary, so a glob over twelve monthly files yields twelve independent monthly roll-ups. A plain single-file source is one document and still emits a single aggregate. This holds whether the Aggregate’s upstream streams or materializes its output — both flush per document.

A Merge of distinct single-document sources forwards each source’s per-document close on every mode — concat, seeded interleave, and the fused unseeded all-Source interleave fast path — so each source flushes its own roll-up. And per-document aggregation also works after a Combine on any join strategy — for example a driver glob: source joined to a lookup table, then a group-by Aggregate, yields one roll-up per driver document:

nodes:
  - type: source            # driver: each monthly file is a document
    name: orders
    config: { name: orders, type: csv, glob: "./orders/*.csv", schema: [ ... ] }
  - type: source            # small lookup table
    name: products
    config: { name: products, type: csv, path: "./products.csv", schema: [ ... ] }
  - type: combine
    name: enrich
    input: { orders: orders, products: products }
    config: { where: "orders.product_id == products.product_id", match: first, on_miss: skip, propagate_ck: driver, cxl: "..." }
  - type: aggregate
    name: monthly_totals    # one roll-up per driver document (per month)
    input: enrich
    config: { group_by: [category], cxl: "..." }

See Envelopes & Document Context for the boundary rules across every Merge mode and Combine strategy.

Strategy selection

Clinker offers two aggregation strategies:

Hash aggregation (default): Builds an in-memory hash map keyed by the group-by columns. Works with any input order. Memory usage is proportional to the number of distinct groups.
Streaming aggregation: Processes records in order, emitting each group’s result as soon as the next group starts. Requires input sorted by the group-by keys. Uses minimal memory regardless of the number of groups.

The default strategy (auto) selects streaming when the optimizer can prove the input is sorted by the group-by keys, and hash otherwise. You can force a strategy:

    config:
      group_by: [department]
      strategy: streaming   # requires sorted input

See Memory Tuning for details on memory implications.

Variations

Multiple group-by keys

    config:
      group_by: [department, region]
      cxl: |
        emit total = sum(amount)
        emit count = count(*)

Produces one row per unique (department, region) combination.

Pre-aggregation transform

Compute derived fields before aggregating:

  - type: transform
    name: prepare
    input: sales
    config:
      cxl: |
        filter status == "active"
        emit department = department
        emit amount = amount
        emit is_large = amount >= 5000

  - type: aggregate
    name: rollup
    input: prepare
    config:
      group_by: [department]
      cxl: |
        emit total = sum(amount)
        emit large_count = sum(if is_large then 1 else 0)
        emit small_count = sum(if not is_large then 1 else 0)

Aggregation followed by routing

Aggregate first, then route the summary rows:

  - type: aggregate
    name: rollup
    input: active_only
    config:
      group_by: [department]
      cxl: |
        emit total = sum(amount)

  - type: route
    name: split_by_total
    input: rollup
    config:
      mode: exclusive
      conditions:
        large: "total >= 10000"
      default: small

This routes departments with over $10,000 in total sales to one output and the rest to another.

No group-by (grand total)

Omit group_by to aggregate all records into a single output row:

    config:
      cxl: |
        emit grand_total = sum(amount)
        emit record_count = count(*)
        emit average_amount = avg(amount)

Time-windowed rollups

To group records into event-time buckets, declare a watermark: on every source and a time_window: on the aggregate. Each window emits one rollup per group when it closes. Three patterns cover the common shapes; all three ship as runnable pipelines under examples/pipelines/.

Tumbling: hourly click counts

Non-overlapping one-hour buckets per user. Use when each record should contribute to exactly one reporting bucket.

examples/pipelines/tumbling_clicks.yaml:

pipeline:
  name: tumbling_clicks

nodes:
  - type: source
    name: clicks
    description: Per-user click stream with an event-time column.
    config:
      name: clicks
      type: csv
      path: ./data/tumbling_clicks.csv
      options:
        has_header: true
      watermark:
        column: event_ts
      schema:
        - { name: user_id, type: string }
        - { name: event_ts, type: date_time }
        - { name: kind, type: string }

  - type: aggregate
    name: hourly_clicks
    description: Per-user click count, bucketed by event-time hour.
    input: clicks
    config:
      group_by: [user_id]
      time_window:
        tumbling: { size: 1h }
      cxl: |
        emit user_id = user_id
        emit n = count(*)

  - type: output
    name: results
    input: hourly_clicks
    config:
      name: results
      type: csv
      path: ./output/tumbling_clicks.csv

error_handling:
  strategy: fail_fast

Run:

cargo run -p clinker -- run examples/pipelines/tumbling_clicks.yaml

Each hour-aligned bucket emits one row per user_id once its time window has passed. Records that arrive out-of-order land in the DLQ as late_record — add delay: on the source or allowed_lateness: on the aggregate if the input has a known out-of-order tail.

Hopping: 1-hour sums advanced every 5 minutes

Overlapping one-hour windows that move forward every 5 minutes. Use for moving averages and rolling sums where one record should contribute to multiple overlapping reports.

examples/pipelines/hopping_sliding_5m_1h.yaml:

pipeline:
  name: hopping_sliding_5m_1h

nodes:
  - type: source
    name: clicks
    config:
      name: clicks
      type: csv
      path: ./data/hopping_clicks.csv
      options:
        has_header: true
      watermark:
        column: event_ts
        delay: 5s
      schema:
        - { name: user_id, type: string }
        - { name: event_ts, type: date_time }
        - { name: amount, type: int }

  - type: aggregate
    name: sliding_amount
    input: clicks
    config:
      group_by: [user_id]
      time_window:
        hopping:
          size: 1h
          slide: 5m
      allowed_lateness: 30s
      cxl: |
        emit user_id = user_id
        emit total = sum(amount)
        emit n = count(*)

  - type: output
    name: results
    input: sliding_amount
    config:
      name: results
      type: csv
      path: ./output/hopping_sliding_5m_1h.csv

error_handling:
  strategy: fail_fast

Run:

cargo run -p clinker -- run examples/pipelines/hopping_sliding_5m_1h.yaml

Each record fans into ceil(size / slide) = 12 overlapping windows, so the output row count is roughly 12× the active-window record count. The source’s delay: 5s plus the aggregate’s allowed_lateness: 30s give the pipeline 35 seconds of total grace beyond strict event-time order before a record drops to the DLQ.

Variable-duration windows bounded by inactivity, computed across two independent sources. Use for activity grouping where the window length is data-driven rather than clock-aligned.

examples/pipelines/multi_source_session.yaml:

pipeline:
  name: multi_source_session

nodes:
  - type: source
    name: src_web
    description: Web login events.
    config:
      name: src_web
      type: csv
      path: ./data/session_logins.csv
      options:
        has_header: true
      watermark:
        column: event_ts
      schema:
        - { name: user_id, type: string }
        - { name: event_ts, type: date_time }
        - { name: source, type: string }

  - type: source
    name: src_mobile
    description: Mobile login events.
    config:
      name: src_mobile
      type: csv
      path: ./data/session_mobile.csv
      options:
        has_header: true
      watermark:
        column: event_ts
      schema:
        - { name: user_id, type: string }
        - { name: event_ts, type: date_time }
        - { name: source, type: string }

  - type: merge
    name: all_logins
    inputs: [src_web, src_mobile]

  - type: aggregate
    name: user_sessions
    input: all_logins
    config:
      group_by: [user_id]
      time_window:
        session: { gap: 5m }
      allowed_lateness: 30s
      cxl: |
        emit user_id = user_id
        emit logins = count(*)

  - type: output
    name: results
    input: user_sessions
    config:
      name: results
      type: csv
      path: ./output/multi_source_session.csv

error_handling:
  strategy: fail_fast

Run:

cargo run -p clinker -- run examples/pipelines/multi_source_session.yaml

Each source declares its own watermark.column independently. A session can’t emit until both src_web and src_mobile have caught up past session_end + allowed_lateness, so the rollup waits for the slower source before closing. Drop the watermark: block on either source and the pipeline is rejected at plan time with E156.

When to pick each

Kind	Bucket shape	Typical use
`tumbling`	Disjoint, clock-aligned, fixed width	Hourly metrics, daily rollups, billing periods.
`hopping`	Overlapping, clock-aligned, fixed width	Moving averages, sliding sums, anomaly detection where each record should affect multiple reports.
`session`	Variable width, gap-bounded, per-key	User sessions, telemetry burst grouping, activity envelopes where the window length is data-driven.

Slowly-Changing Dimensions (SCD Type 2)

This recipe splits over-long dimension records into closed historical rows plus a continuation row, the shape an SCD Type 2 backfill produces. It uses a Reshape node to both mutate the record that needs splitting and synthesize the continuation row, per correlation group.

It ships as a runnable pipeline at examples/pipelines/scd_type2.yaml.

The problem

Each subject (here, an employee) has a history of dimension records — benefit plans, addresses, price tiers — each with a validity window. A record whose window runs longer than it should (it was never split when the underlying fact changed) needs to be closed at the boundary, with a fresh continuation record carrying the rest of the window forward. That is a per-group transformation: the whole employee’s history is one correlation group, and the split must observe the original group, not a half-mutated one.

Reshape fits exactly: it groups by partition_by, and within each group a rule can both mutate the trigger record and synthesize a derived one — all against the original group snapshot (the no-cascade contract).

Input data

data/scd_plans.csv — each employee’s plan history, with start/end day-numbers:

employee_id,plan_start,plan_end,status
E001,100,90,baseline
E001,1000,100,baseline
E002,200,150,baseline
E002,2000,300,baseline
E003,300,250,baseline
E004,400,380,baseline
E004,1500,400,baseline
E005,500,450,baseline

E001, E002, and E004 each have one row whose window exceeds the one-year boundary (plan_start - plan_end > 365); the others are already short enough.

Pipeline

scd_type2.yaml:

pipeline:
  name: scd_type2_backfill
  # A small budget forces the spill path even on this tiny fixture, so the
  # example also demonstrates bounded-memory Reshape. Raise or remove it for
  # production volumes.
  memory: { limit: "16K", backpressure: spill }

nodes:
  - type: source
    name: plans
    config:
      name: plans
      type: csv
      path: ./data/scd_plans.csv
      options:
        has_header: true
      schema:
        - { name: employee_id, type: string }
        - { name: plan_start, type: int }
        - { name: plan_end, type: int }
        - { name: status, type: string }

  - type: reshape
    name: backfill
    input: plans
    config:
      partition_by: [employee_id]
      order_by:
        - { field: plan_start, order: asc }
      rules:
        - name: split_long_plan
          when: "plan_start - plan_end > 365"
          mutate:
            set:
              plan_end: "plan_start"        # close the over-long window
          synthesize:
            copy_from: none
            overrides:
              employee_id: "employee_id"
              plan_start: "plan_start"
              plan_end: "plan_end"          # the rest of the window
              status: "'synthesized'"

  - type: output
    name: out
    input: backfill
    config:
      name: out
      type: csv
      path: ./output/scd_type2.csv

error_handling:
  strategy: continue

Run it

cargo run -p clinker -- run examples/pipelines/scd_type2.yaml

Expected output

output/scd_type2.csv:

employee_id,plan_start,plan_end,status
E001,100,90,baseline
E001,1000,1000,baseline
E001,1000,100,synthesized
E002,200,150,baseline
E002,2000,2000,baseline
E002,2000,300,synthesized
E003,300,250,baseline
E004,400,380,baseline
E004,1500,1500,baseline
E004,1500,400,synthesized
E005,500,450,baseline

Eight input rows produce eleven output rows: the three trigger rows have their plan_end closed at plan_start, and each emits one status=synthesized continuation row. The untriggered rows (E003, E005, the short rows of every employee) pass through unchanged.

How it works

One rule, two actions

The single rule’s when predicate selects the trigger rows. For each trigger row:

mutate.set rewrites plan_end to plan_start, closing the over-long window at its start boundary. The mutated row keeps its identity and is emitted in place.
synthesize derives a brand-new continuation row. copy_from: none starts from an all-null base and every column is supplied by an overrides expression, so the continuation row is fully constructed from the trigger’s values rather than copied. It is marked status=synthesized so downstream stages can tell originals from engine-derived rows.

Because Reshape applies every rule against the original group snapshot, the mutate and the synthesize both read the trigger row as it arrived — the mutation never feeds back into the synthesis.

Audit provenance

Reshape stamps $meta.synthetic, $meta.synthesized_by, and $meta.mutated_by on every output row (see Audit stamps). These stay out of the default CSV output but are available for downstream CXL — a follow-on Route or Transform can filter on $meta.synthetic to handle generated rows separately.

Bounded memory and spill

The example’s memory.limit: "16K" with backpressure: spill is deliberately tiny so the run exercises Reshape’s disk-spill path on a small fixture. Reshape buffers each employee’s group, and when the budget trips it spills the raw input records to disk and re-runs synthesis on reload — the output is identical whether a group stayed in memory or round-tripped through disk. The per-stage spill volume appears in clinker run --explain and in the post-run spill summary. For real workloads, drop the artificial limit (the default budget is 512 MB) and Reshape stays in memory until it genuinely needs to spill.

Two limits apply: a single correlation group must still fit the memory budget at finalize (the no-cascade contract reloads the whole group to apply its rules — a group larger than the budget fails loud rather than crashing), and Reshape rules cannot reference $doc document context while spill is in play (such a pipeline is rejected at compile time). Each employee’s group in this example is tiny, so neither limit is reached here. See Reshape’s memory model and Memory & Spill for the full picture.

Idempotence

Re-running this pipeline over its own output does not re-trigger: a closed row has plan_start - plan_end == 0, and a synthesized continuation row likewise sits inside the boundary, so the when predicate fires only on genuinely over-long windows. That makes the backfill safe to apply repeatedly.

Backfill, Then Cull for Review

This recipe chains two grouping operators: a Reshape node backfills each subject’s history, then a Cull node sets aside whole subjects that need manual review — routing them to a second output stream instead of dropping them.

It ships as a runnable pipeline at examples/pipelines/employee_plan_backfill.yaml.

The problem

You have run an SCD-style backfill over each employee’s benefit-plan history (closing over-long windows and synthesizing continuation rows). After backfilling, some employees end up with a large or otherwise unusual plan history that an analyst should eyeball before it lands in the clean dataset. You want two outputs:

a clean stream for the employees whose history looks fine, and
a review stream for the flagged employees — their records intact, not discarded, not errored.

That is a per-group decision based on an aggregate property of the whole group (“this employee has more than three plan rows”), and the flagged records belong on a second valid data stream, not in the dead-letter queue. Cull fits exactly: it groups by partition_by, evaluates a group-level drop_group_when predicate, and emits removed groups on a first-class removed_to side-output port.

Input data

data/employee_plans.csv — each employee’s plan history, with start/end day-numbers:

employee_id,plan_start,plan_end,status
E001,100,90,baseline
E001,1000,100,baseline
E002,200,150,baseline
E003,50,40,baseline
E003,300,250,baseline
E003,600,550,baseline
E003,900,850,baseline

E001 has one over-long window (1000 - 100 > 365); E003 has four plan rows.

The pipeline

nodes:
  - type: source
    name: plans
    config:
      name: plans
      type: csv
      path: ./data/employee_plans.csv
      schema:
        - { name: employee_id, type: string }
        - { name: plan_start, type: int }
        - { name: plan_end, type: int }
        - { name: status, type: string }

  - type: reshape
    name: backfill
    input: plans
    config:
      partition_by: [employee_id]
      order_by:
        - { field: plan_start, order: asc }
      rules:
        - name: split_long_plan
          when: "plan_start - plan_end > 365"
          mutate:
            set:
              plan_end: "plan_start"
          synthesize:
            copy_from: trigger
            overrides:
              status: "'synthesized'"

  - type: cull
    name: flag_large_histories
    input: backfill
    config:
      partition_by: [employee_id]
      removed_to: review
      rules:
        - name: too_many_plans
          drop_group_when: "count(*) > 3"

  - type: output
    name: out
    input: flag_large_histories         # main port — kept employees
    config: { name: out, type: csv, path: ./output/employee_plans_clean.csv }

  - type: output
    name: review
    input: flag_large_histories.review  # side-output port — flagged employees
    config: { name: review, type: csv, path: ./output/employee_plans_review.csv }

How it works

Reshape (backfill) groups by employee_id and, for the over-long window, closes it (plan_end = plan_start) and synthesizes a continuation row marked status=synthesized. E001 gains a synthesized row, ending up with three rows.
Cull (flag_large_histories) groups the backfilled rows by employee_id again and evaluates count(*) > 3 over each whole group. E003 has four rows, so the whole E003 group is routed to the review side-output port; E001 (three rows) and E002 (one row) flow to the main output.
Two outputs draw the two ports: out references the Cull node by name (the main port, kept employees); review references flag_large_histories.review (the side-output port, flagged employees).

The main output (employee_plans_clean.csv) carries E001 (with its synthesized continuation row) and E002; the review output (employee_plans_review.csv) carries all four of E003’s rows. Both streams carry the unchanged input schema — Cull does not widen, and the flagged records are valid rows on a normal data edge, never DLQ entries.

Expressing group-level conditions

drop_group_when is an aggregate predicate over the whole group. CXL’s bare aggregates are sum / count / min / max / avg / collect / weighted_avg — there is no bare any(). To flag a group when any row matches a condition, sum an indicator and compare to zero:

rules:
  # Flag the whole employee for review if any plan row is still flagged
  # `status == 'error'` after backfill.
  - name: any_error
    drop_group_when: "sum(if status == 'error' then 1 else 0) > 0"

See the Cull node reference for the full predicate vocabulary, the producer-side port model, and the bounded-memory spill behavior.

File Splitting

This recipe demonstrates splitting large output files into smaller chunks, optionally keeping related records together.

Basic record-count splitting

Split output into files of at most 5,000 records each:

pipeline:
  name: monthly_report

nodes:
  - type: source
    name: transactions
    config:
      name: transactions
      type: csv
      path: "./data/transactions.csv"
      schema:
        - { name: id, type: int }
        - { name: date, type: string }
        - { name: department, type: string }
        - { name: amount, type: float }
        - { name: description, type: string }

  - type: output
    name: split_output
    input: transactions
    config:
      name: monthly_report
      type: csv
      path: "./output/report.csv"
      split:
        max_records: 5000
        naming: "{stem}_{seq:04}.{ext}"
        repeat_header: true

Output files

output/report_0001.csv  (5000 records + header)
output/report_0002.csv  (5000 records + header)
output/report_0003.csv  (remaining records + header)

Naming pattern variables

Variable	Description	Example
`{stem}`	Base filename without extension	`report`
`{ext}`	File extension	`csv`
`{seq:04}`	Zero-padded sequence number (width 4)	`0001`

The path field provides the template: ./output/report.csv means stem is report and ext is csv.

Header behavior

When repeat_header: true, each output file includes the CSV header row. This is the recommended setting – each file is self-contained and can be processed independently.

Grouped splitting

Keep all records with the same group key value in the same file:

      split:
        max_records: 5000
        group_key: "department"
        naming: "{stem}_{seq:04}.{ext}"
        repeat_header: true
        oversize_group: warn

With group_key: "department", the splitter ensures that all records for a given department land in the same output file. A new file starts only at a group boundary (when the department value changes), even if the current file has not reached max_records yet.

Oversize group policy

If a single group contains more records than max_records, the oversize_group setting controls behavior:

Policy	Behavior
`warn` (default)	Log a warning and write all records for the group into one file, exceeding the limit
`error`	Stop the pipeline with an error
`allow`	Silently allow the oversized file

For example, if max_records is 5,000 but the Engineering department has 7,000 records, the warn policy produces a file with 7,000 records and logs a warning.

Byte-based splitting

Split by file size instead of record count:

      split:
        max_bytes: 10485760  # 10 MB per file
        naming: "{stem}_{seq:04}.{ext}"
        repeat_header: true

The splitter estimates the current file size and starts a new file when the limit is approached. The actual file size may slightly exceed the limit because the current record is always completed before splitting.

Byte-based splitting works with every output format. For formats that wrap the whole file in framing – a JSON array or an XML root element – each rotation closes the current file’s framing and reopens it in the next, so every chunk is a complete, independently valid document (its own [ ... ] array or <Root> ... </Root> tree), never a fragment.

Combined limits

Use both max_records and max_bytes together – whichever limit is reached first triggers a new file:

      split:
        max_records: 10000
        max_bytes: 5242880   # 5 MB
        naming: "{stem}_{seq:04}.{ext}"
        repeat_header: true

This is useful when record sizes vary widely. Short records might produce a tiny file at 10,000 records, while long records might hit the byte limit well before 10,000.

Full pipeline example

A complete pipeline that reads a large transaction file, filters it, and splits the output:

pipeline:
  name: split_transactions

nodes:
  - type: source
    name: transactions
    config:
      name: transactions
      type: csv
      path: "./data/all_transactions.csv"
      schema:
        - { name: id, type: int }
        - { name: date, type: string }
        - { name: department, type: string }
        - { name: category, type: string }
        - { name: amount, type: float }

  - type: transform
    name: current_year
    input: transactions
    config:
      cxl: |
        filter date.starts_with("2026")

  - type: output
    name: chunked
    input: current_year
    config:
      name: transactions_2026
      type: csv
      path: "./output/transactions_2026.csv"
      split:
        max_records: 5000
        group_key: "department"
        naming: "{stem}_{seq:04}.{ext}"
        repeat_header: true
        oversize_group: warn

clinker run split_transactions.yaml --force

Practical considerations

Downstream consumers. Splitting is useful when the receiving system has file size limits (e.g., an upload API that accepts files up to 10 MB) or when parallel processing of chunks is desired.
Record ordering. Records within each output file maintain their original order from the pipeline. Across files, the sequence number ({seq}) indicates the order.
Group key sorting. For group_key to work correctly, the input should ideally be sorted by the group key. If the input is not sorted, records for the same group may appear in multiple files. Pre-sort with a transform if needed, or accept the split-group behavior.
Overwrite behavior. Use --force when re-running a pipeline with splitting enabled. Without it, the pipeline aborts if any of the output chunk files already exist.

Intra-Record Closures

This recipe shows the complete intra-record fan-out shape: an NDJSON source where each record carries an array of line items, a transform that filters items by price and then fans each remaining item into its own output record, and a flat NDJSON sink ready for downstream billing.

The pieces involved:

Arrow-syntax closures for predicates and projections.
Array methods (filter, map) for in-place transformation.
Bracket-index access (it["sku"]) for reading fields off each map element.
emit each for fan-out.
The Output node’s include_unmapped flag for controlling which fields reach the sink.

Input data

orders.ndjson – one JSON object per line, each carrying a nested items array:

{"order_id":"O-1","customer":"alice@example.com","items":[{"sku":"a","price":10,"qty":2},{"sku":"b","price":20,"qty":1},{"sku":"c","price":3,"qty":5}]}
{"order_id":"O-2","customer":"bob@example.com","items":[{"sku":"a","price":10,"qty":1},{"sku":"d","price":50,"qty":1}]}

Each record has two order-level fields (order_id, customer) and an items array whose elements are maps with sku, price, and qty.

Goal

For each order:

Drop items priced under $5 (a sub-threshold cutoff).
Fan the surviving items into one output record each, carrying the order-level identifiers plus the per-item fields.
Compute the per-line revenue (unit_price * qty) for each output record.

Pipeline

billing_lines.yaml:

pipeline:
  name: billing_lines

nodes:
  - type: source
    name: orders
    config:
      name: orders
      type: json
      options:
        format: ndjson
      path: "./orders.ndjson"
      schema:
        - { name: order_id, type: string }
        - { name: customer, type: string }
        - { name: items, type: any }

  - type: transform
    name: filter_lines
    input: orders
    config:
      cxl: |
        emit order_id = order_id
        emit customer = customer
        emit item_count = items.length()
        emit kept = items.filter(it => it["price"] >= 5)

  - type: transform
    name: explode
    input: filter_lines
    config:
      max_expansion: 10000
      cxl: |
        emit each it in kept {
          emit order_id = order_id
          emit customer = customer
          emit sku = it["sku"]
          emit unit_price = it["price"]
          emit qty = it["qty"]
          emit line_total = it["price"] * it["qty"]
        }

  - type: output
    name: lines_out
    input: explode
    config:
      name: lines_out
      type: json
      path: "./output/billing_lines.ndjson"
      options:
        format: ndjson
      include_unmapped: false
      exclude: [items, kept]

error_handling:
  strategy: continue

Run it

# Validate first
clinker run billing_lines.yaml --dry-run

# Preview the first few output records
clinker run billing_lines.yaml --dry-run -n 3

# Full run
clinker run billing_lines.yaml

Expected output

output/billing_lines.ndjson:

{"order_id":"O-1","customer":"alice@example.com","sku":"a","unit_price":10,"qty":2,"line_total":20}
{"order_id":"O-1","customer":"alice@example.com","sku":"b","unit_price":20,"qty":1,"line_total":20}
{"order_id":"O-2","customer":"bob@example.com","sku":"a","unit_price":10,"qty":1,"line_total":10}
{"order_id":"O-2","customer":"bob@example.com","sku":"d","unit_price":50,"qty":1,"line_total":50}

Order O-1’s three input items collapse to two output records (the sku=c line was filtered out because its price was below $5). Order O-2’s two items both survive the filter and produce two output records.

How it works

Filter stage. The filter_lines transform reads each order, runs items.filter(it => it["price"] >= 5) to drop sub-threshold items, and stashes the survivors in a kept field. The closure body uses bracket indexing (it["price"]) because each it is a map; bracket indexing returns null for missing keys without aborting. The same record also carries an item_count projection so downstream nodes could route or audit on the original (pre-filter) item count.

Explode stage. The explode transform contains one emit each block over kept. For each surviving item, the body emits a flat record with the order-level identifiers (order_id, customer) repeated, plus the per-item fields lifted out of it. The body has no filter or nested emit each – those are forbidden inside the block; pre-filter upstream as we did, or post-filter in a downstream transform.

include_unmapped: false. The default Output policy is to pass every unmapped input field through. Here we set it to false so the order-level items array (carried through from the source), the item_count projection, and the intermediate kept array (used only as the fan-out source) do not leak into the per-line output. The exclude: [items, kept] list provides a belt-and-suspenders defense against future renaming.

max_expansion: 10000. Caps how many output records a single input order may produce. The default is 10000; we set it explicitly here so the value is visible in the YAML. Orders with arrays larger than the cap route to the DLQ with category expansion_limit_exceeded (see Transform Nodes -> Expansion Cap).

Variations

Pass through every input field

Remove include_unmapped: false (or set it to true) and the original order-level fields plus the intermediate kept array will appear on every output record. Useful when downstream consumers expect a complete record context, or when you need to audit what was filtered.

Emit a single record per order with the kept-items array

Drop the explode transform and route filter_lines directly to the Output. Each output record stays at order grain, with kept carrying the post-filter array. This is the same pipeline minus the fan-out step.

Reach for `.flat_map` instead of two transforms

When the per-element transformation is simple enough to fit in a single closure body, flat_map collapses the filter + project + explode pattern into one expression. It produces a flat array, which downstream nodes still see as a single field on the input record; the explicit emit each is what produces multiple output records.

Rewrite a nested field in place with `.set`

When you want to keep the record at order grain but mutate a value buried inside it, the set map method takes a dotted/indexed path and rewrites a single leaf, leaving every sibling untouched:

    cxl: |
      emit order = order.set("items[0].sku", "A-100").set("ship.region", "us-east")

The first set overwrites the SKU of the first item; the second writes ship.region, auto-creating the ship map if the order had no ship field yet. Because set is copy-on-write, this builds a fresh order document without disturbing the upstream binding. A path that conflicts with the existing shape (descending into a scalar, or an array index past the end) yields null for that set rather than partially writing – guard with catch if a path may not match every record.

Keyboard shortcuts

Clinker User Guide