Clinker
Clinker is a pure-Rust, bounded-memory batch DAG executor for CSV, JSON, XML, and fixed-width data. It reads finite inputs, drives them through a directed acyclic graph of transformation nodes one record at a time, and exits when the inputs are drained. It ships as a single static binary with no interpreter, no runtime, and no install dependencies.
Pipelines are declared in YAML. Data transformation logic is written in CXL, a custom expression language purpose-built for ETL. Together they replace legacy tools like Informatica, SSIS, Talend, and NiFi with something deterministic, lightweight, and easy to reason about.
What Clinker is, plainly
A finite batch executor with per-record streaming evaluation, not a
long-running stream processor. A pipeline run is a job: Sources read until
EOF, the DAG drains, the process exits. Within a run, stateless operators
(Transform, Route, most Combine probe-side work, Output) evaluate records one
at a time without accumulating per-record state. Every stage is charged
against the configured RSS budget. Fused Source → Transform → Output paths
run streaming with no per-stage materialization; non-fused boundaries
(Route fan-out, Merge fan-in, Composition bodies, diamond DAGs) materialize
records into per-stage buffers that charge against the same envelope. The
engine spills buffers to disk at 80% of the limit and fails fast with
E310 MemoryBudgetExceeded at the hard limit, naming the offending
producer. Blocking operators (Aggregate, sort, grace-hash Combine)
accumulate state inside that same budget and spill to disk when soft
and hard memory thresholds trip, rather than OOM-killing the process.
If you have used Flink, Kafka Streams, or Beam in unbounded mode: Clinker is not that. There are no watermarks against wall-clock time, no infinite-source semantics, no exactly-once delivery across restarts. The closest prior art is Pentaho Kettle / Apache Hop, Embulk, Singer, Benthos in batch mode, and Vector running file-to-file – finite ETL jobs with per-record evaluation and a hard memory ceiling.
Three pillars of what Clinker is:
- Finite inputs. Files (CSV, JSON, XML, fixed-width) are the canonical
shape. Finite-cursor network sources (paginated REST APIs, SQL
SELECTcursors) fit the same model – they exhaust their cursor and EOF. Unbounded sources (Kafka topics, Kinesis streams, Server-Sent Events, webhooks,tail -f-style file followers) are out of scope and will remain so. - Finite jobs. A pipeline run begins when you invoke
clinker run, drains the DAG, and exits with a status code. No long-running daemon, no service surface, no infinite event loop. - Single process. One
clinkerbinary invocation is one operating- system process. Parallelism happens inside the process via threads (std::thread, Rayon). Clinker does not spawn worker processes, does not coordinate a cluster, and does not shuffle data between machines. Scale by giving the host more cores, more RAM, and more disk – the DuckDB / Polars / Kettle model. If a single host genuinely can’t fit the work, partition the input by file or by key and run multipleclinkerinvocations from a shell script; that’s a five-line bash script, not an architectural addition.
Why Clinker?
Single binary, zero dependencies. Download it, run it. No JVM, no Python, no package manager. Works on any Linux server out of the box.
Good neighbor on busy servers. Clinker enforces a strict memory ceiling (default 512 MB) so it can run alongside JVM applications, databases, and other services without competing for RAM. Aggregation spills to disk when memory pressure rises.
Reproducible output. Given the same input and pipeline, Clinker produces byte-identical output across runs. No nondeterminism from thread scheduling, hash randomization, or floating-point reordering.
Operability-first design. Per-stage metrics, dead-letter queues for error records, explain plans for understanding execution, and structured exit codes for scripting. Built for production from day one.
Two binaries:
| Binary | Purpose |
|---|---|
clinker | Run pipelines against real data |
cxl | Check, evaluate, and format CXL expressions interactively |
A taste of Clinker
Here is a complete pipeline that reads a customer CSV, filters to active customers, classifies them into tiers, and writes the result:
pipeline:
name: customer_etl
nodes:
- type: source
name: customers
config:
name: customers
type: csv
path: "./data/customers.csv"
schema:
- { name: customer_id, type: int }
- { name: first_name, type: string }
- { name: last_name, type: string }
- { name: status, type: string }
- { name: lifetime_value, type: float }
- type: transform
name: enrich
input: customers
config:
cxl: |
filter status == "active"
emit customer_id = customer_id
emit full_name = first_name + " " + last_name
emit tier = if lifetime_value >= 10000 then "gold" else "standard"
- type: output
name: result
input: enrich
config:
name: enriched
type: csv
path: "./output/enriched_customers.csv"
Run it:
clinker run customer_etl.yaml
That is the entire workflow. No project scaffolding, no configuration files, no compile step. One YAML file, one command.
Next steps
- Installation – download the binary and verify it works
- Your First Pipeline – build and run a pipeline step by step
- Key Concepts – understand the mental model behind Clinker pipelines
Non-Goals
This page lists what Clinker is deliberately not. These are architectural commitments — design surfaces Clinker will not grow into, not just features that haven’t been built yet.
If you arrived here because you were considering Clinker for one of the scenarios below, the answer is “a different tool is the right fit.” Each non-goal is paired with the kind of tool that is the right fit.
Not an unbounded stream processor
Clinker reads sources that have an end. A pipeline run is a finite job: Sources read until EOF, the DAG drains, the process exits.
Out of scope:
- Kafka topics, Kinesis streams, Pub/Sub subscriptions (long-running consumers without a natural end).
- Server-Sent Events, WebSocket subscriptions, webhooks-as-input.
tail -f-style file followers.- Watermarking against wall-clock time.
- Exactly-once delivery across process restarts.
- Stateful infinite-stream windowing (tumbling / sliding / session windows over event time without a finite boundary).
Right fit instead: Apache Flink, Kafka Streams, Apache Beam in unbounded mode, Vector with streaming sources, Benthos with streaming inputs, Apache NiFi.
Not a multi-process or distributed engine
One clinker run invocation is one operating-system process. Clinker
does not spawn worker processes, does not coordinate a cluster, and does
not shuffle data between machines.
Out of scope:
- Worker-process pools on a single machine.
- Multi-machine sharded execution.
- Network shuffle between executors.
- Cluster managers (Kubernetes operators, YARN, Mesos integrations).
- Distributed memory accounting.
- Partial-failure recovery across worker boundaries.
Right fit instead: Apache Spark, Trino / Presto, Apache Flink in cluster mode, Apache Beam on Dataflow, Hadoop MapReduce.
Scaling Clinker: give the host more cores, more RAM, more disk — the
DuckDB / Polars / Kettle / Hop model. If a single host genuinely can’t
fit the work, partition the input by file or by key and run multiple
clinker invocations from a shell script. That’s a five-line script,
not an architectural addition.
Not a long-running service
Clinker is a CLI binary, not a server. There is no daemon mode, no HTTP control plane, no JDBC/ODBC listener, no UI server, no scheduled job runner inside Clinker itself.
Out of scope:
- HTTP API exposing pipeline execution.
- Built-in cron / scheduler / orchestrator.
- Persistent connection pool living across pipeline runs.
- A long-lived process accepting new pipeline submissions over a socket.
Right fit instead:
- For scheduling: cron, systemd timers, Airflow, Dagster, Prefect, Temporal.
- For HTTP-fronted ETL: any of the above orchestrators wrapping
clinker runinvocations. - For interactive queries against finite data: DuckDB, Polars, or any embedded query engine.
Not an OLAP / SQL query engine
Clinker is a per-record expression engine with explicit nodes: in
a DAG. It does not parse SQL, does not optimize joins via cost-based
optimization across the whole pipeline, and does not present a relational
table model.
Out of scope:
- SQL parsing (the CXL language is the surface; no
SELECT ... FROMis accepted). - Cost-based join reordering across more than the local Combine node.
- Materialized views or query caching.
- Interactive query latencies under a second.
- ANSI-SQL semantics for NULL, type coercion, or aggregate behavior.
Right fit instead: DuckDB, ClickHouse, DataFusion, Trino, Postgres, or any RDBMS. If you want SQL-driven transformation over files, DuckDB is the closest single-binary alternative to Clinker for the cases where SQL is the right surface.
Not a connector marketplace
Clinker ships with a deliberately small set of source and sink types: CSV, JSON, XML, fixed-width files in the current release; finite-cursor REST and SQL sources on the roadmap. There is no plugin registry, no third-party connector store, no SaaS-API catalog.
Out of scope:
- Hundreds of pre-built SaaS integrations (Salesforce, HubSpot, Stripe, etc.).
- A central registry of community-maintained connectors.
- Schema discovery against arbitrary external APIs.
- Change-data-capture (CDC) sources.
Right fit instead: Airbyte, Fivetran, Stitch, Singer with its tap ecosystem, dlt (data load tool).
Not a streaming-CDC engine
Clinker treats each pipeline run as a fresh, finite pass over the input. It does not maintain a persistent log of source changes, does not replicate row-level changes from a database, and does not produce an append-only stream of inserts / updates / deletes.
Out of scope:
- Postgres logical replication subscriptions.
- MySQL binlog tailing.
- Debezium-style CDC stream production.
- Maintaining a target database in continuous sync with a source.
Right fit instead: Debezium, Maxwell, AWS DMS, Striim, Estuary Flow, or vendor-native CDC like Snowflake Streams.
What Clinker is
For the positive framing, see the Introduction and Key Concepts. The short version:
- A pure-Rust, single-binary, bounded-memory batch DAG executor for finite file and finite-cursor inputs.
- Per-record evaluation through a directed acyclic graph of Source, Transform, Aggregate, Route, Merge, Combine, Output, and Composition nodes.
- Pipelines declared in YAML, transformation logic written in CXL (a custom per-record expression language).
- One process, finite job, EOF-then-exit. Disk spill under memory pressure rather than OOM.
Installation
Clinker is a single static binary with no runtime dependencies. Download it,
put it on your PATH, and you are ready to go.
Binaries
Clinker ships two binaries:
clinker– the pipeline executor. This is the main tool you use to validate and run pipelines against data.cxl– the CXL expression checker, evaluator, and formatter. Use it during development to test expressions interactively, check types, and format CXL blocks.
Verify installation
After placing the binaries on your PATH, confirm they work:
clinker --version
clinker 0.1.0
cxl --version
cxl 0.1.0
Both commands should print a version string and exit. If you see
command not found, check that the directory containing the binaries is in
your PATH.
Building from source
Clinker requires Rust 1.91+ (edition 2024). If you have a Rust toolchain installed, build and install both binaries directly from the repository:
# Clone the repository
git clone https://github.com/rustpunk/clinker.git
cd clinker
# Install the pipeline executor
cargo install --path crates/clinker
# Install the CXL expression tool
cargo install --path crates/cxl-cli
This compiles release-optimized binaries and places them in ~/.cargo/bin/,
which is typically already on your PATH.
To verify the build:
cargo test --workspace
This runs the full test suite (approximately 1100 tests) and confirms everything is working correctly on your system.
Rust toolchain
The repository includes a rust-toolchain.toml that pins the exact Rust
version. If you use rustup, it will automatically download the correct
toolchain when you build.
| Requirement | Value |
|---|---|
| Rust edition | 2024 |
| Minimum version | 1.91 |
| C dependencies | None |
Your First Pipeline
This walkthrough builds a pipeline from scratch, runs it, and explores the tools Clinker provides for validating and understanding pipelines before they touch real data.
1. Create sample data
Save the following as employees.csv:
id,name,department,salary
1,Alice Chen,Engineering,95000
2,Bob Martinez,Marketing,62000
3,Carol Johnson,Engineering,88000
4,Dave Williams,Sales,71000
2. Write the pipeline
Save the following as my_first_pipeline.yaml:
pipeline:
name: salary_report
nodes:
- type: source
name: employees
config:
name: employees
type: csv
path: "./employees.csv"
schema:
- { name: id, type: int }
- { name: name, type: string }
- { name: department, type: string }
- { name: salary, type: int }
- type: transform
name: classify
input: employees
config:
cxl: |
emit id = id
emit name = name
emit department = department
emit salary = salary
emit level = if salary >= 90000 then "senior" else "junior"
- type: output
name: report
input: classify
config:
name: salary_report
type: csv
path: "./salary_report.csv"
This pipeline has three nodes:
employees(source) – reads the CSV file and declares the schema.classify(transform) – passes all fields through and adds alevelfield based on salary.report(output) – writes the result to a new CSV file.
The input: field on each consumer node wires the DAG together. Data flows
from employees through classify to report.
3. Validate before running
Before processing any data, check that the pipeline is well-formed:
clinker run my_first_pipeline.yaml --dry-run
Dry-run parses the YAML, resolves the DAG, and type-checks all CXL expressions
against the declared schemas. If there are errors – a typo in a field name, a
type mismatch, a missing input: reference – Clinker reports them with
source-location diagnostics and stops. No data is read.
4. Preview records
To see what the output will look like without writing files, preview a few records:
clinker run my_first_pipeline.yaml --dry-run -n 2
This reads the first 2 records from the source, runs them through the pipeline, and prints the results to the terminal. Useful for sanity-checking transformations before committing to a full run.
5. Understand the execution plan
To see how Clinker will execute the pipeline:
clinker run my_first_pipeline.yaml --explain
The explain plan shows the DAG topology, the order nodes will execute, per-node parallelism strategy, and schema propagation through the pipeline. This is valuable for understanding complex pipelines with routes, merges, and aggregations.
6. Run it
clinker run my_first_pipeline.yaml
Clinker reads employees.csv, applies the transform, and writes
salary_report.csv. The output:
id,name,department,salary,level
1,Alice Chen,Engineering,95000,senior
2,Bob Martinez,Marketing,62000,junior
3,Carol Johnson,Engineering,88000,junior
4,Dave Williams,Sales,71000,junior
Alice’s salary of 95,000 meets the threshold, so she is classified as
senior. Everyone else is junior.
What just happened
The pipeline executed as a streaming process:
- The source node read
employees.csvone record at a time. - Each record flowed through the
classifytransform, which evaluated the CXL block to produce the output fields. - The output node wrote each transformed record to
salary_report.csv.
At no point was the entire dataset loaded into memory. This is how Clinker processes files of any size under its memory ceiling.
Next steps
- Key Concepts – understand the building blocks of Clinker pipelines
- Pipeline YAML Structure – full reference for pipeline configuration
- CXL Overview – learn the expression language in depth
Key Concepts
This page covers the mental model behind Clinker pipelines. If you have experience with other ETL tools, most of this will feel familiar – but pay attention to where Clinker diverges, especially around CXL, per-record evaluation, and the memory budget.
Batch jobs, not unbounded streams
A Clinker run is a finite batch job. Source nodes read their files until EOF, the DAG drains, and the process exits. There are no watermarks against wall-clock time, no infinite-source semantics, no exactly-once delivery across restarts. If you have used Flink, Kafka Streams, or Beam in unbounded mode: Clinker is not that.
The word “streaming” in Clinker’s documentation always refers to per-record
evaluation within a single batch run – records flow through the graph one
at a time rather than being materialized as a whole table – not to
long-running stream-processor semantics. Internal identifiers in the codebase
(function names like streaming_output_task, config fields like
strategy: streaming, error messages, log lines) use the word in the same
row-by-row sense; if you see it in a stack trace, it is not Flink leaking
through.
Finite inputs only
Clinker reads sources that have an end. Files are the canonical shape, and
finite-cursor network sources (paginated REST APIs, SQL SELECT cursors)
fit the same model – they exhaust their cursor and EOF. Unbounded sources
(Kafka, Kinesis, Server-Sent Events, webhooks, tail -f-style file
followers) are explicitly out of scope and will remain so.
Single process, ever
One clinker run invocation is one OS process. Parallelism happens inside
that process via threads. Clinker does not spawn worker processes, does not
coordinate a cluster, and does not shuffle data between machines. Scale by
giving the host more cores, more RAM, more disk – the DuckDB / Polars /
Kettle model. If a single host genuinely can’t fit the work, partition the
input by file or by key and run multiple clinker invocations from a shell
script; that’s a five-line script, not an architectural addition to
Clinker.
For the full list of what Clinker deliberately does not do, see Non-Goals.
Pipelines are DAGs
A pipeline is a directed acyclic graph of nodes. Data flows from sources, through processing nodes, to outputs. There are no cycles – a node cannot consume its own output, directly or indirectly.
You define the graph by setting input: on each consumer node, naming the
upstream node it reads from. Clinker resolves these references, validates that
the graph is acyclic, and determines execution order automatically.
The nodes: list
Every pipeline has a single flat list of nodes. Each node has a type:
discriminator that determines its behavior. The eight node types are:
| Type | Purpose |
|---|---|
source | Read data from a file (CSV, JSON, XML, fixed-width) |
transform | Apply CXL logic to reshape, filter, or enrich records |
aggregate | Group records and compute summary values (sum, count, etc.) |
route | Split a stream into named ports based on conditions |
merge | Concatenate multiple streams that share a schema |
combine | Join records across N inputs with cross-input predicates |
output | Write data to a file |
composition | Embed a reusable sub-pipeline |
You can have as many nodes of each type as your pipeline requires. The only constraint is that the resulting graph must be a valid DAG.
CXL is not SQL
CXL is a per-record expression language. Each record flows through a CXL block
independently – there is no table-level context, no SELECT, no FROM, no
JOIN. Think of it as a programmable row mapper.
The core statements:
emit name = expr– produce a field in the output record. Only emitted fields appear downstream. If you want to pass a field through unchanged, you must emit it explicitly:emit id = id.let name = expr– bind a local variable for use in later expressions. Local variables do not appear in the output.filter condition– discard the record if the condition is false. A filtered record produces no output and is not counted as an error.distinct/distinct by field– deduplicate records.distinctdeduplicates on all output fields;distinct by fielddeduplicates on a specific field.
CXL uses and, or, and not for boolean logic – not && or ||. String
concatenation uses +. Conditional expressions use
if ... then ... else ... syntax.
System namespaces use a $ prefix: $pipeline.*, $window.*, $meta.*.
These provide access to pipeline metadata, window function state, and record
metadata respectively.
Per-record evaluation and the memory budget
Within a run, records flow through the pipeline one at a time. Clinker does not load an entire file into memory before processing it. A source reads one record, pushes it through the downstream nodes, and then reads the next. This is what “streaming” means in Clinker – row-by-row evaluation inside a finite batch job, not Flink-style unbounded stream processing.
Per-record evaluation keeps per-row memory usage bounded for the
stateless parts of the graph (Transform, Route, Merge, most Combine
probe-side work, Output). Every stage is charged against the configured
RSS budget. Fused Source → Transform → Output paths run streaming, with
no per-stage materialization, so a 100 GB CSV passes through with the
same footprint as a 100 KB CSV. A stage that hands its output to a single
downstream sink Output also avoids a charged inter-stage buffer –
single-branch Route, non-fused Merge, streaming Aggregate, and the Combine
probe-side stream their result straight to the writer (see
Streaming vs. Blocking Stages).
The remaining boundaries – multi-branch Route fan-out, output that forks
to several consumers, Composition bodies, diamond DAGs – materialize
records into per-stage buffers that charge against the same budget
envelope. When a buffer would push cumulative usage past the soft
threshold (80% of the limit), the engine spills the buffer to disk; when
it would exceed the hard limit, the engine fails fast with a structured
E310 MemoryBudgetExceeded diagnostic that names the offending
producer.
Use clinker run --explain to see which nodes will materialize
(buffer: materialized) versus which will stream (buffer: streaming)
before runtime – that label is the canonical “which stages charge the
budget” signal. See the --explain reference and
the memory-tuning page.
Stateful operators must accumulate. Aggregate, sort, and grace-hash Combine cannot emit until they have seen enough input – sums need every addend, a full sort needs the last row, a hash join needs the build side complete. These operators run inside a configured RSS budget (default 512 MB) and degrade gracefully under pressure rather than OOM:
- Aggregate uses hash aggregation by default and spills partitions to disk when soft/hard memory thresholds trip. When the input is already sorted by the group key, the planner picks streaming aggregation, which requires only constant memory.
- Sort spills runs to disk and merges them.
- Combine picks among in-memory hash join, grace hash join (spilled), and IEJoin / sort-merge depending on predicates and memory pressure.
The memory ceiling is a first-class promise. Clinker is designed to share a server with JVM applications, databases, and other services without competing for RAM.
Input wiring
Consumer nodes reference their upstream via the input: field:
- type: transform
name: enrich
input: customers # reads from the node named "customers"
Route nodes produce named output ports. Downstream nodes reference a specific port using dot notation:
- type: route
name: split_by_region
input: customers
config:
routes:
us: region == "US"
eu: region == "EU"
default: other
- type: output
name: us_output
input: split_by_region.us # reads from the "us" port
Merge nodes accept multiple inputs using inputs: (plural):
- type: merge
name: combined
inputs:
- us_transform
- eu_transform
Schema declaration
Source nodes require an explicit schema: that declares every column’s name
and type:
config:
schema:
- { name: customer_id, type: int }
- { name: email, type: string }
- { name: balance, type: float }
- { name: created_at, type: date }
Clinker uses these declarations to type-check CXL expressions at compile time, before any data is read. If a CXL block references a field that does not exist in the upstream schema, or applies an operation to an incompatible type, the error is caught during validation – not at row 5 million of a production run.
Supported types include int, float, string, bool, date, and
datetime.
Error handling
Each node can specify an error handling strategy:
| Strategy | Behavior |
|---|---|
fail_fast | Stop the pipeline on the first error (default) |
continue | Route error records to a dead-letter queue file and continue |
best_effort | Log errors and continue without writing error records |
When using continue, Clinker writes rejected records to a DLQ file alongside
the output. Each DLQ entry includes the original record, the error category,
the error message, and the node that rejected it. This makes diagnosing
production issues straightforward: check the DLQ, fix the data or the
pipeline, and rerun.
Pipeline YAML Structure
A Clinker pipeline is a single YAML file with three top-level sections: pipeline (metadata), nodes (the processing graph), and optionally error_handling.
Top-level shape
pipeline:
name: my_pipeline # Required — pipeline identifier
memory: # Optional — see ops/memory.md
limit: "256M" # Optional (K/M/G suffixes), default 512M
backpressure: pause # Optional, default `pause`
vars: # Optional key-value pairs
threshold: 500
label: "Monthly Report"
date_formats: ["%Y-%m-%d"] # Optional — custom date parsing formats
rules_path: "./rules/" # Optional — CXL module search path
concurrency: # Optional
threads: 4
chunk_size: 1000
metrics: # Optional
spool_dir: "./metrics/"
nodes: # Required — flat list of pipeline nodes
- type: source
name: raw_data
config:
name: raw_data
type: csv
path: "./data/input.csv"
schema:
- { name: id, type: int }
- { name: value, type: string }
- type: transform
name: clean
input: raw_data
config:
cxl: |
emit id = id
emit value = value.trim()
- type: output
name: result
input: clean
config:
name: result
type: csv
path: "./output/result.csv"
error_handling: # Optional
strategy: fail_fast
Pipeline metadata
The pipeline: block carries global settings that apply to the entire run.
| Field | Required | Description |
|---|---|---|
name | Yes | Pipeline identifier. Used in logs and metrics. |
memory | No | Memory-arbitrator tuning. Nested fields: limit (RSS budget, K/M/G suffixes, default 512M) and backpressure (spill/pause/both, default pause). See Memory Tuning. |
vars | No | Scalar constants accessible in CXL via $vars.*. |
date_formats | No | List of strftime-style patterns for date parsing. |
rules_path | No | Directory for CXL use module resolution. |
concurrency | No | threads and chunk_size for parallel chunk processing. |
metrics | No | spool_dir for per-run JSON metric files. |
date_locale | No | Locale for date formatting. |
include_provenance | No | Attach provenance metadata to records. |
The nodes list
Every pipeline has a flat nodes: list. Each entry is a node with a type: discriminator that determines its kind:
| Type | Role |
|---|---|
source | Reads data from a file |
transform | Applies CXL expressions to each record |
aggregate | Groups and summarizes records |
route | Splits records into named branches by condition |
merge | Concatenates multiple upstream branches that share a schema |
combine | Joins records across N inputs with where: predicates |
output | Writes records to a file |
composition | Imports a reusable transform fragment |
Node naming
Every node must have a name: field. Names must be unique within the pipeline and must not contain dots – the dot character is reserved for port syntax (see below). Names are used for wiring, logging, and diagnostics.
Wiring: input and inputs
Nodes connect to each other through input: (singular) and inputs: (plural) fields that live at the node’s top level, alongside name: and type:.
Single upstream – used by transform, aggregate, route, and output nodes:
- type: transform
name: clean
input: raw_data # References the source node named "raw_data"
config: ...
Port syntax – for consuming a specific branch from a route node, use node.port:
- type: output
name: high_value_out
input: split.high # Consumes the "high" branch of route node "split"
config: ...
Multiple upstreams – merge nodes use inputs: (plural) instead of input::
- type: merge
name: combined
inputs:
- east_processed
- west_processed
config: {}
Source nodes have no input field. They are entry points – adding an input: field to a source is a parse error.
Using inputs: on a non-merge node (or input: on a merge node) is caught at parse time by deny_unknown_fields.
Optional fields on all nodes
Every node type supports these optional fields:
description:– human-readable text for documentation. Ignored by the engine._notes:– arbitrary metadata (JSON object). Ignored by the engine, used by the Kiln IDE for visual annotations and inspector panels.
- type: transform
name: enrich
description: "Add customer tier based on lifetime value"
_notes:
color: "#4a9eff"
position: { x: 300, y: 200 }
input: customers
config:
cxl: |
emit tier = if lifetime_value >= 10000 then "gold" else "standard"
Strict parsing
All config structs use deny_unknown_fields. If you misspell a field name – for example, writing inputt: instead of input: or stratgy: instead of strategy: – the YAML parser rejects it immediately with a diagnostic pointing to the typo. This catches configuration errors before any data processing begins.
Environment variable: CLINKER_ENV
The CLINKER_ENV environment variable can be used for conditional logic outside of pipelines (e.g., selecting channel directories or controlling CLI behavior). It is not directly referenced within pipeline YAML but is available to the channel and workspace systems.
Source Nodes
Source nodes read data from files and are the entry points of every pipeline. They have no input: field – they produce records, they do not consume them.
Basic structure
- type: source
name: customers
config:
name: customers
type: csv
path: "./data/customers.csv"
schema:
- { name: customer_id, type: int }
- { name: name, type: string }
- { name: email, type: string }
- { name: status, type: string }
- { name: amount, type: float }
Schema declaration
The schema: field is required on every source node. Clinker does not infer types from data – you must declare each column’s name and CXL type explicitly. This schema drives compile-time type checking across the entire pipeline.
Each entry is a { name, type } pair:
schema:
- { name: employee_id, type: string }
- { name: salary, type: int }
- { name: hired_at, type: date_time }
- { name: is_active, type: bool }
- { name: notes, type: nullable(string) }
Available types
| Type | Description |
|---|---|
string | UTF-8 text |
int | 64-bit signed integer |
float | 64-bit IEEE 754 floating point |
bool | Boolean (true / false) |
date | Calendar date |
date_time | Date with time component |
array | Ordered sequence of values |
numeric | Union of int and float – resolved during type unification |
any | Unknown type – field used in type-agnostic contexts |
nullable(T) | Nullable wrapper around any inner type (e.g. nullable(int)) |
long_unique — storage hint for high-cardinality text
A string column may carry an optional long_unique: true flag. It is an
advisory, opt-in storage hint, not a type change: it tells the engine the
column’s values are long and effectively unique — never repeated across
records — so it stores them in a leaner, header-free representation that drops
the per-value bookkeeping the default representation keeps for values that
might be shared. Typical candidates are UUIDs rendered as text, street
addresses, and free-text comment or note fields.
schema:
- { name: ticket_id, type: string, long_unique: true } # 36-char UUID
- { name: notes, type: string, long_unique: true } # free text
- { name: department, type: string } # low-cardinality, default
The flag changes only the in-memory footprint of the annotated column. The
value’s content, its comparison/grouping/join/sort behavior, and the on-disk
encoding when a run spills to disk are all unchanged — a long_unique value
compares and groups identically to the same text in any other column. Omitting
the flag (the common case) leaves the default behavior untouched. Set it only
when you know a column is genuinely high-cardinality free text; on a column
whose values repeat, the default representation is the better choice because it
shares repeated values instead of storing each copy independently.
Transport vs format
A source declaration has two independent layers:
- Transport (
transport:) selects where the records come from. The only transport today isfile— read bytes from the filesystem, resolved through one of the file matchers (path/glob/regex/paths).transport:is optional and defaults tofile, so a source that omits it reads from disk exactly as before. - Format (
type:) selects how the bytes decode into records:csv,json,xml,fixed_width,edifact,x12.
- type: source
name: orders
config:
name: orders
transport: file # optional; this is the default
type: csv # the on-disk format
path: "./data/orders.csv"
schema:
- { name: order_id, type: int }
A file transport requires exactly one file matcher (path, glob, regex, or paths). Declaring none fails validation with E211; declaring more than one fails with E210. Both are reported at config-load time, before any file is opened.
Format types
The type: field inside config: selects the on-disk format. Supported values: csv, json, xml, fixed_width, edifact, x12.
The edifact format reads UN/EDIFACT interchanges; it has its own
reference page covering the segment-record schema, delimiter discovery,
envelope sections, and control-count validation. See
EDIFACT Format. Its one input option is max_elements
(default 32) — the number of positional eNN element columns on the
record schema; a segment with more data elements than that is rejected
rather than truncated.
The x12 format reads ANSI ASC X12 interchanges with their three-tier
ISA..IEA → GS..GE → ST..SE envelope, surfacing all three tiers as
nested $doc sections. It shares the same max_elements input option and
has its own reference page. See X12 Format.
CSV
- type: source
name: orders
config:
name: orders
type: csv
path: "./data/orders.csv"
schema:
- { name: order_id, type: int }
- { name: customer_id, type: int }
- { name: amount, type: float }
- { name: order_date, type: date }
options:
delimiter: "," # Default: ","
quote_char: "\"" # Default: "\""
has_header: true # Default: true
encoding: "utf-8" # Default: "utf-8"
All CSV options are optional. With no options: block, Clinker uses standard RFC 4180 defaults.
JSON
- type: source
name: events
config:
name: events
type: json
path: "./data/events.json"
schema:
- { name: event_id, type: string }
- { name: timestamp, type: date_time }
- { name: payload, type: string }
options:
format: ndjson # array | ndjson | object (auto-detect if omitted)
record_path: "$.data" # JSONPath to records array
array– the file is a single JSON array of objects.ndjson– one JSON object per line (newline-delimited JSON).object– single top-level object; userecord_pathto locate the records array within it.
If format is omitted, Clinker auto-detects based on file content.
XML
- type: source
name: catalog
config:
name: catalog
type: xml
path: "./data/catalog.xml"
schema:
- { name: product_id, type: int }
- { name: name, type: string }
- { name: price, type: float }
options:
record_path: "//product" # XPath to record elements
attribute_prefix: "@" # Prefix for XML attribute fields
namespace_handling: strip # strip | qualify
strip(default) – removes namespace prefixes from element and attribute names.qualify– preserves namespace-qualified names.
Fixed-width
- type: source
name: legacy_data
config:
name: legacy_data
type: fixed_width
path: "./data/mainframe.dat"
schema:
- { name: account_id, type: string }
- { name: balance, type: float }
- { name: status_code, type: string }
options:
line_separator: crlf # Line ending style
Fixed-width sources require a separate format schema (.schema.yaml file) that defines field positions, widths, and padding. The schema: on the source body declares CXL types for compile-time checking; the format schema defines the physical layout.
on_unmapped — undeclared input fields
The per-source on_unmapped policy decides what to do with input fields the source’s schema: block does not name. Three modes — auto_widen (default), drop, reject:
- type: source
name: orders
config:
name: orders
type: csv
path: "./data/orders.csv"
on_unmapped:
mode: auto_widen # default; other values: drop, reject
schema:
- { name: order_id, type: string }
- { name: amount, type: numeric }
See Auto-Widen & Schema Drift for the full
specification: the $widened sidecar absorber design, propagation
rules per downstream node type, the include_unmapped Output flag,
E315 merge-policy mismatch, and fixed-width inertness.
Sort order
If your source data is pre-sorted, declare the sort order so the optimizer can use streaming aggregation instead of hash aggregation:
- type: source
name: sorted_transactions
config:
name: sorted_transactions
type: csv
path: "./data/transactions_sorted.csv"
schema:
- { name: account_id, type: string }
- { name: txn_date, type: date }
- { name: amount, type: float }
sort_order:
- { field: "account_id", order: asc }
- { field: "txn_date", order: asc }
Sort order declarations are trusted – Clinker does not verify that the data is actually sorted. If the data violates the declared order, downstream streaming aggregation may produce incorrect results.
The shorthand form is also accepted – a bare string defaults to ascending:
sort_order:
- "account_id"
- { field: "txn_date", order: desc }
Watermarks
An event-time watermark declares which column on the source carries
each record’s event time — the wall-clock instant the event
happened, distinct from when Clinker read the row. When set, the
engine reads the column on every record, subtracts the source’s
delay, and folds the result into a per-source monotonic watermark.
The delay-corrected value is also stamped on every record as
$source.event_time, the column a
downstream time-windowed aggregate
uses to assign records to windows.
- type: source
name: clicks
config:
name: clicks
type: csv
path: "./data/clicks.csv"
options:
has_header: true
watermark:
column: event_ts # must be date_time or date
delay: 5s # bounded out-of-order tolerance
idle_timeout: 30s # flip partitions to idle if quiet
schema:
- { name: user_id, type: string }
- { name: event_ts, type: date_time }
- { name: amount, type: int }
Fields:
-
column(required) — the schema column whose value is each record’s event time. The column’s declared type must bedate_timeordate. Acolumn:that names a field absent fromschema:raises E154; acolumn:whose declared type is neither raises E155. -
delay(optional duration, default unset) — bounded out-of-order tolerance. Each record’s event time is shifted earlier bydelaybefore being folded into the watermark, so the source’s effective watermark trails its observed max event time by this amount. Mirrors Flink’sBoundedOutOfOrdernessWatermarks. Withoutdelay, the watermark advances strictly to the observed max — a single late record routes to the DLQ. -
idle_timeout(optional duration, default unset) — if a live source’s receiver stays quiet longer than this, its partitions flip to idle and stop holding backmin_across_sources. Lets downstream windows keep closing when one source pauses.Nonemeans never go idle, preserving the prior behaviour for pipelines without a window-close consumer.
Durations use the suffixes ms, s, m, h, d. ms is
matched before the single-character s, so 500ms reads as 500
milliseconds, not 500 seconds with a stray m.
A pipeline whose aggregate declares time_window: must have a
watermark.column on every upstream-reachable source. Without it,
min_across_sources over the source set stays at None and the
window can never close — the planner rejects this with
E156.
Array paths
For nested data (JSON/XML sources with embedded arrays), array_paths controls how nested arrays are handled:
- type: source
name: invoices
config:
name: invoices
type: json
path: "./data/invoices.json"
schema:
- { name: invoice_id, type: int }
- { name: customer, type: string }
- { name: line_item, type: string }
- { name: line_amount, type: float }
array_paths:
- path: "$.line_items"
mode: explode # One output record per array element
- path: "$.tags"
mode: join # Concatenate array elements into a string
separator: ","
explode(default) – produces one output record per array element, with parent fields repeated.join– concatenates array elements into a single string using the specifiedseparator.
EDIFACT Format
Clinker reads and writes UN/EDIFACT interchanges alongside CSV, JSON,
XML, and fixed-width. An interchange is a finite file: it opens with an
optional UNA service-string advice and a mandatory UNB header, wraps
one or more UNH..UNT messages, and closes with a UNZ trailer. The
reader streams one segment at a time and the writer reconstructs the
envelope around emitted records. The reader decodes release-escape
sequences into clean data values and the writer re-escapes them on
output, so a reader → writer → reader round-trip preserves the data
values and the envelope control references.
Delimiters and the UNA service string
Each segment is terminated by the segment terminator; within a segment, data elements split on the element separator and components on the component separator. A release character escapes a delimiter that occurs as literal data.
When the file begins with a 9-byte UNA prefix, its six service
characters override the defaults in this fixed order: component,
element, decimal, release, repetition, terminator. When UNA is absent,
the syntax Level-A defaults apply:
| Role | Level-A default |
|---|---|
| Component separator | : |
| Element separator | + |
| Decimal notation | . |
| Release / escape | ? |
| Repetition | space (inactive) |
| Segment terminator | ' |
UNA is optional — a parser that requires it would fail on the common
no-UNA interchange, so Clinker assumes Level-A when it is absent.
Release character
The release character (default ?) marks the following byte as literal
data rather than a delimiter: ?+ is a literal + inside an element,
?' is a literal apostrophe (not a terminator), and ?? is a literal
?. The reader decodes these sequences into clean data values, so a
downstream CSV/JSON sink, a CXL string comparison, or a $doc field sees
O'BRIEN, never the wire form O?'BRIEN. The writer re-escapes on
output: any element value that carries the element separator, the segment
terminator, or the release character is release-escaped automatically, so
a value computed by a Transform or sourced from CSV — never
EDIFACT-escaped to begin with — does not corrupt the interchange. A
reader → writer → reader round-trip therefore preserves the data values
exactly.
The component separator inside an element (e.g. the : in the composite
UNOA:1) is kept as part of the element’s text and is not escaped — the
positional element model works above component resolution, so a composite
element round-trips unchanged. A literal colon in free-text data is the
one ambiguity this introduces: because components are not split into
separate fields, a : in a value re-reads as a component boundary.
Repeating elements ride inside one element string intact and are likewise
never truncated to their first repetition.
Newlines between segments
Some producers insert CR/LF after each segment terminator for readability. Those bytes are insignificant and are stripped between segments; CR/LF that appears inside an element is preserved.
Record shape
Each non-service segment becomes one record under a fixed positional schema:
| Column | Meaning |
|---|---|
seg_id | The segment tag (BGM, NAD, …) |
msg_ref | The enclosing message reference (the UNH element 1) |
msg_type | The message type (the UNH element 2, full composite) |
e01, e02, … | The segment’s positional data elements (release sequences decoded) |
Service segments (UNB, UNZ, UNH, UNT) are consumed by the reader
to drive envelope state and validation — they are never emitted as body
records. The UNH segment that opens a message is emitted as a body
record (its seg_id is UNH), carrying the message reference and type.
The number of eNN columns is controlled by the source max_elements
option (default 32). A segment carrying more data elements than that is
rejected with guidance rather than silently truncated. Absent trailing
elements read as null.
nodes:
- type: source
name: orders
config:
name: orders
type: edifact
glob: ./inbox/*.edi
options:
max_elements: 48 # widen the positional schema for exotic segments
schema:
- { name: seg_id, type: string }
- { name: msg_ref, type: string }
- { name: e01, type: string }
Envelope sections over UNB
The interchange header UNB is extractable as a document envelope
section, exposing its positional elements to CXL as
$doc.<section>.<field>. Use the segment extract rule with the section
field names matching the positional keys e01, e02, …:
envelope:
sections:
interchange:
extract: { segment: "UNB" }
fields:
e05: string # interchange control reference (UNB element 5)
A Transform can then read $doc.interchange.e05 on every body record.
Only the UNB header is extractable as an envelope section. Trailer
segments (UNT, UNZ) arrive after the body and cannot become $doc
fields without buffering the whole interchange — their control counts
are instead validated inline by the reader (see below). A segment
extract naming any tag other than UNB, or an xml_path / json_pointer
extract against an EDIFACT source, is rejected at startup.
Control-count validation
The reader validates the structural integrity claims carried in the trailers as they arrive, failing the run on a mismatch (a truncation or corruption signal):
UNTsegment count — must equal the actual number of segments in the message, counting theUNHandUNTthemselves.UNTmessage reference — must echo the openingUNHreference.UNZmessage count — must equal the actual number ofUNHmessages in the interchange.UNZcontrol reference — must echo theUNBcontrol reference.
The UNB control reference (data element 0020) is located by its
structural position — the first data element after the four mandatory
leading composites (syntax identifier, sender, recipient, date/time) —
rather than at a fixed element index. An interchange that carries an
empty optional element ahead of the control reference (shifting it past
the fifth position) therefore validates and round-trips correctly: the
reader reads the real reference and the writer echoes the same one into
UNZ, so the trailer never contradicts its own header.
A missing UNZ at end of input is a truncation error; content after the
UNZ trailer is rejected.
Writing EDIFACT
An EDIFACT Output node reconstructs the envelope around emitted records.
Records map by the same positional columns (seg_id, msg_ref,
msg_type, eNN); trailing null/empty elements are trimmed so no
fabricated delimiters appear, and a column the writer does not recognize
is an error (project the record to the EDIFACT columns first).
Engine-internal $-namespaced columns are excluded automatically.
nodes:
- type: output
name: out
input: messages
config:
name: out
type: edifact
path: ./out/result.edi
options:
interchange: ["UNOA:1", "SENDER", "RECEIVER", "240101:1200", "REF1"]
message_type: "ORDERS:D:96A:UN"
write_una: false
segment_newline: true
Output options:
| Option | Meaning |
|---|---|
interchange | Literal UNB data elements (release-escaped as needed on write). |
interchange_from_doc | Name of a $doc section to echo the UNB elements from (round-trip). |
message_type | Fallback UNH message type when a record carries no msg_type value. |
write_una | Emit a leading UNA segment (default false). |
segment_newline | Write a newline after each segment terminator (default true). |
Consecutive records are grouped into UNH..UNT messages on msg_ref
transitions. The writer recomputes the UNT segment count and UNZ
message count, and echoes the message and interchange control references,
so the output passes its own count validation on re-read.
interchange_from_doc echoes the header from a record’s document
context. That context is populated by a source’s UNB envelope section
(declare a segment: "UNB" envelope section on the source) and travels
with every body record through the pipeline — including to a sink that
sits directly downstream of the source with no intervening Transform. The
reader stashes the complete, ordered UNB element list (empty middle
elements included), so the reconstructed header is faithful even when a
middle element is empty and the user declares only the fields they care
about. Supply interchange literal elements instead when the records
have no source UNB section to echo.
Limitations
- Charset. Element text is decoded as UTF-8. Non-UTF-8 interchanges (UNOA/UNOB/Latin-1 high bytes) are rejected explicitly rather than silently corrupted.
- Functional groups. A single
UNB..UNZinterchange is supported;UNG/UNEfunctional-group segments are rejected with a precise error. - UNH composite fidelity. The reader stamps the
UNHreference (element 1) and the full message-type composite (element 2). AUNHcarrying additional elements (e.g. a common access reference) is reconstructed as a two-elementUNHon round-trip. - Output splitting. An interchange is a single
UNB..UNZenvelope and cannot be divided across files. Anedifactoutput combined with asplit:block is rejected at config-validation time (diagnosticE323) rather than emitting a structurally corrupt interchange.
X12 Format
Clinker reads and writes ANSI ASC X12 interchanges alongside CSV, JSON,
XML, fixed-width, and EDIFACT. An X12 interchange is a finite file with a
three-tier envelope: an ISA..IEA interchange wraps one or more GS..GE
functional groups, and each functional group wraps one or more ST..SE
transaction sets. The reader streams one segment at a time and the writer
reconstructs the three envelope tiers around emitted records.
The three tiers surface as nested document-context levels: the ISA
interchange becomes the file-level $doc document, and each GS group and
ST set opens a nested level whose $doc sections layer over the
enclosing tiers. A body record therefore sees every enclosing tier’s
fields through one $doc.<section>.<field> lookup.
Delimiters and the ISA header
Unlike EDIFACT’s optional UNA service-string advice, X12 declares its
delimiters in a fixed-length 106-byte ISA header. Three delimiter bytes
live at structural positions within it:
| Role | Source in the ISA |
|---|---|
| Element (data) separator | The byte immediately after the ISA tag |
| Sub-element (component) sep. | ISA16, the last single-byte ISA element |
| Segment terminator | The byte immediately after ISA16 |
The reader reads these three bytes from the header rather than assuming a
fixed delimiter set, so an interchange that uses */:/~,
|/^/newline, or any other producer-chosen delimiters parses correctly.
The ISA13 interchange control number is located as the 13th element of
the header split on the discovered element separator — structurally, not by
an absolute byte offset — so producer padding quirks do not misalign it.
No escape character
X12 has no release/escape character (EDIFACT’s ? has no X12
equivalent). A data value that contains a delimiter byte is therefore
unrepresentable. On output the writer rejects any element value carrying
the element separator or the segment terminator with a precise error rather
than silently corrupting the interchange; re-encode the value or choose
delimiters the data does not contain.
The sub-element (component) separator inside an element (e.g. the : in a
composite A:B:C) is kept as part of the element’s text and is not split —
the positional element model works above component resolution, so a
composite element round-trips unchanged.
Newlines between segments
Some producers insert CR/LF after each segment terminator for readability. Those bytes are insignificant and are stripped between segments; CR/LF that appears inside an element is preserved.
Record shape
Each non-service segment becomes one record under a fixed positional schema:
| Column | Meaning |
|---|---|
seg_id | The segment tag (BEG, PO1, …) |
set_ref | The enclosing transaction set control number (ST02) |
set_type | The transaction set identifier code (ST01, e.g. 850) |
e01, e02, … | The segment’s positional data elements |
Service segments (ISA, IEA, GS, GE, SE) are consumed by the
reader to drive the envelope and validation — they are never emitted as
body records. The ST segment that opens a transaction set is emitted
as a body record (its seg_id is ST), carrying the set reference and
type.
The number of eNN columns is controlled by the source max_elements
option (default 32). A segment carrying more data elements than that is
rejected with guidance rather than silently truncated. Absent trailing
elements read as null.
nodes:
- type: source
name: orders
config:
name: orders
type: x12
glob: ./inbox/*.x12
options:
max_elements: 48 # widen the positional schema for exotic segments
schema:
- { name: seg_id, type: string }
- { name: set_ref, type: string }
- { name: e01, type: string }
Envelope sections over the three tiers
The interchange header ISA is extractable as a file-level document
envelope section, exposing its positional elements to CXL as
$doc.<section>.<field>. Use the segment extract rule with the section
field names matching the positional keys e01, e02, …:
envelope:
sections:
interchange:
extract: { segment: "ISA" }
fields:
e13: string # interchange control number (ISA13)
The GS functional group and the ST transaction set surface
automatically as the nested $doc sections functional_group and
transaction_set, each keyed by positional eNN elements — no envelope
declaration is needed for them. A Transform on any body record can read
all three tiers at once:
emit isa13 = $doc.interchange.e13 # interchange control number
emit gs06 = $doc.functional_group.e06 # group control number (GS06)
emit st02 = $doc.transaction_set.e02 # set control number (ST02)
Only the ISA header is extractable as a declared envelope section.
Trailer segments (SE, GE, IEA) arrive after the body they close and
cannot become $doc fields without buffering the whole interchange — their
control counts are instead validated inline by the reader (see below). A
segment extract naming any tag other than ISA, or an xml_path /
json_pointer extract against an X12 source, is rejected at startup.
Control-count validation
The reader validates the structural integrity claims carried in the trailers as they arrive, failing the run on a mismatch (a truncation or corruption signal):
SEsegment count (SE01) — must equal the number of segments in the transaction set, counting theSTandSEthemselves.SEset control number (SE02) — must echo the openingST02.GEtransaction-set count (GE01) — must equal the number ofSTsets in the functional group.GEgroup control number (GE02) — must echo theGS06.IEAfunctional-group count (IEA01) — must equal the number ofGSgroups in the interchange.IEAcontrol number (IEA02) — must echo theISA13.
A missing IEA at end of input is a truncation error; content after the
IEA trailer is rejected.
Writing X12
An X12 Output node reconstructs the three-tier envelope around emitted
records. Records map by the same positional columns (seg_id, set_ref,
set_type, eNN); trailing null/empty elements are trimmed so no
fabricated delimiters appear, and a column the writer does not recognize is
an error (project the record to the X12 columns first). Engine-internal
$-namespaced columns are excluded automatically.
nodes:
- type: output
name: out
input: messages
config:
name: out
type: x12
path: ./out/result.x12
options:
interchange:
["00", " ", "00", " ", "ZZ", "SENDER ",
"ZZ", "RECEIVER ", "240101", "1200", "U", "00401",
"000000001", "0", "P", ":"]
group_header: ["PO", "SENDER", "RECEIVER", "20240101", "1200", "1", "X", "004010"]
set_type: "850"
segment_newline: true
Output options:
| Option | Meaning |
|---|---|
interchange | Literal ISA data elements (the 16 fixed-width ISA fields). |
interchange_from_doc | Name of a $doc section to echo the ISA elements from (round-trip). |
group_header | Literal GS01..GS08 elements (GS06 control number recomputed). |
set_type | Fallback ST01 set type when a record carries no set_type value. |
segment_newline | Write a newline after each segment terminator (default true). |
Consecutive records are grouped into ST..SE transaction sets on set_ref
transitions, and all sets are wrapped in a single GS..GE functional
group. The writer recomputes the SE segment count, the GE
transaction-set count, and the IEA functional-group count, and echoes the
set, group, and interchange control numbers, so the output passes its own
count validation on re-read.
interchange_from_doc echoes the header from a record’s document context.
That context is populated by a source’s ISA envelope section (declare a
segment: "ISA" envelope section on the source) and travels with every
body record through the pipeline — including to a sink that sits directly
downstream of the source with no intervening Transform. The reader stashes
the complete, ordered ISA element list, so the reconstructed header is
faithful. Supply interchange literal elements instead when the records
have no source ISA section to echo.
Limitations
- Charset. Element text is decoded as UTF-8. Non-UTF-8 interchanges are rejected explicitly rather than silently corrupted.
- No escape character. X12 has no release mechanism, so a data value that contains a delimiter byte is rejected on output rather than corrupting the interchange.
- One functional group on output. The writer wraps all transaction sets
in a single
GS..GEfunctional group; the reader handles any number of groups on input. A multi-group output shape requires multiple runs. - Output splitting. An interchange is a single
ISA..IEAenvelope and cannot be divided across files. Anx12output combined with asplit:block is rejected at config-validation time (diagnosticE338) rather than emitting a structurally corrupt interchange.
Network Sources (REST)
A Source reads from the filesystem by default. To pull records from a
network endpoint instead, declare a transport: block on the Source. The
transport selects where records come from; it sits above the on-disk
type: (the format), which for a REST source still selects how the
response bodies decode.
A network transport is a finite-pull source: it runs on its own thread, drives a synchronous client to cursor exhaustion, then exits. There is no daemon, no event loop, and no async runtime — the same single- process, run-to-drain model as a file pipeline. Finiteness is a hard property of the reader: a REST source caps its pull with an explicit page/record limit, so an unbounded endpoint cannot keep it running forever.
A network source still requires a schema: block. That authored
schema is the row-to-record target: the reader maps each decoded object
onto it, coercing values leniently. A per-row value that cannot coerce is
left unchanged at the reader and routed to the dead-letter queue at the
Transform stage — identical to file-source semantics. A network source
declares no file matcher (path / glob / regex / paths);
declaring one is a configuration error (E219).
Because a network source has no file path, its $source.file provenance
column and the {source_file} output template both resolve to a stable
synthetic identifier, <source:NAME>, where NAME is the Source node’s
name.
REST sources
A rest source issues paginated HTTP GETs against a base URL, decoding
each response body through the declared json or xml format. (Other
formats are rejected with E220 — a REST body is a multi-record
document, not a flat CSV/fixed-width stream.)
nodes:
- type: source
name: orders_api
config:
name: orders_api
type: json
options:
format: array # each page body is a JSON array of objects
transport:
kind: rest
url: https://api.example.com/v1/orders
max_pages: 50 # HARD page cap — required
pagination:
strategy: link_header
auth:
scheme: bearer
token: "${ORDERS_TOKEN}"
schema:
- { name: order_id, type: int }
- { name: total, type: float }
- { name: placed_at, type: date_time }
Pagination strategies
The pagination.strategy selects how the reader advances pages and
detects the last one. Whatever the strategy, the pull always stops at the
max_pages / max_records cap, even when the server keeps offering more.
-
none(default) — a singleGET; the body is the whole result. -
offset—?offset=N&limit=L, advancing the offset by the page size each request. The last page is the one that returns fewer rows thanlimit.pagination: strategy: offset limit: 200 offset_param: offset # optional, defaults shown limit_param: limit -
cursor_token— the reader reads a continuation token from a JSON pointer in each response and sends it back on the next request. Paging stops when the token field is absent or null.pagination: strategy: cursor_token cursor_param: page_token next_token_pointer: /meta/next_page # RFC 6901 JSON pointer -
link_header— the reader follows the URL in the response’s RFC 5988Link: <…>; rel="next"header until no such link is present.pagination: strategy: link_header
Authentication
auth.scheme selects the credential sent on every request:
-
none(default) — no auth header. -
bearer— sendsAuthorization: Bearer <token>. -
header— sends an arbitrary static header, e.g. an API key.auth: scheme: header name: X-API-Key value: "${API_KEY}"
Reliability and finiteness knobs
| Key | Default | Meaning |
|---|---|---|
max_pages | — | Required. Hard ceiling on pages fetched, regardless of the server. |
max_records | none | Optional hard ceiling on records emitted. |
retries | 3 | Bounded retries on a transient failure (5xx, connect/timeout error). A 4xx is fatal — retrying cannot help. |
timeout_secs | 30 | Per-request timeout. Bounds in-flight time so an interrupt lands within the shutdown window. |
A partial-page decode failure routes that page’s offending rows to the DLQ per-row, exactly like a file source; it does not abort the pull.
Shutdown
On SIGINT/SIGTERM the reader polls its cancellation handle at each
page boundary and stops cleanly with a normal end-of-input — the same
graceful drain a file source performs. The timeout_secs per-request
bound caps how long a single in-flight request can delay that stop.
Auto-Widen & Schema Drift
When an input file carries columns the source’s declared schema:
block does not name, Clinker decides what to do with them via the
per-source on_unmapped policy. The engine-wide default is
auto_widen — schema drift is preserved end-to-end without
user-visible breakage. This chapter is the single source of truth
for the absorber design, propagation rules, output controls, and
diagnostics.
The three modes
- type: source
name: orders
config:
name: orders
type: csv
path: "./data/orders.csv"
on_unmapped:
mode: auto_widen # default; other values: drop, reject
schema:
- { name: order_id, type: string }
- { name: amount, type: numeric }
-
auto_widen(default) — per-record undeclared fields are absorbed into aValue::Mappayload carried by an engine-stamped$widenedsidecar column appended to the source’s schema. The sidecar’s payload propagates through downstream nodes and the sink expands it back to top-level columns wheneverinclude_unmapped: trueis set on the Output node (the default). Pattern precedent: Databricks Auto Loader’s_rescued_datasidecar and ClickHouse’sJSONcolumn type. -
drop— undeclared input fields are silently stripped at read time. No sidecar; the source’s plan-time schema equals the declaredschema:. Matches Snowflake’sMATCH_BY_COLUMN_NAME='CASE_INSENSITIVE'withERROR_ON_COLUMN_COUNT_MISMATCH=FALSEand dbt’son_schema_change=ignore. -
reject— any input record carrying a key not in the declared schema fails the source with aFormatError::UndeclaredFielddiagnostic naming the offending field. Strict; matches dlt’sfreezemode.
The $widened sidecar absorber
auto_widen is implemented as an on-schema sidecar: the
engine appends a single column named $widened to the source’s
schema, marked with FieldMetadata::WidenedSidecar. Each record’s
undeclared input fields are stored as the sidecar’s
Value::Map payload — keyed by input field name, valued by the
read scalar.
The on-schema design is deliberate. An off-schema sidecar (a
parallel data structure outside Schema) is a silent-loss bug
class: any code path that reconstructs a Record from
schema.columns() and a value vector silently drops the
side-channel. The on-schema slot inherits the same
serialization, span propagation, sort/spill, and projection
machinery as user-declared columns — there is no “remember to
copy the sidecar” obligation on every consumer. CXL expressions
cannot read or write the sidecar (the typechecker is blind to
its contents); see System variables → $widened
for the parser-level rejection.
Propagation through the DAG
The $widened sidecar follows these rules through downstream
nodes:
| Node type | Sidecar behavior |
|---|---|
| Transform | Inherits unchanged from input (transforms are row-preserving). |
| Aggregate | Output’s $widened slot is Value::Null — per-row payloads have no canonical aggregation. Users who need an unmapped field at aggregate output must add it to group_by or emit it explicitly via an aggregate function. |
| Combine | Driver’s sidecar rides through; build-side sidecars are dropped (mirrors propagate_ck: Driver). Build-side iter_user_fields() filters every engine-stamped column from match: collect array payloads, so build $widened cannot leak into the collect array. Users can lift a build-side unmapped field via <build_qualifier>.<field> in the combine body’s CXL. |
| Route / Merge | Row-preserving — sidecar passes through. Merge requires every input source to share the same on_unmapped policy; mixing fails compile with E315 (see below). |
| Composition | Body inherits the parent’s sidecar via the synthetic input port; whatever the body’s terminal node carries flows back to the parent. The body’s terminal-node propagation rule applies (e.g. an Aggregate terminal yields Value::Null at the parent boundary, a match: first Combine terminal carries the driver’s payload). |
| Output | Sidecar expands to top-level columns when include_unmapped: true (the default). Set include_unmapped: false to strip the sidecar (and every other unmapped input field) so only explicitly-emitted columns reach the writer. |
Output controls
- type: output
name: out
input: src
config:
name: out
type: json
path: out.json
include_unmapped: true # default: true
When true (the default), fields the source absorbed into
$widened are expanded back to top-level columns at the sink.
Useful for pass-through pipelines where every original input
field should reach the output regardless of whether it was
declared in schema:. Set include_unmapped: false to strip
the sidecar (and every other input field not explicitly emitted
upstream) so the writer sees only user-declared columns.
include_unmapped composes independently with
include_correlation_keys: true — each, both, or neither can be
set. include_correlation_keys does not surface
$widened; the two flags are orthogonal.
Cross-format flow
The expansion happens at the projection layer, before the
writer sees the record. So a CSV source with auto_widen plus
a JSON output with include_unmapped: true produces JSON
objects whose top-level keys include both declared columns and
absorbed input columns:
input.csv: id,extra,city
1,foo,Paris
output.json: {"id": "1", "extra": "foo", "city": "Paris"}
The literal $widened slot is stripped during expansion; the
writer never sees a Value::Map.
Writer rejection of Value::Map payloads
CSV, XML, and fixed-width writers refuse records carrying a
Value::Map payload at any column slot, raising
FormatError::UnserializableMapValue { format, column }. The
rejection lives in each writer’s value-to-string helper —
single point of truth, no defensive prechecks. JSON serializes
Value::Map natively as a nested object and does not raise.
The most common cause: the $widened sidecar reaches the
writer because the Output node set include_unmapped: false.
Remediation is either to leave include_unmapped at its default
of true (so the projection layer expands the map to top-level
columns before write) or to coerce the map to a scalar in CXL
before the emit. The error message lists both routes.
The DLQ writer applies the same filter at its own layer:
dlq::dlq_user_columns strips any column tagged
FieldMetadata::WidenedSidecar, so the DLQ CSV header never
contains $widened even when the DLQ entry’s
original_record retains the auto_widen schema shape.
Correlation-lattice columns ($ck.*) are retained in the DLQ
output for collateral debugging.
E315 — Merge inputs must agree on policy
Merge concatenates streams positionally against the merge node’s
output_schema (taken from the first input). Every input must
agree on column shape — same column names, same on_unmapped
policy, same correlation_key set.
If two upstream sources disagree on whether they carry the
$widened sidecar (one source uses auto_widen, another uses
drop / reject), compile fails:
E315: merge "merged": input schemas disagree on the `$widened` auto_widen sidecar column.
Remediation: set every merge upstream source to the same
on_unmapped policy. The engine-wide default is auto_widen;
for sources that should explicitly omit the sidecar, declare
on_unmapped: { mode: drop } (or reject) on each.
Fixed-width sources are structurally inert
Fixed-width sources are positional — the schema is constructed
from width / start..end byte ranges, and bytes outside the
declared ranges are invisible to the reader. auto_widen
therefore can never populate the sidecar for fixed-width
sources; the slot stays Value::Null for every record.
The executor emits a tracing::info diagnostic at source-reader
construction time when auto_widen is the policy on a
fixed-width source, naming the source. The diagnostic fires
once per reader instance; a source used as a combine
build-side input across multiple combines may produce one log
per combine. To avoid the noise, switch to on_unmapped: drop
(or reject) for explicit scalar semantics, or accept the
empty sidecar.
Transform Nodes
Transform nodes apply CXL expressions to each record, producing new fields, filtering records, or both. They process one record at a time in streaming fashion with constant memory overhead.
Basic structure
- type: transform
name: enrich
input: customers
config:
cxl: |
emit full_name = first_name + " " + last_name
emit tier = if lifetime_value >= 10000 then "gold" else "standard"
filter status == "active"
The cxl: field is required and contains a CXL program. The three core CXL statements for transforms are:
emit– produces an output field. Only emitted fields appear in downstream nodes.filter– drops records that do not match the boolean condition.let– binds a local variable for use in subsequent expressions (not emitted).
cxl: |
let margin = revenue - cost
emit product_id = product_id
emit margin = margin
emit margin_pct = if revenue > 0 then margin / revenue * 100 else 0
filter margin > 0
Analytic window
The analytic_window field enables cross-source lookups by joining a secondary dataset into the transform. The secondary source is loaded into memory and indexed by the join key.
- type: transform
name: enrich_orders
input: orders
config:
analytic_window:
source: products
on: product_id
group_by: [product_id]
cxl: |
emit order_id = order_id
emit product_name = $window.first()
emit quantity = quantity
emit line_total = quantity * price
The $window.* namespace provides access to the windowed data. Functions like $window.first(), $window.last(), and $window.count() operate over the matched group.
Validations
Declarative validation checks can be attached to a transform. They run against each record and either route failures to the DLQ (severity error) or log a warning and continue (severity warn).
- type: transform
name: validate_orders
input: raw_orders
config:
cxl: |
emit order_id = order_id
emit amount = amount
emit email = email
validations:
- field: email
check: "not_empty"
severity: error
message: "Email is required"
- check: "amount > 0"
severity: warn
message: "Non-positive amount"
- field: order_id
check: "not_empty"
severity: error
Validation fields
| Field | Required | Description |
|---|---|---|
field | No | Restrict the check to a single field |
check | Yes | Validation name (e.g. "not_empty") or CXL boolean expression |
severity | No | error (default) routes to DLQ; warn logs and continues |
message | No | Custom error message for DLQ entries |
name | No | Validation name for DLQ reporting. Auto-derived from field + check if omitted |
args | No | Additional arguments as key-value pairs |
Expansion cap (max_expansion)
When a transform body contains an emit each statement, every input record can fan out into multiple output records. The max_expansion field caps how many output records a single input record may produce – a safety bound against unexpectedly large arrays.
- type: transform
name: explode_items
input: orders
config:
max_expansion: 5000 # default: 10000
cxl: |
emit each it in items {
emit order_id = order_id
emit sku = it["sku"]
emit price = it["price"]
}
| Field | Type | Default | Description |
|---|---|---|---|
max_expansion | u64 | 10000 | Maximum cumulative output records per input record. |
If a single input record’s emit each block produces more than max_expansion output records, the originating record routes to the DLQ with category expansion_limit_exceeded instead of producing a truncated or unbounded result. No partial output is emitted for that record – the cap is enforced eagerly so the writer never sees records from a runaway expansion.
When to tune
- Lower (e.g.
100,1000) when input arrays are bounded by a known business rule and you want hostile or malformed input to surface as a DLQ entry rather than as a flood of downstream records. - Higher (e.g.
100000,1000000) when legitimate input carries large arrays – for example, an order with a long line-item list or an event carrying a per-second pricing curve.
The DLQ category expansion_limit_exceeded is distinct from generic CXL evaluation failures, so DLQ-side filters and metrics can target expansion runaway specifically. See Error Handling & DLQ for the wider DLQ contract.
Batch size (batch_size)
A streaming-eligible transform hands its output downstream in bounded batches rather than accumulating the whole stage before the next stage runs. batch_size sets how many events (records plus document-boundary punctuations) a batch holds. A per-transform batch_size overrides the pipeline-level pipeline.batch_size for this one stage; omit it to inherit the pipeline value (or the built-in default of 2048).
- type: transform
name: enrich
input: orders
config:
batch_size: 512 # override pipeline.batch_size for this stage
cxl: |
emit order_id = order_id
emit total = quantity * unit_price
| Field | Type | Default | Description |
|---|---|---|---|
batch_size | usize | inherits pipeline.batch_size (else 2048) | Events per streaming batch for this transform. Must be >= 1. |
A batch_size of 0 is rejected at config load (a zero-event batch never flushes). Smaller batches lower the in-flight memory of a streaming stage at the cost of more per-batch bookkeeping; larger batches amortize the bookkeeping at the cost of a larger live working set. The default suits typical record widths — tune it only when a profiling run shows a streaming stage’s per-batch footprint matters. See Streaming vs. Blocking Stages for which stages stream and which fully materialize.
Log directives
Log directives control diagnostic output during transform execution:
- type: transform
name: process
input: validated
config:
cxl: |
emit id = id
emit result = compute(value)
log:
- level: info
when: per_record
every: 1000
message: "Processed record"
- level: warn
when: on_error
message: "Record failed processing"
- level: debug
when: before_transform
message: "Starting transform"
Log directive fields
| Field | Required | Description |
|---|---|---|
level | Yes | trace, debug, info, warn, or error |
when | Yes | before_transform, after_transform, per_record, or on_error |
message | Yes | Log message text |
every | No | Only log every N records (for per_record timing) |
condition | No | CXL boolean expression – only log when true |
fields | No | List of field names to include in the log output |
log_rule | No | Reference to an external log rule definition |
Complete example
- type: source
name: employees
config:
name: employees
type: csv
path: "./data/employees.csv"
schema:
- { name: employee_id, type: string }
- { name: first_name, type: string }
- { name: last_name, type: string }
- { name: department, type: string }
- { name: salary, type: int }
- { name: hire_date, type: date }
- type: transform
name: enrich_employees
description: "Compute display name and tenure"
input: employees
config:
cxl: |
emit employee_id = employee_id
emit display_name = last_name + ", " + first_name
emit department = department.upper()
emit salary = salary
emit annual_bonus = if salary >= 80000 then salary * 0.15
else salary * 0.10
validations:
- field: employee_id
check: "not_empty"
severity: error
message: "Employee ID is required"
- check: "salary > 0"
severity: warn
message: "Salary should be positive"
log:
- level: info
when: per_record
every: 5000
message: "Processing employees"
Aggregate Nodes
Aggregate nodes group records by one or more fields and compute summary values using CXL aggregate functions. They consume all input records in a group before emitting a single summary record per group.
Basic structure
- type: aggregate
name: dept_totals
input: employees
config:
group_by: [department]
cxl: |
emit total_salary = sum(salary)
emit headcount = count(*)
emit avg_salary = avg(salary)
Group-by fields pass through automatically – you do not need to emit them. In this example, the output records contain department, total_salary, headcount, and avg_salary.
Group-by fields
The group_by: field is a list of field names from the input schema. Records sharing the same values for all group-by fields are placed in the same group.
group_by: [region, department]
cxl: |
emit total_salary = sum(salary)
emit max_salary = max(salary)
This produces one output record per unique (region, department) combination.
Global aggregation
An empty group_by list treats the entire input as a single group, producing exactly one output record:
- type: aggregate
name: grand_totals
input: orders
config:
group_by: []
cxl: |
emit grand_total = sum(amount)
emit record_count = count(*)
emit avg_order = avg(amount)
Aggregate functions
The following aggregate functions are available in CXL:
| Function | Description |
|---|---|
sum(field) | Sum of all values in the group |
count(*) | Number of records in the group |
avg(field) | Arithmetic mean |
min(field) | Minimum value |
max(field) | Maximum value |
collect(field) | Collect all values into an array |
weighted_avg(value, weight) | Weighted average |
Strategy hint
The strategy: field controls how aggregation is executed:
- type: aggregate
name: totals
input: sorted_data
config:
group_by: [account_id]
strategy: streaming
cxl: |
emit total = sum(amount)
| Strategy | Behavior |
|---|---|
auto | Default. The optimizer chooses based on whether the input is provably sorted for the group-by keys. |
hash | Force hash aggregation. Works on any input ordering. Holds all groups in memory (with disk spill if memory budget is exceeded). |
streaming | Require streaming aggregation. Processes one group at a time with O(1) memory per group. Compile-time error if the input is not provably sorted for the group-by keys. |
When to use streaming
If your source declares a sort_order: that covers the group-by fields, the optimizer will automatically choose streaming aggregation. Use strategy: streaming as an explicit assertion – it turns a silent fallback to hash aggregation into a compile error, which is useful for catching sort-order regressions.
When to use hash
Hash aggregation works on unsorted input and is the safe default. It uses more memory but handles any data ordering. Memory-aware disk spill kicks in when RSS approaches the pipeline’s memory.limit.
Correlation-key interaction
In a pipeline whose sources declare correlation_key: fields, the engine inspects each aggregate’s group_by against the upstream CK lattice (the union of $ck.* shadow columns visible at the aggregate’s input):
group_bycovers every upstream CK field — strict-collateral path. The aggregate emits one row per group, the row inherits the correlation identity of its inputs, and a DLQ trigger anywhere in the group rolls back the whole group including the aggregate output. Zero retraction overhead.group_byomits any upstream CK field — retraction protocol path. A single correlation group may span multiple aggregate groups; CK fields omitted fromgroup_bystop being visible to downstream consumers of this aggregate’s output as user-named columns. The engine retracts only the failing records and refinalizes affected groups.
Authors do not configure this — the engine selects the path automatically based on group_by content. A retraction-mode aggregate is incompatible with strategy: streaming (rejected with E15Y, because streaming aggregates emit at group-boundary close before the terminal correlation commit and that defeats the rollback window). See Correlation Keys for the full lattice rules.
A retraction-mode aggregate emits one engine-managed $ck.aggregate.<name> shadow column on its output schema, alongside [group_by_columns] ++ [emitted_binding_columns]. The column carries the aggregator’s per-group index at finalize and costs ~16 bytes per emitted row (the Value::Integer payload plus its slot overhead); it is hidden from default writer output. The synthetic column is the lineage hook that lifts the post-aggregate retract path: a Transform or Output that fails on an aggregate output row carries the column on the failing record, the orchestrator’s detect phase decodes the index back to the contributing source row ids, and the recompute phase retracts those source rows so the failing aggregate row’s contributors are removed from the writer payload — matching the upstream-failure DLQ-fan-out semantic. See Correlation Keys → Where retraction triggers are sourced and the runnable demo at examples/pipelines/retract-demo/.
The retraction protocol carries a per-aggregate cost — Reversible accumulators use a per-row lineage map, BufferRequired accumulators hold raw contributions until commit. Both paths additionally pay ~16 bytes per output row for the synthetic-CK shadow column. The operator-by-operator retraction cost reference has the per-operator breakdown; clinker run --explain reports the live per-aggregate detail including the synthetic-CK line.
Time-windowed aggregates
When time_window: is set on the aggregate body, the operator
groups records not just by group_by but also by event-time
window. Each record is assigned to one or more windows by the
engine-stamped $source.event_time
column; state accumulates per (group_by, window); a window closes
once min_across_sources >= window_end + allowed_lateness and emits
one row per group it saw. The shape parallels Flink SQL
Window TVFs,
Spark Structured Streaming
window / session_window,
and Beam
windowing.
Every upstream-reachable source must declare a
watermark:. Otherwise min_across_sources
stays at None, no window ever closes, and the planner rejects the
pipeline with
E156.
The engine emits user-declared columns only — window bounds do not
appear in the output unless you compute and emit them yourself. The
emit order is ascending window_start (deterministic), so output
rows naturally group by window.
Tumbling windows
Non-overlapping fixed-size buckets. Each record lands in exactly one
window [floor(t / size) * size, floor(t / size) * size + size).
time_window:
tumbling: { size: 1h }
Input (tumbling_demo.csv):
user_id,event_ts,kind
u1,2026-05-14T10:05:00,click
u2,2026-05-14T10:30:00,click
u1,2026-05-14T10:42:00,click
u1,2026-05-14T11:03:00,click
u2,2026-05-14T11:15:00,click
u2,2026-05-14T11:50:00,click
Output with tumbling: { size: 1h }, group_by: [user_id],
emit n = count(*):
user_id,n
u1,2
u2,1
u1,1
u2,2
Reading top-to-bottom: the first two rows are the [10:00, 11:00)
bucket (u1’s 10:05 and 10:42, then u2’s 10:30); the next two are
the [11:00, 12:00) bucket (u1’s 11:03, then u2’s 11:15 and 11:50).
Each input record contributes to exactly one window.
Hopping windows
Overlapping fixed-size buckets advanced by slide. Each record
lands in ceil(size / slide) windows: slide < size produces
overlap, slide == size degenerates to tumbling, slide > size
produces gaps where some records fall in zero windows.
time_window:
hopping: { size: 1h, slide: 30m }
Input (hopping_demo.csv):
user_id,event_ts,amount
u1,2026-05-14T10:05:00,10
u1,2026-05-14T10:42:00,20
u1,2026-05-14T11:10:00,15
Output with group_by: [user_id], emit total = sum(amount),
emit n = count(*):
user_id,total,n
u1,10,1
u1,30,2
u1,35,2
u1,15,1
Three input records, four output rows — each record fans into two
overlapping size: 1h, slide: 30m windows:
[09:30, 10:30)— just 10:05 →total=10, n=1[10:00, 11:00)— 10:05 + 10:42 →total=30, n=2[10:30, 11:30)— 10:42 + 11:10 →total=35, n=2[11:00, 12:00)— just 11:10 →total=15, n=1
Session windows
Per-key gap-bounded sessions. A new record extends its key’s current
session if its event time is within gap of the session’s last
event time; otherwise it starts a new session. The boundary is
data-driven, not clock-aligned.
time_window:
session: { gap: 10m }
Input (session_demo.csv):
user_id,event_ts,action
u1,2026-05-14T10:00:00,login
u1,2026-05-14T10:07:00,click
u1,2026-05-14T10:13:00,click
u1,2026-05-14T10:50:00,login
u1,2026-05-14T10:55:00,click
Output with group_by: [user_id], emit n = count(*):
user_id,n
u1,3
u1,2
u1’s first three rows form one session (10:00 → 10:07 → 10:13,
consecutive gaps ≤ 10m). The 37-minute idle stretch exceeds gap,
so 10:50 starts a fresh session that runs through 10:55. Two
sessions, two output rows.
Allowed lateness
allowed_lateness is an operator-side knob, distinct from the
source-side watermark.delay. A window with time_window: closes
when min_across_sources >= window_end + allowed_lateness. Records
arriving after a window’s end + allowed_lateness route to the DLQ
as LateRecord with stage label time_window:<aggregate-name>.
See DLQ category: LateRecord
for the DLQ row layout.
- type: aggregate
name: hourly
input: clicks
config:
group_by: [user_id]
time_window:
tumbling: { size: 1h }
allowed_lateness: 30s
cxl: |
emit n = count(*)
Default (unset) means no grace beyond the watermark — windows close
the instant min_across_sources crosses window_end. Set
allowed_lateness when the source’s watermark.delay alone is too
small to absorb the observed out-of-order tail.
Worked example: multi-source session window
This pipeline merges two independent login feeds and groups per-user
events into gap-bounded sessions. Issue
#61 tracks the
multi-source synchronisation contract: a window cannot close until
every upstream source has advanced its watermark past
window_end + allowed_lateness.
pipeline:
name: multi_source_session
nodes:
- type: source
name: src_web
description: Web login events.
config:
name: src_web
type: csv
path: ./data/session_logins.csv
options:
has_header: true
watermark:
column: event_ts
schema:
- { name: user_id, type: string }
- { name: event_ts, type: date_time }
- { name: source, type: string }
- type: source
name: src_mobile
description: Mobile login events.
config:
name: src_mobile
type: csv
path: ./data/session_mobile.csv
options:
has_header: true
watermark:
column: event_ts
schema:
- { name: user_id, type: string }
- { name: event_ts, type: date_time }
- { name: source, type: string }
- type: merge
name: all_logins
inputs: [src_web, src_mobile]
- type: aggregate
name: user_sessions
input: all_logins
config:
group_by: [user_id]
time_window:
session: { gap: 5m }
allowed_lateness: 30s
cxl: |
emit user_id = user_id
emit logins = count(*)
- type: output
name: results
input: user_sessions
config:
name: results
type: csv
path: ./output/multi_source_session.csv
Both sources declare their own watermark.column independently. At
ingest, each record gets the engine-stamped $source.event_time
column, so the aggregate is column-name-agnostic about which source
delivered any given record. The aggregate’s close decision reads
min_across_sources across both sources’ partitions: a session
cannot emit until both src_web and src_mobile have advanced past
the session’s end + allowed_lateness. Drop the watermark: block
on either source and the planner rejects the pipeline with
E156.
Run it from the repo:
cargo run -p clinker -- run examples/pipelines/multi_source_session.yaml
Complete example
- type: source
name: transactions
config:
name: transactions
type: csv
path: "./data/transactions.csv"
schema:
- { name: account_id, type: string }
- { name: txn_date, type: date }
- { name: amount, type: float }
- { name: category, type: string }
sort_order:
- { field: "account_id", order: asc }
- type: aggregate
name: account_summary
input: transactions
config:
group_by: [account_id]
strategy: streaming
cxl: |
emit total_amount = sum(amount)
emit txn_count = count(*)
emit avg_amount = avg(amount)
emit max_amount = max(amount)
emit categories = collect(category)
- type: output
name: summary_output
input: account_summary
config:
name: summary_output
type: csv
path: "./output/account_summary.csv"
Route Nodes
Route nodes split a stream of records into named branches based on CXL boolean conditions. Each branch becomes an independent output port that downstream nodes can wire to using port syntax.
Basic structure
- type: route
name: split_by_value
input: orders
config:
mode: exclusive
conditions:
high: "amount.to_int() > 1000"
medium: "amount.to_int() > 100"
default: low
This creates three output ports: split_by_value.high, split_by_value.medium, and split_by_value.low.
Conditions
The conditions: field is an ordered map of branch names to CXL boolean expressions. Each expression is evaluated against the incoming record.
conditions:
priority: "urgency == \"high\" and amount > 500"
standard: "urgency == \"medium\""
bulk: "quantity > 100"
default: other
Condition keys become the port names used in downstream input: wiring.
Default branch
The default: field is required. Records that match no condition are routed to the default branch. The default branch name must not collide with any condition key.
Routing modes
Exclusive (default)
In exclusive mode, conditions are evaluated in declaration order and the first matching condition wins. A record appears in exactly one branch. Order matters – put more specific conditions first.
mode: exclusive
conditions:
vip: "lifetime_value > 100000"
high: "lifetime_value > 10000"
medium: "lifetime_value > 1000"
default: standard
A customer with lifetime_value = 50000 matches both vip and high, but because exclusive stops at first match, they go to high only if vip was checked first – and they do, because vip comes first. Actually, 50000 is not > 100000, so they match high.
Inclusive
In inclusive mode, all matching conditions route the record. A single record can appear in multiple branches simultaneously.
mode: inclusive
conditions:
needs_review: "amount > 10000"
flagged: "status == \"flagged\""
international: "country != \"US\""
default: standard
A flagged international order over 10000 would appear in needs_review, flagged, and international – three copies routed to three branches.
Downstream wiring
Downstream nodes reference route branches using port syntax: route_name.branch_name.
- type: route
name: classify
input: transactions
config:
mode: exclusive
conditions:
high: "amount > 1000"
medium: "amount > 100"
default: low
- type: transform
name: high_value_processing
input: classify.high
config:
cxl: |
emit txn_id = txn_id
emit amount = amount
emit review_flag = true
- type: transform
name: standard_processing
input: classify.medium
config:
cxl: |
emit txn_id = txn_id
emit amount = amount
- type: output
name: low_value_out
input: classify.low
config:
name: low_value_out
type: csv
path: "./output/low_value.csv"
Constraints
- At least 1 condition is required.
- Maximum 256 branches (conditions + default).
- Branch names must be unique.
- The
defaultname must not collide with any condition key.
Complete example
pipeline:
name: order_routing
nodes:
- type: source
name: orders
config:
name: orders
type: csv
path: "./data/orders.csv"
schema:
- { name: order_id, type: int }
- { name: region, type: string }
- { name: amount, type: float }
- { name: priority, type: string }
- type: route
name: by_region
input: orders
config:
mode: exclusive
conditions:
domestic: "region == \"US\" or region == \"CA\""
emea: "region == \"UK\" or region == \"DE\" or region == \"FR\""
apac: "region == \"JP\" or region == \"AU\" or region == \"SG\""
default: other
- type: output
name: domestic_orders
input: by_region.domestic
config:
name: domestic_orders
type: csv
path: "./output/domestic.csv"
- type: output
name: emea_orders
input: by_region.emea
config:
name: emea_orders
type: csv
path: "./output/emea.csv"
- type: output
name: apac_orders
input: by_region.apac
config:
name: apac_orders
type: csv
path: "./output/apac.csv"
- type: output
name: other_orders
input: by_region.other
config:
name: other_orders
type: csv
path: "./output/other_regions.csv"
Merge Nodes
Merge nodes concatenate multiple upstream branches into a single stream. They are the counterpart to route nodes – where a route splits one stream into many, a merge joins many streams back into one.
Merge is for streamwise concatenation of inputs that share a schema. For record-level joining across inputs that have different schemas, see Combine Nodes.
Basic structure
- type: merge
name: combined
inputs:
- east_data
- west_data
config: {}
Note the key differences from other node types:
- Uses
inputs:(plural), notinput:(singular). - The
config:block is empty – all wiring is on the node header. - Using
input:(singular) on a merge node is a parse error.
Wiring
The inputs: field is a list of upstream node references. These can be bare node names or port references from route nodes:
- type: merge
name: rejoin
inputs:
- process_high
- process_medium
- classify.low # Port syntax for a route branch
config: {}
Downstream nodes wire to the merge as a normal single-input reference:
- type: output
name: final_output
input: rejoin
config:
name: final_output
type: csv
path: "./output/combined.csv"
Modes
Merge’s cross-input ordering discipline is selected by config.mode. Two modes exist; concat is the default.
concat (default)
Predecessor records drain in declaration order: inputs[0] flows to output first, then inputs[1], then inputs[2], and so on. Within a single predecessor, per-source FIFO order is preserved. Output is reproducible run-to-run.
- type: merge
name: combined
inputs: [east, west]
config:
mode: concat
interleave
Records flow to output as they become available from any predecessor. Per-source FIFO is preserved within each input; cross-input order follows wall-clock arrival and is non-deterministic.
- type: merge
name: combined
inputs: [east, west]
config:
mode: interleave
When every direct predecessor of an unseeded interleave merge is a Source node, the executor fuses the Merge into the source ingest loop — predecessor channels are polled directly and Merge consumption proceeds at live ingest rate without any intermediate buffering tier.
Seeded interleave — interleave_seed:
Snapshot tests and benchmarks that need reproducible cross-input ordering can opt into a deterministic schedule:
- type: merge
name: combined
inputs: [east, west]
config:
mode: interleave
interleave_seed: 42
A seeded interleave bypasses the fused live-channel path. The Merge instead pre-buffers each predecessor’s output into a Vec, then emits records in fastrand-driven order seeded by interleave_seed. Output is reproducible regardless of upstream timing — at the cost of opting out of live back-pressure across this Merge (see below).
Back-pressure semantics
How a slow consumer or slow upstream reader propagates back through the DAG depends on the merge mode.
concat
Each Source ingest task pushes into its own bounded mpsc channel (capacity 1024 records per Source). Peer sources produce concurrently up to that capacity — the dispatch arm just consumes from inputs[0]’s channel before turning to inputs[1]’s.
Consequences:
- Memory: a non-leading input can hold up to one channel’s worth of buffered records before its producer blocks. Multi-input
concatoverNSources may carry up to(N - 1) × 1024records in flight even while only one input is being drained. - Latency: a record produced by
inputs[1]whileinputs[0]is still draining will not reach output untilinputs[0]finishes, regardless of how fast it was produced. - Producer-side back-pressure: when a non-leading input’s channel fills, its reader blocks at
blocking_send, propagating pressure back to the upstream file/network reader. The upstream is throttled even though it is not the currently-consumed input.
concat is the right choice when downstream consumers depend on declaration-ordered records (e.g. snapshot tests asserting on byte-identical output) or when the inputs represent ordered time partitions that must remain contiguous.
interleave (unseeded)
Fused with Source predecessors, the Merge arm polls every predecessor’s channel concurrently. Live back-pressure flows end-to-end:
- A slow downstream operator delays Merge consumption, which fills the predecessor channels, which blocks the Source reader tasks.
- A fast input does not wait on a slow peer — the Merge schedules whichever channel has a ready record.
When predecessors are not all Sources (e.g. Transform → Merge), fusion does not apply and the Merge consumes pre-buffered predecessor outputs in round-robin order; live back-pressure across the Merge boundary itself is unavailable in that shape, though the upstream operator’s own bounded buffer still throttles its predecessors.
Unseeded interleave is the right choice when end-to-end latency matters and the downstream consumer is order-insensitive (e.g. an aggregator grouping on a key, or a writer that does not assert on row sequencing).
interleave (seeded)
The seeded path does not preserve live back-pressure across the Merge: it pre-buffers each predecessor’s full output into a Vec before emitting in fastrand-driven order. A slow consumer downstream of a seeded Merge will not throttle the Source readers while the buffers are still filling.
If you need both run-to-run determinism and live back-pressure, prefer asserting on the multiset of records rather than their sequence and use unseeded interleave, or fall back to concat over deterministically-declared inputs.
Record ordering
Records arrive in the order described by the mode in use — see Modes and Back-pressure semantics above. If you need sorted output regardless of merge mode, apply a sort_order on the downstream output node.
Use cases
Reuniting route branches
The most common pattern is routing records through different processing paths and then merging them back together:
- type: route
name: classify
input: orders
config:
mode: exclusive
conditions:
high: "amount > 1000"
default: standard
- type: transform
name: process_high
input: classify.high
config:
cxl: |
emit order_id = order_id
emit amount = amount
emit surcharge = amount * 0.02
emit tier = "premium"
- type: transform
name: process_standard
input: classify.standard
config:
cxl: |
emit order_id = order_id
emit amount = amount
emit surcharge = 0
emit tier = "standard"
- type: merge
name: all_orders
inputs:
- process_high
- process_standard
config: {}
- type: output
name: result
input: all_orders
config:
name: result
type: csv
path: "./output/all_orders.csv"
Unioning multiple sources
Merge nodes can combine records from multiple source files that share the same schema:
- type: source
name: jan_sales
config:
name: jan_sales
type: csv
path: "./data/sales_jan.csv"
schema:
- { name: sale_id, type: int }
- { name: amount, type: float }
- { name: region, type: string }
- type: source
name: feb_sales
config:
name: feb_sales
type: csv
path: "./data/sales_feb.csv"
schema:
- { name: sale_id, type: int }
- { name: amount, type: float }
- { name: region, type: string }
- type: merge
name: all_sales
inputs:
- jan_sales
- feb_sales
config: {}
- type: aggregate
name: totals
input: all_sales
config:
group_by: [region]
cxl: |
emit total = sum(amount)
emit count = count(*)
Schema constraints across inputs
Merge concatenates streams positionally against the merge node’s output_schema (taken from the first input). Every input must therefore agree on column shape — same column names, same on_unmapped policy, same correlation_key set.
Disagreement on the $widened auto_widen sidecar (one source uses auto_widen, another uses drop / reject) fails compile with E315. See Auto-Widen & Schema Drift → E315 for the full diagnostic shape and remediation.
Combine Nodes
Combine nodes are the N-ary record-combining operator. Every input is declared up front and bound to a qualifier; the where: expression matches records across inputs using qualified field references (e.g. orders.product_id == products.product_id); the cxl: body shapes the output row.
Combine is distinct from merge: merge concatenates upstream branches that share a schema, while combine joins records across inputs that have different schemas.
Basic structure
- type: combine
name: enrich
input:
orders: orders # qualifier: upstream node name
products: products
config:
where: "orders.product_id == products.product_id"
match: first
on_miss: null_fields
cxl: |
emit order_id = orders.order_id
emit product_name = products.product_name
emit amount = orders.amount
propagate_ck: driver
Note the differences from other node types:
- Uses
input:as a map, binding qualifier names to upstream node references. Other nodes useinput:as a single string orinputs:as a list of strings. - Every field reference inside
where:andcxl:must be qualified (<qualifier>.<field>). Bare field names are a compile error. - Using
inputs:(plural list) on a combine node is a parse error.
Wiring
Each entry in the input: map binds a qualifier to an upstream node:
input:
orders: orders # qualifier "orders" -> source node "orders"
products: products
high_priority: classify.high # qualifier "high_priority" -> route port
Qualifiers are local names used inside where: and cxl:; they do not need to match the upstream node name. Upstream references can be bare node names or port references from a route node.
Iteration order in the input: map is preserved and used as the default driver-selection order (see Choosing the driving input below).
Configuration fields
| Field | Required | Default | Description |
|---|---|---|---|
where | Yes | – | CXL boolean expression matching records across inputs. Must contain at least one cross-input equality. |
match | No | first | Match cardinality: first, all, or collect. |
on_miss | No | null_fields | Driver-record handling on zero matches: null_fields, skip, or error. |
cxl | Yes (except under match: collect) | – | Emit statements defining the output row. Empty under match: collect. |
drive | No | first input | Explicit driver-input qualifier. Overrides the iteration-order default. |
strategy | No | auto | Execution strategy hint: auto or grace_hash. |
propagate_ck | Yes | – | Selects which correlation-key columns ride onto the output. driver keeps the driver’s CK only; all unions every input’s CK columns; { named: [<field>, ...] } carries an explicit subset. See Correlation-key propagation below. |
The where: predicate
The where: expression is a CXL boolean expression evaluated for every candidate record pair across inputs. It must contain at least one cross-input equality – an equality with field references from two different inputs:
where: "orders.product_id == products.product_id"
Compound predicates combine multiple conjuncts with and. Each conjunct is classified by the planner:
- Equi conjunct – a cross-input equality (
a.x == b.y). Drives the hash lookup or sort-merge join. - Range conjunct – a cross-input ordered comparison (
a.start <= b.ts and b.ts <= a.end). Handled by the IEJoin algorithm when no equi conjunct constrains the same input pair. - Residual conjunct – any other CXL predicate (intra-input filter, function call, etc.). Applied as a post-filter after the equi/range match.
where: |
orders.product_id == products.product_id
and orders.amount >= 100
and products.region == "us-east"
Above: the equi conjunct drives the join; orders.amount >= 100 and products.region == "us-east" are applied as residuals.
At least one cross-input equality is required for every combine. Pure-range predicates without an equi conjunct are also supported via IEJoin.
Match modes
match: first
Emit one output row per driver record, using the first matching build-side record. Standard 1:1 enrichment. Default.
config:
where: "orders.product_id == products.product_id"
match: first
cxl: |
emit order_id = orders.order_id
emit product_name = products.product_name
match: all
Emit one output row for every matching build-side record. 1:N fan-out – if a driver record matches three build records, three rows are emitted.
config:
where: "employees.department == benefits.department"
match: all
cxl: |
emit employee_id = employees.employee_id
emit benefit = benefits.benefit_name
match: collect
Gather every matching build-side record into a single Array-typed field on the output row. The driver record appears once; the build matches are aggregated into an array. The cxl: body must be empty under collect – the combine node synthesizes the output as { driver fields..., <build_qualifier>: Array }.
config:
where: "orders.product_id == products.product_id"
match: collect
cxl: ""
A per-group entry limit of 10,000 prevents unbounded growth.
Use collect when you need the set of matches as a single structured value; use all when you need a flat row per match.
Unmatched records (on_miss)
on_miss controls what happens to driver records with zero matches:
| Value | Semantics |
|---|---|
null_fields (default) | Build-side fields resolve to null. Driver record is still emitted. Equivalent to left-join. |
skip | Driver record is dropped. Equivalent to inner-join. |
error | Pipeline fails on the first unmatched driver record. |
config:
where: "orders.product_id == products.product_id"
on_miss: skip
on_miss: error is useful for strict referential integrity where any miss should halt processing. on_miss: skip is the inner-join shape. on_miss: null_fields is the left-join shape and the default.
Composite keys
Chain multiple cross-input equalities with and:
config:
where: |
sales.department == targets.department
and sales.region == targets.region
cxl: |
emit department = sales.department
emit region = sales.region
emit actual = sales.amount
emit goal = targets.goal
All conjuncts must hold for a record pair to match.
Multi-input combine (three or more)
Combine accepts any number of inputs. Each pair of inputs that should be related needs an explicit cross-input equality:
- type: combine
name: fully_enriched
input:
orders: orders
products: products
categories: categories
config:
where: |
orders.product_id == products.product_id
and products.category_id == categories.category_id
match: first
on_miss: null_fields
cxl: |
emit order_id = orders.order_id
emit product_name = products.product_name
emit category_name = categories.name
emit amount = orders.amount
propagate_ck: driver
The planner builds a join tree by walking equalities pairwise and ordering the joins by selectivity.
Choosing the driving input
The driver is the input whose records flow through one at a time during execution; the other inputs are materialized as build-side hash tables (or IEJoin index structures). By default the first input in the input: map is the driver.
Use drive: to override:
config:
where: "orders.product_id == products.product_id"
drive: products
cxl: |
emit product_id = products.product_id
emit product_name = products.product_name
emit sample_order_id = orders.order_id
With drive: products, the pipeline emits one row per product enriched with a matching order, instead of one row per order enriched with its product. Pick the driver based on which side you want to iterate over (typically the larger stream, or the one whose ordering you want to preserve).
Strategy hint
| Value | Behavior |
|---|---|
auto (default) | Planner picks a strategy from the predicate shape. Hash join for equi predicates; IEJoin for pure-range predicates. |
grace_hash | Force grace hash join (disk-spilling partitioned hash). Applies only to pure-equi predicates; ignored on predicates with range conjuncts. |
grace_hash is the right hint when build-side inputs are larger than the memory budget but fit on disk after partitioning. The planner falls back automatically to grace-hash spill when an in-memory hash table approaches the RSS soft limit, so strategy: grace_hash is mostly an explicit assertion for performance reasoning.
Correlation-key propagation
Combine declares which correlation-key columns its output rows carry via the required propagate_ck field. The choice shapes both the combine’s compile-time output schema and the runtime record builder.
- type: combine
name: enriched
input:
orders: orders
products: products
config:
where: "orders.product_id == products.product_id"
cxl: |
emit order_id = orders.order_id
emit product_name = products.name
propagate_ck: driver # driver-only (today's behavior)
propagate_ck: all # union of every input's $ck.* columns
propagate_ck:
named: [order_id] # explicit subset (intersected with upstream)
driver– output schema carries only the driver input’s$ck.<field>columns. Build-side records contribute body fields; their CK identity is consumed by the match.all– output schema carries every input’s$ck.<field>columns; the runtime copies build-side values onto each output row alongside the body’semitcolumns. Use when the build side carries CK fields downstream operators need to read.named: [<field>, ...]– explicit subset, intersected with what’s actually present upstream. Use to project a multi-field CK down to a single field after a join.
Driver wins on a name collision: if both the driver and a build input declare $ck.<field>, the column appears once on the output schema and the runtime keeps the driver’s value. See the Correlation-key combine interaction reference for match-mode interaction details (especially match: collect, where the propagated slot is single-valued and the array column preserves full lineage).
propagate_ck is required on every combine; pipelines without an explicit value fail to compile. Existing pipelines migrate by adding propagate_ck: driver, which is bit-for-bit equivalent to today’s behavior.
Memory considerations
Build-side inputs are materialized in memory as hash tables keyed by the equi columns. For each non-driving input, plan for roughly 1.5-2x the raw CSV size in heap. A 50 MB product catalog typically uses 75-100 MB of hash-table memory. Tune with pipeline.memory.limit at the pipeline level; see Memory Tuning for spill thresholds, the backpressure knob, and strategy overrides.
Complete example
pipeline:
name: order_enrichment
nodes:
- type: source
name: orders
config:
name: orders
type: csv
path: "./data/orders.csv"
schema:
- { name: order_id, type: string }
- { name: product_id, type: string }
- { name: amount, type: float }
- type: source
name: products
config:
name: products
type: csv
path: "./data/products.csv"
schema:
- { name: product_id, type: string }
- { name: product_name, type: string }
- { name: category, type: string }
- type: combine
name: enrich
input:
orders: orders
products: products
config:
where: "orders.product_id == products.product_id"
match: first
on_miss: null_fields
cxl: |
emit order_id = orders.order_id
emit product_id = orders.product_id
emit product_name = products.product_name
emit category = products.category
emit amount = orders.amount
propagate_ck: driver
- type: output
name: result
input: enrich
config:
name: result
type: csv
path: "./output/enriched_orders.csv"
See also
- Multi-Input Combine – recipe-style walkthrough with input data and expected output.
- Merge Nodes – streamwise concatenation; the right operator when inputs share a schema and no per-record matching is needed.
- Memory Tuning – memory budget, spill thresholds, and strategy overrides.
Output Nodes
Output nodes write processed records to files. They are the terminal nodes of a pipeline – every pipeline path must end at an output (or records are silently dropped).
Basic structure
- type: output
name: result
input: transform_node
config:
name: output_stage
type: csv
path: "./output/result.csv"
The type: field selects the output format: csv, json, xml, fixed_width, edifact, or x12. The edifact and x12 writers reconstruct their EDI interchange envelopes around emitted records; see EDIFACT Format and X12 Format.
Field control
Output nodes can either pass every upstream field through to the writer or restrict output to the fields the upstream transform explicitly emitted. Several options control which fields appear and how they are named.
Unmapped input field passthrough
include_unmapped: false # Default: true
When true (the default), every field on an input record that the upstream transform did not explicitly emit still passes through to the output unchanged. This includes fields the source’s on_unmapped: auto_widen policy absorbed into the per-record $widened sidecar map – their contents expand back to top-level columns at the sink.
When false, only fields named by an emit statement in the upstream transform appear in the output. The $widened sidecar slot is stripped and undeclared input fields are dropped.
Migration notice
The default flipped from false to true in a recent release (see issue #90). Pipelines that relied on the previous behavior – where output records contained only the fields explicitly emitted upstream – must now set include_unmapped: false explicitly to restore that shape.
The flag composes independently with include_correlation_keys: true – see below. See Auto-Widen & Schema Drift -> Output controls for the full specification, cross-format flow examples, and the writer-rejection contract for Value::Map payloads.
Worked example
Suppose the upstream source emits records with order_id, customer_id, amount, and region, and a transform that emits only one derived field:
- type: transform
name: classify
input: orders
config:
cxl: |
emit amount_bucket = if amount >= 1000 then "high" else "low"
With include_unmapped: true (the default), each output record carries order_id, customer_id, amount, region, and amount_bucket. With include_unmapped: false, each output record carries only amount_bucket. The transform’s CXL is unchanged in both cases – the Output node decides the field set.
Include correlation-key shadow columns
include_correlation_keys: true # Default: false
When the pipeline declares error_handling.correlation_key: <field>, the engine adds shadow columns named $ck.<field> to the schema. These shadow columns preserve correlation-group identity through transforms that may rewrite the user-declared field. They are an internal engine namespace and are stripped from output by default.
Set include_correlation_keys: true to surface the shadow columns in the writer output – typically for debugging correlation-group routing or auditing DLQ behavior. See Correlation Keys for the full lifecycle.
include_correlation_keys does not surface the $widened sidecar – include_unmapped is the separate flag for that. The two are independent: each, both, or neither can be set.
Writer rejection of Value::Map payloads
CSV, XML, fixed-width, EDIFACT, and X12 writers refuse records carrying a Value::Map payload at any column slot, raising FormatError::UnserializableMapValue { format, column }. JSON serializes Value::Map natively as a nested object.
The typical cause is a $widened sidecar reaching a non-JSON writer because the Output node set include_unmapped: false. See Auto-Widen & Schema Drift -> Writer rejection for the rejection contract and remediation routes.
Field mapping
Rename fields at output time without changing upstream CXL:
mapping:
"Customer Name": "full_name"
"Order Total": "amount"
Keys are output column names; values are the source field names from upstream.
Excluding fields
Remove specific fields from output:
exclude: [internal_id, _debug_flag, temp_calc]
Header control (CSV)
include_header: true # Default: true
Set to false to omit the CSV header row.
Null handling
preserve_nulls: false # Default: false
When false, null values are written as empty strings. When true, nulls are preserved in the output format’s native null representation (e.g., null in JSON).
Metadata inclusion
Control whether per-record $meta.* metadata fields appear in output:
include_metadata: all # Include all metadata fields
include_metadata: none # Default -- strip all metadata
include_metadata:
- source_file # Include only listed metadata keys
- source_row
Metadata fields are prefixed with meta. in the output.
Output format options
CSV
- type: output
name: csv_out
input: processed
config:
name: csv_out
type: csv
path: "./output/result.csv"
options:
delimiter: "|"
JSON
- type: output
name: json_out
input: processed
config:
name: json_out
type: json
path: "./output/result.json"
options:
format: ndjson # array | ndjson
pretty: true # Pretty-print JSON
array(default) – writes a single JSON array containing all records.ndjson– writes one JSON object per line.
XML
- type: output
name: xml_out
input: processed
config:
name: xml_out
type: xml
path: "./output/result.xml"
options:
root_element: "data"
record_element: "row"
Fixed-width
- type: output
name: fw_out
input: processed
config:
name: fw_out
type: fixed_width
path: "./output/result.dat"
schema: "./schemas/output.schema.yaml"
options:
line_separator: crlf
Fixed-width output requires a format schema defining field positions and widths.
EDIFACT
- type: output
name: edi_out
input: messages
config:
name: edi_out
type: edifact
path: "./out/result.edi"
options:
interchange: ["UNOA:1", "SENDER", "RECEIVER", "240101:1200", "REF1"]
message_type: "ORDERS:D:96A:UN"
write_una: false
segment_newline: true
The EDIFACT writer reconstructs the interchange envelope around emitted
records, recomputing the UNT/UNZ control counts and echoing the
control references, and release-escapes any element data that carries a
service character. The UNB header comes from interchange (literal
elements) or interchange_from_doc (echoed from a $doc section). An
interchange is a single envelope, so an edifact output cannot be
combined with a split: block — the combination is rejected at
config-validation time (E323). See EDIFACT Format for the
full option reference, the record schema, and the round-trip semantics.
Sort order
Sort records before writing:
sort_order:
- { field: "name", order: asc }
- { field: "amount", order: desc, null_order: last }
| Sort option | Values | Default |
|---|---|---|
order | asc, desc | asc |
null_order | first, last, drop | last |
first– nulls sort before all non-null values.last– nulls sort after all non-null values.drop– records with null sort keys are excluded from output.
Shorthand: a bare string defaults to ascending with nulls last:
sort_order:
- "name"
- { field: "amount", order: desc }
File splitting
Split output into multiple files based on record count, byte size, or group boundaries:
- type: output
name: split_output
input: processed
config:
name: split_output
type: csv
path: "./output/result.csv"
split:
max_records: 10000
max_bytes: 10485760 # 10 MB
group_key: "department" # Never split mid-group
naming: "{stem}_{seq:04}.{ext}"
repeat_header: true # Repeat CSV header in each file
oversize_group: warn # warn | error | allow
Split configuration fields
| Field | Required | Default | Description |
|---|---|---|---|
max_records | No | – | Soft record count limit per file |
max_bytes | No | – | Soft byte size limit per file |
group_key | No | – | Field name – never split within a group sharing this key value |
naming | No | "{stem}_{seq:04}.{ext}" | File naming pattern. {stem} is the base name, {seq:04} is a zero-padded sequence number, {ext} is the file extension |
repeat_header | No | true | Repeat CSV header row in each split file |
oversize_group | No | warn | What to do when a single key group exceeds file limits |
At least one of max_records or max_bytes should be specified for splitting to have any effect.
Oversize group policies
warn(default) – log a warning and allow the oversized file.error– stop the pipeline.allow– silently allow the oversized file.
When group_key is set, the split point is the first group boundary after the threshold is reached (greedy). Without group_key, files are split at the exact limit.
Streaming writes under fused Merge.interleave
When a single Output sits directly downstream of a Merge whose mode is interleave and whose every direct predecessor is a Source, the executor takes a streaming path: a bounded tokio::sync::mpsc::channel connects the Merge arm to the writer task, and Writer::write_record fires per record as Merge emits, concurrent with Merge production.
The buffered alternative — which still runs for every other Output topology — waits until the Merge arm has accumulated every record before invoking the writer. With a slow upstream Source that defeats the live back-pressure the Merge.interleave fusion provides at the Source-channel layer: each record sits in node_buffers[merge] until the slow Source finishes.
Topology
- type: source
name: src_a
config: { type: csv, path: a.csv, schema: ... }
- type: source
name: src_b
config: { type: csv, path: b.csv, schema: ... }
- type: merge
name: merged
inputs: [src_a, src_b]
config:
mode: interleave # required
- type: output
name: out
input: merged
config:
name: out
type: csv
path: out.csv
The streaming path is selected automatically — there is no opt-in setting. Pipelines that don’t match the topology keep the buffered path.
Eligibility
Every condition must hold for the streaming path to engage; if any fails, the buffered path runs:
- The Output has exactly one incoming edge, and that predecessor is a
Mergewithmode: interleave. - Every direct predecessor of that Merge is a
Source(same predicate the fusedMerge.interleavearm uses for its livetokio::select!). - The Merge has no other downstream consumer besides this one Output (no fan-out).
- The Output is not in the init-phase ancestor closure.
- The OutputConfig has no
split:block — splitting writers manage their own file rotation lifecycle. - The writer is registered in the single-file writer registry (not
fan_out_per_source_file). - No
Sourcein the pipeline declares a correlation key — the correlation-buffered output path defers writes toCorrelationCommitand is incompatible with per-record write.
Back-pressure flow
Under the streaming path, back-pressure flows end-to-end:
writer slow → mpsc::Sender::send().await yields
→ Merge arm yields
→ Source mpsc::Receiver fills
→ Source ingest task blocks on send
The bounded handoff channel between Merge and Output (256 slots) and the existing per-Source ingest channels (issue #67) form a single pace-bound chain from the underlying Write sink back to the source reader. A slow file system, a saturated network sink, or a deliberately-paced writer no longer accumulates records in pipeline-internal Vecs; the upstream readers slow down to match.
Counter semantics
Counter behavior under the streaming path matches the buffered Output arm exactly: records_written increments once per Writer::write_record call, ok_count counts distinct source row_nums reaching the Output, and dlq_count is unaffected (DLQ entries originate upstream). Stage metrics (SchemaScan, Write, Projection) accumulate into the same fields the buffered path uses; the dispatcher folds the streaming task’s per-task accounting back into the run-wide totals at end of DAG.
Complete example
- type: output
name: department_reports
input: enriched_employees
config:
name: department_reports
type: csv
path: "./output/employees.csv"
mapping:
"Employee ID": "employee_id"
"Full Name": "display_name"
"Department": "department"
"Annual Salary": "salary"
exclude: [internal_flags]
include_header: true
sort_order:
- { field: "department", order: asc }
- { field: "display_name", order: asc }
split:
max_records: 5000
group_key: "department"
naming: "employees_{seq:03}.csv"
repeat_header: true
Error Handling & DLQ
Clinker provides structured error handling with a dead-letter queue (DLQ) for records that fail processing. The error_handling: block at the top level of the pipeline YAML controls the behavior.
Configuration
error_handling:
strategy: continue
dlq:
path: "./output/errors.csv"
include_reason: true
include_source_row: true
Strategies
The strategy: field controls what happens when a record fails:
| Strategy | Behavior |
|---|---|
fail_fast | Default. Stop the pipeline on the first error. |
continue | Route bad records to the DLQ and keep processing good records. |
best_effort | Continue processing with partial results, even if some stages produce incomplete output. |
fail_fast
The safest strategy. Any record-level error (type coercion failure, validation error, missing required field) halts the pipeline immediately. Use this when data quality is critical and you prefer to fix issues before reprocessing.
continue
The production workhorse. Bad records are written to the DLQ file with diagnostic metadata, and the pipeline continues processing remaining records. After the run completes, inspect the DLQ to understand and correct failures.
A pipeline that completes with DLQ entries exits with code 2 – this signals “pipeline completed successfully but some records were rejected.” It is not a crash or internal error.
best_effort
The most lenient strategy. Processing continues even with partial results. Use this for exploratory data analysis where completeness is less important than progress.
DLQ configuration
The DLQ is always written as CSV, regardless of the pipeline’s input/output formats.
dlq:
path: "./output/errors.csv"
include_reason: true
include_source_row: true
| Field | Required | Default | Description |
|---|---|---|---|
path | No | – | File path for DLQ output. If omitted, DLQ records are logged but not written to file. |
include_reason | No | – | Include _cxl_dlq_error_category and _cxl_dlq_error_detail columns. |
include_source_row | No | – | Include original source fields alongside DLQ metadata. |
DLQ columns
Every DLQ record includes these metadata columns:
| Column | Description |
|---|---|
_cxl_dlq_id | UUID v7 (time-ordered unique identifier) |
_cxl_dlq_timestamp | RFC 3339 timestamp of when the error occurred |
_cxl_dlq_source_file | Input filename that produced the failing record |
_cxl_dlq_source_row | 1-based row number in the source file |
_cxl_dlq_stage | Name of the transform or aggregate node where the error occurred |
_cxl_dlq_route | Route branch name (if the error occurred after routing) |
_cxl_dlq_trigger | Validation rule name that triggered the rejection |
When include_reason: true is set, two additional columns appear:
| Column | Description |
|---|---|
_cxl_dlq_error_category | Machine-readable error classification |
_cxl_dlq_error_detail | Human-readable error description |
Error categories
The _cxl_dlq_error_category column contains one of these values:
| Category | Description |
|---|---|
missing_required_field | A required field is absent from the record |
type_coercion_failure | A value could not be converted to the expected type |
required_field_conversion_failure | A required field exists but its value cannot be converted |
nan_in_output_field | A computation produced NaN |
aggregate_type_error | An aggregate function received an incompatible type |
validation_failure | A declarative validation check failed |
aggregate_finalize | An aggregate function failed during finalization |
correlated | A non-failing record was DLQ’d as collateral because another record in its correlation group failed |
group_size_exceeded | A correlation-key group exceeded the configured max_group_buffer limit |
late_record | A record arrived at a time-windowed aggregate after its event-time window had already closed |
expansion_limit_exceeded | A transform’s emit each fan-out produced more output records than its max_expansion ceiling allows |
combine_output_row | A Combine output-stage eval failed for one driver row (probe-key, residual, or matched / on_miss: null_fields body); the entry carries the contributing-build lineage and rewinds both the driver and matched build source’s rollback cursor. Routed to the DLQ under continue / best_effort across every Combine join mode; fail_fast propagates the eval error |
Advanced options
Type error threshold
Abort the pipeline if the fraction of failing records exceeds a threshold:
type_error_threshold: 0.05 # Abort if >5% of records fail
This acts as a circuit breaker – if your input data is unexpectedly corrupt, the pipeline stops early rather than filling the DLQ with millions of entries.
Correlation key
Group DLQ rejections by a key field. When any record in a correlation group fails, records from the failing source’s contribution to that group are routed to the DLQ:
correlation_key: order_id
For compound keys:
correlation_key: [order_id, customer_id]
This is useful for transactional data where partial processing of a group is worse than rejecting the entire group. For example, if one line item in an order fails validation, you may want to reject the entire order.
Under multi-source ingest, the collateral fan-out narrows to the failing source: a src_b trigger does NOT DLQ records from src_a that share the same correlation key. Single-source pipelines see bit-identical behavior to today’s pipeline-wide collateral DLQ. See Per-source rollback narrowing for the full semantic and the two documented exceptions (max_group_buffer overflow and Combine output failures).
For the full lifecycle and per-operator semantics (route, merge, aggregate, combine), see Correlation Keys.
Max group buffer
Limit the number of records buffered per correlation group:
max_group_buffer: 100000 # Default: 100,000
Groups exceeding this limit are DLQ’d entirely with a group_size_exceeded summary entry.
Exit codes
| Code | Meaning |
|---|---|
| 0 | Pipeline completed successfully, no errors |
| 1 | Pipeline failed (internal error, config error, or fail_fast triggered) |
| 2 | Pipeline completed, but DLQ entries were produced |
Exit code 2 is not a failure – it means the pipeline ran to completion and handled errors according to the configured strategy. Check the DLQ file for details.
Complete example
pipeline:
name: order_processing
memory: { limit: "512M" }
nodes:
- type: source
name: orders
config:
name: orders
type: csv
path: "./data/orders.csv"
schema:
- { name: order_id, type: int }
- { name: customer_id, type: int }
- { name: amount, type: float }
- { name: email, type: string }
- type: transform
name: validate_orders
input: orders
config:
cxl: |
emit order_id = order_id
emit customer_id = customer_id
emit amount = amount
emit email = email
validations:
- field: email
check: "not_empty"
severity: error
message: "Customer email is required"
- check: "amount > 0"
severity: error
message: "Order amount must be positive"
- type: output
name: valid_orders
input: validate_orders
config:
name: valid_orders
type: csv
path: "./output/valid_orders.csv"
error_handling:
strategy: continue
dlq:
path: "./output/rejected_orders.csv"
include_reason: true
include_source_row: true
type_error_threshold: 0.10
correlation_key: order_id
Correlation Keys
A correlation key declares a set of records from a single source as an atomic group: if any record in the group fails validation or processing, the whole group is sent to the DLQ. This is the right shape for transactional data where partial processing is worse than total rejection – the canonical example is an order with multiple line items where one bad line should reject the entire order.
This page describes the full lifecycle of a correlation key and how it interacts with each operator that can fan out, fan in, group, or join records.
Declaration
Correlation keys are declared per source. Each source’s config: block carries an optional correlation_key: field naming the column (or list of columns) whose value identifies a record’s correlation group within that source. The engine widens each declaring source’s schema with one $ck.<field> shadow column per field and stamps the user-declared value into it at ingest.
nodes:
- type: source
name: orders
config:
name: orders
type: csv
path: ./data/orders.csv
correlation_key: order_id
schema:
- { name: order_id, type: string }
- { name: amount, type: int }
- type: source
name: customers
config:
name: customers
type: csv
path: ./data/customers.csv
correlation_key: [customer_id, region]
schema:
- { name: customer_id, type: string }
- { name: region, type: string }
- { name: name, type: string }
- type: source
name: sensor_readings
config:
name: sensor_readings
type: csv
path: ./data/sensors.csv
# No correlation_key field declared. This source carries no
# $ck.* widening; record-level errors land in the DLQ as
# standalone entries with no group atomicity.
schema:
- { name: ts, type: date_time }
- { name: value, type: float }
A record’s correlation group is identified by the tuple of values for that source’s listed fields. Records sharing the same tuple within the same source belong to the same group. There is no pipeline-level correlation key — the previous error_handling.correlation_key: field has been removed; pipelines that previously declared it move the field down to each contributing source.
A source whose declared correlation_key: field names a column not present in its own schema: block is rejected at compile time with diagnostic E153.
Lifecycle
The engine adds a shadow column named $ck.<field> (one per correlation-key field) to every declaring source’s schema and copies the field’s value into it at ingest. From that point on, the shadow column is the authoritative group identity – if a downstream transform rewrites the user-declared correlation field, the shadow column is untouched and the group identity is preserved.
Shadow columns are an internal engine namespace. You never write $ck.<field> in YAML or CXL – the engine manages them. They are stripped from default writer output. To surface them for debugging, set include_correlation_keys: true on an output node:
- type: output
name: debug_out
input: validate
config:
name: debug_out
type: csv
path: "./debug.csv"
include_correlation_keys: true
Multi-source pipelines
Different sources can declare different correlation-key fields. The engine treats each source’s CK identity as locally consistent: a record from customers is a member of the customer-id group named in its row, and a record from orders is a member of the order-id group named in its row, regardless of whether customer_id appears in orders or vice versa. Combine and Merge nodes that join across sources negotiate which CK columns survive into the joined output via the Combine node’s propagate_ck: field (see Combine interaction below).
A source that declares no correlation_key: carries no $ck.* widening. Records from such a source flow through the pipeline without group identity; per-record errors DLQ on a per-record basis with no group fan-out. The orchestrator’s relaxed-aggregate retraction protocol still activates if any other source on the same DAG carries a CK field that an aggregate’s group_by omits — the retraction protocol scope is the DAG’s lattice of $ck.* columns, not any single source’s declaration.
DLQ semantics
When a record fails inside a correlation group:
- The failing record produces a trigger DLQ entry. Its category reflects the actual failure (e.g.
type_error,validation_failed). - Every other record from a source that contributed a trigger to the same group produces a collateral DLQ entry. Collaterals carry the category
correlated. - Records belonging to other (clean) groups proceed normally.
A record with a null value for the correlation-key field is treated as its own per-record group: it has no peers and DLQ atomicity does not span multiple records.
A Combine output-row eval failure that the engine recovers from (under continue / best_effort, in the hash build-probe inline arm) produces entries under the combine_output_row category — distinct from the upstream-Transform type_coercion_failure because the entry carries the contributing-build lineage and rewinds both the driver and the matched build source’s rollback cursor. See Per-source rollback narrowing below for the cursor-rewind detail.
The dlq_count counter sums triggers and collaterals.
Per-source rollback narrowing
When two sources contribute records to the same correlation group, a failure originating from one source does NOT collaterally DLQ records from the OTHER source. The collateral fan-out is scoped to the failing source’s records only.
Concretely, consider [src_a, src_b] → merge → tfm → out with both sources declaring correlation_key: id. A mid-stream Transform error fires on every src_b record but leaves src_a records untouched:
- type: transform
name: tfm
input: m
config:
cxl: |
emit id = id
emit ratio = if($source.name == "src_b") then (1 / 0) else amt
Under per-source rollback, the dirty correlation group for each id value contains:
- One trigger DLQ entry — the
src_brow that hit1 / 0. - The
src_arow sharing the sameidis spared and reaches the output.
The engine identifies origin per record via the engine-stamped $source.name column. Within the failing source’s records, the existing CorrelationFanoutPolicy (Any / All / Primary) determines which records DLQ — the policy semantics are unchanged. Single-source pipelines see bit-identical behavior to the pre-narrowing engine because every co-grouped record shares the failing source by construction.
Records that carry no single-source attribution — synthetic aggregate emits and Combine output rows — are NOT spared by per-source narrowing. They flow through the existing collateral path because their stamp falls back to the merged-source identity which is ambiguous about origin.
The engine also surfaces a per_source_rollback_cursors map on the ExecutionReport, keyed by source name and carrying the highest source row number that cleanly exited a forward operator. The map advances per record at the clean exit of Transform / Route / Aggregate, and rewinds per contributing source on max_group_buffer overflow to the lowest row_num any group member of that source contributed. Sources whose records all DLQ never land in the map. The map is the replay anchor for per-source resume: a downstream rerun reads each source’s cursor as the floor for what must be reprocessed.
On max_group_buffer overflow, every record in the overflowing group still lands in DLQ (one GroupSizeExceeded trigger plus per-row collaterals), but the per-source rollback cursor rewinds independently per contributing source. Attributing the overflow failure itself to one source would be a fiction — every contributing source shared blame proportionally — so the DLQ shape stays group-wide while the rewind narrows per source.
The relaxed-CK aggregator’s per-row lineage carries (row_id, source_name) pairs so a finalize-time retract scoped to one source rewinds only that source’s contributions to each affected group. The source half of the pair is load-bearing under multi-source ingest: each source numbers its rows from its own monotonic counter, so two sources that both feed the same aggregate group can contribute records at identical row_id values. Pairing the row id with its source keeps src_a’s row 1 distinct from src_b’s row 1 when both land in one group, so a retract that must remove both reaches each one instead of collapsing the colliding ids and stranding the second source’s contribution. Combine input snapshots are captured at fold start and cleared at every Combine arm’s exit (inline, IEJoin, GraceHash, SortMerge). When a Combine output-row eval fails recoverably under continue / best_effort in the hash build-probe (inline) arm — a probe-key, residual-filter, or matched / on_miss: null_fields body failure on one driver row — the snapshot restores each contributing source’s rollback cursor to the value it held at the start of the fold (its pre-fold floor), lowering the cursor only if it had since advanced, then routes the row to the DLQ under the combine_output_row category. Only the sources that fed the failing row rewind; co-folded sources that did not contribute keep their forward progress. The IEJoin, grace-hash, and sort-merge arms propagate an output-eval failure as fail-fast regardless of strategy.
Group buffering
The engine buffers records per correlation group until either the group completes (all source records observed) or a failure triggers a flush. The max_group_buffer: field on the pipeline-level error_handling: block caps per-group buffering across every source’s groups:
error_handling:
max_group_buffer: 100000 # Default: 100,000
Groups that exceed the cap are DLQ’d entirely with a group_size_exceeded trigger plus a collateral entry per buffered record. This is a backpressure boundary, not a hard error.
Compile-time constraints
Two compile-time invariants are enforced:
-
CK field must exist in source schema (E153). A source that declares
correlation_key: <field>must list<field>in its ownschema:block; otherwise the engine emitsE153pointing at the offending source declaration. The remediation is to either add the field to the source’sschema:block or remove the field fromcorrelation_key:. -
Arena execution incompatible. The arena-evaluated execution path is incompatible with correlation grouping. Combinations are rejected at compile time.
Aggregates whose group_by covers the upstream CK lattice stay on the strict-collateral path; aggregates that omit any CK field visible upstream activate the retraction protocol automatically. Authors do not configure this — the engine inspects the configuration and picks the correct path. See Aggregate interaction below.
Per-operator interactions
Transform interaction
A transform that rewrites the user-declared correlation-key field does not change a record’s group identity. The shadow column captured at ingest is what the buffer-key extractor reads, not the live field value.
- type: source
name: orders
config:
schema:
- { name: order_id, type: string }
- { name: amount, type: float }
# At ingest each record gets $ck.order_id = order_id
- type: transform
name: anonymize
input: orders
config:
cxl: |
emit order_id = "REDACTED" # writes the live field
emit amount = amount
# Group identity is still the original order_id from $ck.order_id;
# anonymize does not collapse records into a single null-keyed group.
This makes the correlation-key declaration robust against routine field-rewrite logic in transforms.
Route interaction (fan-out)
A correlation group can span multiple route branches. Group atomicity is preserved across branches: if any record in the group fails (in any branch’s transform, or in the route predicate itself), the entire group is rejected from every branch.
- type: route
name: split
input: validate
config:
mode: inclusive
conditions:
a: 'priority == "high"'
b: 'priority == "low"'
default: a
- type: output
name: out_a
input: split.a
config: { ... }
- type: output
name: out_b
input: split.b
config: { ... }
For an inclusive route where one record reaches both branches, a single failure in the source still DLQ’s that source row exactly once – not once per (row, output) pair. The group identity dedupes the DLQ entries at the source-row level.
A route predicate that itself fails to evaluate (e.g. type error inside the condition expression) is treated like any other failure: it triggers DLQ atomicity for the whole correlation group.
Merge interaction (fan-in)
Merge concatenates upstream branches that share a schema. Each record carries its $ck.<field> shadow column unchanged through the merge. Groups originating from different upstream sources but sharing the same correlation-key value are treated as a single correlation domain downstream:
- type: source
name: east_orders
config: { ... }
- type: source
name: west_orders
config: { ... }
- type: merge
name: all_orders
inputs: [east_orders, west_orders]
config: {}
If east_orders and west_orders both contain rows for order_id = ORD-42, all of those rows are members of the same correlation group post-merge. A failure on any one of them DLQ’s the whole group across both upstream sources.
Aggregate interaction
When an aggregate’s group_by covers every CK field visible upstream, the aggregate stays on the strict-collateral path: each emitted row inherits the correlation identity of its inputs and any DLQ trigger in the group rolls back every record in the group, including the aggregate output row. This is the zero-overhead default.
- type: source
name: orders
config:
name: orders
type: csv
path: ./data/orders.csv
correlation_key: order_id
schema:
- { name: order_id, type: string }
- { name: amount, type: int }
- type: aggregate
name: order_totals
input: orders
config:
group_by: [order_id] # strict -- covers the upstream CK
cxl: |
emit total = sum(amount)
When an aggregate’s group_by omits any CK field visible upstream, the engine routes the aggregate through the retraction protocol automatically. A single correlation group may span multiple aggregate groups; CK fields omitted from group_by stop being visible to downstream consumers of this aggregate’s output as user-named columns. Authors do not configure this — the engine inspects the configuration and picks the correct path.
- type: source
name: orders
config:
name: orders
type: csv
path: ./data/orders.csv
correlation_key: order_id
schema:
- { name: order_id, type: string }
- { name: department, type: string }
- { name: amount, type: int }
- type: aggregate
name: dept_totals
input: orders
config:
group_by: [department] # retraction protocol is active
cxl: |
emit total = sum(amount)
Aggregate output rows on the strict path inherit the correlation meta of the records that fed them. If any input record in a correlation group fails, the surviving records in that group still flow through the aggregator and produce one aggregate row – but that aggregate row is itself DLQ’d as a collateral and never reaches the writer.
On the retraction path, the engine retracts only the failing records and refinalizes affected groups, so the aggregate output row reflects the surviving contributions. The retraction protocol’s runtime constraint (E15Y for strategy: streaming on a retraction-mode aggregate) is enforced automatically once the engine has classified the aggregate. Operators downstream of a retraction-mode aggregate run only at commit time on the post-recompute aggregate emits, so non-deterministic CXL builtins (e.g. now) evaluate exactly once per output row and need no special-casing.
Synthetic correlation column
A retraction-mode aggregate emits one engine-managed $ck.aggregate.<name> column on its output schema, alongside the user-emitted bindings. The column carries the aggregator’s per-group index at finalize and is the lineage hook that lifts the post-aggregate retract path: a Transform or Output that fails on an aggregate output row carries the synthetic column on the failing record, the orchestrator’s detect phase decodes the index back to the contributing source row ids via the retained aggregator’s input_rows table, and the recompute phase retracts those source rows just as it would retract a directly-failing source record. Authors never write or read $ck.aggregate.<name> — the column is hidden from default writer output (mirroring the source-CK shadow column posture) and lives outside any user-visible CXL surface.
Where retraction triggers are sourced
Retraction is fine-grained for failures upstream of a retraction-mode aggregate (Source ingest, Transform evaluation, Combine probe, Validation): the failing record carries $ck.<field> shadow columns, the engine identifies its correlation group from those columns, and retract_row removes that record’s specific contribution from every affected aggregate group while leaving every other contributing record intact.
Failures downstream of a retraction-mode aggregate (a Transform that fails on an aggregate output row, an Output writer that rejects an aggregate row) carry the synthetic $ck.aggregate.<name> lineage column described above. The detect phase resolves that column to the contributing source row ids and feeds them into the same recompute pipeline as upstream failures. The end-to-end demo at examples/pipelines/retract-demo/ runs both surfaces in one pipeline.
Combine interaction
Every combine declares propagate_ck: to select which correlation-key fields its output rows carry:
propagate_ck: driver– output inherits only the driver input’s correlation identity. Build-side records contribute fields to the output but their group identity is consumed by the match. Default-equivalent behavior; today’s strict-correlation pipelines stay on this setting.propagate_ck: all– output carries the union of correlation-key fields across every input. Use when the build side carries CK fields that downstream operators need to read (for example, a build-side stream is also subject to correlation-driven DLQ on its own keys).propagate_ck: { named: [<field>, ...] }– output carries exactly the named subset, intersected with what is actually present upstream. Use to project a multi-field correlation key down to a single field after a join.
- type: source
name: orders
config:
name: orders
type: csv
path: ./data/orders.csv
correlation_key: employee_id
schema:
- { name: employee_id, type: string }
- { name: amount, type: float }
- type: source
name: departments
config:
name: departments
type: csv
path: ./data/departments.csv
correlation_key: employee_id
schema:
- { name: employee_id, type: string }
- { name: dept, type: string }
- type: combine
name: enriched
input:
o: orders # driver
d: departments # build side
config:
where: "o.employee_id == d.employee_id"
match: first
on_miss: skip
cxl: |
emit employee_id = o.employee_id
emit amount = o.amount
emit dept = d.dept
propagate_ck: driver
- type: combine
name: enriched_all
input:
o: orders # both sources declare correlation_key
d: departments
config:
where: "o.employee_id == d.employee_id"
cxl: |
emit employee_id = o.employee_id
emit dept = d.dept
propagate_ck: all # union of every input's CK columns
Under propagate_ck: driver, output rows from enriched carry the $ck.employee_id value from the driver record, regardless of which department record matched. A trigger error on a driver record DLQ’s that driver’s whole correlation group, including any combine output rows that were already produced for that group.
Under propagate_ck: all (or { named: [...] }), the combine widens its output schema with the build-side $ck.<field> columns it propagates, and the runtime copies the matched build record’s values into those columns. Driver wins on a name collision: if both the driver and a build input declare $ck.<field>, the column appears once on the output schema and the runtime keeps the driver’s value – the build’s value would only land if the driver’s slot was null, which never happens for a same-named CK field that the driver itself observes.
Match-mode interaction:
match: first– one matched build per driver row; that build’s$ck.<field>fills the propagated slot.match: all– one output row per matched build; each row carries its own matched build’s$ck.<field>.match: collect– one synthesized output row per driver. The propagated$ck.<field>slot is single-valued: the first matched build’s CK fills it. Every matched build’s full payload still rides inside the array column viaValue::Map, so per-build lineage is preserved at the cost of single-valued addressing on the propagated slot.
This rule holds across all combine execution paths: the hash-join path, the IEJoin range-predicate path, the grace-hash spill path, the sort-merge path, and chained combines (combine consuming the output of another combine).
The drive: field on a combine selects which input is the driver. Choose the side that carries the authoritative group identity for downstream DLQ routing – typically the larger or more transactional stream.
propagate_ck is a required field with no default value – every combine must spell out which propagation mode it uses. Existing pipelines migrate by adding propagate_ck: driver to keep today’s behavior.
Composition interaction
A composition’s body operates on records flowing in from the parent pipeline. The correlation-key shadow columns flow into composition inputs and back out the named ports unchanged. Compositions cannot declare their own correlation key — CK is a property of a source’s identity, not of the composition body that consumes records from one.
Operator-by-operator retraction cost reference
An aggregate whose group_by omits any upstream CK field activates the retraction protocol automatically. Each operator on the post-source DAG carries a different cost profile under retraction; the table below summarizes the per-operator footprint so you can size memory and pick propagate_ck settings before pipelines hit production.
| Operator | Retraction cost |
|---|---|
| Source | None at retraction time. The CK shadow columns are stamped at ingest; replay never re-reads the source file. |
| Transform | Runs only at commit time on post-recompute aggregate emits when sitting inside a deferred region. Cost = O(rows_emitted_post_recompute) per region member, no extra state held. Non-deterministic CXL builtins (e.g. now) evaluate exactly once per output row, same as on a non-retraction pipeline. |
Aggregate (strict, group_by covers upstream CK lattice) | None. Strict aggregates short-circuit to today’s two-phase commit body and pay zero retraction overhead. |
| Aggregate (retraction-mode, Reversible bindings) | Per-row lineage map (input_row_id → group_index) carried alongside accumulator state — ~8 bytes/row plus the per-group input_rows Vec inline cost — plus one synthetic $ck.aggregate.<name> shadow column on every output row at ~16 bytes/row. Retract is O(retracted_rows) reverse-op calls plus one finalize_in_place. Reversible accumulators: sum, count, collect, any. |
| Aggregate (retraction-mode, BufferRequired bindings) | Per-group raw contributions held until commit, plus one synthetic $ck.aggregate.<name> shadow column on every output row at ~16 bytes/row. Memory cost = O(input_rows × Σ binding_value_size) plus the synthetic-column tail. Retract recomputes affected groups from contributions − retracted_rows. BufferRequired accumulators: min, max, avg, weighted_avg. |
| Combine (driver propagation) | One propagated $ck.<field> slot from the driver record. No retraction state held by the combine itself; replay carries upstream deltas through. |
Combine (propagate_ck: all / named: [...]) | Same per-row cost as driver propagation, plus the widened output schema’s $ck.<field> columns must be re-populated on replay. Cost scales with the output schema width, not retraction frequency. |
| Window (streaming) | None — streaming windows are incompatible with a retraction-mode aggregate whose dropped CK fields overlap partition_by. The plan-time derivation switches such windows into buffer mode. |
| Window (buffer-mode) | Per-partition raw row buffers held until commit. Memory cost = O(largest partition × per-row-size). Retract reruns the configured $window.* evaluation over partition − retracted_rows. Covers all 13 $window.* builtins uniformly via wholesale recompute. |
| Output | Holds retracted rows in correlation_buffers until commit. Replay substitutes the post-retract row in place; clean records flush to the writer, dirty records DLQ per the resolved correlation_fanout_policy. |
The --explain output’s === Retraction === section reports the live per-aggregate / per-window detail derived from the current pipeline, including the per-aggregate synthetic-CK column and its 16-byte/output-row cost. The clinker metrics collect spool reports the runtime counterpart: correlation.retract.groups_recomputed, .partitions_recomputed, .subdag_replay_rows, .output_rows_retracted_total, .degrade_fallback_count, .synthetic_ck_columns_emitted_total, .synthetic_ck_fanout_lookups_total, .synthetic_ck_fanout_rows_expanded_total. Use the explain block for plan-time capacity sizing, the metrics spool for post-run confirmation.
When retraction’s preconditions break at runtime (an aggregate spilled before retract reached it, or a window partition exceeded the memory budget), the orchestrator degrades to “DLQ entire affected group/partition” — the same strict-collateral DLQ shape every aggregate uses on the strict path. Each degrade increments correlation.retract.degrade_fallback_count; persistent non-zero values point at a tighter memory budget or a smaller correlation key cardinality.
Debugging
To see correlation-key shadow columns in writer output:
- type: output
name: debug
input: any_node
config:
type: csv
path: "./debug.csv"
include_correlation_keys: true
The output will contain extra columns named $ck.<field> (literal $ck. prefix in the CSV header) for each correlation-key field declared on the source whose records reach this output. The synthetic $ck.aggregate.<name> shadow column emitted by retraction-mode aggregates is also surfaced when this flag is enabled.
To investigate DLQ collaterals: every collateral entry’s category is correlated. The trigger entry in the same group carries the actual failure category and message.
See also
- Error Handling & DLQ – general DLQ configuration, fail-fast vs continue, type-error thresholds.
- Aggregate Nodes – group-by semantics and the strategy hint.
- Combine Nodes – driver selection and match modes.
- Output Nodes –
include_correlation_keysand other field-control flags.
Scoped Variables
Clinker’s scoped-variable system lets a pipeline read and write
named values at three lifetimes: the pipeline run, the source, and
the record. Variables are declared statically at pipeline top
with their type and scope, read inline from CXL via the $pipeline.*,
$source.*, and $record.* namespaces, and written exclusively
by a dedicated state node.
The three scopes
| Scope | Lifetime | Reset | Reader namespace |
|---|---|---|---|
pipeline | Entire pipeline run | Never (per run) | $pipeline.<key> |
source | One per source file (Arc<str>-keyed) | Per source-file | $source.<key> |
record | A single record as it flows through nodes | Per record | $record.<key> |
$record.<key> is a separate namespace from $meta.<key>. Metadata
is written via emit $meta.x = ... from a transform and survives only
to the immediate downstream operator. Record-scope vars survive the
whole row pipeline (every transform along the row’s path can read
them) but never serialize as output columns unless explicitly emitted
as a regular column.
Declaring variables
Every scoped variable must be declared in the pipeline’s top-level
vars: block, named, scoped, typed, and optionally given a default:
pipeline:
name: order_processing
vars:
pipeline:
cutoff_date:
type: date
default: "2024-01-01"
fuzzy_threshold:
type: float
default: 0.85
source:
batch_id:
type: string
ingestion_label:
type: string
record:
fuzzy_score:
type: float
Allowed types: int, float, string, bool, date, date_time.
Built-in members of each scope ($source.file, $source.row,
$source.path, $source.count, $source.batch,
$source.ingestion_timestamp; $pipeline.start_time,
$pipeline.name, $pipeline.execution_id, $pipeline.batch_id,
$pipeline.total_count, $pipeline.ok_count, $pipeline.dlq_count,
$pipeline.filtered_count, $pipeline.distinct_count) are reserved —
declaring a user variable with one of those names is rejected at
parse time.
$source.count semantics
$source.count is the finalized per-source record total for the
Source that produced the current record. It is observable only after
that Source’s input stream closes:
- Mid-stream reads (records emitted before the Source’s input
closes — typical of Transform / Route / Window / Merge per-record
evaluation) resolve to
Null. The final count cannot be known before every record has been observed; the engine does not speculate or block. - Post-close reads (terminal aggregate emits, commit-time
deferred dispatch, post-recompute paths, any record emitted after
the originating Source’s
mpsc::ReceiverreturnedNone) resolve to the per-source total.
Pipelines that previously used $source.count as a streaming
denominator (e.g. value / $source.count) will now see Null from
that division on mid-stream records. If you need a streaming row
counter, declare a scope: source variable and increment it from a
state writer — that gives you a running count instead of waiting
for the final.
Reading variables
CXL access is identical for declared and built-in keys:
- type: transform
name: filter_recent
input: orders
config:
cxl: |
emit id = id
filter received_at > $pipeline.cutoff_date
emit batch = $source.batch_id
emit confidence = $record.fuzzy_score
Reads of undeclared keys are rejected with E200 (CXL name resolution failed) at compile time, with a “did you mean” suggestion that scans the declared registry.
Writing variables: the state node
The only way to mutate a scoped variable is a dedicated state
node. The node is a pass-through for records — its input record
forwards unchanged on the output edge — but evaluates its set:
assignments and writes the results into the appropriate scope-keyed
runtime registry.
- type: state
name: capture_header
input: salesforce_in
config:
scope: source
set:
- var: batch_id
cxl: "first(this.batch)"
- var: ingestion_label
cxl: "$source.file.file_stem()"
- type: state
name: row_score
input: enrich
config:
scope: record
set:
- var: fuzzy_score
cxl: "fuzzy_match(this.name, $pipeline.canonical_name)"
Inline mutation from a regular transform (emit $pipeline.x = ...)
is a parse error. The dedicated-node design keeps the dependency
between writers and readers visible at plan time.
Init phase: pre-runtime population
A state node may declare phase: init to run to completion
before any runtime-phase node sees a record:
- type: source
name: config_src
config:
name: config_src
type: csv
path: config.csv
schema:
- { name: cutoff, type: int }
- type: aggregate
name: max_agg
input: config_src
config:
group_by: []
cxl: |
emit cap = max(cutoff)
- type: state
name: precompute_cutoff
input: max_agg
config:
scope: pipeline
phase: init
set:
- var: cutoff_date
cxl: "cap"
Init-phase nodes must be terminal — no runtime-phase node may consume from an init-phase state node. (Init-phase state nodes can chain through init-only descendants for compositions.) Use disjoint Sources for init vs runtime when you need both, since a Source shared between an init and a runtime branch only feeds the init pass.
Compile-time validation
Scoped variables earn their architectural payoff at plan time. Every reference and every writer is checked against a static registry, and every cross-DAG flow is verified against the topology.
| Code | What it catches |
|---|---|
| E107 | Channel var override declares a different type than the pipeline. |
| E109 | Channel targets a composition but carries vars: overrides. |
| E110 | Channel var name shadows a reserved system field for that scope. |
| E111 | Channel vars.source.<src> references an unknown source-node name. |
| E164 | An init-phase state node has a runtime descendant. |
| E171 | A reader is not a transitive DAG descendant of its writer. |
| E172 | Bare $source.<custom> read downstream of a Merge or Combine. |
| E173 | Composition body reads a parent scoped var without opting in. |
| E174 | Composition _compose.scoped_vars declares a different type than the parent. |
| E175 | An init-phase node reads a runtime-only writer’s variable. |
| E200 | A reference to an undeclared scoped variable (resolver-level failure). |
Cross-Transform duplicate declares: (the same (scope, name) declared
on two Transforms) is rejected at config-validation time, ahead of
compilation. $pipeline, $source, and $record are flat shared
namespaces; declare each name once and reference it from every consumer.
Each diagnostic carries the offending span plus secondary spans pointing at the conflicting writer or the parent declaration, so the report shows up where the user is reading or writing — not in some unrelated configuration block.
Post-merge access: qualified $source.<input>.<key>
After a Merge or Combine, the bare $source.<custom> form is
ambiguous: each record carries its own source’s value, but the
reader’s intent is usually to compare across inputs. E172 rejects
the unqualified form and the qualified form is the legal alternative:
- type: transform
name: read_after_merge
input: merged
config:
cxl: |
emit id = id
emit lt = $source.left_input.left_label
emit rt = $source.right_input.right_label
The <input_name> segment matches the named input on the Combine
(its IndexMap key) or the upstream node name on the Merge.
Composition opt-in
A composition body cannot see parent scoped variables by default —
the seal is enforced by E173. To pass values across the boundary,
the composition declares the schema of parent vars it consumes in
its _compose.scoped_vars block:
# read_pipeline_var.comp.yaml
_compose:
name: read_pipeline_var
inputs:
inp:
schema:
- { name: id, type: int }
outputs:
out: tap
scoped_vars:
pipeline:
cutoff:
type: int
nodes:
- type: transform
name: tap
input: inp
config:
cxl: |
emit id = id
emit cutoff_seen = $pipeline.cutoff
The parent must declare cutoff with the matching type; mismatches
raise E174.
What scoped variables are not
These are intentional non-features:
- No persistence across runs. State is in-memory only. A pipeline run starts with declaration defaults; the writes don’t survive the process.
- No inline
emit $pipeline.xwrites. Convenience-style mutation from a transform body is forbidden — empirical evidence from comparable engines shows it leads to race conditions and hidden DAG dependencies. - No dynamic var creation. The set of variables is closed at plan time, by design. This bounds memory and makes the validation matrix above tractable.
Channel overrides
A channel can both override a pipeline’s declaration defaults and add
new entries across all four registries ($vars.*, $pipeline.*,
$source.*, $record.*). Each registry has its own sub-block under
vars: on a .channel.yaml, and each entry uses the same
{ type, default } shape that pipeline-side declarations use:
# Pipeline declarations
pipeline:
name: orders
vars:
fuzzy_threshold: { type: float, default: 0.85 } # $vars.*
nodes:
- type: source
name: orders_src
config: { name: orders_src, type: csv, path: in.csv,
schema: [{ name: id, type: int }] }
- type: transform
name: enrich
input: orders_src
config:
declares:
- { name: cutoff_date, scope: pipeline, type: date, default: "2024-01-01" }
- { name: ingest_label, scope: source, type: string, default: "prod" }
- { name: tier, scope: record, type: string, default: "bronze" }
cxl: |
emit id = id
# channels/acme-prod.channel.yaml
channel:
name: acme-prod
target: ./pipelines/orders.yaml
vars:
static:
fuzzy_threshold: { type: float, default: 0.95 }
pipeline:
cutoff_date: { type: date, default: "2026-01-01" }
source:
orders_src:
ingest_label: { type: string, default: "acme-prod" }
record:
tier: { type: string, default: "platinum" }
Override semantics (entry name already declared) require the channel’s
type to match the declared type — mismatches produce E107. Add
semantics (entry name not yet declared) extend the registry with a new
declaration. $source overrides are keyed by source-node name; an
unknown source name produces E111. The reserved-name guard
(E110) blocks channels from shadowing system fields like
$pipeline.execution_id or $source.path. Channels that target a
.comp.yaml may not carry vars: (E109 if they do).
See Channels for the full overlay rules and the channel manifest reference.
Document Envelope Context ($doc.*)
Many enterprise file formats wrap their record body in an envelope:
named sections that surround the records and carry document-level
metadata — a batch header with a run date and batch id, a trailer with
a record count and checksum, or arbitrary sibling sections. Clinker
exposes these sections to CXL through the $doc.<section>.<field>
namespace.
sources:
- name: payments
path: data/payments.xml
format: xml
envelope:
sections:
BatchInfo:
extract: { xml_path: "/payments/BatchInfo" }
fields:
batch_id: string
run_date: date
Summary:
extract: { xml_path: "/payments/Summary" }
fields:
record_count: int
checksum: string
A downstream transform reads any declared section field on every body record:
nodes:
- transform: tag
inputs: { in: payments }
project:
- batch: $doc.BatchInfo.batch_id
- expected_total: $doc.Summary.record_count
- amount: amount
Section names are yours
The engine reserves no section names. BatchInfo and Summary
above are arbitrary identifiers chosen by the pipeline author — Head
/ Foot, preamble / trailer, batch_metadata / eob_summary are
all equally valid. A section name is whatever string you put in the
sections: map; CXL exposes it verbatim as $doc.<that_name>.<field>.
All sections are available everywhere in the body stream
Before the first body record streams from a file, the reader runs a
one-time envelope pre-scan that extracts every declared section —
no matter where it physically sits in the file. A header at the top and
a trailer at the bottom are both pulled out up front. The result: every
body record sees every $doc.<section>.<field> value, from the first
record to the last.
This means a trailer field is available during body streaming, not just at end-of-file. A pipeline can compute, on every row, a ratio against the trailer’s total:
project:
- running_fraction: row_index / $doc.Summary.record_count
The pre-scan reads the envelope-bearing segments of the file before body streaming begins. Envelope payloads are small (a few hundred bytes per document is typical), but reaching a trailing section requires the reader to have buffered the file — so envelope-aware sources hold the source file’s bytes in memory for the lifetime of the read. Body records still stream one at a time; only the envelope sections (not the body) live in the document context.
$doc.* is not the file in memory. It holds the parsed envelope
sections only — body records flow through the pipeline one at a time,
and the only stages that buffer multiple records are the usual blocking
operators (Aggregate, Sort, grace-hash Combine) under the standard RSS
budget.
Extract rules per format
Each section declares how the reader locates its payload:
| Format | extract: key | Value |
|---|---|---|
| XML | xml_path | Slash-path to the section element, e.g. /doc/Head |
| JSON | json_pointer | RFC 6901 pointer, e.g. /Head |
| EDIFACT | segment | A service-segment tag — only UNB |
| X12 | segment | A service-segment tag — only ISA (GS/ST surface as nested levels) |
Declaring an xml_path section against a JSON source (or vice versa),
or a segment extract against XML/JSON, is a configuration error and
fails fast when the source opens, rather than silently producing empty
sections. CSV and fixed-width sources do not yet support envelope
extraction; declaring envelope sections on those formats is a no-op
today.
EDIFACT segment extract
An EDIFACT source exposes its interchange header UNB as an envelope
section. The section’s field names are the positional element keys
e01, e02, … :
envelope:
sections:
interchange:
extract: { segment: "UNB" }
fields:
e05: string # interchange control reference
Only the UNB header is extractable. EDIFACT is scanned as a flat byte
stream with only the header pre-read, so trailer segments (UNT, UNZ)
that arrive after the body are not envelope sections — their control
counts are validated inline by the reader instead. A segment extract
naming any tag other than UNB is rejected at startup. See
EDIFACT Format for the full reference.
A JSON example:
sources:
- name: payments
path: data/payments.json
format: json
record_path: records
envelope:
sections:
Head:
extract: { json_pointer: "/Head" }
fields:
batch_id: string
Foot:
extract: { json_pointer: "/Foot" }
fields:
count: int
against:
{
"Head": { "batch_id": "RUN-001" },
"records": [ { "amount": 10 }, { "amount": 20 } ],
"Foot": { "count": 2 }
}
Typed fields
Each section’s fields: map declares the field name and its type, drawn
from the same small vocabulary as source schemas: string, int,
float, bool, date, date_time. The extracted raw value is coerced
to the declared type at pre-scan time; a value that cannot coerce
(e.g. a non-numeric string declared int) fails the source with a
diagnostic naming the section, field, and offending value.
A field that the document does not carry resolves to null — $doc.*
follows the same missing-value convention as $source.* and
$pipeline.*. A section that the document does not carry at all is
simply absent from the context; any $doc.<missing_section>.<field>
resolves to null.
One document per file
Each source file is its own document with its own envelope context.
When a source matches multiple files (via glob: / paths:), each file
gets a fresh document context with its own section values. Records from
different files never share a context — a record’s $doc.* always
reflects the file that record came from.
Document boundaries flow through the pipeline as inline punctuation signals (one when a document opens, one when it closes). These signals let document-scoped operators — for example a future per-document aggregate flush or trailer-count validation — fire at exactly the right point. Today the signals propagate through Source, Transform, Route, Sort, and Combine, and are reconciled at Merge (a document that fans in through several branches closes downstream exactly once).
Nested (multi-level) envelopes
Some formats wrap their records in several envelope levels, one inside another. EDI X12 is the canonical example and the first format that implements this: an interchange (ISA/IEA) contains one or more functional groups (GS/GE), each containing one or more transaction sets (ST/SE), each containing the records. A single file can carry multiple interchanges back to back. See X12 Format for the full reference.
A reader for such a format opens and closes each nested level as it
crosses the corresponding envelope boundary mid-file. Each level
contributes its own sections to $doc. There is no new $doc syntax for
nesting — every level’s sections are read through the same two-level
$doc.<section>.<field> lookup. A record inside the innermost level sees
every enclosing level’s sections at once. For X12 the interchange header is
a declared segment: "ISA" envelope section (you choose its name), while
the GS group and ST set surface automatically as the reader-supplied
sections functional_group and transaction_set, each keyed by positional
eNN elements:
project:
- interchange_control: $doc.interchange.e13 # ISA13, declared section
- functional_id: $doc.functional_group.e01 # GS01 (reader-supplied)
- transaction_type: $doc.transaction_set.e01 # ST01 (reader-supplied)
- claim_amount: amount # body field
A record streamed inside the ST level resolves the ST section, the enclosing GS section, and the outermost ISA section, all at once: each inner level inherits every enclosing level’s sections as siblings in one flat namespace. If two levels declare a section with the same name, the innermost wins for records inside it — the same shadowing rule a nested scope follows in any language. Picking distinct per-level names (as above) keeps every level independently visible.
Boundaries nest correctly through the pipeline: each level opens before the records inside it and closes after them, in strict innermost-first order. A level that fans in through several branches is still reconciled once at Merge, exactly like a single-level document.
Header-only interchanges
A multi-level envelope file can legitimately carry an interchange whose
body is empty — envelope structure (an interchange header, and possibly
inner group headers) with zero records inside. Such an interchange still
opens a document and emits its open/close boundaries, so downstream
operators and trailer-count validation observe it just like any other
document. The interchange’s $doc.* sections are extracted and the
boundaries flow even though no body record ever streams from it.
The same holds for an empty inner envelope — an open/close pair with no records between — and for an inner envelope that opens or closes after the file’s last body record. Every envelope boundary a reader signals is applied, whether or not a record follows it, so the document frame stays balanced end to end.
Channels
Channels enable multi-tenant pipeline customization. A single pipeline definition can be run with different configurations per client, environment, or business unit – without duplicating or modifying the base YAML.
A .channel.yaml file declares a target pipeline (or composition), composition-level config knobs, and overrides/adds for the four scoped-variable registries the pipeline reads.
Channel manifest
# channels/staging.channel.yaml
channel:
name: staging
target: ./pipelines/my_pipeline.yaml
# Composition-level config knobs (DottedPath keys: alias.param)
config:
default:
enrich1.fuzzy_threshold: 0.85
fixed:
enrich1.lookup_table: "s3://acme/lookups/staging.csv"
# Variable overrides / adds (issue #45)
vars:
static: # overrides + adds for $vars.*
fuzzy_threshold:
type: float
default: 0.92
pipeline: # overrides + adds for $pipeline.*
cutoff_date:
type: date
default: "2026-01-01"
source: # per-source-name overrides + adds for $source.*
orders:
ingest_label:
type: string
default: "staging"
record: # overrides + adds for $record.*
tier:
type: string
default: "bronze"
Top-level fields
| Field | Required | Description |
|---|---|---|
channel.name | Yes | Channel identifier; used in --channel, path templates, and the channel-identity stamp on the compiled plan. |
channel.target | Yes | Path to the target pipeline (*.yaml) or composition (*.comp.yaml). |
config.default / config.fixed | No | Composition-config overlays. default can be overridden by a higher layer; fixed cannot. |
vars.* | No | See Variable overrides below. |
Running with a channel
clinker run pipeline.yaml --channel ./channels/staging.channel.yaml
--channel loads the binding once, validates it against the compiled plan, applies the overlay, and seeds the executor’s eval context before any record-stream-phase node runs. The channel name is also available as the {channel} token in output path templates.
If channel.target does not match the loaded <config> path, clinker emits W104 and proceeds — the operator may have a legitimate reason to run a sibling pipeline against the same channel.
Variable overrides
A pipeline exposes four scoped-variable registries:
| Read syntax | Lifetime | Pipeline declaration site |
|---|---|---|
$vars.<key> | Frozen at pipeline start | Top-level vars: { key: { type, default } } |
$pipeline.<key> | Pipeline-wide, mutable | Transform declares: [{ name, scope: pipeline, type, default? }] |
$source.<key> | Per-source-file, mutable | Transform declares: [{ name, scope: source, type, default? }] |
$record.<key> | Per-record, mutable | Transform declares: [{ name, scope: record, type, default? }] |
Each registry has a corresponding sub-block under vars: on a channel YAML. Every entry uses the same { type, default } shape the pipeline declarations use:
- Override — entry name already exists in the registry. The channel-supplied
typeMUST equal the declared type (mismatch → E107). The channeldefaultreplaces the declared default after passing the same typecheck pipeline declarations use. - Add — entry name not yet declared. The full
{ type, default }becomes the new declaration in that registry.
Source overrides are keyed by source-node name (vars.source.<src>.<var>). Adds and overrides on $source apply to every file the named source ingests; an unknown source name produces E111.
Reserved-system fields
Each scope has a small set of reserved field names that the engine populates (e.g. $pipeline.execution_id, $source.path, $source.row, $pipeline.start_time). Channels cannot shadow these — attempting it produces E110, naming the offending scope and field. The full lists live in crates/clinker-core/src/config/mod.rs (RESERVED_PIPELINE_NAMES, RESERVED_SOURCE_NAMES, RESERVED_RECORD_NAMES); $vars.* has no reserved subset.
Composition-target channels
Channels that target a .comp.yaml may not carry a vars: block (composition var overlay is out of scope today) — the binding emits E109 if vars: is non-empty. Channel-config knobs (config: block) on composition targets continue to work as before.
Diagnostic codes
| Code | Meaning |
|---|---|
| E107 | Var override type mismatch (declared T, override declared U). |
| E109 | Var overrides not supported on composition channels. |
| E110 | Channel var shadows reserved system field for that scope. |
| E111 | vars.source.<src> references a source-node name not declared in the pipeline. |
| W103 | Channel config.* key did not match any composition parameter in the compiled plan. |
| W104 | channel.target does not match the <config> argument passed to clinker run. |
Cross-Transform declaration uniqueness
$pipeline, $source, and $record are flat shared namespaces. The same name declared on more than one Transform’s declares: is a config-validation error — clinker mirrors the fail-fast posture of Beam, Flink, Kafka Streams, Dagster, and post-fix dbt for shared-namespace key collisions. Authors who want shared state declare it once and reference everywhere.
Workspace discovery
Channels are part of the broader workspace system. Clinker discovers workspaces via clinker.toml files, which can define the channel directory layout and other workspace-level settings.
Compositions
Compositions are reusable pipeline fragments that can be imported into multiple pipelines. They encapsulate common transform patterns – date derivations, address normalization, currency conversion – into self-contained, testable units.
Using a composition
A composition node in your pipeline references an external .comp.yaml file:
- type: composition
name: fiscal_dates
input: invoices
use: "./compositions/fiscal_date.comp.yaml"
config:
start_month: 4
The use: field points to the composition definition file. The config: block passes parameters that customize the composition’s behavior for this specific invocation.
Composition definition file
A .comp.yaml file declares the composition’s interface – what fields it requires from upstream and what fields it produces:
# compositions/fiscal_date.comp.yaml
composition:
name: fiscal_date
description: "Derive fiscal year, quarter, and period from a date field"
requires:
- { name: invoice_date, type: date }
produces:
- { name: fiscal_year, type: int }
- { name: fiscal_quarter, type: string }
- { name: fiscal_period, type: int }
params:
- name: start_month
type: int
default: 1
description: "First month of the fiscal year (1-12)"
Composition fields
| Field | Required | Description |
|---|---|---|
name | Yes | Composition identifier |
description | No | Human-readable purpose |
requires | Yes | Input fields the composition needs from upstream (name + type) |
produces | Yes | Output fields the composition adds to the record (name + type) |
params | No | Configurable parameters with optional defaults |
Advanced wiring
For compositions with multiple input or output ports, the node supports explicit port bindings:
- type: composition
name: enrich_address
input: customers
use: "./compositions/address_normalize.comp.yaml"
inputs:
primary: customers
reference: zip_lookup
outputs:
normalized: next_stage
config:
country_code: "US"
resources:
zip_database: "./data/zipcodes.csv"
Port and resource fields
| Field | Required | Description |
|---|---|---|
inputs | No | Map of composition input ports to upstream node references |
outputs | No | Map of composition output ports to downstream node references |
config | No | Parameter overrides (key-value pairs) |
resources | No | External resource bindings (file paths, connection strings) |
alias | No | Namespace prefix for expanded node names (avoids collisions) |
Complete example
pipeline:
name: invoice_pipeline
nodes:
- type: source
name: invoices
config:
name: invoices
type: csv
path: "./data/invoices.csv"
schema:
- { name: invoice_id, type: int }
- { name: customer_id, type: int }
- { name: invoice_date, type: date }
- { name: amount, type: float }
- type: composition
name: fiscal_dates
input: invoices
use: "./compositions/fiscal_date.comp.yaml"
config:
start_month: 4
- type: transform
name: final_enrich
input: fiscal_dates
config:
cxl: |
emit invoice_id = invoice_id
emit customer_id = customer_id
emit amount = amount
emit fiscal_year = fiscal_year
emit fiscal_quarter = fiscal_quarter
- type: output
name: result
input: final_enrich
config:
name: result
type: csv
path: "./output/invoices_enriched.csv"
Current status
Note: Composition support is being built in Phase 16c. The YAML shape parses and validates, but compilation currently returns a diagnostic (E100) per composition node. The documentation above reflects the intended design. Full compilation and expansion will land when Phase 16c is complete.
CXL Overview
CXL (Clinker Expression Language) is a per-record expression language designed for ETL transformations. Every CXL program operates on one record at a time, producing output fields, filtering records, or computing derived values.
CXL is not SQL. There are no SELECT, FROM, or WHERE keywords. CXL programs are sequences of statements – emit, let, filter, distinct – that execute top to bottom against the current record.
Key differences from SQL
| SQL | CXL |
|---|---|
SELECT col AS alias | emit alias = col |
WHERE condition | filter condition |
AND / OR / NOT | and / or / not (keywords) |
&& / || / ! | Not supported – use keywords |
COALESCE(a, b) | a ?? b |
CASE WHEN ... THEN ... END | if ... then ... else ... or match { } |
Boolean operators are keywords
CXL uses English keywords for boolean logic, not symbols:
$ cxl eval -e 'emit result = true and false' --field dummy=1
{
"result": false
}
The operators &&, ||, and ! are syntax errors in CXL. Always use and, or, and not.
System namespaces use $ prefix
CXL provides built-in namespaces for accessing pipeline state, metadata, and window functions. All system namespaces are prefixed with $:
$pipeline.*– pipeline execution context (name, counters, provenance)$meta.*– per-record metadata$window.*– window function calls$vars.*– user-defined pipeline variables
$ cxl eval -e 'emit name = $pipeline.name'
{
"name": "cxl-eval"
}
Compile-time type checking
CXL catches type errors before data processing begins. The compilation pipeline runs four phases:
- Parse – tokenize and build an AST from CXL source text
- Resolve – bind field references, validate method names, check arity
- Typecheck – infer types, validate operator compatibility, check method receiver types
- Eval – execute the typed program against each record
Errors at any phase produce rich diagnostics with source locations and fix suggestions via miette.
$ cxl check transform.cxl
ok: transform.cxl is valid
If there are type errors, the checker reports them with spans:
error[typecheck]: cannot apply '+' to String and Int (at transform.cxl:12)
help: convert one operand — use .to_int() or .to_string()
A minimal CXL program
emit greeting = "hello"
emit doubled = amount * 2
filter amount > 0
This program:
- Emits a constant string field
greeting - Emits
doubledas twice the inputamount - Filters out records where
amountis not positive
Try it:
$ cxl eval -e 'emit greeting = "hello"' -e 'emit doubled = amount * 2' \
--field amount=5
{
"greeting": "hello",
"doubled": 10
}
Statement order matters
CXL statements execute sequentially. Later statements can reference fields produced by earlier emit or let statements:
$ cxl eval -e 'let tax_rate = 0.21' -e 'emit tax = price * tax_rate' \
--field price=100
{
"tax": 21.0
}
A filter statement short-circuits execution – if the condition is false, remaining statements do not run and the record is excluded from output.
Types & Literals
CXL has 9 value types. Every field value, literal, and expression result is one of these types.
Value types
| Type | Rust backing | Description |
|---|---|---|
| Null | Value::Null | Missing or absent value |
| Bool | bool | true or false |
| Integer | i64 | 64-bit signed integer |
| Float | f64 | 64-bit double-precision float |
| String | Box<str> | UTF-8 text |
| Date | NaiveDate | Calendar date without timezone |
| DateTime | NaiveDateTime | Date and time without timezone |
| Array | Vec<Value> | Ordered collection of values |
| Map | IndexMap<Box<str>, Value> | Key-value pairs |
Literal syntax
Integers
Standard decimal notation. Negative values use the unary minus operator.
$ cxl eval -e 'emit a = 42' -e 'emit b = -5' -e 'emit c = 0'
{
"a": 42,
"b": -5,
"c": 0
}
Floats
Decimal notation with a dot. Must have digits on both sides of the decimal point.
$ cxl eval -e 'emit a = 3.14' -e 'emit b = -0.5'
{
"a": 3.14,
"b": -0.5
}
Strings
Double-quoted or single-quoted. Supports escape sequences: \\, \", \', \n, \t, \r.
$ cxl eval -e 'emit greeting = "hello world"'
{
"greeting": "hello world"
}
Booleans
The keywords true and false.
$ cxl eval -e 'emit flag = true' -e 'emit neg = not flag'
{
"flag": true,
"neg": false
}
Dates
Hash-delimited ISO 8601 format: #YYYY-MM-DD#.
$ cxl eval -e 'emit d = #2024-01-15#'
{
"d": "2024-01-15"
}
Null
The keyword null.
$ cxl eval -e 'emit nothing = null'
{
"nothing": null
}
Schema types
When declaring column types in YAML pipeline schemas, use these type names:
| Schema type | CXL type | Description |
|---|---|---|
string | String | Text values |
int | Integer | 64-bit integers |
float | Float | 64-bit floats |
bool | Bool | Boolean values |
date | Date | Calendar dates |
date_time | DateTime | Date and time |
array | Array | Ordered collections |
numeric | Int or Float | Union type – accepts either |
any | Any | Unknown type – no type constraints |
nullable(T) | Nullable(T) | Wrapper – value may be null |
Example YAML schema declaration:
schema:
employee_id: int
name: string
salary: nullable(float)
start_date: date
Type promotion
CXL automatically promotes types in mixed expressions:
Int + Float promotes to Float:
$ cxl eval -e 'emit result = 2 + 3.5'
{
"result": 5.5
}
Null + T produces Nullable(T): Any operation involving null produces a nullable result.
$ cxl eval -e 'emit result = null + 5'
{
"result": null
}
Nullable(A) + B unifies to Nullable(unified): When a nullable value meets a non-nullable value, the result type wraps the unified inner type in Nullable.
Type unification rules
The type checker follows these rules when two types meet in an expression:
- Same types unify to themselves:
Int + IntproducesInt Anyunifies with anything:Any + TproducesTNumericresolves to the concrete type:Numeric + IntproducesInt,Numeric + FloatproducesFloatIntpromotes toFloat:Int + FloatproducesFloatNullwraps:Null + TproducesNullable(T)Nullablepropagates:Nullable(A) + BproducesNullable(unified(A, B))- Incompatible types fail:
String + Intis a type error
Operators & Expressions
CXL provides arithmetic, comparison, boolean, null coalescing, and string operators. Boolean logic uses keywords (and, or, not), not symbols.
Arithmetic operators
| Operator | Description | Example |
|---|---|---|
+ | Addition (or string concatenation) | 2 + 3 |
- | Subtraction | 10 - 4 |
* | Multiplication | 3 * 5 |
/ | Division | 10 / 3 |
% | Modulo (remainder) | 10 % 3 |
$ cxl eval -e 'emit result = 2 + 3 * 4'
{
"result": 14
}
Multiplication binds tighter than addition, so 2 + 3 * 4 is 2 + (3 * 4) = 14, not (2 + 3) * 4 = 20.
$ cxl eval -e 'emit result = 10 % 3'
{
"result": 1
}
Comparison operators
| Operator | Description | Example |
|---|---|---|
== | Equal | x == 0 |
!= | Not equal | x != 0 |
> | Greater than | x > 10 |
< | Less than | x < 10 |
>= | Greater than or equal | x >= 10 |
<= | Less than or equal | x <= 10 |
$ cxl eval -e 'emit result = 5 > 3' --field dummy=1
{
"result": true
}
Boolean operators
CXL uses keywords for boolean logic. The symbols &&, ||, and ! are not valid CXL syntax.
| Operator | Description | Example |
|---|---|---|
and | Logical AND | a and b |
or | Logical OR | a or b |
not | Logical NOT (unary) | not a |
$ cxl eval -e 'emit result = true and not false'
{
"result": true
}
$ cxl eval -e 'emit result = 5 > 3 or 10 < 2'
{
"result": true
}
Null coalesce operator
The ?? operator returns its left operand if non-null, otherwise its right operand.
$ cxl eval -e 'emit result = null ?? "default"'
{
"result": "default"
}
$ cxl eval -e 'emit result = "present" ?? "default"'
{
"result": "present"
}
String concatenation
The + operator concatenates strings when both operands are strings.
$ cxl eval -e 'emit result = "hello" + " " + "world"'
{
"result": "hello world"
}
Unary operators
| Operator | Description | Example |
|---|---|---|
- | Numeric negation | -x |
not | Boolean negation | not done |
$ cxl eval -e 'emit result = -42'
{
"result": -42
}
Method calls
Methods are called on a receiver using dot notation:
$ cxl eval -e 'emit result = "hello".upper()'
{
"result": "HELLO"
}
Methods can be chained:
$ cxl eval -e 'emit result = " hello ".trim().upper()'
{
"result": "HELLO"
}
Field references
Bare identifiers reference fields from the input record:
$ cxl eval -e 'emit result = price * qty' \
--field price=10 \
--field qty=3
{
"result": 30
}
Qualified field references use dot notation for multi-source pipelines: source.field.
Operator precedence
From highest (binds tightest) to lowest:
| Precedence | Operators | Associativity |
|---|---|---|
| 1 (highest) | . (method calls, field access) | Left |
| 2 | - (unary), not | Prefix |
| 3 | * / % | Left |
| 4 | + - | Left |
| 5 | == != > < >= <= | Left |
| 6 | and | Left |
| 7 | or | Left |
| 8 (lowest) | ?? | Right |
Use parentheses to override precedence:
$ cxl eval -e 'emit result = (2 + 3) * 4'
{
"result": 20
}
Comments
Line comments start with # (when not followed by a digit – digit-prefixed # starts a date literal):
# This is a comment
emit total = price * qty # inline comment
emit deadline = #2024-12-31# # this is a date literal, not a comment
Statements
CXL programs are sequences of statements that execute top-to-bottom against each input record. Statement order matters – later statements can reference values produced by earlier ones.
emit
The emit statement produces an output field. Each emit becomes a column in the output record.
emit name = expression
$ cxl eval -e 'emit greeting = "hello"' -e 'emit doubled = 21 * 2'
{
"greeting": "hello",
"doubled": 42
}
Multiple emit statements build up the output record field by field:
$ cxl eval -e 'emit first = "Alice"' -e 'emit last = "Smith"' \
-e 'emit full = first + " " + last'
{
"first": "Alice",
"last": "Smith",
"full": "Alice Smith"
}
let
The let statement creates a local variable binding. The variable is available to subsequent statements but is NOT included in the output record.
let name = expression
$ cxl eval -e 'let tax_rate = 0.21' -e 'emit tax = 100 * tax_rate'
{
"tax": 21.0
}
Note that tax_rate does not appear in the output – only emit statements produce output fields.
filter
The filter statement excludes records where the condition evaluates to false. When a filter excludes a record, remaining statements do not execute (short-circuit).
filter condition
$ cxl eval -e 'filter amount > 0' -e 'emit result = amount * 2' \
--field amount=5
{
"result": 10
}
When the filter condition is false, the entire record is dropped and no output is produced.
Filters can appear anywhere in the statement sequence. Place them early to skip unnecessary computation:
filter status == "active"
let discount = if tier == "gold" then 0.2 else 0.1
emit final_price = price * (1 - discount)
distinct
The distinct statement deduplicates records. The bare form deduplicates on all emitted fields. The by form deduplicates on a specific field.
distinct
distinct by field_name
In a pipeline, distinct tracks values seen so far and drops records that have already been emitted with the same key.
emit meta
The emit meta statement writes a value to the $meta.* namespace – per-record metadata that is not part of the output columns. Metadata can be read by downstream nodes via $meta.field.
emit meta quality_flag = if amount < 0 then "suspect" else "ok"
Access metadata downstream:
filter $meta.quality_flag == "ok"
trace
The trace statement emits debug logging. It has no effect on the output record. Trace messages are only visible when tracing is enabled at the appropriate level.
trace "processing record"
trace warn "unusual value detected"
trace info if amount > 10000 then "high value transaction"
Trace levels: trace (default), debug, info, warn, error. An optional guard condition (via if) limits when the trace fires.
Statement ordering
Statements execute sequentially. A statement can reference any field or variable defined by a preceding emit or let:
$ cxl eval -e 'let base = 100' -e 'let rate = 0.15' \
-e 'emit subtotal = base * rate' \
-e 'emit total = base + subtotal'
{
"subtotal": 15.0,
"total": 115.0
}
Referencing a name before it is defined is a resolve-time error:
emit total = base + tax # error: 'base' is not defined yet
let base = 100
let tax = base * 0.21
use
The use statement imports a CXL module for reuse. See Modules & use for details.
use shared.dates as d
emit fy = d::fiscal_year(invoice_date)
Conditionals
CXL provides two conditional expression forms: if/then/else and match. Both are expressions – they return values and can be used anywhere an expression is expected.
If / then / else
The basic conditional expression:
if condition then value else alternative
$ cxl eval -e 'emit label = if amount > 100 then "high" else "low"' \
--field amount=250
{
"label": "high"
}
The else branch is optional. When omitted, records where the condition is false produce null:
$ cxl eval -e 'emit bonus = if score > 90 then score * 0.1' \
--field score=80
{
"bonus": null
}
Chained conditionals
Chain multiple conditions with else if:
$ cxl eval -e 'emit tier = if amount > 1000 then "platinum"
else if amount > 500 then "gold"
else if amount > 100 then "silver"
else "bronze"' \
--field amount=750
{
"tier": "gold"
}
Nested usage
Since if/then/else is an expression, it can be used inside other expressions:
$ cxl eval -e 'emit price = base * (if member then 0.8 else 1.0)' \
--field base=100 \
--field member=true
{
"price": 80.0
}
Match
The match expression provides pattern matching. It comes in two forms: value matching (with a subject) and condition matching (without a subject).
Value form (with subject)
Match a subject expression against literal patterns:
match subject {
pattern1 => result1,
pattern2 => result2,
_ => default
}
$ cxl eval -e 'emit label = match status {
"A" => "Active",
"I" => "Inactive",
"P" => "Pending",
_ => "Unknown"
}' \
--field status=A
{
"label": "Active"
}
The wildcard _ is the catch-all arm. It matches any value not covered by preceding arms.
Condition form (without subject)
When no subject is provided, each arm’s pattern is evaluated as a boolean condition. This is CXL’s equivalent of SQL’s CASE WHEN:
match {
condition1 => result1,
condition2 => result2,
_ => default
}
$ cxl eval -e 'emit tier = match {
amount > 1000 => "high",
amount > 100 => "medium",
_ => "low"
}' \
--field amount=500
{
"tier": "medium"
}
Practical examples
Tiered pricing:
emit discount = match {
qty >= 1000 => 0.25,
qty >= 100 => 0.15,
qty >= 10 => 0.05,
_ => 0.0
}
Status code mapping:
emit status_text = match http_code {
200 => "OK",
201 => "Created",
400 => "Bad Request",
404 => "Not Found",
500 => "Internal Server Error",
_ => "HTTP " + http_code.to_string()
}
Region classification:
emit region = match country {
"US" => "North America",
"CA" => "North America",
"MX" => "North America",
"GB" => "Europe",
"DE" => "Europe",
"FR" => "Europe",
_ => "Other"
}
Match arms are evaluated in order
The first matching arm wins. Place more specific conditions before general ones:
# Correct: specific before general
emit category = match {
amount > 10000 => "enterprise",
amount > 1000 => "business",
_ => "personal"
}
# Wrong: first arm always matches
emit category = match {
amount > 0 => "personal", # catches everything positive
amount > 1000 => "business", # never reached
amount > 10000 => "enterprise", # never reached
_ => "unknown"
}
Built-in Methods
CXL provides built-in scalar methods organized into categories. Methods are called on a receiver value using dot notation: receiver.method(args).
Null propagation
Most methods return null when the receiver is null. This means null values flow through method chains without causing errors. The exceptions are documented in Introspection & Debug.
Method categories
String Methods (24 methods)
Text manipulation: case conversion, trimming, padding, searching, splitting, regex matching.
| Method | Description |
|---|---|
upper, lower | Case conversion |
trim, trim_start, trim_end | Whitespace removal |
starts_with, ends_with, contains | Substring testing |
replace | Find and replace |
substring, left, right | Extraction |
pad_left, pad_right | Padding |
repeat, reverse | Repetition and reversal |
length | Character count |
split, join | Splitting and joining |
matches, find, capture | Regex operations |
format, concat | Formatting and concatenation |
Numeric Methods (8 methods)
Rounding, clamping, and comparison for integers and floats.
| Method | Description |
|---|---|
abs | Absolute value |
ceil, floor | Ceiling and floor |
round, round_to | Rounding to decimal places |
clamp | Constrain to range |
min, max | Pairwise minimum/maximum |
Date & Time Methods (13 methods)
Date component extraction, arithmetic, and formatting.
| Method | Description |
|---|---|
year, month, day | Date component extraction |
hour, minute, second | Time component extraction (DateTime only) |
add_days, add_months, add_years | Date arithmetic |
diff_days, diff_months, diff_years | Date difference |
format_date | Custom date formatting |
Conversion Methods (11 methods)
Type conversion in strict (error on failure) and lenient (null on failure) variants.
| Method | Description |
|---|---|
to_int, to_float, to_string, to_bool | Strict conversion |
to_date, to_datetime | Strict date parsing |
try_int, try_float, try_bool | Lenient conversion |
try_date, try_datetime | Lenient date parsing |
Introspection & Debug (5 methods)
Type inspection, null checking, and debugging. These are the only methods that accept null receivers without propagating null.
| Method | Description |
|---|---|
type_of | Returns the type name as a string |
is_null | Tests for null |
is_empty | Tests for empty string, empty array, or null |
catch | Null fallback (equivalent to ??) |
debug | Passthrough with tracing side effect |
Path Methods (5 methods)
File path component extraction.
| Method | Description |
|---|---|
file_name | Full filename with extension |
file_stem | Filename without extension |
extension | File extension |
parent | Parent directory path |
parent_name | Parent directory name |
Array Methods
Traversal and transformation over nested arrays. Closure-bearing methods take an arrow-syntax closure and evaluate it per element.
| Method | Description |
|---|---|
filter, map, find, any, flat_map | Closure-bearing traversal |
remove | Drop the element at a given index |
length, join | Cross-listed on arrays (also defined on strings) |
Map Methods
Builders and accessors for Value::Map payloads. All map methods return new maps – they never mutate the receiver.
| Method | Description |
|---|---|
keys, values | List map keys / values as arrays |
merge | Union of two maps (right wins on conflict) |
set | Insert / replace an entry, by single key or by a nested a.b[0].c path |
remove_field | Drop a single entry by key |
String Methods
CXL provides 24 built-in methods for string manipulation. All string methods return null when the receiver is null (null propagation).
Case conversion
upper()
Converts all characters to uppercase.
$ cxl eval -e 'emit result = "hello world".upper()'
{
"result": "HELLO WORLD"
}
lower()
Converts all characters to lowercase.
$ cxl eval -e 'emit result = "Hello World".lower()'
{
"result": "hello world"
}
Whitespace trimming
trim()
Removes leading and trailing whitespace.
$ cxl eval -e 'emit result = " hello ".trim()'
{
"result": "hello"
}
trim_start()
Removes leading whitespace only.
$ cxl eval -e 'emit result = " hello ".trim_start()'
{
"result": "hello "
}
trim_end()
Removes trailing whitespace only.
$ cxl eval -e 'emit result = " hello ".trim_end()'
{
"result": " hello"
}
Substring testing
starts_with(prefix: String) -> Bool
Tests whether the string starts with the given prefix.
$ cxl eval -e 'emit result = "hello world".starts_with("hello")'
{
"result": true
}
ends_with(suffix: String) -> Bool
Tests whether the string ends with the given suffix.
$ cxl eval -e 'emit result = "report.csv".ends_with(".csv")'
{
"result": true
}
contains(substring: String) -> Bool
Tests whether the string contains the given substring.
$ cxl eval -e 'emit result = "hello world".contains("lo wo")'
{
"result": true
}
Find and replace
replace(find: String, replacement: String) -> String
Replaces all occurrences of find with replacement.
$ cxl eval -e 'emit result = "foo-bar-baz".replace("-", "_")'
{
"result": "foo_bar_baz"
}
Extraction
substring(start: Int [, length: Int]) -> String
Extracts a substring starting at start (0-based character index). If length is provided, takes at most that many characters. If omitted, takes all remaining characters.
$ cxl eval -e 'emit result = "hello world".substring(6)'
{
"result": "world"
}
$ cxl eval -e 'emit result = "hello world".substring(0, 5)'
{
"result": "hello"
}
left(n: Int) -> String
Returns the first n characters.
$ cxl eval -e 'emit result = "hello world".left(5)'
{
"result": "hello"
}
right(n: Int) -> String
Returns the last n characters.
$ cxl eval -e 'emit result = "hello world".right(5)'
{
"result": "world"
}
Padding
pad_left(width: Int [, char: String]) -> String
Left-pads the string to the given width. Default pad character is a space.
$ cxl eval -e 'emit result = "42".pad_left(5, "0")'
{
"result": "00042"
}
$ cxl eval -e 'emit result = "hi".pad_left(6)'
{
"result": " hi"
}
pad_right(width: Int [, char: String]) -> String
Right-pads the string to the given width. Default pad character is a space.
$ cxl eval -e 'emit result = "hi".pad_right(6, ".")'
{
"result": "hi...."
}
Repetition and reversal
repeat(n: Int) -> String
Repeats the string n times.
$ cxl eval -e 'emit result = "ab".repeat(3)'
{
"result": "ababab"
}
reverse() -> String
Reverses the characters in the string.
$ cxl eval -e 'emit result = "hello".reverse()'
{
"result": "olleh"
}
Length
length() -> Int
Returns the number of characters in the string. Also works on arrays, returning the number of elements.
$ cxl eval -e 'emit result = "hello".length()'
{
"result": 5
}
Splitting and joining
split(delimiter: String) -> Array
Splits the string by the delimiter, returning an array of strings.
$ cxl eval -e 'emit result = "a,b,c".split(",")'
{
"result": ["a", "b", "c"]
}
join(delimiter: String) -> String
Joins an array of values into a string with the given delimiter. The receiver must be an array.
$ cxl eval -e 'emit result = "a,b,c".split(",").join(" - ")'
{
"result": "a - b - c"
}
Regex operations
matches(pattern: String) -> Bool
Tests whether the string fully matches the given regex pattern.
$ cxl eval -e 'emit result = "abc123".matches("^[a-z]+[0-9]+$")'
{
"result": true
}
find(pattern: String) -> Bool
Tests whether the string contains a substring matching the given regex pattern (partial match).
$ cxl eval -e 'emit result = "hello world 42".find("[0-9]+")'
{
"result": true
}
capture(pattern: String [, group: Int]) -> String
Extracts a capture group from the first regex match. Default group is 0 (the full match).
$ cxl eval -e 'emit result = "order-12345".capture("order-([0-9]+)", 1)'
{
"result": "12345"
}
Formatting and concatenation
format(fmt: String) -> String
Formats the receiver value as a string.
$ cxl eval -e 'emit result = 42.format("")'
{
"result": "42"
}
concat(args: String…) -> String
Concatenates the receiver with one or more string arguments. Null arguments are treated as empty strings.
$ cxl eval -e 'emit result = "hello".concat(" ", "world")'
{
"result": "hello world"
}
This is variadic – it accepts any number of string arguments:
$ cxl eval -e 'emit result = "a".concat("b", "c", "d")'
{
"result": "abcd"
}
Numeric Methods
CXL provides 8 built-in methods for numeric operations. These methods work on both Integer and Float values (the Numeric receiver type). All return null when the receiver is null.
abs() -> Numeric
Returns the absolute value. Preserves the original type (Int stays Int, Float stays Float).
$ cxl eval -e 'emit result = (-42).abs()'
{
"result": 42
}
$ cxl eval -e 'emit result = (-3.14).abs()'
{
"result": 3.14
}
ceil() -> Int
Rounds up to the nearest integer. Returns the value unchanged for integers.
$ cxl eval -e 'emit result = 3.2.ceil()'
{
"result": 4
}
$ cxl eval -e 'emit result = (-3.2).ceil()'
{
"result": -3
}
floor() -> Int
Rounds down to the nearest integer. Returns the value unchanged for integers.
$ cxl eval -e 'emit result = 3.8.floor()'
{
"result": 3
}
$ cxl eval -e 'emit result = (-3.2).floor()'
{
"result": -4
}
round([decimals: Int]) -> Float
Rounds to the specified number of decimal places. Default is 0 decimal places.
$ cxl eval -e 'emit result = 3.456.round()'
{
"result": 3.0
}
$ cxl eval -e 'emit result = 3.456.round(2)'
{
"result": 3.46
}
round_to(decimals: Int) -> Float
Rounds to the specified number of decimal places. Unlike round(), the decimals argument is required.
$ cxl eval -e 'emit result = 3.14159.round_to(3)'
{
"result": 3.142
}
Use round_to when you want to be explicit about precision in financial or scientific calculations:
$ cxl eval -e 'emit price = 19.995.round_to(2)'
{
"price": 20.0
}
clamp(min: Numeric, max: Numeric) -> Numeric
Constrains the value to the given range. Returns min if the value is below it, max if above, or the value itself if within range.
$ cxl eval -e 'emit result = 150.clamp(0, 100)'
{
"result": 100
}
$ cxl eval -e 'emit result = (-5).clamp(0, 100)'
{
"result": 0
}
$ cxl eval -e 'emit result = 50.clamp(0, 100)'
{
"result": 50
}
min(other: Numeric) -> Numeric
Returns the smaller of the receiver and the argument.
$ cxl eval -e 'emit result = 10.min(20)'
{
"result": 10
}
$ cxl eval -e 'emit result = 10.min(5)'
{
"result": 5
}
max(other: Numeric) -> Numeric
Returns the larger of the receiver and the argument.
$ cxl eval -e 'emit result = 10.max(20)'
{
"result": 20
}
$ cxl eval -e 'emit result = 10.max(5)'
{
"result": 10
}
Practical examples
Clamp a percentage:
emit pct = (completed / total * 100).clamp(0, 100).round_to(1)
Absolute difference:
emit diff = (actual - expected).abs()
Floor division for batch numbering:
emit batch = (row_number / 1000).floor()
Date & Time Methods
CXL provides 13 built-in methods for date and time manipulation. These methods work on Date and DateTime values. All return null when the receiver is null.
Component extraction
year() -> Int
Returns the year component.
$ cxl eval -e 'emit result = #2024-03-15#.year()'
{
"result": 2024
}
month() -> Int
Returns the month component (1-12).
$ cxl eval -e 'emit result = #2024-03-15#.month()'
{
"result": 3
}
day() -> Int
Returns the day-of-month component (1-31).
$ cxl eval -e 'emit result = #2024-03-15#.day()'
{
"result": 15
}
hour() -> Int
Returns the hour component (0-23). DateTime only – returns null for Date values.
$ cxl eval -e 'emit result = "2024-03-15T14:30:00".to_datetime().hour()'
{
"result": 14
}
minute() -> Int
Returns the minute component (0-59). DateTime only – returns null for Date values.
$ cxl eval -e 'emit result = "2024-03-15T14:30:00".to_datetime().minute()'
{
"result": 30
}
second() -> Int
Returns the second component (0-59). DateTime only – returns null for Date values.
$ cxl eval -e 'emit result = "2024-03-15T14:30:45".to_datetime().second()'
{
"result": 45
}
Date arithmetic
add_days(n: Int) -> Date
Adds n days to the date. Use negative values to subtract. Works on both Date and DateTime.
$ cxl eval -e 'emit result = #2024-01-15#.add_days(10)'
{
"result": "2024-01-25"
}
$ cxl eval -e 'emit result = #2024-01-15#.add_days(-5)'
{
"result": "2024-01-10"
}
add_months(n: Int) -> Date
Adds n months to the date. Day is clamped to the last day of the target month if necessary.
$ cxl eval -e 'emit result = #2024-01-31#.add_months(1)'
{
"result": "2024-02-29"
}
$ cxl eval -e 'emit result = #2024-03-15#.add_months(-2)'
{
"result": "2024-01-15"
}
add_years(n: Int) -> Date
Adds n years to the date. Leap day (Feb 29) is clamped to Feb 28 in non-leap years.
$ cxl eval -e 'emit result = #2024-02-29#.add_years(1)'
{
"result": "2025-02-28"
}
Date difference
diff_days(other: Date) -> Int
Returns the number of days between the receiver and the argument (receiver - other). Positive when the receiver is later.
$ cxl eval -e 'emit result = #2024-03-15#.diff_days(#2024-03-01#)'
{
"result": 14
}
$ cxl eval -e 'emit result = #2024-01-01#.diff_days(#2024-03-15#)'
{
"result": -74
}
diff_months(other: Date) -> Int
Returns the difference in months between two dates.
Note: This method currently returns
null(unimplemented). Usediff_daysand divide by 30 as an approximation.
diff_years(other: Date) -> Int
Returns the difference in years between two dates.
Note: This method currently returns
null(unimplemented). Usediff_daysand divide by 365 as an approximation.
Formatting
format_date(format: String) -> String
Formats the date/datetime using a chrono format string. See chrono format syntax.
Common format specifiers:
| Specifier | Description | Example |
|---|---|---|
%Y | 4-digit year | 2024 |
%m | 2-digit month | 03 |
%d | 2-digit day | 15 |
%H | Hour (24h) | 14 |
%M | Minute | 30 |
%S | Second | 00 |
%B | Full month name | March |
%b | Abbreviated month | Mar |
%A | Full weekday | Friday |
$ cxl eval -e 'emit result = #2024-03-15#.format_date("%B %d, %Y")'
{
"result": "March 15, 2024"
}
$ cxl eval -e 'emit result = #2024-03-15#.format_date("%Y/%m/%d")'
{
"result": "2024/03/15"
}
Practical examples
Fiscal year calculation (April start):
let d = invoice_date
emit fiscal_year = if d.month() < 4 then d.year() - 1 else d.year()
Age in days:
emit days_since = now.diff_days(created_date)
Quarter:
emit quarter = match {
invoice_date.month() <= 3 => "Q1",
invoice_date.month() <= 6 => "Q2",
invoice_date.month() <= 9 => "Q3",
_ => "Q4"
}
ISO week format:
emit formatted = order_date.format_date("%Y-W%V")
Conversion Methods
CXL provides two families of conversion methods: strict (6 methods) and lenient (5 methods). Strict conversions raise an error on failure, halting pipeline execution. Lenient conversions return null on failure, allowing graceful handling of dirty data.
All conversion methods accept any receiver type (Any).
Strict conversions
Use strict conversions for required fields where invalid data should halt processing.
to_int() -> Int
Converts the receiver to an integer. Errors on failure.
- Float: truncates toward zero
- String: parses as integer
- Bool:
truebecomes1,falsebecomes0
$ cxl eval -e 'emit result = "42".to_int()'
{
"result": 42
}
$ cxl eval -e 'emit result = 3.9.to_int()'
{
"result": 3
}
to_float() -> Float
Converts the receiver to a float. Errors on failure.
- Integer: promotes to float
- String: parses as float
$ cxl eval -e 'emit result = "3.14".to_float()'
{
"result": 3.14
}
$ cxl eval -e 'emit result = 42.to_float()'
{
"result": 42.0
}
to_string() -> String
Converts any value to its string representation. Never fails.
$ cxl eval -e 'emit result = 42.to_string()'
{
"result": "42"
}
$ cxl eval -e 'emit result = true.to_string()'
{
"result": "true"
}
to_bool() -> Bool
Converts the receiver to a boolean. Errors on failure.
- String:
"true","1","yes"becometrue;"false","0","no"becomefalse(case-insensitive) - Integer:
0isfalse, everything else istrue
$ cxl eval -e 'emit result = "yes".to_bool()'
{
"result": true
}
$ cxl eval -e 'emit result = 0.to_bool()'
{
"result": false
}
to_date([format: String]) -> Date
Parses a string to a Date. Without a format argument, expects ISO 8601 (YYYY-MM-DD). With a format, uses chrono strftime syntax.
$ cxl eval -e 'emit result = "2024-03-15".to_date()'
{
"result": "2024-03-15"
}
$ cxl eval -e 'emit result = "15/03/2024".to_date("%d/%m/%Y")'
{
"result": "2024-03-15"
}
to_datetime([format: String]) -> DateTime
Parses a string to a DateTime. Without a format argument, expects ISO 8601 (YYYY-MM-DDTHH:MM:SS). With a format, uses chrono strftime syntax.
$ cxl eval -e 'emit result = "2024-03-15T14:30:00".to_datetime()'
{
"result": "2024-03-15T14:30:00"
}
Lenient conversions
Use lenient conversions for optional or dirty data fields. They return null instead of raising errors, making them safe to combine with ?? for fallback values.
try_int() -> Int
Attempts to convert to integer. Returns null on failure.
$ cxl eval -e 'emit a = "42".try_int()' -e 'emit b = "abc".try_int()'
{
"a": 42,
"b": null
}
try_float() -> Float
Attempts to convert to float. Returns null on failure.
$ cxl eval -e 'emit a = "3.14".try_float()' -e 'emit b = "N/A".try_float()'
{
"a": 3.14,
"b": null
}
try_bool() -> Bool
Attempts to convert to boolean. Returns null on failure.
$ cxl eval -e 'emit a = "yes".try_bool()' -e 'emit b = "maybe".try_bool()'
{
"a": true,
"b": null
}
try_date([format: String]) -> Date
Attempts to parse a string as a Date. Returns null on failure.
$ cxl eval -e 'emit a = "2024-03-15".try_date()' \
-e 'emit b = "not a date".try_date()'
{
"a": "2024-03-15",
"b": null
}
try_datetime([format: String]) -> DateTime
Attempts to parse a string as a DateTime. Returns null on failure.
$ cxl eval -e 'emit a = "2024-03-15T14:30:00".try_datetime()' \
-e 'emit b = "invalid".try_datetime()'
{
"a": "2024-03-15T14:30:00",
"b": null
}
When to use each
Strict conversions (to_*) for:
- Required fields that must be valid
- Schema-enforced data where bad input should halt the pipeline
- Fields already validated upstream
Lenient conversions (try_*) for:
- Optional fields that may be missing or malformed
- Dirty data with mixed formats
- Fields where a fallback value is acceptable
Practical patterns
Safe numeric parsing with fallback:
emit amount = raw_amount.try_float() ?? 0.0
Parse dates from multiple formats:
emit parsed = raw_date.try_date("%Y-%m-%d")
?? raw_date.try_date("%m/%d/%Y")
?? raw_date.try_date("%d-%b-%Y")
Strict conversion for required fields:
emit employee_id = raw_id.to_int() # halts on bad data -- correct behavior
emit salary = raw_salary.to_float() # must be numeric
Lenient conversion for optional fields:
emit bonus = raw_bonus.try_float() # null if missing or non-numeric
emit total = salary + (bonus ?? 0.0) # safe arithmetic
Introspection & Debug
CXL provides 4 introspection methods and 1 debug method. These are the only methods that accept null receivers without propagating null – they are designed specifically for inspecting and handling null values.
type_of() -> String
Returns the type name of the receiver as a string. Works on any value, including null.
Type name strings: "String", "Int", "Float", "Bool", "Date", "DateTime", "Null", "Array", "Map".
$ cxl eval -e 'emit a = 42.type_of()' -e 'emit b = "hello".type_of()' \
-e 'emit c = null.type_of()'
{
"a": "Int",
"b": "String",
"c": "Null"
}
Useful for branching on dynamic types:
emit formatted = match value.type_of() {
"Int" => value.to_string() + " (integer)",
"Float" => value.round_to(2).to_string() + " (decimal)",
_ => value.to_string()
}
is_null() -> Bool
Returns true if the receiver is null, false otherwise. This is the primary way to test for null values – it is NOT subject to null propagation.
$ cxl eval -e 'emit a = null.is_null()' -e 'emit b = 42.is_null()'
{
"a": true,
"b": false
}
Use in filter statements:
filter not field.is_null()
is_empty() -> Bool
Returns true for empty strings, empty arrays, or null values. Returns false for all other values.
$ cxl eval -e 'emit a = "".is_empty()' -e 'emit b = "hello".is_empty()' \
-e 'emit c = null.is_empty()'
{
"a": true,
"b": false,
"c": true
}
Useful for filtering out blank or missing records:
filter not name.is_empty()
catch(fallback: Any) -> Any
Returns the receiver if it is non-null, otherwise returns the fallback value. This is the method equivalent of the ?? operator.
$ cxl eval -e 'emit a = null.catch("default")' \
-e 'emit b = "present".catch("default")'
{
"a": "default",
"b": "present"
}
catch and ?? are interchangeable:
# These two are equivalent:
emit name = raw_name.catch("Unknown")
emit name = raw_name ?? "Unknown"
debug(label: String) -> Any
Passes the receiver through unchanged while emitting a trace log with the given label. Zero overhead when tracing is disabled. The return value is always the receiver, making it safe to insert into any expression chain.
$ cxl eval -e 'emit result = 42.debug("check value")'
{
"result": 42
}
Insert debug anywhere in a method chain for inspection without affecting the output:
emit total = price.debug("price")
* qty.debug("qty")
When tracing is enabled, this produces log lines like:
TRACE source_row=1 source_file=input.csv: price: Integer(100)
TRACE source_row=1 source_file=input.csv: qty: Integer(5)
Null-safe summary
| Method | Null receiver behavior |
|---|---|
type_of() | Returns "Null" |
is_null() | Returns true |
is_empty() | Returns true |
catch(x) | Returns x |
debug(l) | Passes through null, logs it |
| All other methods | Return null (propagation) |
Path Methods
CXL provides 5 built-in methods for extracting components from file path strings. All path methods take a string receiver and return a string. They return null when the receiver is null or when the requested component does not exist.
file_name() -> String
Returns the full filename (with extension) from the path.
$ cxl eval -e 'emit result = "/data/reports/sales.csv".file_name()'
{
"result": "sales.csv"
}
file_stem() -> String
Returns the filename without the extension.
$ cxl eval -e 'emit result = "/data/reports/sales.csv".file_stem()'
{
"result": "sales"
}
extension() -> String
Returns the file extension (without the leading dot).
$ cxl eval -e 'emit result = "/data/reports/sales.csv".extension()'
{
"result": "csv"
}
Returns null when no extension is present:
$ cxl eval -e 'emit result = "/data/reports/README".extension()'
{
"result": null
}
parent() -> String
Returns the parent directory path.
$ cxl eval -e 'emit result = "/data/reports/sales.csv".parent()'
{
"result": "/data/reports"
}
parent_name() -> String
Returns just the name of the parent directory (not the full path).
$ cxl eval -e 'emit result = "/data/reports/sales.csv".parent_name()'
{
"result": "reports"
}
Practical examples
Organize output by source directory:
emit source_dir = $pipeline.source_file.parent_name()
emit source_type = $pipeline.source_file.extension()
Extract file identifiers:
emit file_id = $pipeline.source_file.file_stem()
emit is_csv = $pipeline.source_file.extension() == "csv"
Route by file type:
let ext = input_path.extension()
emit format = match ext {
"csv" => "delimited",
"json" => "structured",
"xml" => "markup",
_ => "unknown"
}
Array Methods
CXL provides closure-bearing and non-closure array builtins for traversing and transforming nested arrays carried on a single record. The closure-bearing methods take an arrow-syntax closure and evaluate it once per element.
Null propagation
Every array method returns null when the receiver is null. The closure body is not invoked on a null receiver.
Closure-bearing methods
filter(it => Bool) -> Array
Returns a new array containing the elements for which the closure body evaluates to true.
- type: transform
name: filter_items
input: orders
config:
cxl: |
emit kept = items.filter(it => it["price"] > 5)
For an input record where items is [{"sku":"a","price":10},{"sku":"b","price":20},{"sku":"c","price":5}], kept is [{"sku":"a","price":10},{"sku":"b","price":20}].
map(it => T) -> Array
Returns a new array whose elements are the closure body’s value for each input element. The element type need not match the input element type.
cxl: |
emit skus = items.map(it => it["sku"])
emit doubled_prices = items.map(it => it["price"] * 2)
skus is ["a", "b", "c"]; doubled_prices is [20, 40, 10].
find(it => Bool) -> Element | Null
Returns the first element for which the closure body evaluates to true. Returns null if no element matches.
cxl: |
emit first_premium = items.find(it => it["price"] > 15)
first_premium is {"sku":"b","price":20} for the running example.
any(it => Bool) -> Bool
Returns true if the closure body evaluates to true for at least one element. Returns false if no element matches (including on an empty array).
cxl: |
emit has_cheap = items.any(it => it["price"] < 10)
has_cheap is true.
flat_map(it => Array) -> Array
Like map, but the closure body returns an array per input element; the results are concatenated into a single flat array. A null body result contributes no elements; a non-array body result contributes a single element.
cxl: |
emit all_tags = items.flat_map(it => it["tags"])
For input items carrying tags arrays (e.g. [{"sku":"a","tags":["new"]},{"sku":"b","tags":["sale","new"]}]), all_tags is ["new","sale","new"].
Non-closure methods
remove(index: Int) -> Array
Returns a new array with the element at the given 0-based index removed. The original array is unchanged.
cxl: |
emit shifted = items.remove(1)
shifted is [{"sku":"a","price":10},{"sku":"c","price":5}] – index 0 is preserved, index 2 shifts down to index 1.
If the index is negative or out of range, remove returns the receiver array unchanged.
length() -> Int
Returns the number of elements in the array. length is also defined on strings (see String Methods).
cxl: |
emit item_count = items.length()
item_count is 3.
join(separator: String) -> String
Joins an array of values into a single string with the given separator between elements. Defined as a string method (see String Methods) but accepts array receivers.
cxl: |
emit sku_list = items.map(it => it["sku"]).join(", ")
sku_list is "a, b, c".
Bracket indexing vs .remove
Bracket indexing (items[0]) reads an element by position and returns null when out of range. .remove(idx) returns a new array with the element dropped; out-of-range indices leave the array unchanged. See Nested Paths for the index-access surface.
See also
- Closures – the
it => bodyform used by closure-bearing array methods. - Map Methods – builtins that operate on the map elements typically iterated by these array methods.
- Nested Paths – bracket-index and dotted-path access through nested arrays and maps.
- Emit Each – fan one input record into many output records, one per array element.
Map Methods
CXL provides five built-in methods for working with map values (key-value pairs). Maps arise naturally from JSON object inputs, from the set builder below, and from upstream emits that produce nested structures.
All map methods return new values – they never mutate the receiver. This is copy-on-write semantics: chaining .set then .remove_field produces a fresh map at each step, leaving the upstream binding untouched.
Null propagation
Every map method returns null when the receiver is null or is not a Value::Map.
Method reference
keys() -> Array
Returns the map’s keys as an array of strings, preserving insertion order.
- type: transform
name: list_keys
input: rows
config:
cxl: |
emit field_names = profile.keys()
For an input record where profile is {"name":"Alice","tier":"gold","since":"2021-04"}, field_names is ["name","tier","since"].
values() -> Array
Returns the map’s values as an array, preserving insertion order. Value types are heterogeneous – the array carries each value as-is.
cxl: |
emit field_values = profile.values()
field_values is ["Alice","gold","2021-04"].
merge(other: Map) -> Map
Returns a new map containing every key from the receiver and from other. On conflicting keys, other’s value wins.
cxl: |
emit enriched = profile.merge(overrides)
For profile = {"name":"Alice","tier":"gold"} and overrides = {"tier":"platinum","since":"2021-04"}, enriched is {"name":"Alice","tier":"platinum","since":"2021-04"}.
set(key: String, value: Any) -> Map
Returns a new map with key set to value. If the key was already present, its value is replaced; insertion order is preserved.
cxl: |
emit stamped = profile.set("region", "us-east")
stamped is {"name":"Alice","tier":"gold","since":"2021-04","region":"us-east"}.
Nested paths
key may be a dotted/indexed path that descends into nested maps and arrays, so a single set writes into a deep document. Dots separate map keys; a [n] suffix indexes an array.
cxl: |
emit moved = profile.set("address.city", "NYC")
emit relabel = order.set("items[0].sku", "A-100")
- Auto-create. Missing intermediate map segments are created as empty maps, so a path can build structure that does not yet exist.
{}.set("a.b.c", 7)returns{"a":{"b":{"c":7}}}. This is what letssetassemble a nested document from scratch (matching jqsetpathand Bloblang assignment). - Type conflict -> null. If an intermediate segment already exists but is the wrong kind for the next step – descending into a key whose value is a scalar, indexing a map with
[n], or naming a field on an array – the whole operation returnsnull. Nothing is partially written. - Array index past the end -> null. Indexing past the last element returns
nullfor the whole operation; arrays are never silently grown. The path can only overwrite an array slot that already exists. - A bare key is a single key, not a path.
"region"writes the top-levelregion. Only.and[n]introduce nesting; a key with neither behaves exactly as before.
For profile = {"name":"Alice","address":{"city":"LA"}}, profile.set("address.city", "NYC") is {"name":"Alice","address":{"city":"NYC"}} – the sibling name and any other address keys are preserved.
Known limitation. Because
.and[are path syntax,setcannot target a key whose name literally contains a.or[(for example a JSON field literally named"a.b"). To write such a key, build it withmergeand a map literal; to remove it, useremove_field, which matches the exact key string.
remove_field(key: String) -> Map
Returns a new map without key. If the key was absent, the receiver is returned unchanged.
cxl: |
emit slim = profile.remove_field("since")
slim is {"name":"Alice","tier":"gold"}.
Worked example: chained set + remove_field
Map methods compose naturally because each returns a new map.
- type: transform
name: rewrite_profile
input: rows
config:
cxl: |
emit profile =
profile.set("region", "us-east").remove_field("internal_id")
For profile = {"name":"Alice","internal_id":"ix-77","tier":"gold"}, the emitted profile is {"name":"Alice","tier":"gold","region":"us-east"}. The internal_id slot is removed and the region slot is appended; both happen on a fresh map so the upstream record’s profile is unaffected for any other downstream branch.
Parentheses are required
All map methods are method calls and must be written with parentheses, even the zero-argument ones:
profile.keys() -- ok
profile.keys -- parses as a field lookup, not a method call
profile.keys parses as a dotted path – a lookup for a field literally named keys inside profile. That path almost certainly returns null. Always include the parentheses when invoking a map method.
Using map methods inside array closures
Map methods compose with closure-bearing array builtins when the array elements are themselves maps.
cxl: |
emit enriched_items = items.map(it => it.set("region", "us-east"))
emit item_keys = items.map(it => it.keys())
Each it is a map; the closure body invokes a map method on it. enriched_items is an array where every element gained a region field. item_keys is an array of key-name arrays, one per element.
See also
- Closures – arrow-syntax closures often invoke map methods on their
itbinding. - Array Methods – closure-bearing array methods commonly carry maps as their elements.
- Nested Paths – bracket-index access (
profile["name"]) reads a single key without producing a new map.
Window Functions
Window functions allow CXL expressions to access aggregated values across a set of records within an analytic window. Unlike aggregate functions (which collapse groups into single rows), window functions attach computed values to each individual record.
Window functions are accessed via the $window.* namespace and require an analytic_window: configuration on the transform node.
Configuring an analytic window
Window functions are only available in transform nodes that declare an analytic_window: section in YAML:
nodes:
- name: ranked_sales
type: transform
input: raw_sales
analytic_window:
group_by: [region]
sort_by:
- field: amount
order: desc
cxl: |
emit region = region
emit amount = amount
emit running_total = $window.sum(amount)
emit rank_position = $window.count()
Window configuration fields
| Field | Description |
|---|---|
group_by | List of fields to partition the window by (the SQL PARTITION BY axis). |
sort_by | List of { field, order } ordering specifications (order is asc or desc). |
source | Optional explicit source-name reference for cross-source windows. |
on | Optional cross-source partition-lookup field. |
Frame specification (frame: { rows: ... } / frame: { range: ... }) is not yet plumbed through the YAML parser; today every window evaluates with a rows: unbounded_preceding..current_row semantic, which matches the SQL default for the listed window functions. See the deferred-work tracker for status of explicit frame syntax.
Aggregate window functions
These compute aggregate values over the window frame.
$window.sum(field)
Sum of the field values in the window frame.
emit running_total = $window.sum(amount)
$window.avg(field)
Average of the field values in the window frame. Returns Float.
emit moving_avg = $window.avg(amount)
$window.min(field)
Minimum value in the window frame.
emit window_min = $window.min(amount)
$window.max(field)
Maximum value in the window frame.
emit window_max = $window.max(amount)
$window.count()
Count of records in the window frame. Takes no arguments.
emit window_size = $window.count()
$window.first_value(field)
Returns the value of field at the first record of the window frame
(ordered by sort_by). Equivalent to SQL FIRST_VALUE(field).
emit opening_amount = $window.first_value(amount)
$window.last_value(field)
Returns the value of field at the last record of the window frame
(ordered by sort_by). Equivalent to SQL LAST_VALUE(field).
emit closing_amount = $window.last_value(amount)
Ranking window functions
Zero-argument integer functions that return the current row’s rank within its partition.
$window.row_number()
1-indexed position of the current record within its partition.
emit row_idx = $window.row_number()
$window.rank()
SQL RANK(): rows that share the same sort_by tuple receive the same
rank, and the next distinct row jumps by the size of the tie group.
emit sales_rank = $window.rank()
$window.dense_rank()
SQL DENSE_RANK(): ties share a rank with no gaps between distinct
ranks.
emit sales_dense_rank = $window.dense_rank()
Positional window functions
These access specific records by position within the window frame.
$window.first()
Returns the value of the current field from the first record in the window frame.
emit first_amount = $window.first()
$window.last()
Returns the value of the current field from the last record in the window frame.
emit last_amount = $window.last()
$window.lag(n)
Returns the value from n records before the current record. Returns null if there is no record at that offset.
emit prev_amount = $window.lag(1)
emit two_back = $window.lag(2)
$window.lead(n)
Returns the value from n records after the current record. Returns null if there is no record at that offset.
emit next_amount = $window.lead(1)
Iterable window functions
These evaluate predicates or collect values across the window.
$window.any(predicate)
Returns true if the predicate is true for any record in the window.
emit has_high = $window.any(amount > 1000)
$window.every(predicate)
Returns true if the predicate is true for every record in the window.
emit all_positive = $window.every(amount > 0)
$window.exists(predicate)
Returns true if the predicate is true for at least one record in the
window — a SQL-fluency alias of $window.any.
emit any_high = $window.exists(amount > 1000)
$window.not_exists(predicate)
Returns true if no record in the window satisfies the predicate.
Equivalent to not $window.exists(predicate) and to
$window.every(not predicate).
emit none_negative = $window.not_exists(amount < 0)
$window.collect(field)
Collects all values of the field in the window into an array.
emit all_amounts = $window.collect(amount)
$window.distinct(field)
Collects distinct values of the field in the window into an array.
emit unique_regions = $window.distinct(region)
Complete example
nodes:
- name: sales_analysis
type: transform
input: daily_sales
analytic_window:
group_by: [store_id]
sort_by:
- field: sale_date
order: asc
cxl: |
emit store_id = store_id
emit sale_date = sale_date
emit daily_revenue = revenue
emit week_avg = $window.avg(revenue)
emit week_total = $window.sum(revenue)
emit prev_day_revenue = $window.lag(1)
emit day_over_day = revenue - ($window.lag(1) ?? revenue)
This computes per-store running averages and totals over the partition’s history-up-to-and-including the current row.
Retraction interaction
When a window sits downstream of a relaxed-CK aggregate whose dropped correlation-key fields overlap the window’s group_by, the planner switches the window from streaming-emit to buffer-mode. The window operator stores per-partition raw row buffers until commit; on retraction, it reruns the configured $window.* evaluation over partition − retracted_rows and emits per-output deltas through the replay phase.
All 13 window functions are covered uniformly by wholesale recompute. The operator-by-operator retraction cost reference has the per-operator memory ceilings; clinker run --explain reports the live per-window detail.
Aggregate Functions
Aggregate functions operate across grouped record sets in aggregate nodes, collapsing multiple input records into summary rows. They are distinct from window functions, which attach computed values to each individual record.
Aggregate functions
CXL provides 7 aggregate functions. These are called as free-standing function calls (not method calls) within the CXL block of an aggregate node.
| Function | Signature | Returns | Description |
|---|---|---|---|
sum(expr) | Numeric | Numeric | Sum of values |
count(*) | – | Int | Count of records in the group |
avg(expr) | Numeric | Float | Arithmetic mean |
min(expr) | Any | Any | Minimum value |
max(expr) | Any | Any | Maximum value |
collect(expr) | Any | Array | All values collected into an array |
weighted_avg(value, weight) | Numeric, Numeric | Float | Weighted arithmetic mean |
YAML aggregate node
Aggregate functions are used inside the cxl: block of a node with type: aggregate. The node must declare group_by: fields.
nodes:
- name: dept_summary
type: aggregate
input: employees
group_by: [department]
cxl: |
emit total_salary = sum(salary)
emit headcount = count(*)
emit avg_salary = avg(salary)
emit max_salary = max(salary)
emit min_salary = min(salary)
Group-by fields pass through automatically
Fields listed in group_by: are automatically included in the output. You do NOT need to emit them – they are carried through as group keys.
In the example above, department is automatically present in every output record without an explicit emit department = department statement.
Function details
sum(expr) -> Numeric
Computes the sum of the expression across all records in the group. Null values are skipped.
cxl: |
emit total_revenue = sum(price * quantity)
count(*) -> Int
Counts the number of records in the group. The argument is the wildcard *.
cxl: |
emit num_orders = count(*)
avg(expr) -> Float
Computes the arithmetic mean. Null values are skipped. Returns Float.
cxl: |
emit avg_order_value = avg(order_total)
min(expr) -> Any
Returns the minimum value in the group. Works on numeric, string, and date types.
cxl: |
emit earliest_order = min(order_date)
emit lowest_price = min(unit_price)
max(expr) -> Any
Returns the maximum value in the group. Works on numeric, string, and date types.
cxl: |
emit latest_order = max(order_date)
emit highest_price = max(unit_price)
collect(expr) -> Array
Collects all values of the expression into an array. Useful for building lists of values per group.
cxl: |
emit all_order_ids = collect(order_id)
weighted_avg(value, weight) -> Float
Computes a weighted average: sum(value * weight) / sum(weight). Takes two arguments.
cxl: |
emit weighted_price = weighted_avg(unit_price, quantity)
Aggregates vs. windows
| Feature | Aggregate node | Window function |
|---|---|---|
| Record output | One row per group | One row per input record |
| Syntax | sum(field) (free-standing) | $window.sum(field) (namespace) |
| Configuration | type: aggregate + group_by: | type: transform + analytic_window: |
| Use case | Summarize groups | Enrich records with group context |
Combining aggregates with expressions
Aggregate function calls can be mixed with regular CXL expressions in emit statements:
nodes:
- name: category_stats
type: aggregate
input: products
group_by: [category]
cxl: |
emit total_revenue = sum(price * quantity)
emit avg_price = avg(price)
emit margin_pct = (sum(revenue) - sum(cost)) / sum(revenue) * 100
emit product_count = count(*)
emit has_premium = max(price) > 100
Restrictions
letbindings in aggregate transforms are restricted to row-pure expressions (no aggregate function calls inlet).filterin aggregate transforms runs pre-aggregation – it filters input records before grouping.distinctis not permitted inside aggregate transforms. Place a separate distinct transform upstream.
Complete example
pipeline:
name: sales_summary
nodes:
- name: raw_sales
type: source
format: csv
path: sales.csv
- name: monthly_summary
type: aggregate
input: raw_sales
group_by: [region, month]
cxl: |
emit total_sales = sum(amount)
emit order_count = count(*)
emit avg_order = avg(amount)
emit top_sale = max(amount)
emit all_reps = collect(sales_rep)
- name: output
type: output
input: monthly_summary
format: csv
path: summary.csv
Closures
CXL supports arrow-syntax closures as arguments to closure-bearing array builtins like filter, map, find, any, and flat_map. They give CXL a way to express element-by-element predicates and projections over nested arrays carried inside a single record – without writing a separate transform node per element.
Syntax
it => expression
A closure has one parameter, named it, and a single expression body. The arrow => separates them.
- type: transform
name: filter_items
input: orders
config:
cxl: |
emit kept = items.filter(it => it["price"] > 5)
The body is an expression, not a block of statements. Use if/then/else or match if you need branching inside a closure.
cxl: |
emit price_buckets = items.map(it =>
if it["price"] >= 100 then "premium"
else if it["price"] >= 10 then "standard"
else "value")
Parameter name
The parameter is always it. Other identifiers are not accepted as the closure binding:
items.filter(item => item["price"] > 5) -- parse error
items.filter(it => it["price"] > 5) -- ok
it is recognized in expression position only inside a closure body. Outside of one, it has no special meaning.
Lexical capture
Inside the closure body, the outer record’s fields and let bindings remain visible. For each iteration the closure parameter it is bound to the current element, the body evaluates, then it is removed before the next iteration.
cxl: |
let threshold = 10
emit kept = items.filter(it => it["price"] > threshold)
Here the closure body reads both it (the current array element) and threshold (an outer let binding). The record’s fields are also reachable by name – a closure over items can still read customer_id, region, or any other field on the same record.
Where closures appear
Closures are valid only as method-call arguments to closure-bearing builtins. They cannot be assigned to variables, stored in fields, or passed to non-closure builtins:
let f = it => it * 2 -- rejected at resolve time
emit doubler = it => it * 2 -- rejected at resolve time
If you need to share a closure across multiple call sites, repeat the literal closure expression. CXL has no first-class function values.
Null propagation
Closure-bearing builtins applied to a null receiver return null without evaluating the body. The body is also never called on records where the array is null:
cxl: |
emit kept = items.filter(it => it["price"] > 5)
-- when `items` is null, `kept` is null; the body never runs
This matches the null-propagation policy on every other builtin – see Null Handling for the wider rules.
Worked example: filter and map over a nested array
Suppose each input record carries an items array of objects, each with sku and price:
{"order_id":"O-1","items":[{"sku":"a","price":10},{"sku":"b","price":20},{"sku":"c","price":5}]}
A transform that drops cheap items and projects the remaining SKUs:
- type: transform
name: filter_items
input: orders
config:
cxl: |
emit order_id = order_id
emit kept = items.filter(it => it["price"] > 5)
emit kept_skus = items.filter(it => it["price"] > 5).map(it => it["sku"])
For the input above, the transform produces:
{
"order_id": "O-1",
"kept": [{"sku": "a", "price": 10}, {"sku": "b", "price": 20}],
"kept_skus": ["a", "b"]
}
Bracket-index access (it["price"]) reaches into each map element. See Nested Paths for the full traversal surface.
See also
- Array Methods – the closure-bearing builtins (
filter,map,find,any,flat_map). - Map Methods – callable on map elements inside a closure body.
- Nested Paths – bracket-index and dotted-path navigation through nested arrays and maps.
- Emit Each – statement that fans one input record into many output records, using a binding similar to the closure parameter.
Nested Paths
CXL records can carry nested arrays and maps as field values (for example, a JSON input where each record has an items array of objects). Reaching into that structure uses two complementary forms: dotted paths and bracket indices.
Dotted paths
A dotted identifier path reads a static field name from a map.
doc.metadata.tenant
Each segment must be a valid identifier. Dotted paths are resolved at compile time – the typechecker walks the structure declared in the source schema and reports a missing-field error if any segment doesn’t exist.
- type: transform
name: project_tenant
input: events
config:
cxl: |
emit tenant = doc.metadata.tenant
emit user_id = doc.user.id
Use dotted paths for structures whose shape is fixed and known at authoring time.
Bracket indices
A bracket index reads a runtime-computed key. The receiver may be an array (integer index) or a map (string index).
items[0]
profile["name"]
items.map(it => it["sku"])
Bracket indices are dynamic – the index expression evaluates per record. The typechecker treats the result as Any and does not assert that the key is present.
Integer index on an array
- type: transform
name: first_item
input: orders
config:
cxl: |
emit head = items[0]
emit second = items[1]
For items = [{"sku":"a"},{"sku":"b"},{"sku":"c"}], head is {"sku":"a"} and second is {"sku":"b"}.
Out-of-range indices return null. Negative indices also return null (CXL does not support negative indexing).
String index on a map
cxl: |
emit name = profile["name"]
emit tier = profile["tier"]
Missing keys return null – the lookup never raises an error. This is the same null-propagation policy closure builtins use on their receivers.
Mixing forms
The two forms compose in either order:
cxl: |
emit first_sku = items[0]["sku"]
emit profile_email = users.profile["email"]
items[0]["sku"] is two bracket indices chained – an integer index against the array, then a string index against the resulting map. users.profile["email"] walks a dotted path to reach profile (a map field on users), then bracket-indexes into it for a runtime key.
Null propagation
Every nested-access form propagates null end-to-end. If the receiver is null, the result is null without evaluating the index expression:
cxl: |
emit sku = items[0]["sku"]
-- when `items` is null, `sku` is null
-- when `items[0]` is null, `sku` is also null
This matches the null behavior on dotted paths and on method-call receivers. Records with missing intermediate structure produce nulls in their derived fields rather than aborting the transform.
Method calls on indexed values
A bracket-indexed expression is a regular value, so it composes with any method or further index:
cxl: |
emit head_sku_upper = items[0]["sku"].upper()
emit cheap_skus = items.filter(it => it["price"] < 10).map(it => it["sku"])
The first chain reads a string out of nested structure and uppercases it. The second filters an array of maps by a numeric field and projects the SKU strings out.
See also
- Closures – closures over arrays of maps typically use bracket-index on the
itbinding. - Array Methods – traversal builtins that consume nested arrays.
- Map Methods – builders and accessors for map values.
- Null Handling – the wider null-propagation rules.
Emit Each
The emit each statement fans one input record into multiple output records – one per element of an array on the input. The body emits the fields each output record carries. A trailing outer modifier preserves the trigger row when the array is empty or null.
Syntax
emit each <binding> in <source> {
<statements>
}
<binding>is the identifier the body uses to refer to the current array element. The conventional name isit(same as the closure parameter), but any identifier is accepted.<source>is any expression producing an array. Typically a field reference on the input record.- The body is a block of
letandemitstatements that produce one output record per iteration.
Worked example
Suppose each input record carries an items array of objects, each with sku and price:
{"order_id":"O-1","items":[{"sku":"a","price":10},{"sku":"b","price":20},{"sku":"c","price":5}]}
A transform that fans each input into one record per item:
- type: transform
name: explode
input: orders
config:
cxl: |
emit each it in items {
emit order_id = order_id
emit sku = it["sku"]
emit price = it["price"]
}
For the input above, the transform produces three output records:
{"order_id":"O-1","sku":"a","price":10}
{"order_id":"O-1","sku":"b","price":20}
{"order_id":"O-1","sku":"c","price":5}
The body reads both it (the current element) and order_id (an outer record field). Outer-record fields remain visible inside the body for every iteration.
Cardinality
If the source array has N elements, emit each produces exactly N output records. Empty array sources produce zero records. A null source also produces zero records – no DLQ entry, no error – mirroring the explode-on-null convention used elsewhere in CXL.
When fan-out nests, the cardinalities multiply: an outer array of M elements whose inner arrays have N elements each produces up to M×N records. The cumulative max_expansion cap bounds that product.
A non-array, non-null source raises a runtime type-mismatch error and routes the originating record to the DLQ.
Preserving the trigger row: outer
A trailing outer modifier switches emit each to its outer-join variant. The grammar is identical except for the keyword after the source:
emit each <binding> in <source> outer {
<statements>
}
The only behavioral difference is what happens when the source is null or an empty array. Plain emit each drops the trigger row entirely (zero output records). The outer variant instead emits the trigger row once, with <binding> bound to null:
| Source | emit each ... | emit each ... outer |
|---|---|---|
| 3-element | 3 records | 3 records (identical) |
| empty array | 0 records | 1 record, binding = null |
null | 0 records | 1 record, binding = null |
This is the shape SQL engines spell LATERAL VIEW OUTER EXPLODE (Spark, Hive) or an outer UNNEST (DuckDB): “for each tag on this article emit a tagged row, but keep articles that have no tags.”
Using the worked example above with an order that carries no items:
{"order_id":"O-2","items":[]}
- type: transform
name: explode_outer
input: orders
config:
cxl: |
emit each it in items outer {
emit order_id = order_id
emit sku = it["sku"]
emit price = it["price"]
}
produces a single record that keeps order_id while the per-item fields read through the null binding:
{"order_id":"O-2","sku":null,"price":null}
Outer-record fields (like order_id) and any emit statements preceding the block still apply to the preserved trigger row, so an outer row is never bare.
The source type rule is slightly wider than plain emit each: a statically-null source is accepted (it is the case the variant exists to handle), alongside arrays and Any. Everything else in this page — the cumulative max_expansion cap, the nesting rules, the body-statement restrictions — applies unchanged to the outer variant. The two variants compose freely: an outer block may nest inside a plain emit each block and vice versa.
Output schema
The body’s emit statements define the output record’s field set, the same way emit does in a regular transform body. Fields the body does not emit fall under the Output node’s include_unmapped policy (see Output Nodes).
Fields written by the body shadow same-named fields on the originating input record.
Nested fan-out: fan-out within fan-out
An emit each body may itself contain emit each blocks — fan-out within fan-out for one trigger row. This is the canonical “for each article, for each section, for each tag, emit a row” shape:
emit each section in article["sections"] {
emit each tag in section["tags"] {
emit article_id = article_id
emit section = section["name"]
emit tag = tag
}
}
For one input article, this produces one output record per (section, tag) pair. The inner binding (tag) reads the current inner element; the outer binding (section) and any outer-record field (article_id) stay visible inside the inner body. A field name reused as both an outer and inner binding shadows lexically — the inner binding wins inside the inner body, and the outer value is restored when the inner block finishes.
Emits are positional: an emit placed in the outer body before a nested block applies to every leaf record that block produces, but an emit placed after a nested block does not retroactively reach the records that block already emitted. Put the fields shared across leaves above the nested block.
Plain and outer blocks compose in any order. An inner plain emit each over an empty or null array contributes no records for that branch, while an inner emit each ... outer preserves one trigger row (inner binding bound to null) — exactly the per-level semantics from the single-level table, applied at each level.
Nesting is bounded to 32 levels so that adversarially deep input cannot exhaust the parser stack; legitimate document fan-out is only a few levels deep. Beyond that bound, parsing fails with a “nesting too deep” diagnostic.
The flat-array workaround (precompute a flattened array with .flat_map and use a single emit each) is still available and may be clearer for a simple two-level cartesian product, but is no longer required.
Body-statement restrictions
Within the body, let, emit, trace, and nested emit each / emit each ... outer are accepted. filter and distinct are rejected at evaluation time – a body filter would split work between branches the engine can’t represent. Move filter/distinct logic into a downstream transform, or pre-filter the source array with .filter before the emit each block.
Safety cap: max_expansion
To bound fan-out, every transform body carries a max_expansion cap on the cumulative records emit each may produce from a single original input record. The cap is cumulative across all nesting levels: every leaf record a nested fan-out produces charges against one shared budget, so nesting cannot multiply past the cap undetected. If the cap is exceeded, the originating record routes to the DLQ with category expansion_limit_exceeded instead of producing a truncated or unbounded result. The default cap is 10000.
See Transform Nodes -> Expansion Cap for the YAML field and tuning guidance.
See also
- Closures – closures bind a similar
itparameter inside method calls. - Array Methods –
flat_mapis the in-expression cousin ofemit each. - Nested Paths – bracket-index access on the body binding.
- Transform Nodes – the
max_expansioncap and DLQ routing. - Error Handling & DLQ – DLQ category semantics.
System Variables
CXL provides several system variable namespaces prefixed with $. These give CXL expressions access to pipeline execution context, user-defined variables, per-record metadata, and the current time.
$pipeline.* – Pipeline context
Pipeline variables are accessed via $pipeline.member_name. Some are frozen at pipeline start; others update per record.
Stable (frozen at pipeline start)
| Variable | Type | Description |
|---|---|---|
$pipeline.name | String | Pipeline name from YAML config |
$pipeline.execution_id | String | UUID v7, unique per pipeline run |
$pipeline.batch_id | String | From --batch-id CLI flag, or auto-generated UUID v7 |
$pipeline.start_time | DateTime | Frozen at pipeline start, deterministic within a run |
$ cxl eval -e 'emit name = $pipeline.name' \
-e 'emit exec = $pipeline.execution_id'
{
"name": "cxl-eval",
"exec": "00000000-0000-0000-0000-000000000000"
}
Counters
| Variable | Type | Description |
|---|---|---|
$pipeline.total_count | Int | Total records processed so far |
$pipeline.ok_count | Int | Records that passed successfully |
$pipeline.dlq_count | Int | Records sent to dead-letter queue |
$pipeline.filtered_count | Int | Records excluded by filter statements |
$pipeline.distinct_count | Int | Records excluded by distinct statements |
trace info if $pipeline.total_count % 10000 == 0 then "processed " + $pipeline.total_count.to_string() + " records"
$source.* – Per-record source lineage
$source.* exposes engine-stamped columns that travel with every
record from its origin Source node downstream through merges,
combines, and transforms. They identify where the record came
from and when in event-time it happened. All three columns are
filtered out of default Output projections — reference them
explicitly with emit if you need them in your output schema.
| Variable | Type | Description |
|---|---|---|
$source.file | String | Path of the input file the current record was read from. |
$source.name | String | Name of the Source node that produced the current record. Survives through merge / combine so downstream nodes can branch on origin. |
$source.event_time | DateTime | Engine-stamped event time, delay-corrected by the source’s watermark.delay. Null when the source has no watermark: block, or when the per-record value did not parse. |
filter $source.name == "src_web"
emit origin = $source.name
emit ingest_file = $source.file
emit ts = $source.event_time
$source.event_time is the column a
time-windowed aggregate
reads to assign records to windows. It is only populated for
records from a source that declares
watermark: — otherwise it
holds Null.
$vars.* – User-defined variables
User-defined variables are declared in the YAML pipeline config under pipeline.vars: and accessed via $vars.name in CXL expressions.
YAML declaration
pipeline:
name: invoice_processing
vars:
high_value_threshold: 10000
tax_rate: 0.21
output_currency: "USD"
fiscal_year_start_month: 4
CXL usage
filter amount > $vars.high_value_threshold
emit tax = amount * $vars.tax_rate
emit currency = $vars.output_currency
Variables provide a clean way to externalize configuration from CXL logic. Combined with channels, different variable sets can parameterize the same pipeline for different environments or clients.
$meta.* – Per-record metadata
Metadata is a per-record key-value store that travels with the record through the pipeline but is not part of the output columns. Write to it with emit meta; read from it with $meta.field.
Writing metadata
emit meta quality = if amount < 0 then "suspect" else "ok"
emit meta source_system = "legacy_erp"
Reading metadata
Downstream nodes can read metadata:
filter $meta.quality == "ok"
emit audit_system = $meta.source_system
Metadata is useful for tagging records with quality flags, routing hints, or audit information that should not appear in the final output unless explicitly emitted.
now – Current time
The now keyword returns the current wall-clock time as a DateTime value. It is evaluated fresh per record, so each record gets the actual time of its processing.
$ cxl eval -e 'emit timestamp = now'
{
"timestamp": "2026-04-11T15:30:00"
}
now is useful for timestamping records:
emit processed_at = now
emit days_old = now.diff_days(created_date)
Note:
nowis a keyword, not a function call. Writenow, notnow().
Complete example
pipeline:
name: order_enrichment
vars:
discount_threshold: 500
tax_rate: 0.08
nodes:
- name: orders
type: source
format: csv
path: orders.csv
- name: enrich
type: transform
input: orders
cxl: |
emit order_id = order_id
emit amount = amount
emit discount = if amount > $vars.discount_threshold then 0.1 else 0.0
emit tax = amount * $vars.tax_rate
emit total = amount * (1 - discount) + tax
emit processed_at = now
emit meta source_file = $source.file
emit pipeline_run = $pipeline.execution_id
- name: output
type: output
input: enrich
format: csv
path: enriched_orders.csv
Null Handling
Null values in CXL represent missing or absent data. CXL uses null propagation – most operations on null produce null – with specific tools for detecting and handling nulls.
Null propagation
When a method receives a null receiver, it returns null without executing. This is called null propagation and applies to all methods except the introspection methods.
$ cxl eval -e 'emit result = null.upper()'
{
"result": null
}
Propagation flows through method chains:
$ cxl eval -e 'emit result = null.trim().upper().length()'
{
"result": null
}
Null propagation exceptions
Five methods are exempt from null propagation and actively handle null receivers:
| Method | Null behavior |
|---|---|
is_null() | Returns true |
type_of() | Returns "Null" |
is_empty() | Returns true |
catch(x) | Returns x |
debug(l) | Passes through null, logs it |
$ cxl eval -e 'emit a = null.is_null()' -e 'emit b = null.type_of()' \
-e 'emit c = null.catch("fallback")'
{
"a": true,
"b": "Null",
"c": "fallback"
}
Null coalesce operator (??)
The ?? operator returns its left operand if non-null, otherwise its right operand. It is the primary tool for providing default values.
$ cxl eval -e 'emit a = null ?? "default"' \
-e 'emit b = "present" ?? "default"'
{
"a": "default",
"b": "present"
}
Chain multiple ?? operators for fallback chains:
$ cxl eval -e 'emit result = null ?? null ?? "last resort"'
{
"result": "last resort"
}
Three-valued logic
Boolean operations with null follow three-valued logic (like SQL):
and
| Left | Right | Result |
|---|---|---|
true | null | null |
false | null | false |
null | true | null |
null | false | false |
null | null | null |
The key insight: false and null is false because the result is false regardless of the unknown value.
or
| Left | Right | Result |
|---|---|---|
true | null | true |
false | null | null |
null | true | true |
null | false | null |
null | null | null |
The key insight: true or null is true because the result is true regardless of the unknown value.
not
| Operand | Result |
|---|---|
true | false |
false | true |
null | null |
Arithmetic with null
Any arithmetic operation involving null produces null:
$ cxl eval -e 'emit result = 5 + null'
{
"result": null
}
Comparison with null
Comparisons involving null produce null (not false):
$ cxl eval -e 'emit result = null == null'
{
"result": null
}
To test for null, use is_null():
$ cxl eval -e 'emit result = null.is_null()'
{
"result": true
}
Practical patterns
Fallback values with ??
emit name = raw_name ?? "Unknown"
emit amount = raw_amount ?? 0
emit active = is_active ?? false
Safe conversion with try_* and ??
emit price = raw_price.try_float() ?? 0.0
emit qty = raw_qty.try_int() ?? 1
Explicit null testing
filter not amount.is_null()
emit has_email = not email.is_null()
Catch method (equivalent to ??)
emit name = raw_name.catch("Unknown")
Conditional null handling
emit status = if amount.is_null() then "missing"
else if amount < 0 then "invalid"
else "ok"
Filter blank or null
# Filter out records where name is null or empty string
filter not name.is_empty()
Null-safe chaining
When working with fields that may be null, place the null check early or use ??:
# Safe: coalesce first, then transform
emit normalized = (raw_name ?? "").trim().upper()
# Safe: test before use
emit name = if raw_name.is_null() then "N/A" else raw_name.trim()
Modules & use
CXL supports a module system for organizing reusable expressions. Modules contain function declarations and constant bindings that can be imported into CXL programs.
Module files
A module is a .cxl file containing fn declarations and let constants. Module files live in the rules path (default: ./rules/).
Function declarations
Functions are pure, single-expression bodies with named parameters:
fn fiscal_year(d) = if d.month() < 4 then d.year() - 1 else d.year()
fn full_name(first, last) = first.trim() + " " + last.trim()
fn clamp_pct(value) = value.clamp(0, 100).round_to(1)
Functions are pure – they have no side effects and always return a value.
Module constants
Constants are let bindings at the module level:
let tax_rate = 0.21
let max_retries = 3
let default_currency = "USD"
Example module file
File: rules/shared/dates.cxl
fn fiscal_year(d) = if d.month() < 4 then d.year() - 1 else d.year()
fn quarter(d) = match {
d.month() <= 3 => 1,
d.month() <= 6 => 2,
d.month() <= 9 => 3,
_ => 4
}
fn fiscal_quarter(d) = quarter(d.add_months(-3))
let fiscal_start_month = 4
Importing modules
Use the use statement to import a module. Module paths use dot notation (not ::):
use shared.dates as d
This imports the module at rules/shared/dates.cxl and binds it to the alias d.
Import syntax
use module.path
use module.path as alias
The as alias clause is optional. When omitted, the last segment of the path becomes the default name.
use shared.dates # access as dates::fiscal_year(...)
use shared.dates as d # access as d::fiscal_year(...)
Path resolution
Module paths are resolved relative to the rules path:
| Import | File path |
|---|---|
use shared.dates | rules/shared/dates.cxl |
use transforms.normalize | rules/transforms/normalize.cxl |
use utils | rules/utils.cxl |
The rules path defaults to ./rules/ and can be overridden with --rules-path.
Using imported functions and constants
After importing, reference module members with :: (double colon) syntax:
use shared.dates as d
use shared.finance as f
emit fiscal_year = d::fiscal_year(invoice_date)
emit quarter = d::quarter(invoice_date)
emit tax = amount * f::tax_rate
emit net = amount - tax
Functions
Call functions with alias::function_name(args):
use shared.dates as d
emit fy = d::fiscal_year(order_date)
Constants
Access constants with alias::constant_name:
use shared.finance as f
emit tax = amount * f::tax_rate
Restrictions
- No wildcard imports.
use shared.*is not supported. Import modules explicitly. - Dot separator only. Module paths use
., not::. The::syntax is reserved for member access after import. - Single expression bodies. Functions must be a single expression – no multi-statement bodies.
- Pure functions. Functions cannot use
emit,filter,distinct, or other statement forms. They are pure computations. - No recursion. Functions cannot call themselves (directly or indirectly).
Complete example
File: rules/etl/clean.cxl
fn normalize_name(name) = name.trim().upper()
fn safe_amount(raw) = raw.try_float() ?? 0.0
fn flag_suspicious(amount, threshold) =
if amount > threshold then "review" else "ok"
let max_amount = 999999.99
Pipeline CXL block:
use etl.clean as c
emit customer = c::normalize_name(raw_customer)
emit amount = c::safe_amount(raw_amount)
filter amount <= c::max_amount
emit review_flag = c::flag_suspicious(amount, 10000)
The cxl CLI Tool
The cxl command-line tool validates, evaluates, and formats CXL source files. It is the standalone companion to the Clinker pipeline engine, useful for testing expressions, validating transforms, and debugging CXL logic.
Commands
cxl check
Parse, resolve, and type-check a .cxl file. Reports errors with source locations and fix suggestions.
$ cxl check transform.cxl
ok: transform.cxl is valid
On errors:
error[parse]: expected expression, found '}' (at transform.cxl:12)
help: check for missing operand or extra closing brace
error[resolve]: unknown field 'amoutn' (at transform.cxl:5)
help: did you mean 'amount'?
error[typecheck]: cannot apply '+' to String and Int (at transform.cxl:8)
help: convert one operand — use .to_int() or .to_string()
cxl eval
Evaluate CXL expressions against provided data and print the result as JSON.
Inline expression:
$ cxl eval -e 'emit result = 1 + 2'
{
"result": 3
}
From a file with field values:
$ cxl eval transform.cxl \
--field Price=10.5 \
--field Qty=3
From a file with JSON input:
$ cxl eval transform.cxl --record '{"price": 10.5, "qty": 3}'
Multiple inline expressions:
$ cxl eval -e 'let tax = 0.21' -e 'emit net = price * (1 - tax)' \
--field price=100
{
"net": 79.0
}
cxl fmt
Parse and pretty-print a .cxl file in canonical format with normalized whitespace and consistent styling.
$ cxl fmt transform.cxl
Output is printed to stdout. Redirect to overwrite:
$ cxl fmt transform.cxl > transform.cxl.tmp && mv transform.cxl.tmp transform.cxl
Input data
–field name=value
Provide individual field values as key-value pairs. Values are automatically type-inferred:
| Input | Inferred type | Example |
|---|---|---|
| Integer pattern | Int | --field count=42 |
| Decimal pattern | Float | --field price=10.5 |
true / false | Bool | --field active=true |
null | Null | --field value=null |
| Anything else | String | --field name=Alice |
$ cxl eval -e 'emit t = amount.type_of()' --field amount=42
{
"t": "Int"
}
$ cxl eval -e 'emit t = name.type_of()' --field name=Alice
{
"t": "String"
}
–record JSON
Provide a full JSON object as input. Mutually exclusive with --field.
$ cxl eval -e 'emit total = price * qty' \
--record '{"price": 10.5, "qty": 3}'
{
"total": 31.5
}
JSON types map directly:
| JSON type | CXL type |
|---|---|
null | Null |
true / false | Bool |
| integer number | Int |
| decimal number | Float |
"string" | String |
[array] | Array |
Output format
Output is always JSON. Each emit statement produces a key-value pair:
$ cxl eval -e 'emit a = 1' -e 'emit b = "two"' -e 'emit c = true'
{
"a": 1,
"b": "two",
"c": true
}
Date and DateTime values are serialized as ISO 8601 strings:
$ cxl eval -e 'emit d = #2024-03-15#'
{
"d": "2024-03-15"
}
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success (or warnings only) |
| 1 | Parse, resolve, type-check, or evaluation errors |
| 2 | I/O error (file not found, invalid JSON, etc.) |
Pipeline context in eval mode
When running cxl eval, a minimal pipeline context is provided:
| Variable | Value |
|---|---|
$pipeline.name | "cxl-eval" |
$pipeline.execution_id | Zeroed UUID |
$pipeline.batch_id | Zeroed UUID |
$pipeline.start_time | Current wall-clock time |
$pipeline.source_file | Filename or "<inline>" |
$pipeline.source_row | 1 |
now | Current wall-clock time (live) |
Practical usage
Quick expression testing:
$ cxl eval -e 'emit result = "hello world".upper().split(" ").length()'
{
"result": 2
}
Validate a transform file:
$ cxl check transforms/enrich_orders.cxl && echo "Valid"
Test conditional logic:
$ cxl eval -e 'emit tier = match {
amount > 1000 => "high",
amount > 100 => "med",
_ => "low"
}' \
--field amount=500
{
"tier": "med"
}
Test date operations:
$ cxl eval -e 'emit year = d.year()' -e 'emit month = d.month()' \
-e 'emit next_week = d.add_days(7)' \
--record '{"d": "2024-03-15"}'
Test null handling:
$ cxl eval -e 'emit safe = raw.try_int() ?? 0' --field raw=abc
{
"safe": 0
}
CLI Reference
Clinker ships two command-line tools: clinker (the pipeline runner) and cxl (the expression REPL, covered in the CXL CLI chapter). This page is the complete reference for clinker.
clinker run
Execute a pipeline.
clinker run [OPTIONS] <CONFIG>
Positional arguments
| Argument | Description |
|---|---|
<CONFIG> | Path to the pipeline YAML configuration file (required) |
Options
| Flag | Default | Description |
|---|---|---|
--memory-limit <SIZE> | 256M | Memory budget for the execution. Accepts binary (1024-based) K/M/G suffixes (K = 1024 bytes, M = 1024², G = 1024³); a bare integer is bytes. When the limit is approached, aggregation operators spill to disk rather than crashing. CLI value overrides any memory.limit set in the YAML. |
--threads <N> | number of CPUs | Size of the thread pool used for parallel node execution. |
--error-threshold <N> | 0 (unlimited) | Maximum number of records routed to the dead-letter queue before the pipeline aborts. 0 means no limit – the pipeline will run to completion regardless of DLQ volume. |
--batch-id <ID> | UUID v7 | Custom execution identifier. Appears in metrics output and log lines. Use a meaningful value (e.g. daily-2026-04-11) for correlation across retries. |
--explain [FORMAT] | text | Print the execution plan and exit without processing data. Accepted formats: text, json, dot. See Explain Plans. |
--dry-run | – | Validate the configuration (YAML structure, CXL syntax, type checking, DAG wiring) without reading any data. |
-n, --dry-run-n <N> | – | Process only the first N records through the full pipeline. Implies --dry-run. |
--dry-run-output <FILE> | stdout | Redirect dry-run output to a file instead of stdout. Only meaningful with -n. |
--rules-path <DIR> | ./rules/ | Search path for CXL module files referenced by use statements. |
--base-dir <DIR> | – | Base directory for resolving relative paths in the YAML config. Defaults to the directory containing the config file. |
--allow-absolute-paths | – | Permit absolute file paths in the pipeline YAML. By default, absolute paths are rejected to encourage portable configs. |
--env <NAME> | – | Set the active environment. Equivalent to setting CLINKER_ENV. Used by when: conditions in channel overrides. |
--quiet | – | Suppress progress output. Errors are still printed to stderr. |
--force | – | Allow output files to be overwritten if they already exist. Without this flag, the pipeline aborts rather than clobbering existing output. |
--log-level <LEVEL> | info | Logging verbosity. One of: error, warn, info, debug, trace. |
--metrics-spool-dir <DIR> | – | Directory for per-execution metrics files. See Metrics & Monitoring. |
Examples
# Basic execution
clinker run pipeline.yaml
# Production run with memory budget and forced overwrite
clinker run pipeline.yaml --memory-limit 512M --force --log-level warn
# Validate without processing
clinker run pipeline.yaml --dry-run
# Preview first 10 records
clinker run pipeline.yaml --dry-run -n 10
# Show execution plan as Graphviz
clinker run pipeline.yaml --explain dot | dot -Tpng -o plan.png
# Run with a custom batch ID for tracing
clinker run pipeline.yaml --batch-id "daily-2026-04-11" --metrics-spool-dir ./metrics/
clinker metrics collect
Sweep per-execution metrics files from a spool directory into a single NDJSON archive.
clinker metrics collect [OPTIONS]
Options
| Flag | Description |
|---|---|
--spool-dir <DIR> | Spool directory to sweep (required). |
--output-file <FILE> | NDJSON archive destination (required). If the file exists, new entries are appended. |
--delete-after-collect | Remove spool files after they have been successfully written to the archive. |
--dry-run | Preview which files would be collected without writing anything. |
Examples
# Collect and archive, then clean up spool
clinker metrics collect \
--spool-dir /var/spool/clinker/ \
--output-file /var/log/clinker/metrics.ndjson \
--delete-after-collect
# Preview what would be collected
clinker metrics collect \
--spool-dir ./metrics/ \
--output-file ./archive.ndjson \
--dry-run
Environment Variables
| Variable | Description |
|---|---|
CLINKER_ENV | Active environment name. Equivalent to --env. Used by when: conditions in channel overrides to select environment-specific configuration. |
CLINKER_METRICS_SPOOL_DIR | Default metrics spool directory. Overridden by --metrics-spool-dir. |
Precedence (highest to lowest): CLI flag, environment variable, YAML config value.
Validation & Dry Run
Clinker provides two levels of pre-flight validation so you can catch problems before committing to a full run.
Config-only validation
clinker run pipeline.yaml --dry-run
This validates everything that can be checked without reading data:
- YAML structure and required fields
- CXL syntax and compile-time type checking
- Schema compatibility between connected nodes
- DAG wiring (no cycles, no dangling inputs, no missing nodes)
- File path resolution (existence checks for inputs)
No records are read. No output files are created. The command exits with code 0 on success or code 1 with a diagnostic message on failure.
Use this after every YAML edit. It runs in milliseconds and catches the majority of configuration mistakes.
Record preview
clinker run pipeline.yaml --dry-run -n 10
This reads the first 10 records from each source and processes them through the full pipeline – transforms, aggregations, routing, and output formatting. Results are printed to stdout.
The record preview exercises the runtime evaluation path, catching issues that config-only validation cannot:
- CXL expressions that are syntactically valid but fail at runtime (e.g., calling a string method on an integer)
- Data format mismatches between the declared schema and actual file contents
- Unexpected null values in required fields
Save preview to file
clinker run pipeline.yaml --dry-run -n 100 --dry-run-output preview.csv
The output format matches what the pipeline’s output node would produce, so preview.csv shows you exactly what the full run will write.
Recommended workflow
Use both validation levels in sequence before every production run:
--dry-run– catch configuration and type errors instantly.--dry-run -n 10– verify output shape and values against real data.- Full run – execute with confidence.
This three-step pattern is especially valuable when:
- Editing CXL expressions in transform or aggregate nodes
- Changing source schemas or swapping input files
- Adding or removing nodes from the pipeline DAG
- Modifying route conditions
Combining with explain
You can also inspect the execution plan before running:
clinker run pipeline.yaml --explain
This shows the DAG structure, parallelism strategy, and node ordering without reading any data. See Explain Plans for details.
The typical full pre-flight sequence is:
clinker run pipeline.yaml --explain # inspect the DAG
clinker run pipeline.yaml --dry-run # validate config
clinker run pipeline.yaml --dry-run -n 10 # preview with data
clinker run pipeline.yaml --force # run for real
Explain Plans
The --explain flag prints the execution plan – the DAG of nodes, their connections, and the parallelism strategy the optimizer has chosen – without reading any data.
Text format
clinker run pipeline.yaml --explain
# or explicitly:
clinker run pipeline.yaml --explain text
The text format shows a human-readable summary of the execution plan:
Execution Plan: customer_etl
============================
Node 0: customers (Source, parallel: file-chunked)
-> transform_1
Node 1: transform_1 (Transform, parallel: record)
-> route_1
Node 2: route_1 (Route, parallel: record)
-> [high] output_high
-> [default] output_standard
Node 3: output_high (Output, parallel: serial)
Node 4: output_standard (Output, parallel: serial)
Key information shown:
- Node index and name – the topological position in the DAG
- Node type – Source, Transform, Aggregate, Route, Merge, Output, Composition
- Parallelism strategy – how the optimizer plans to execute the node
- Connections – downstream nodes, with port labels for route branches
- Buffer class (Physical Properties section) –
buffer: streamingfor nodes whose output is handed straight to a single downstream consumer rather than crossing a charged inter-stage buffer (fused Source → Transform → Output chains, Merge.interleave of Sources, single-branch Route, non-fused Merge,streaming-strategy Aggregate, hash-build-probe Combine probe-side, and every sink Output);buffer: materializedfor nodes whose output sits in an inter-stage buffer between dispatch arms. See Streaming vs. Blocking Stages for which streaming stages bound their footprint to one batch and which only spare the second copy - Arbitration parameters (Physical Properties section, plus a
=== Buffer Edges ===block) – each node’sarbitration: spill_priority=.., can_back_pressure=..line shows which operator the memory arbitrator would spill or pause first. See Reading--explainarbitration output for the full annotation model and a worked example.
The buffer class is a pre-runtime signal for memory pressure: every materialized node charges its in-flight rows against pipeline.memory.limit and may spill to disk once the soft threshold trips. A streaming node’s output crosses no charged inter-stage buffer, so it is never charged twice and never spill-eligible (though a non-fused streaming stage still builds its own result before handing it off — see Streaming vs. Blocking Stages). Use the annotation alongside --memory-limit / pipeline.memory.limit to predict which stages dominate the RSS budget before running the pipeline.
JSON format
clinker run pipeline.yaml --explain json
Produces a machine-readable JSON object for programmatic consumption. Useful for:
- CI pipelines that need to assert plan properties
- Custom dashboards that visualize execution plans
- Diffing plans between config versions
# Compare plans before and after a config change
clinker run old.yaml --explain json > plan_old.json
clinker run new.yaml --explain json > plan_new.json
diff plan_old.json plan_new.json
Graphviz DOT format
clinker run pipeline.yaml --explain dot
Produces a Graphviz DOT graph. Pipe it to dot to render an image:
# PNG
clinker run pipeline.yaml --explain dot | dot -Tpng -o pipeline.png
# SVG (scalable, good for documentation)
clinker run pipeline.yaml --explain dot | dot -Tsvg -o pipeline.svg
# PDF
clinker run pipeline.yaml --explain dot | dot -Tpdf -o pipeline.pdf
This requires the graphviz package to be installed on the system.
The resulting diagram shows:
- Nodes as labeled boxes with type and parallelism annotations
- Edges as arrows with port labels where applicable
- Branch/merge fan-out and fan-in structure
When to use explain
- During development – verify the DAG shape matches your mental model before writing test data.
- After adding route or merge nodes – confirm branch wiring is correct.
- When tuning parallelism – check which strategy the optimizer selected for each node.
- In code review – generate a DOT diagram and include it in the PR for visual confirmation.
Explain runs instantly because it only parses the YAML and builds the plan – no data is touched. Pair it with --dry-run for full config validation:
clinker run pipeline.yaml --explain # inspect plan
clinker run pipeline.yaml --dry-run # validate config
Retraction section
Pipelines whose at least one Aggregate has a group_by that omits a correlation-key field get a === Retraction === block in the text output. The engine selects the retraction-mode path automatically based on group_by content; the block is silent on every other pipeline, so strict-correlation and non-correlated --explain output stays identical to today’s text.
The block opens with a one-line summary – retraction enabled — N relaxed aggregates, M buffer-mode windows, fanout policy: <policy>. – followed by one block per retraction-mode Aggregate and one per buffer-mode window index.
Per retraction-mode Aggregate the block reports:
- the resolved accumulator path (
ReversibleorBufferRequired), - the per-row lineage memory cost (
~8 bytes/rowfor Reversible,n/afor BufferRequired which holds raw contributions instead), - the worst-case degrade fallback when retraction’s preconditions break at runtime.
Per buffer-mode window index the block reports:
- the source name and
partition_byfields, - the per-row buffer cost in
Valueslots over the index’s arena fields, - the worst-case partition memory ceiling under degrade.
Group cardinality is honestly surfaced as “unknown at plan time” – the planner has no group-cardinality side-table to consult before the run. Use the operator-by-operator retraction cost reference and the per-row figures the explain block prints for capacity planning, then confirm the live shape via clinker metrics collect after the first production run.
Looking up diagnostic codes
clinker explain --code <CODE> prints the documentation for any registered error or warning code, including retraction-specific codes:
clinker explain --code E15Y # retraction-mode aggregate incompatible with strategy: streaming
The full set of codes is enumerated in the error returned when an unknown code is passed.
Memory Tuning
Clinker is designed to be a good neighbor on shared servers. Rather than consuming all available memory, it works within a configurable budget and reaches for back-pressure or disk spill before it runs out.
The memory: block
All pipeline-level memory tuning lives under a single optional block:
pipeline:
name: my_pipeline
memory:
limit: "1G" # optional — defaults to 512M
backpressure: pause # optional — defaults to pause
The entire block is optional. A pipeline with no opinions about memory writes nothing:
pipeline:
name: my_pipeline
…and gets the runtime defaults (512 MB hard limit, backpressure: pause).
Individual fields are also optional. Setting just one is fine:
pipeline:
name: my_pipeline
memory:
limit: "2G"
Setting the memory limit
CLI flag (highest priority):
clinker run pipeline.yaml --memory-limit 512M
YAML config:
pipeline:
memory:
limit: "512M"
The CLI flag overrides the YAML value. Suffixes are binary (1024-based): K = 1024 bytes, M = 1024², G = 1024³; a bare integer is bytes. (This differs from the decimal KB/MB/GB used by min_size/max_size, which are 1000-based.)
Default: 512 MB.
Choosing a backpressure policy
When the soft spill threshold (80 % of limit) trips, somebody has to give up memory. The backpressure knob picks the policy that decides who:
| Value | Active policy | Behavior |
|---|---|---|
pause (default) | BackPressurePreferred -> Priority | If any consumer can be paused at its inbound channel, pause it. Otherwise pick the lowest-priority consumer (cheapest to spill) and ask it to spill. |
spill | Priority | Never pause a producer. Pick the lowest-priority consumer and ask it to spill. Closest to the pre-arbitrator react-only behavior, but with deterministic priority-based selection. |
both | BackPressurePreferred -> LargestFirst | Pause when possible, otherwise force the largest holder to spill regardless of priority. Useful when one operator dominates the budget and a fairness override is wanted. |
pause is the right default for most pipelines: when a fast Source is feeding a slow Combine through a bounded inter-stage buffer, pausing the Source is strictly cheaper than spilling the buffer to disk (no I/O, no serialization round-trip). spill and both exist for the rarer cases where you want a different posture.
To inspect which policy is active before running, use --explain:
=== Execution Plan ===
Mode: ...
DAG nodes: 7
arbitration: BackPressurePreferred -> Priority
The arbitration: line shows the composed policy name. A pipeline with backpressure: spill prints arbitration: Priority; one with backpressure: both prints arbitration: BackPressurePreferred -> LargestFirst.
When to override the default
- Fast Source + slow Combine → leave on
pause(the default). The arbitrator pauses the Source when the inter-stage buffer approaches its share of the budget, and no spill files are written. - Two parallel Aggregate stages, one much larger than the other → consider
both.BackPressurePreferred -> LargestFirstpauses where it can, then targets the dominant Aggregate for spill, freeing the most headroom per spill call. - Pure react-only with deterministic priority →
spill. Pauses are disabled; the arbitrator picks the cheapest-to-spill consumer (node_buffersbefore grace-hash before sort before Aggregate) every time. Closest to the pre-arbitrator behavior.
Streaming batch size (batch_size)
pipeline.batch_size sets how many events (records plus document-boundary punctuations) a streaming-eligible stage hands off to its downstream consumer at a time over a back-pressured channel. For a fused stage (Source → Transform → Output, Merge.interleave of Sources) it bounds the in-flight working set to one batch rather than the whole stage, because the stage pulls records off a live upstream channel without ever building a full result. The other streaming stages build their full result first and stream it in batches; there the knob sizes only the inter-stage slice, not the producer’s footprint. The knob is optional; omit it to use the built-in default of 2048 events. See Streaming vs. Blocking Stages for the distinction.
pipeline:
name: orders_rollup
batch_size: 1024 # optional; default 2048
A per-transform override is available on a Transform’s config.batch_size (see Transform Nodes); it takes precedence over the pipeline value for that one stage. A batch_size of 0 is rejected at config load. The knob affects only the memory profile of streaming stages, never their output — blocking stages (sort, hash Aggregate, Combine build side) ignore it and continue to fully materialize. See Streaming vs. Blocking Stages for the full model.
How it works
Clinker tracks memory in two layers. RSS (resident set size) is sampled at chunk boundaries and supplies the primary spill / abort signal. Alongside RSS, every memory-touching operator (Source ingest channels, Aggregate hash maps, sort buffers, grace-hash partitions, sort-merge accumulators, IEJoin arrays, inline-Combine hash tables, node_buffers slots, and window-runtime arenas) registers a MemoryConsumer wrapper with the pipeline-scoped arbitrator. Each operator owns its live byte counter and updates it on every admit / spill transition; the arbitrator queries current_usage() per consumer at every policy poll. This pull-mode attribution lets the policy distinguish reclaimable bytes (what an operator can give up right now) from currently-held bytes — a grace-hash with on-disk partitions, for instance, reports only its in-memory portion.
Window-runtime arenas (the columnar backing store that analytic-window evaluation reads from) are attributed but not independently spillable: an arena is immutable once built and is freed only indirectly, when the operator that consumes its windows drains to disk. Its wrapper reports the arena’s bytes so the arbitrator’s attribution is complete, but ranks last among spill victims so a policy never elects an arena while any consumer that can actually pause or spill remains.
Per-operator arbitration parameters
Each registered consumer carries two parameters the active policy reads: a spill priority (lower is spilled first under Priority) and a back-pressure flag (whether its producer can be paused instead). The defaults are:
| Operator class | spill_priority | can_back_pressure |
|---|---|---|
node_buffers slot (inter-stage buffer) | 0 | false |
| grace-hash Combine | 10 | false |
| sort buffer / IEJoin build | 20 | false |
| sort-merge Combine | 25 | false |
| hash Aggregate | 30 | false |
| inline-hash Combine | 30 | false |
| Source ingest | N/A | true |
| streaming Aggregate | N/A | false |
| window arena | last | false |
Lower priority is spilled first, so node_buffers slots (priority 0) are the cheapest victim class — spilling an inter-stage buffer to disk costs one LZ4 + postcard round-trip and frees the most reclaimable bytes per call. The blocking operators climb from there: a grace-hash Combine (10) is preferred over a sort buffer (20), which is preferred over a hash Aggregate or inline-hash Combine (30).
A Source and a streaming Aggregate show spill_priority=N/A because neither operator holds spillable accumulated state. A Source’s try_spill always frees zero bytes — its only real lever is the pause its can_back_pressure=true advertises. A streaming Aggregate emits each group as it completes and never accumulates a spillable group table. The N/A here is about the operator’s own state, not its downstream handoff: when a streaming stage’s output rides a per-batch streaming handoff to a single consumer, that handoff registers a priority-0 consumer just like a node_buffers slot does, and its in-flight batches are spilled to disk one batch at a time if RSS crosses the soft threshold while they are in flight. So a streaming Aggregate’s group table is never a spill victim, but the batches it hands downstream can be.
When memory pressure crosses the soft threshold (80 % of limit), the arbitrator runs the active policy to pick a victim and invokes the corresponding action: pause() on a back-pressureable consumer (its producer’s hot loop parks on a Condvar until resume), or try_spill(target_bytes) on a spillable consumer (the consumer’s wrapper flips a spill-requested flag the operator reads at its next batch boundary). When RSS crosses the hard limit, the engine fails fast with E310 MemoryBudgetExceeded.
This means:
- Pipelines always complete if disk space is available, regardless of input size.
- Performance degrades gracefully under memory pressure — you will see slower execution (and possibly disk I/O), not failures.
- The memory limit is a soft ceiling, not a hard wall. Momentary spikes may briefly exceed the limit before the policy fires.
Bounded-memory contract for non-fused stages
A stage runs streaming — no charged per-stage node_buffers slot — when it hands its output to a single downstream sink Output and roots no window: fused Source → Transform → Output and Merge.interleave-of-Sources chains, plus single-branch Route, non-fused Merge, streaming-strategy Aggregate, and hash-build-probe Combine probe-side feeding one Output (see Streaming vs. Blocking Stages). The remaining boundaries — multi-branch Route fan-out, a Merge or other operator whose output forks to several consumers, Composition bodies, diamond DAGs, and every blocking strategy — materialize records into per-stage node_buffers. Each slot registers a NodeBufferConsumer with the arbitrator (priority 0 — the cheapest-to-spill victim class), so the active policy’s victim selection is fully attributed.
When a buffer crosses the soft threshold (80 % of the limit) the arbitrator runs the active policy. Under the default pause, the producer feeding the buffer is paused at its inbound channel; under spill or when no consumer can be paused, the slot spills to disk using the same LZ4 + postcard frame format as grace-hash sort partitions. When RSS crosses the hard limit, the engine fails fast with E310 MemoryBudgetExceeded { node } naming the operator whose hot loop polled the abort gate. See error E310 for the full diagnostic model, including the composition-involved two-shape error model.
Spill fires at the producer side of the first slot whose downstream topology permits it — single-consumer, port-less. For a Source feeding a Route, that’s the Source’s own slot, not the Route’s per-branch slots, because the Source has the one outgoing edge that satisfies the topology rule. Per-branch slots can still spill independently when their own row-distribution drives them past the soft threshold, but the canonical case lands at the producer.
Use clinker run --explain to predict which stages will dominate the budget before runtime — each node carries a buffer: streaming | materialized annotation. Materialized nodes charge pipeline.memory.limit as one full-stage slot and spill the whole stage; streaming nodes charge per in-flight batch and, on a single-consumer edge, spill those batches one at a time. Both classes count against the limit and can spill — the annotation tells you the granularity (whole-stage vs. per-batch), not whether a stage is exempt from the budget.
Reading --explain arbitration output
Alongside the buffer: class, every node in the Physical Properties stanza of --explain carries an arbitration: line giving the per-operator parameters the arbitrator would apply at runtime. The numbers are derived at plan time — --explain does no I/O, so there are no live consumers to query — but they mirror the runtime values exactly, so an author can read the spill/pause model before running the pipeline.
For a fast Source feeding a slow Aggregate (the canonical bounded-memory shape), the relevant lines read:
=== Physical Properties ===
source.orders:
buffer: materialized
arbitration: spill_priority=N/A, can_back_pressure=true, predicted_peak=1K, predicted_freed=0B, predicted_subtree_reclaim=1K
aggregation.dept_totals:
buffer: materialized
arbitration: spill_priority=30, can_back_pressure=false, predicted_peak=1K, predicted_freed=1K, predicted_subtree_reclaim=1K
The Source advertises can_back_pressure=true and spill_priority=N/A: when memory pressure rises, the arbitrator pauses the Source rather than asking it to spill (it has nothing to free). The hash Aggregate advertises the opposite — spill_priority=30, can_back_pressure=false — so it is a spill victim, ranked behind any cheaper consumer.
The three predicted_* values are the scheduler’s inputs (see Scheduling below). predicted_peak is the live volume a node is expected to hold at its peak — seeded at a file-backed Source from its path: file’s on-disk size and propagated forward. predicted_freed is what the node returns to the budget the instant it finishes draining: a blocking Aggregate holds its whole accumulated input (predicted_peak=1K) and frees it on drain (predicted_freed=1K), while a streaming Source carries the volume through but frees nothing the instant it drains (predicted_freed=0B). predicted_subtree_reclaim is the largest reclaim the node’s downstream chain eventually unlocks: the Source frees nothing itself, but launching it is the only way to reach the point where its downstream Aggregate can drain, so it inherits that Aggregate’s reclaim (predicted_subtree_reclaim=1K). Propagation of the subtree value stops at a convergence node — the Combine two independent chains feed — so each feeding chain keeps the distinct reclaim it owns up to the join rather than the shared post-join total. All three render 0B when no file-size seed reached the node — a multi-file (glob/regex/paths) or absent/unreadable Source, or any node downstream of one. The bytes are formatted in the same binary-prefix units as memory.limit (1K, 64M, 2G), and the same three values appear in --explain --format json under node_properties.<name>.predicted_peak_bytes, predicted_freed_bytes_on_complete, and predicted_subtree_reclaim_bytes.
A === Buffer Edges === section follows, listing the node_buffers slot between each pair of non-fused stages. Every slot is a priority-0, non-back-pressureable NodeBufferConsumer — the cheapest victim class — and the slot= number is the stable index the executor admits into. The slot carries the producer’s predicted volume (it holds the producer’s materialized output and frees that whole buffer once the consumer drains it):
=== Buffer Edges ===
edge source.orders -> aggregation.dept_totals:
buffer: node_buffer (slot=0)
arbitration: spill_priority=0, can_back_pressure=false, predicted_peak=1K, predicted_freed=1K (producer: source)
Reading top to bottom: under memory pressure the arbitrator first spills the inter-stage buffer (priority 0), then — if the soft threshold is still tripped — pauses the Source before it ever forces the Aggregate (priority 30) to spill. That ordering is exactly what the default pause policy (BackPressurePreferred -> Priority) encodes. Cross-reference the per-operator table to see where any operator in your own pipeline lands.
Scheduling
When a pipeline has several nodes that are simultaneously runnable — every one of their inputs is ready, so the executor could legally run any of them next — the engine picks one deterministically rather than walking topological position blindly. The common case is a single linear chain where only one node is ever runnable at a time, and there is nothing to choose. The choice matters only for a pipeline whose DAG has multiple independent subgraphs (for example, two unrelated Source → Aggregate branches that a later Combine or Merge joins): both branches’ lead nodes become runnable together.
The engine runs one node to completion before dispatching the next. When two independent chains converge — two Source → Aggregate branches a later Combine joins — both branches’ outputs must be materialized and held until the Combine consumes them, so the chain that runs second builds its working set while the first chain’s output already sits in a buffer. Running the memory-heaviest chain first therefore drains and releases its large state before the lighter chain’s output has to coexist with it, lowering the peak resident working set; running it last makes its large state coexist with the already-materialized output of every chain that finished before it. What the ranking also buys is when the frontier offers a mix of node kinds: with a blocking operator ready to drain (and reclaim its accumulated state) alongside a fresh Source about to charge a new buffer, draining first reclaims headroom before the new charge lands, and under a tight budget the engine prefers the runnable node that fits the remaining headroom over one that would overflow it.
The engine ranks the simultaneously-runnable nodes by these rules, in order:
-
Headroom fit. A node whose
predicted_peakfits within the budget’s remaining headroom is preferred over one that does not. Running a node that fits avoids tipping the live working set over the soft threshold and forcing a spill that a different ordering would have avoided. A node with an unknown peak (predicted_peak=0B— no file-size seed reached it) counts as fitting, because0is always within any headroom; this keeps an unestimated pipeline on its topological order rather than deprioritizing every node. -
Immediate-freed tiebreak. Among nodes that fit equally, the one with the larger
predicted_freedruns first. Finishing a node that returns more bytes to the budget the instant it completes maximizes the headroom available to everything still waiting — the same intuition as shortest-remaining-state-first. A ready blocking operator (which reclaims its accumulated state now) therefore wins over a fresh Source (which frees nothing the instant it drains), because the immediate reclaim is the headroom-minimizing choice. -
Subtree-reclaim tiebreak. Among nodes that also tie on immediate freed — most importantly the fresh Sources of independent chains, which all free
0the instant they drain — the one with the largerpredicted_subtree_reclaimruns first. This front-loads the chain whose completion eventually frees the most: a Source’s value is the reclaim its downstream Aggregate will release, so the heavier chain’s Source is dispatched ahead of the lighter one even when it sorts later in topological order. Because it ranks below immediate freed, it never elects a fresh heavy Source over a ready light Aggregate (which would raise the peak), only between candidates whose immediate reclaim is equal. -
Stable-index tiebreak. If two nodes still tie (equal fit, equal immediate freed, equal subtree reclaim — including the all-unknown case where all are
0), the one with the lower stable node index wins. The index is each node’s position in the plan’s topological order — the exact sequence the executor walks the DAG — so this tiebreak is fully deterministic and independent of the machine, the thread schedule, and the order the runnable set happened to be assembled in.
Fallback to topological order. When no node carries a volume estimate (every predicted_peak is 0B), rules 1–3 are no-ops — every node fits and every node frees the same 0 — so rule 4 alone decides, and the engine runs nodes in exactly the lowest-index / topological order it used before any volume estimates existed. This is the load-bearing guarantee: scheduling never changes record output or branching order. A pipeline’s data output is byte-identical regardless of the predictions; the estimates only steer which runnable node goes first to reclaim headroom sooner, front-load the heaviest chain, and prefer fitting nodes under pressure, never what each node computes.
Because the predictions are a pure function of the plan shape and the input files’ on-disk sizes (resolved against the pipeline file’s directory, never the process working directory), the scheduling decision is identical on every machine for an identical plan over identically-sized inputs.
Sizing guidelines
| Workload | Recommended limit | Notes |
|---|---|---|
| Small files (<10 MB) | 128M | Minimal memory pressure |
| Medium files (10–50 MB) | 256M | Covers most ETL jobs |
| Large files or complex aggregations | 512M (default) – 1G | Multiple group-by keys, large cardinality |
| Multiple large group-by keys | 1G+ | High-cardinality distinct values |
Target workload: Clinker is optimized for 1–5 input files of up to 100 MB each, processing 10K–2M records per run.
Aggregation strategy interaction
Memory consumption depends heavily on the aggregation strategy the optimizer selects:
-
Hash aggregation accumulates state in a hash map. Memory usage is proportional to the number of distinct group-by values. With high-cardinality keys, this can consume significant memory before spill triggers.
-
Streaming aggregation processes groups in order and emits results as each group completes. Memory usage is minimal (proportional to a single group’s state) but requires the input to be sorted by the group-by keys.
-
strategy: auto(the default) lets the optimizer choose based on the declared sort order of the input. If the data arrives sorted by the group-by keys, streaming aggregation is selected automatically.
To influence strategy selection:
- type: aggregate
name: rollup
input: sorted_data
config:
group_by: [department]
strategy: streaming # force streaming (input MUST be sorted)
cxl: |
emit total = sum(amount)
Only force streaming when you are certain the input is sorted by the group-by keys. If the data is not sorted, results will be incorrect. Use auto when in doubt.
Compositions
A composition (a reusable sub-pipeline included via use:) does not get its own memory budget. Body operators register with the same arbitrator instance as the parent pipeline, admit through the same paths, and spill to the same temporary directory. The recursion is purely structural. Body-scope consumer registrations are unregistered automatically when the body exits, so a body’s NodeBufferConsumer wrappers do not leak into the parent scope’s policy registry.
When a budget exceedance involves a composition, the error message arrives in one of two shapes:
- At the composition boundary (records flowing into the body via an input port, or back out into the parent) — the error names the composition’s call-site directly (e.g.
enrich_call). - Inside the body — the error is wrapped so the user-visible call-site name surfaces alongside the body-internal operator that tripped. The rendered message reads
in composition 'enrich_call': ...followed by the inner detail.
See error E310 for the full diagnostic model.
Monitoring memory usage
Use the metrics system to track peak_rss_bytes across runs:
clinker run pipeline.yaml --metrics-spool-dir ./metrics/
The metrics file includes peak_rss_bytes, which shows the maximum resident memory during execution. If this consistently approaches your memory limit, consider increasing the budget or restructuring the pipeline to reduce intermediate state.
Shared server considerations
On servers running JVM applications, memory is often at a premium. Recommendations:
- Set
--memory-limitormemory.limitexplicitly rather than relying on the default. Know your budget. - Use
--threadsto limit CPU contention alongside memory limits. - Monitor
peak_rss_bytesin production metrics to right-size the limit over time. - Schedule large pipelines during off-peak hours when JVM heap pressure is lower.
Storage & Spill Location
Blocking operators — Aggregate, sort, and grace-hash Combine — accumulate
state in memory up to the configured budget, then spill to disk when a
soft or hard memory threshold trips, rather than running the process out of
memory. By default those spill files land in the operating system’s temporary
directory. The [storage] block in clinker.toml lets you redirect them.
The [storage] block
Storage settings are a property of the workspace, not of an individual
pipeline, so they live in clinker.toml at the workspace root rather than in
the per-pipeline YAML:
[storage.spill]
dir = "/var/clinker/spill" # optional; default = OS temp dir
disk_cap_bytes = "10GB" # optional; default = unlimited
compress = "auto" # optional; auto | off | on (default = auto)
[storage.staging]
enabled = false # opt-in; default off
dir = "/var/clinker/staging" # required when enabled
patterns = ["/mnt/nfs/data/**"] # which sources to stage
The whole block is optional. With no clinker.toml, or a clinker.toml that
omits [storage], Clinker spills to the OS temp directory exactly as it
always has.
storage.spill.dir — where spill files go
When dir is set, the per-run spill directory (clinker-spill-<random>/) is
created under that path, and every blocking operator writes its spill files
there. When dir is omitted, the per-run directory is created under the OS
temp directory (std::env::temp_dir, typically $TMPDIR or /tmp).
The directory is validated once at startup, before any input is read. If the path does not exist, is a file, or is not writable, the run fails immediately with a diagnostic naming the setting:
storage.spill.dir /var/clinker/spill does not exist; create it or point at an existing volume
Validating up front — rather than at the first spill — means a misconfigured spill volume fails fast, while the run is cheap to abandon, instead of after minutes of work. (This is the trap DuckDB fell into when its temp-directory setting was honored only lazily, duckdb/duckdb#9401.)
Why redirect spill off /tmp
On many Linux hosts — especially systemd-managed ones — /tmp is mounted as
tmpfs, which is backed by RAM (and swap), not disk. Spilling there does
not actually free physical memory: the spill bytes stay resident, defeating
the whole point of the memory budget. If df -T /tmp reports a tmpfs
filesystem, point storage.spill.dir at a path on a real block device so
spilling moves pressure off RAM and onto disk.
Inspecting the resolved spill root
clinker run --explain prints the resolved spill root and where it came from,
so you can confirm the setting took effect before committing to a run:
Spill root: /var/clinker/spill [storage.spill.dir]
…or, with no configuration:
Spill root: /tmp [OS temp dir (default)]
The same --explain output reports the resolved disk cap on the next line:
Spill disk cap: 10737418240 bytes [storage.spill.disk_cap_bytes]
…or, with no cap configured:
Spill disk cap: unlimited (default)
Finally, --explain reports the resolved compression decision per
spill-writing operator, so you can see which spills will be LZ4-framed (lz4)
and which will be written raw (off) before the run starts. Under auto the
choice varies by operator width:
Spill compression: Auto [storage.spill.compress]
Aggregate 'totals' → lz4
Sort 'by_amount' → off
Only operators that actually write spill files appear here: the external sort, the hash Aggregate, and the grace-hash / sort-merge Combine. In-memory join strategies (the inline hash build/probe and the IEJoin range join) run their kernel entirely in RAM and never open a spill file, so spill compression does not apply to them and they are omitted from this list — even though they carry a spill priority for memory arbitration.
storage.spill.disk_cap_bytes — cap cumulative spill
By default a run will spill as much as it needs, limited only by the physical
space on the spill volume. disk_cap_bytes sets a cumulative budget: the
total on-disk size of every spill file a run writes. When that running total
would cross the cap, the run aborts with a dedicated diagnostic instead of
continuing to fill the volume.
[storage.spill]
dir = "/mnt/fast-ssd/clinker-spill"
disk_cap_bytes = "50GB"
The value accepts the same human-readable byte-size grammar as the source
size filters — a bare integer is bytes, and KB/MB/GB suffixes use
decimal units (1GB = 1,000,000,000 bytes), matching du, df, and the AWS
CLI. Omitting the key leaves spill unlimited, exactly as before.
The cap is a policy ceiling, deliberately independent of both the memory
budget and the physical volume size. A run can sit well inside its
memory.limit and still exhaust local disk through an unbounded stream of
spill files; the cap lets an operator bound that on a shared volume. It is the
guard DataFusion shipped without (apache/datafusion#15358) until production
runs filled volumes.
storage.spill.compress — LZ4 compression policy
Spill files are postcard-encoded record streams. By default each stream is wrapped in an LZ4 frame, which shrinks large spilled runs. But LZ4 carries a per-frame fixed cost — clearing the compressor’s internal state on every frame reset — and on small spills that cost can outweigh the byte savings. The LZ4 v1.8.2 release notes call this out directly, and Pentaho Kettle ships explicit guidance to turn spill compression off for small rows.
compress controls the policy:
[storage.spill]
compress = "auto" # auto | off | on (default = auto)
| Mode | Behavior |
|---|---|
auto (default) | Compress only when a spilled batch is projected large enough to amortize LZ4’s per-frame cost — both ≥ 4 KiB and ≥ 1024 rows. Below either threshold the batch is written raw. The projection comes from the operator’s schema width and the run’s batch_size, so the decision is made per blocking operator. |
off | Never compress. Postcard records are written straight to disk with no LZ4 frame. Cheapest for small spills; largest on-disk size. |
on | Always compress with an LZ4 frame. The pre-knob behavior, best for spills of large, compressible rows. |
Each spill file records its compression choice in a one-byte header tag, so the read path always dispatches to the right decoder regardless of the mode the file was written with — changing the knob between runs never breaks re-reading an earlier run’s files.
The 4 KiB / 1024-row thresholds mark the empirical crossover: below them the
LZ4 frame’s fixed cost dominates the small amount of compressible payload, and
writing raw is faster end-to-end (the spill_compression benchmark sweeps
batch sizes from 256 B to 64 KiB and confirms auto tracks the faster of
on / off across the range). Most pipelines should leave compress at
auto; set on when spilling wide, highly compressible rows to a
space-constrained volume, and off when spills are dominated by many small
batches.
Observability — what the planner will do before you run
clinker run --explain is plan-only (it reads no input and spills nothing), so
it is the safe place to see what a run would do to the spill volume and to the
staging dir before committing to it. On top of the resolved spill root, disk
cap, and compression decision documented above, --explain surfaces three
storage-observability sections, and a real clinker run reports the matching
actuals at end-of-run so you can calibrate the estimate.
A note on byte units. Three different unit conventions appear across the storage surface, and it helps to know which is which before comparing figures:
- Config values you write (
disk_cap_bytes = "10GB") use decimal units —1GB= 1,000,000,000 bytes — matchingdu,df, and the AWS CLI (see the disk-cap grammar). - The
=== Estimated Spill Volume ===section humanizes with binary suffixes —K/M/G= KiB/MiB/GiB — so it lines up with thepredicted_peakfigure on each stage’s Physical Properties line, which uses the same humanizer. - The cap-headroom line and the post-run actuals print raw bytes with no suffix, so the cap-minus-estimate subtraction and the estimate-vs-actual comparison are exact rather than rounded.
When you calibrate the estimate against the post-run actual, convert the binary
estimate suffix to bytes first (1K = 1024 bytes, 1M = 1,048,576 bytes) so you
are comparing the same unit the actuals report.
Estimated spill volume per stage
The === Estimated Spill Volume === section lists one line per spill-writing
stage (hash Aggregate, external sort, grace-hash / sort-merge Combine) with its
plan-time spill-volume estimate, followed by a total. In-memory join strategies
(inline hash build/probe, IEJoin) never write spill files, so they do not
appear here and do not inflate the total:
=== Estimated Spill Volume ===
Estimated spill volume (per blocking stage):
[aggregation:hash] dept_totals → 1K
[sort] by_amount → 4K
Total: 5K
Each figure is the operator’s coarse predicted peak live state — the same
predicted_peak the Physical Properties arbitration line shows — and bytes
render in binary units (K/M/G = KiB/MiB/GiB). Summing rather than maxing is
the conservative choice for a preflight: two blocking operators can be live and
spilled at the same time, so their footprints add.
A streaming-only pipeline (no blocking operator) has nothing that spills, so the section is omitted entirely.
Unknown stages. The estimate is seeded from input file sizes resolved at
plan time. A stage whose volume cannot be known before the run renders
unknown instead of a misleading 0B, and the total notes that unknown stages
are excluded:
[aggregation:hash] dept_totals → unknown
Total (known stages): 0B (excludes stages whose volume is unknown at plan time
— a network source, a missing or unreadable input, or a glob/regex matcher
whose discovery fails)
The seed is known for every file-backed matcher whose files can be sized at
plan time: a single-file path: source, an explicit paths: list, and a
glob: or regex: matcher. A glob/regex seed runs the same discovery resolver
the run uses — applying its exclude, min_size/max_size,
modified_after/before, take, and sort filters — and sums the matched
files’ sizes, so the estimate names exactly the bytes the run will read with no
second implementation to drift. A glob/regex that matches nothing seeds zero
(rendered as unknown, since there is no spill volume to preview). The seed is
genuinely unknown for a network source, for a missing or unreadable input
file, and for a glob/regex matcher whose discovery itself fails (an invalid
pattern, or no match under on_no_match: error) — the run surfaces the same
error at startup. Check the post-run actuals below to calibrate any estimate.
Staging plan per source
When storage.staging is enabled, the === Staging Plan === section reports,
for each source (and each discovered file under a multi-file matcher): whether
it would be staged, the resolved content-addressed staged path, and — under
on_existing = reuse — the reuse-if-fresh cache decision (hit if a committed
prior copy still matches the live source, miss if it would be re-staged):
=== Staging Plan ===
Source 'orders':
/data/in/orders-2024.csv → staged: yes, path: /mnt/local/staging/3f2a…b1.staged, reuse: hit
/data/in/orders-2025.csv → staged: yes, path: /mnt/local/staging/9c4e…07.staged, reuse: miss
The reuse prediction runs the exact freshness check (mtime + size against the
committed manifest) the real run makes, read-only — --explain copies nothing.
A source that matches no staging pattern reports
staged: no (no pattern match, reads in place); a network source reports
not stagable (network source reads in place). When staging is disabled the
section states that every source reads in place.
Cap headroom
When a spill cap is configured, --explain reports the headroom (cap minus
estimate) with the same per-invocation disclaimer the startup
cap-headroom preflight carries, and the same 80%
warning:
Cap headroom: 5000000000 bytes free (5000000000 estimated of 10000000000 cap, 50%)
[per invocation — does NOT account for sibling invocations sharing the spill
volume under partition-and-run]
Machine-readable form — --explain json
clinker run --explain json emits the whole plan as JSON for tooling (the Kiln
canvas, dashboards, CI gates). The same storage observability the text form
prints lives under a structured storage_summary object, so a consumer reads
per-stage spill estimates and the cap / staging summary without re-parsing
prose:
{
"schema_version": "1",
"nodes": [ ... ],
"node_properties": { ... },
"storage_summary": {
"spill_root": { "path": "/mnt/fast-ssd/clinker-spill", "source": "storage.spill.dir" },
"spill_disk_cap_bytes": 1000000000,
"estimated_spill": {
"per_stage": [
{ "node_name": "dept_totals", "display_name": "[aggregation:hash] dept_totals", "estimate_bytes": 1024 },
{ "node_name": "by_amount", "display_name": "[sort] by_amount", "estimate_bytes": 4096 }
],
"total_known_bytes": 5120,
"any_unknown": false
},
"spill_compression": {
"mode": "auto",
"per_operator": [
{ "node_name": "dept_totals", "display_name": "[aggregation:hash] dept_totals", "compression": "lz4" },
{ "node_name": "by_amount", "display_name": "[sort] by_amount", "compression": "off" }
]
},
"cap_headroom": {
"headroom_bytes": 999994880,
"estimated_bytes": 5120,
"cap_bytes": 1000000000,
"pct_of_cap": 0.000512,
"over_threshold": false
},
"staging": { "enabled": false, "sources": [] }
}
}
The fields mirror the text sections one-for-one: estimated_spill is the
=== Estimated Spill Volume === section (a stage whose volume is unknown at
plan time carries estimate_bytes: null and sets any_unknown: true),
spill_compression is the Spill compression: projection, cap_headroom is
the cap-headroom line (omitted when no cap is configured or the estimate is
zero), and staging is the === Staging Plan === section. The JSON and DOT
formats emit only their machine payload — the human-readable
=== Resolved Outputs === preamble the text form prints is suppressed so the
output parses cleanly.
Post-run actuals — calibrating the estimate
A real clinker run that spills prints a per-stage actual spill-volume
section at end-of-run, so you can compare it against the --explain estimate
for the same stage — the calibration loop that turns a coarse pre-run estimate
into a trustworthy one over repeated runs:
=== Spill Volume (actual, per stage) ===
dept_totals → 1048576 bytes
by_amount → 4194304 bytes
Total: 5242880 bytes (compare against the --explain estimate)
The per-stage breakdown sums to the pipeline-wide cumulative spill total. A run that stayed within memory spilled nothing and prints no section. A large estimate-vs-actual delta is the single highest-leverage signal when a pipeline starts spilling unexpectedly (the failure mode behind Polars’ documented 13.5× spill amplification, where an optimizer interaction turned 30 GB of input into 400 GB of spill with no per-stage visibility).
Note on the
--explaincompression projection. The per-operator spill-compression decision shown underSpill compression:is projected from the same column count the operator’s runtime spill writer sees, so the projectedautoverdict matches the file the run actually writes. A hash Aggregate and a grace-hash / sort-merge Combine project against their output schema (engine-stamped identity columns included), exactly the width their dispatch arms resolve compression against; an enforcer sort projects against the width of the records flowing into it — its upstream’s emitted schema — which is the width its sort buffer reads at runtime. The read path also dispatches on each spill file’s own one-byte header tag, so re-reading is robust regardless.
Distinguishing the runtime storage-abort conditions
A run that fails while spilling or staging emits one of several distinct
diagnostics so a single glance at the error tells you exactly what to fix —
instead of every disk and memory problem rendering as one ambiguous “out of
memory” message (the trap DuckDB hit in duckdb/duckdb#14142, where a temp-dir
cap was reported as “Out of Memory Error … 187.3 GiB/187.3 GiB used” and users
inspected df only to find free space). The aborts split along two axes: the
spill side (in-memory operator state landing on disk) and the staging
side (matched source files copied to local disk before they are read).
Spill aborts
| Condition | Code | What happened | What to do |
|---|---|---|---|
| Out of memory | E310 | An operator’s in-RAM state crossed the hard memory.limit (a true RSS overrun). | Raise memory.limit, reduce input, or let the operator spill. |
| Spill cap exceeded | E320 | Cumulative spill bytes crossed storage.spill.disk_cap_bytes. The volume may still have free space — you hit the configured budget. | Raise disk_cap_bytes, point storage.spill.dir at a larger volume, or reduce the spill footprint. |
| Spill volume full | E321 | The OS reported the spill volume out of space (ENOSPC). The physical disk filled. | Free space on the volume, or move storage.spill.dir to a larger mount. |
| Spill directory unavailable | (Spill) | The spill directory went bad mid-run — unmounted, remounted read-only, deleted by a cleaner, or permissions revoked. | Remount/restore the volume; stop the over-eager cleaner. |
The key separations:
- E310 vs E320 — an OOM is an in-RAM overrun; a cap-exceeded is a disk-budget stop. A run can hit E320 while comfortably inside its memory envelope, so conflating the two would point you at the wrong knob.
- E320 vs E321 — E320 is the budget you set; E321 is the disk itself
running dry. If you removed
disk_cap_bytes, an over-large run would no longer trip E320 and would instead spill until the volume filled (E321).
(A future per-operator memory-reservation surface will add a fifth, reservation-exhausted condition; it is not part of the engine yet.)
Staging-copy aborts
When storage.staging is enabled, copying a matched source to local disk can
fail in three distinct ways. Like the spill split, each has its own code so a
content-corruption problem never renders as a budget problem and vice versa.
Staging runs before any record flows, so these surface as startup-style
validation failures.
| Condition | Code | What happened | What to do |
|---|---|---|---|
| Staged copy corrupt | E335 | The local copy’s BLAKE3 digest did not match the source — the transport (e.g. a soft-mount NFS share) delivered different bytes than the source holds. | Re-run over a healthy transport, harden the mount, or stage from a stable snapshot. Do not set verify = "none" to silence it — that hides corruption, not fixes it. |
| Staging cap exceeded | E336 | The cumulative bytes staged this run would cross storage.staging.disk_cap_bytes. The volume may still have free space — you hit the configured budget, not a full disk. | Raise disk_cap_bytes, point storage.staging.dir at a larger volume, narrow storage.staging.patterns, or remove the cap. |
| Staged copy already exists | E337 | A staged copy of this source already exists and on_existing = error refuses to touch it. | Remove the existing copy, or switch on_existing to overwrite (re-stage) or reuse (reuse a fresh copy). |
The same cap-vs-full-disk separation applies here as on the spill side: E336 is the budget you set (mirroring E320), so it must not render as an out-of-space message — a physically full staging volume instead surfaces as a staging I/O error (mirroring E321). E335 is distinct from a generic staging I/O error: an I/O error means the OS reported a fault, whereas E335 means the copy completed cleanly yet still does not match the source.
Startup storage validation
Before a run spawns its first source-ingest thread — after the plan compiles
but before any input is read or any byte is spilled or staged — Clinker runs a
single comprehensive validation pass over the resolved [storage]
configuration. It rejects configurations that are physically wrong for the job,
each with a stable diagnostic code, the offending clinker.toml field, and a
clinker explain --code <CODE> pointer. Validating up front fails a
misconfigured volume while the run is still cheap to abandon, rather than after
minutes of work when the first spill or staged copy hits the bad volume.
| Code | Rejected configuration | Why |
|---|---|---|
| E330 | storage.spill.dir on an in-memory filesystem (Linux tmpfs / ramfs, Windows RAM disk). | Spilling there keeps the bytes in RAM, so it frees no physical memory and defeats the memory budget. |
| E331 | storage.spill.dir on a network filesystem (NFS / SMB / CIFS / FUSE). | A spill target on a soft-mounted share risks silent truncation and mmap data loss — the failure modes spill exists to avoid. |
| E332 | storage.staging.dir on a network filesystem. | Staging copies inputs off a flaky share; a staging dir that is itself on a share reintroduces the fragility staging exists to escape. |
| E333 | storage.staging.dir on the same physical device as a matched (staged) source. | The copy moves no I/O off the source volume, so it buys nothing while still spending time and space. Applies only to matched sources. |
| E334 | storage.spill.dir equal to storage.staging.dir. | Spill files and staged source copies are sized and cleaned up differently; sharing one directory makes accounting and cleanup ambiguous. |
The filesystem-class checks (E330–E332) read the volume type through one
cross-platform detection layer, so they behave identically on Linux, macOS,
and Windows: Linux matches the statfs f_type magic, macOS matches the
f_fstypename string, and Windows maps GetDriveTypeW. (macOS has no native
tmpfs, so E330 only ever fires on Linux and Windows.) The same-device check
(E333) compares the device id on Linux/macOS and the volume serial number on
Windows — the very same probe the staging same-volume rule uses, so there is
one consistent notion of “same device” across the whole run.
Free-space preflight
Separately from the runtime disk cap (E320) and the full-volume surface
(E321), the startup pass runs a free-space preflight: it queries the bytes
available on the spill volume and compares them to the run’s estimated spill
footprint (the sum of every blocking operator’s predicted peak state, the same
estimate --explain surfaces). When the spill volume looks too small, the run
prints a warning and continues:
W330: spill volume /var/clinker/spill has 2000000000 bytes free but the run is
estimated to spill up to 8000000000 bytes; the run may abort with a full-volume
error (E321) at the final spill — point storage.spill.dir at a larger volume or
reduce the spill footprint (raise memory.limit, partition the input)
This is advisory, not fatal: the estimate is a coarse upper bound (it
ignores spill compression and the streaming drain), so the run may well finish
within the available space. The warning exists so a long pipeline that would
die at its final spill surfaces that risk before it runs for an hour, rather
than after. The free-space query uses a cross-platform probe (statvfs on
Unix, GetDiskFreeSpaceExW on Windows) that returns a 64-bit byte count, so
the historical 32-bit f_bavail truncation never affects the comparison.
Cap-headroom preflight
When storage.spill.disk_cap_bytes is configured, the same startup pass also
runs a cap-headroom preflight: it compares the run’s estimated spill volume
to the configured cap and warns when the estimate reaches 80% of the cap.
Unlike the free-space preflight (which probes the physical volume), this checks
the run against the policy ceiling you set, so it fires even on a volume with
plenty of free space:
W331: this run is estimated to spill up to 9000000000 bytes, which is 90% of the
configured spill cap storage.spill.disk_cap_bytes (10000000000 bytes); the run
may abort with a spill-cap error (E320) before it finishes — raise disk_cap_bytes
or reduce the spill footprint (raise memory.limit, partition the input). This
headroom is per invocation: if you partition the input and run several clinker
invocations against the same spill volume and cap, they share the cap, so the
real headroom is smaller than this figure
Like W330, this is advisory, not fatal — the estimate is a coarse upper
bound, so a run that compresses well or never trips its memory budget may finish
comfortably under the cap. It fires on a normal clinker run (before ingestion,
at startup), not only under --explain, so an operator sees the signal on the
real run even when they did not explicitly inspect the plan first.
Per-invocation accounting. The cap and the headroom figure are scoped to a
single clinker invocation. Under the partition-and-run model — where you
split a large input by file or key and launch several clinker processes that
share one spill volume and one disk_cap_bytes — the physical spill volume is
shared by every sibling, so the real headroom is smaller than any one
invocation’s figure. The warning text states this explicitly rather than
silently presenting a per-invocation number as a whole-volume guarantee. Clinker
is single-process by design (one invocation = one OS process), so the engine
cannot see its siblings; the disclaimer is the honest stance.
Mid-run spill failures
The startup check guarantees the spill directory is writable when the run begins, but it can still go bad mid-run — an NFS share remounts read-only, a volume unmounts, an over-eager temp-file cleaner deletes the directory, or permissions are revoked. When a spill write fails because the directory has vanished or become read-only, the run aborts cleanly with a distinct diagnostic rather than a generic I/O error or a panic:
spill directory /var/clinker/spill became unavailable mid-run: No such file or directory
(the directory may have been unmounted, remounted read-only, deleted by an
external cleaner, or had its permissions revoked)
This surfaces the directory-level cause directly, so the fix (remount the volume, stop the cleaner, restore permissions) is obvious from the message.
Crash purge of orphaned spill directories
A run’s spill directory (clinker-spill-<random>/) is normally removed when the
run ends — a clean exit, a run that aborts with a fatal error, or even a panic
that unwinds all delete it, on every platform. (The run holds an advisory lock on
a .lock file inside that directory; the lock is always released before the
directory is removed, so the removal never trips over its own open handle — which
matters on Windows, where an open file handle blocks deletion.) But a SIGKILL,
the Linux OOM-killer, or a power loss kills the process before that cleanup runs,
leaking the directory and every spill file inside it under the spill root. Over
many crashed runs that fills the spill volume.
To prevent that, a run reaps orphaned spill directories at startup — but
only when a spill directory is explicitly configured (storage.spill.dir),
before it creates its own. Each live run holds an operating-system advisory lock
on a .lock file inside its spill directory for the run’s whole lifetime; the
OS releases that lock automatically when the process exits, however it exits. At
startup a run scans the configured spill root and, for each clinker-spill-*
directory, tries to take that lock: if it succeeds the owning process is gone, so
the directory is an orphan and is removed; if the lock is still held a concurrent
live run owns it, so it is left alone. Asking the kernel “is anyone still
holding this?” is robust against PID reuse and never reaps a directory a
concurrent run is still using. The purge is best-effort: a failure to reap one
directory is logged and the run proceeds.
When storage.spill.dir is not set, the spill root defaults to the OS temp
directory (std::env::temp_dir, typically $TMPDIR or /tmp), and no
startup purge runs there. In the default case a run cleans up after itself
directly: the per-run spill directory is removed on every exit short of a hard
kill — a clean exit, an error-return abort, or a panic that unwinds. Only a
SIGKILL, the OOM-killer, or a power loss leaks one, and a directory leaked into
the OS temp directory is the operating system’s temp-reaper’s responsibility to
clean up, not Clinker’s.
The purge is deliberately confined to a configured spill root because it must
never police the shared OS temp directory. That directory is used by every
process on the host, so a startup sweep there would race not only concurrent
Clinker runs (the lock-based check narrows that window but cannot eliminate it
against a peer whose just-created spill directory is not yet locked) but also
unrelated programs that happen to use a colliding name prefix. A run owns its
configured spill volume and can safely sweep it; it does not own /tmp and so
leaves it alone.
storage.staging — opt-in source staging
Reading source files directly from a network share (NFS, SMB) couples every run to the share’s availability and quirks: a soft-mount can silently truncate a read, and latency multiplies across many small files. Source staging copies matched source files to a local volume before the pipeline reads them, so the run works from stable local copies. It is off by default and activated per workspace by pattern match — pipelines that don’t opt in behave exactly as before.
[storage.staging]
enabled = true
dir = "/var/clinker/staging" # required when enabled
patterns = [
"/mnt/nfs/data/**",
"//fileserver/share/**",
]
disk_cap_bytes = "50GB" # optional; cap on bytes copied per run (default unlimited)
verify = "blake3" # optional; blake3 | none (default blake3)
on_existing = "overwrite" # optional; overwrite | reuse | error (default overwrite)
cleanup = "on_success" # optional; on_success | always | never (default on_success)
| Key | Default | Meaning |
|---|---|---|
enabled | false | Master switch. When false, patterns is ignored and every source reads in place. |
dir | — | Local directory the copies are written under. Required when enabled. |
patterns | [] | Glob patterns selecting which source paths to stage. A source is staged only when enabled and its path matches at least one pattern. Empty ⇒ nothing is staged. |
disk_cap_bytes | unlimited | Cumulative cap on bytes copied per run. Same byte-size grammar as the spill cap ("50GB", bare integers are bytes). |
verify | blake3 | Post-copy integrity check. blake3 hashes source and copy and requires a match — the only check that catches a soft-mount’s silent truncation. none skips the check. |
on_existing | overwrite | What to do when a staged copy of this source already exists from a prior run: overwrite re-copies unconditionally; reuse reuses the existing copy only when it is still fresh (the source’s modification time and size match what was recorded when it was staged), otherwise re-copies; error fails the run rather than touch the existing copy. See The staging cache below. |
cleanup | on_success | When staged copies are deleted relative to the run’s outcome: on_success removes them after a clean exit but keeps them after a failure so the operator can inspect the exact inputs the failed run saw; always removes them regardless; never keeps them as a persistent reuse cache for a later reuse run. See Cleanup. |
Pattern matching
patterns uses the same glob grammar as a source’s exclude: list. Each
pattern is tested against both the full path and the basename, so
/mnt/nfs/** matches a deep path by its full path while *.csv matches any
CSV by basename. ** crosses directory boundaries; * does not.
Startup validation
When enabled, staging is validated once at startup, before any input is
opened, so a misconfiguration fails the run immediately rather than at the
first copy. The run is refused when:
diris unset.dirdoes not exist, is a file, or is not writable (probed with a real create-and-delete, so a read-only mount or restrictive ACL is caught).- a
patternsentry is not a valid glob. dirsits on the same volume as a matched source. Staging within one volume copies bytes without moving I/O off the slow share — a well-documented anti-pattern — so it is refused up front rather than left to surface as a confusingly slow pipeline. The check compares the source’s and the staging dir’s storage volume (the device id on Linux/macOS, the volume mount root on Windows); pointdirat a local disk on a different volume.
The same-volume rule applies only to matched sources: a source the patterns don’t select reads in place, so its volume is irrelevant.
How a file is staged
Each matched source maps to a stable, content-addressed pair of files
directly under dir, deterministic across runs of the same source:
<source-id>.staged— the local copy the reader opens.<source-id>.manifest.json— a sidecar recording the source’s identity: its path, modification time, size, the BLAKE3 content hash, and the stage time.<source-id>.lock— a small advisory-lock file that serializes concurrent invocations staging the same source (see the staging cache). It carries no data and persists between runs as the per-source coordination point, alongside the cached copy it guards. Once that source’s cache entry is gone and no run holds the lock, a later startup crash purge reclaims the lock too, so a persistent staging root does not accumulate one orphan lock per source that has ever passed through it.
<source-id> is derived from the source’s canonical path, so the same source
always resolves to the same staged file. That stability is what makes the
reuse cache work (a later run can find the prior copy) and it is why the layout
is stable rather than per-run UUIDs.
The copy is built to survive a crash at any point without leaving a corrupt or partial file a later run might trust:
- Single-pass copy + hash. The source is read once in ~1 MiB chunks; each chunk is fed to both the BLAKE3 hasher and the destination file in the same pass. The copy never holds the whole file in memory, so it stays a memory-budget no-op regardless of file size.
- Atomic publish. Bytes are written to a
<source-id>.<run>.partialtemp file, flushed andfsync’d, then renamed to<source-id>.staged. A rename is an atomic replace on Linux, macOS, and Windows (Windows 10 1607+), so a reader scanning for.stagedfiles sees either nothing or the complete file — never a half-written one. The<run>segment in the partial name keeps any two in-flight copies of one source on distinct temp files, and the per-source lock (see the staging cache) ensures only one of them ever runs at a time. - Durable rename. On Linux/macOS the parent directory is
fsync’d after the rename, because on ext4/xfs a rename is only crash-durable once the directory entry itself is flushed. On Windows the NTFS journal makes the rename durable, so there is no separate directory flush to do. - Verify. With
verify = blake3(the default) the source is independently re-read and hashed, and the two digests must match. A size check cannot catch a soft-mount that silently truncated the read; two content digests can. A mismatch removes the published copy and fails the run with a distinct “staged copy is corrupt” diagnostic (E335) — not a generic I/O error. - Commit the manifest. The identity manifest is written with the same
atomic temp-file + rename discipline. The manifest is the commit marker:
a
.stagedfile is only trustworthy once its manifest exists. A crash between the copy and the manifest leaves a.stagedwith no manifest, which the next run’s crash purge reaps as an orphan rather than half-trusting.
If the copy fails partway, the .partial is removed before the error
propagates.
The staging cache (on_existing)
Because staged copies live at stable paths, a copy from a prior run is still on
disk when the next run starts (unless cleanup removed it). on_existing
decides what happens when that prior copy is found:
| Mode | Behavior |
|---|---|
overwrite (default) | Always re-stage. The prior copy and its manifest are removed and the source is copied fresh. The safe default: a copy from a crashed run must not be trusted. |
reuse | Reuse the prior copy only when it is still fresh — the source’s current modification time and size both match what the manifest recorded. A fresh match skips the copy entirely (no bytes read off the share, nothing charged against the disk cap). A changed mtime or size means the source was rewritten, so the copy is stale and is re-staged. |
error | Fail the run with a clear diagnostic if a staged copy already exists, rather than overwrite or reuse it. For workflows that want an explicit “the cache is already populated” stop. |
reuse is the mode that turns staging into a cache: re-running the same
pipeline over an unchanged network share copies nothing on the second run. The
freshness check is mtime + size, not a re-hash, so it is a cheap stat rather
than a full read of the source.
Staging is collision-safe across concurrent invocations. Under the
partition-and-run model — several clinker processes over a partitioned input
sharing one staging volume — independent runs may stage, reuse, or clean up the
same shared source at the same time. The per-source advisory lock (a
<source-id>.lock file under the staging root) is a reader-writer lock that
keeps every such overlap safe on Linux, macOS, and Windows:
- Exactly one copy. A run that needs to copy takes the lock exclusively
for its copy-and-publish. The first run to take it copies and publishes; every
other run blocks, then acquires the lock, finds the now-fresh
.staged, and reuses it. So a source is copied exactly once no matter how many invocations race for it. - A reader is never yanked. A run reading a staged copy holds the lock in shared mode for as long as it has the file open, and keeps it held across the moment it decides to reuse a copy and the moment it opens that copy — so the file it chose cannot be deleted or replaced in between. Any number of readers share the lock at once, so concurrent runs all read the same copy in parallel.
- Cleanup and overwrite wait for readers. Removing or re-copying a staged pair takes the lock exclusively, which a live reader’s shared lock blocks. Cleanup probes the lock without waiting: if a concurrent run is still reading the copy, cleanup leaves it in place (the last run to release it, or a later crash purge, reaps it). An overwrite re-stage waits for in-flight readers to finish, then publishes atomically.
On Windows the staged copy is additionally opened with a share mode that permits
a concurrent delete or atomic-rename replace (FILE_SHARE_DELETE), so an open
reader and a concurrent publish/cleanup interoperate there exactly as they do on
POSIX, where an unlinked-but-open file stays readable. The net guarantee: across
any mix of concurrent runs sharing a staged source, a reader always sees a
complete, coherent .staged file and no run fails spuriously.
Cleanup (on_success | always | never)
cleanup decides when a run’s staged copies are removed, keyed on the run’s
outcome:
| Mode | Behavior |
|---|---|
on_success (default) | Remove the copies after a clean exit; keep them after a failure (or an interrupted / DLQ-producing run) so the operator can inspect the exact inputs the run saw and re-run without re-fetching. |
always | Remove the copies when the run ends, success or failure. |
never | Keep the copies indefinitely as a persistent reuse cache. Combine with on_existing = reuse to make repeated runs over a stable source copy-free. The operator reclaims the staging dir manually (or lets the next run’s crash purge eventually reap stale entries). |
Each staged file’s manifest is removed alongside it, so cleanup never leaves a manifest pointing at a staged file that is gone.
Crash purge of orphaned artifacts
A clean (or panicking) run runs its cleanup. But a SIGKILL, the Linux
OOM-killer, or a power loss kills the process before any cleanup runs, leaking
its staged artifacts under the staging root. To stop that from accumulating
across crashes, every run performs an idempotent crash purge at startup,
before it stages anything. It reaps four orphan shapes left under the staging root:
- a
*.partial— an interrupted copy. Reaped only when its owning run is dead (see below), so a concurrent sibling’s in-flight copy is never reaped; - a
*.stagedwith no matching manifest — a copy that crashed before it could commit its manifest; - a
*.manifest.jsonwith no matching staged file; - a
*.lockwhose source has no surviving cache entry — a coordination lock left by a source that is no longer staged (not necessarily from a crash), reclaimed under the liveness and age gates described below.
A clean pair (a .staged with its committed .manifest.json) is the reuse
cache and is kept — the purge never removes a complete, trustworthy copy —
and the source’s .lock is kept alongside it so a later reuse run has a lock to
take.
A .lock whose source has no surviving cache entry (no .staged and no
.manifest.json) is itself reclaimed by the purge, under the same liveness gate
as a partial: it is removed only when it is acquirable under a try-lock (no live
run holds it) and has aged past the creation grace window (so a sibling
mid-acquire is not raced), and the removal is performed while the purge holds the
lock exclusively. A held lock, a lock still guarding a cached copy, and a
freshly created lock are all kept. This bounds what would otherwise be unbounded
growth of one zero-byte lock file per distinct source ever staged — relevant for
a long-lived persistent cache (on_existing = reuse, cleanup = never) — while
never removing a coordination point a live or cached source still needs.
Because several invocations can share one staging volume, the purge must not
reap a live sibling’s in-flight .partial. It tells a crash corpse from a live
copy the same way the spill-directory purge does: a .partial is reaped only
when the source’s .lock is acquirable under a try-lock (no live process
holds it, so the owner is gone) and the partial has aged past a short
creation grace window (so a copy that has just started but not yet taken the
lock is left alone). A partial whose lock is still held, or one too young to
have been locked yet, is kept. This asks the operating system “is anyone still
staging this?” rather than guessing, so a concurrent purge can never delete a
running sibling’s work.
File permissions
Staged copies hold verbatim source records — potentially PII, credentials, or
financial data — and on a shared staging volume they must not be readable by
other users. On Unix each staged file and its manifest are created with mode
0o600 (owner-only). On Windows there is no portable mode bit; staged files
inherit the staging directory’s ACL, so restrict the directory’s ACL if the
volume is shared.
Crash durability and the parent-directory fsync
The atomic-rename guarantee only holds across a crash if the rename is durable.
On POSIX filesystems (ext4, xfs) a rename’s directory entry can still be in the
page cache after rename returns, so Clinker fsyncs the parent directory
after the rename. Windows is intentionally exempt: the NTFS metadata journal
already makes the rename crash-durable (the semantics MOVEFILE_WRITE_THROUGH
requests), and Windows offers no directory-fsync equivalent.
Streaming vs. Blocking Stages
Every node in a pipeline plan is one of two kinds at runtime:
- Streaming stages hand their output downstream in bounded batches over a back-pressured channel, never crossing an inter-stage buffer that charges the memory budget. The two fused streaming paths additionally hold at most one batch of in-flight events at a time, so their inter-stage memory does not grow with input size. The other streaming stages still build their own result before handing it off — streaming spares them the second copy into a charged buffer and overlaps the writer with downstream work, but their own working set is as large as a blocking stage’s would be.
- Blocking stages must see their whole input before they can produce any output. They accumulate state inside the memory budget and spill to disk when the soft threshold trips, rather than holding everything in RAM.
This distinction is what makes Clinker a bounded-memory executor: a pipeline’s peak memory is set by its largest live blocking-or-non-fused-streaming stage plus one batch per fused streaming stage, not by the cumulative size of every stage at once. A streaming stage’s output is never separately buffered between dispatch arms, so it is never charged twice: the arbitrator counts each in-flight batch once when the producer flushes it and discharges that charge as the consumer drains it. If RSS still crosses the soft threshold while a single-consumer streaming stage holds batches in flight, the engine spills those batches’ records to disk one batch at a time — the streaming handoff is the per-batch counterpart of a blocking stage’s full-stage spill, not an exemption from spilling.
Which stages stream
A stage streams when its output is handed straight to a single downstream consumer instead of crossing a charged inter-stage buffer. The downstream consumer is a sink Output writer, an Aggregate’s ingest, or a hash build-probe Combine’s probe (driver) side — see Streaming into an Aggregate and Streaming into a Combine probe below.
Two stages stream and bound their own footprint to one batch, because they pull records off a live upstream channel and forward each batch without ever building a full result:
- Source → Transform → Output fused chains. A non-windowed Transform whose only upstream is a single Source and whose only downstream is a single sink Output consumes that Source’s records directly and hands each batch to the Output’s writer thread over a back-pressured channel; neither the Transform nor the Output materializes the whole record set. A Transform that fans out to multiple consumers, feeds another operator, or roots a window keeps the buffered (materialized) path.
Mergeininterleavemode fed entirely by Sources. The merge reads each Source’s live stream and forwards records as they arrive.
These stages stream their output to a single downstream consumer too — sparing the second copy and overlapping the consumer — but each still builds its full result first, so its own working set is not bounded to one batch:
- Single-branch
Route. A Route with exactly one branch feeding one sink Output streams that branch’s records to the writer thread. A multi-branch Route forks records across several successor buffers and stays materialized. Mergeinconcatmode, orinterleavefed by non-Source inputs, feeding one sink Output. The merge drains its predecessors’ buffers in order (concat) or round-robin (interleave) into the merged result, then streams it.streaming-strategyAggregatefeeding one sink Output. When the planner certifies the aggregate’s input is pre-sorted on the group key, it finalizes the group rows and streams them rather than buffering them for a downstream arm.Combineprobe side (hash build-probe strategy) feeding one sink Output. The build relation stays fully materialized in the hash table; the matched probe output streams to the writer.
Each of these requires the producer to feed exactly one downstream consumer and to root no window; a producer that roots a window keeps the materialized path because the window arena needs the producer’s full output to build.
- Every
Output. A sink writes records to its configured writer and never buffers a whole stage.
Document-boundary punctuations (DocumentOpen / DocumentClose, the signals behind $doc.*) flow inline with records through streaming stages, preserving their order: a document’s close always trails the document’s last record, even when the document’s records span several batches.
Streaming into an Aggregate
The streaming consumer above is usually a sink Output. It can also be an Aggregate’s ingest: when an eligible producer (a fused Source → Transform, a single-branch Route, a non-fused Merge, or a streaming-strategy Aggregate) feeds exactly one downstream Aggregate, the producer streams record-at-a-time into the aggregate’s add_record over a back-pressured channel rather than the aggregate pre-draining the producer’s whole output from a charged buffer. The producer reports buffer: streaming and --explain shows no node_buffer edge between it and the aggregate.
This streams the aggregate’s ingest half only — the producer no longer needs a charged inter-stage slot, and a slow aggregate (one that is spilling, say) paces the producer through the bounded channel. The aggregate’s finalize half stays blocking by nature: a group_by value depends on every member, so the group table accumulates the whole input and emits only after the channel closes (end of input). Spill stays driven by RSS pressure, never by channel depth, exactly as on the materialized path.
Two aggregate shapes keep the materialized ingest, because their finalize is not a single forward pass: a time-windowed aggregate runs a multi-pass per-window algorithm over the whole input, and a relaxed correlation-key aggregate retains its group state for the correlation-commit phase. Both show buffer: materialized on the edge into them.
Streaming into a Combine probe
A producer can also stream into a hash build-probe Combine’s probe (driver) side. When an eligible producer (a fused Source → Transform, a single-branch Route, a non-fused Merge, a streaming-strategy Aggregate, or another hash build-probe Combine) is the Combine’s driver input, the producer streams record-at-a-time into the probe kernel over a back-pressured channel rather than the Combine pre-draining the driver’s whole output from a charged buffer. The driver producer reports buffer: streaming and --explain shows no node_buffer edge between it and the Combine. Only the HashBuildProbe strategy qualifies — the range, sort-merge, and grace-hash kernels re-sort or re-scan the driver and stay materialized.
This streams the Combine’s probe half only. The build side stays fully materialized: the engine builds the complete hash table on the main thread before the driver producer streams its first record, so the probe never matches against an incomplete index. The probe consumer runs on its own thread, so a slow driver paces the probe through the bounded channel and a slow probe (a large fan-out) back-pressures the driver. The build relation’s footprint is the hash table, exactly as on the materialized path; the streaming handoff spares only the driver’s inter-stage slot. Per-source dead-letter rewind, memory accounting, and output are byte-identical to the materialized path.
Which stages block
A stage blocks when its result depends on records it has not seen yet:
sort— the full input must be present before the first sorted record is known.- Hash
Aggregate— a group’s final value depends on every member, so the group table accumulates the whole input. (Astreaming-strategy Aggregate over a pre-sorted input is the exception: the planner certifies it can emit a group as soon as the sort key advances.) Combinebuild side — the build relation is fully indexed before any probe record is matched. The probe side streams against the built index, but the build side materializes.IEJoin/ sort-mergeCombine— both inputs are sorted and buffered before the band/merge step runs.CorrelationCommit— a correlation group is held until its commit decision (flush or dead-letter) is known.
A blocking stage keeps its full-stage accumulation inside pipeline.memory.limit and spills to disk past the soft threshold; it does not stream batches.
Seeing the classification
clinker run <pipeline>.yaml --explain annotates every node with its class in the Physical Properties section:
output.report:
buffer: streaming
aggregation.dept_totals:
buffer: materialized
buffer: streaming marks a stage whose output is consumed without an inter-stage buffer — it charges the budget per in-flight batch and, on a single-consumer edge, spills those batches to disk under pressure; buffer: materialized marks a stage whose output crosses a node_buffers slot that charges the memory budget as one full-stage slot and spills the whole stage. Both classes are spill-eligible; they differ in granularity, not in whether they can spill. The explain annotation is derived from the same classifier the executor uses at runtime, so what --explain reports is exactly what the dispatcher does. See Explain Plans and Memory Tuning for the arbitration model that rides alongside the buffer class.
Tuning the batch size
The number of events handed downstream per batch is set by pipeline.batch_size (default 2048), with an optional per-transform override. For a fused streaming stage — the only kind whose footprint is one batch — smaller batches lower its in-flight footprint at the cost of more per-batch bookkeeping; larger batches do the reverse. For the other streaming stages the batch size sets only the in-flight slice handed across the channel; the producer’s own result is built in full regardless, so batch_size does not cap their footprint. The batch size changes only the memory profile of streaming handoffs — never their output, and never the behavior of blocking stages.
Metrics & Monitoring
Clinker writes per-execution metrics as JSON files to a spool directory. These files can be collected into an NDJSON archive for ingestion into monitoring systems.
Enabling metrics
There are three ways to enable metrics collection, listed from highest to lowest priority:
CLI flag:
clinker run pipeline.yaml --metrics-spool-dir ./metrics/
Environment variable:
export CLINKER_METRICS_SPOOL_DIR=./metrics/
clinker run pipeline.yaml
YAML config:
pipeline:
metrics:
spool_dir: "./metrics/"
When metrics are enabled, each execution writes one JSON file to the spool directory, named <execution_id>.json.
Metrics schema
Each metrics file follows schema version 1:
{
"execution_id": "01912345-6789-7abc-def0-123456789abc",
"schema_version": 1,
"pipeline_name": "customer_etl",
"config_path": "/opt/clinker/pipelines/daily_etl.yaml",
"hostname": "prod-etl-01",
"started_at": "2026-04-11T10:00:00Z",
"finished_at": "2026-04-11T10:00:05Z",
"duration_ms": 5000,
"exit_code": 0,
"records_total": 50000,
"records_ok": 49950,
"records_dlq": 50,
"execution_mode": "streaming",
"peak_rss_bytes": 134217728,
"thread_count": 4,
"input_files": ["./data/customers.csv"],
"output_files": ["./output/enriched.csv"],
"dlq_path": "./output/errors.csv",
"error": null
}
Field reference
| Field | Type | Description |
|---|---|---|
execution_id | string | UUID v7 or custom --batch-id value |
schema_version | integer | Always 1 for this release |
pipeline_name | string | The name from the pipeline YAML |
config_path | string | Absolute path to the config file |
hostname | string | Machine hostname |
started_at | string | ISO 8601 UTC timestamp |
finished_at | string | ISO 8601 UTC timestamp |
duration_ms | integer | Wall-clock duration in milliseconds |
exit_code | integer | Process exit code (see Exit Codes) |
records_total | integer | Total records read from all sources |
records_ok | integer | Records that reached an output node |
records_dlq | integer | Records routed to the dead-letter queue |
execution_mode | string | streaming or batch |
peak_rss_bytes | integer | Maximum resident set size during execution |
thread_count | integer | Thread pool size used |
input_files | array | Paths to all source files |
output_files | array | Paths to all output files written |
dlq_path | string/null | Path to the DLQ file, or null if none |
error | string/null | Error message on failure, or null on success |
Collecting metrics
The spool directory accumulates one file per execution. Use clinker metrics collect to sweep them into an NDJSON archive:
clinker metrics collect \
--spool-dir ./metrics/ \
--output-file ./metrics/archive.ndjson \
--delete-after-collect
This appends all spool files to the archive (one JSON object per line) and removes the originals. The NDJSON format is compatible with most log aggregation and monitoring tools.
Preview without writing:
clinker metrics collect \
--spool-dir ./metrics/ \
--output-file ./metrics/archive.ndjson \
--dry-run
Integration with monitoring systems
Grafana / Prometheus
Parse the NDJSON archive with a log shipper (Promtail, Filebeat, Vector) and create dashboards tracking:
duration_ms– execution time trendsrecords_dlq– data quality over timepeak_rss_bytes– memory utilization
Datadog
Ship NDJSON to Datadog Logs, then create metrics from log attributes:
# Example: tail the archive and ship to Datadog
tail -f ./metrics/archive.ndjson | datadog-agent log-stream
ELK Stack
Filebeat can ingest NDJSON directly:
# filebeat.yml
filebeat.inputs:
- type: log
paths:
- /var/log/clinker/metrics.ndjson
json.keys_under_root: true
Simple alerting with jq
For environments without a full monitoring stack, use jq to query the archive directly:
# Find all runs with DLQ entries in the last 24 hours
jq 'select(.records_dlq > 0)' metrics/archive.ndjson
# Find runs that exceeded 400MB RSS
jq 'select(.peak_rss_bytes > 419430400)' metrics/archive.ndjson
# Average duration by pipeline
jq -s 'group_by(.pipeline_name) | map({
pipeline: .[0].pipeline_name,
avg_ms: (map(.duration_ms) | add / length)
})' metrics/archive.ndjson
Operational recommendations
- Always enable metrics in production. The overhead is negligible (one small JSON write at the end of each run).
- Run
metrics collect --delete-after-collecton a schedule (e.g., hourly) to prevent spool directory growth. - Use
--batch-idwith meaningful identifiers to correlate metrics across retries and environments. - Alert on
records_dlq > 0to catch data quality regressions early. - Track
peak_rss_bytestrends to anticipate when memory limits need adjustment.
Exit Codes & Error Diagnosis
Clinker uses structured exit codes to communicate the outcome of a pipeline run. These codes are designed for integration with schedulers, cron, CI systems, and monitoring tools.
Exit code reference
| Code | Meaning | Description |
|---|---|---|
| 0 | Success | Pipeline completed. All records processed successfully. |
| 1 | Configuration error | Invalid YAML, CXL syntax error, type mismatch, or DAG wiring problem. Fix the pipeline configuration. |
| 2 | Partial success | Pipeline ran to completion, but some records were routed to the dead-letter queue. Check the DLQ file. |
| 3 | Evaluation error | CXL runtime error during record processing (e.g., division by zero, type coercion failure). |
| 4 | I/O error | File not found, permission denied, disk full, or input format mismatch. |
Understanding exit code 2
Exit code 2 is not a crash. It means:
- The pipeline started and ran to completion.
- All viable records were processed and written to output files.
- Some records could not be processed and were diverted to the dead-letter queue.
Your scheduler should treat exit code 2 as a warning, not a failure. The DLQ file contains the problematic records along with the error that caused each one to be rejected.
To control when exit code 2 escalates to a hard failure, use --error-threshold:
# Abort if more than 100 records hit the DLQ
clinker run pipeline.yaml --error-threshold 100
With a threshold set, the pipeline aborts (exit code 3) when the DLQ count exceeds the threshold, rather than continuing to completion.
Diagnosing failures
Exit code 1: Configuration error
The error message includes a span-annotated diagnostic pointing to the exact location of the problem:
Error: CXL type error in node 'transform_1'
--> pipeline.yaml:25:15
|
25 | emit total = amount + name
| ^^^^^^^^^^^^^ cannot add Int and String
Action: Fix the YAML or CXL expression indicated in the diagnostic, then re-run with --dry-run to confirm the fix.
Exit code 2: Partial success (DLQ entries)
Check the DLQ file for details:
# The DLQ path is shown in the run output and in metrics
cat output/errors.csv
Common causes:
- Null values in fields that a CXL expression does not handle
- Data that does not match the declared schema (e.g., non-numeric value in an integer column)
- Coercion failures between types
Action: Review the DLQ records, fix the data or add null handling to CXL expressions, and re-run.
Exit code 3: Evaluation error
A CXL expression failed at runtime. The error message includes the failing expression and the record that triggered it:
Error: division by zero in node 'compute_ratio'
expression: emit ratio = total / count
record: {total: 500, count: 0}
Action: Add guard conditions to the CXL expression:
emit ratio = if count == 0 then 0 else total / count
Exit code 4: I/O error
File system or format errors:
Error: file not found: ./data/customers.csv
--> pipeline.yaml:8:12
Common causes:
- Input file does not exist or path is wrong
- Permission denied on input or output directories
- Output file already exists (use
--forceto overwrite) - Disk full during output writing
- Input file format does not match the declared type (e.g., invalid CSV)
Action: Fix file paths, permissions, or disk space, then re-run.
Plan-time diagnostic codes
The process exit codes above tell a scheduler whether the run
succeeded. The E### codes below appear inside the structured
Error: messages a configuration error (exit code 1) prints, and
identify the specific compile-time check that rejected the
pipeline. The codes below cover the event-time watermark and
time-windowed aggregate surface
(issue #61);
related code sets live in
Pipeline Variables,
Channels, and
Correlation Keys.
| Code | Trigger | Remediation |
|---|---|---|
| E154 | A source declares watermark.column: <col> but <col> is not present in that source’s schema: block. | Add the column to schema:, or remove the watermark: block. |
| E155 | A source declares watermark.column: <col> and the column exists, but its declared CXL type is not date_time or date. | Change the column’s type: to date_time or date, or point watermark.column at a column that already has one of those types. |
| E156 | An aggregate declares time_window: but at least one upstream-reachable source does not declare watermark.column. | Add watermark: { column: <event-time-column> } to each listed source, or remove time_window: from the aggregate. Without a watermark on every upstream source, min_across_sources never advances past None and the window can never close. |
See Source Nodes → Watermarks and Aggregate Nodes → Time-windowed aggregates for the field semantics each code is enforcing.
DLQ category: LateRecord
When a time-windowed aggregate sees a record whose event time falls
inside an already-closed window
(window_end + allowed_lateness < min_across_sources), the engine
routes the record to the DLQ instead of attempting to fold it into
a finalized accumulator. Mirrors Flink’s
sideOutputLateData
and Spark Structured Streaming’s late-data drop.
The DLQ row carries:
_cxl_dlq_error_category=late_record_cxl_dlq_stage=time_window:<aggregate-name>_cxl_dlq_error_detail— the closed window’s[start, end)bounds as i64 nanoseconds since the Unix epoch
Tune watermark.delay (source-side, applies before any aggregate)
or allowed_lateness (operator-side, applies per aggregate) to
absorb expected out-of-order tails before they reach this path.
Scheduler integration
Cron script
#!/bin/bash
set -euo pipefail
PIPELINE=/opt/clinker/pipelines/daily_etl.yaml
METRICS_DIR=/var/spool/clinker/
clinker run "$PIPELINE" \
--memory-limit 512M \
--log-level warn \
--metrics-spool-dir "$METRICS_DIR" \
--force
EXIT=$?
case $EXIT in
0)
echo "$(date): Success" >> /var/log/clinker/daily_etl.log
;;
2)
echo "$(date): Warning - DLQ entries produced" >> /var/log/clinker/daily_etl.log
mail -s "Clinker ETL Warning: DLQ entries" ops@company.com < /dev/null
;;
*)
echo "$(date): FAILURE (exit code $EXIT)" >> /var/log/clinker/daily_etl.log
mail -s "Clinker ETL FAILURE (exit $EXIT)" ops@company.com < /dev/null
;;
esac
exit $EXIT
CI pipeline (GitHub Actions)
- name: Run ETL pipeline
run: clinker run pipeline.yaml --dry-run
# Exit code 1 fails the build on config errors
- name: Smoke test with real data
run: clinker run pipeline.yaml --dry-run -n 100
# Catches runtime evaluation errors
Systemd
Systemd Type=oneshot services interpret non-zero exit codes as failures. To allow exit code 2 (partial success) without triggering service failure:
[Service]
Type=oneshot
SuccessExitStatus=2
ExecStart=/opt/clinker/bin/clinker run /opt/clinker/pipelines/daily_etl.yaml --force
Production Deployment
Clinker is a single statically-linked binary with no runtime dependencies. Deployment is straightforward: copy the binary to the server.
Installation
# Copy the binary
scp target/release/clinker user@server:/opt/clinker/bin/
# Verify it runs
ssh user@server /opt/clinker/bin/clinker --version
No JVM, no Python, no container runtime required.
Recommended directory structure
/opt/clinker/
bin/
clinker # The binary
pipelines/
daily_etl.yaml # Pipeline configs
weekly_report.yaml
data/ # Input data (or symlinks to data locations)
output/ # Output files
rules/ # CXL module files (for use statements)
metrics/ # Metrics spool directory
Create a dedicated user:
sudo useradd --system --home-dir /opt/clinker --shell /usr/sbin/nologin clinker
sudo chown -R clinker:clinker /opt/clinker
Systemd service
For scheduled one-shot execution:
[Unit]
Description=Clinker ETL - Daily Customer Processing
After=network.target
[Service]
Type=oneshot
ExecStart=/opt/clinker/bin/clinker run /opt/clinker/pipelines/daily_etl.yaml \
--memory-limit 512M \
--log-level warn \
--metrics-spool-dir /var/spool/clinker/ \
--force
WorkingDirectory=/opt/clinker
User=clinker
Group=clinker
SuccessExitStatus=2
# Resource limits
MemoryMax=1G
CPUQuota=200%
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=clinker-daily
[Install]
WantedBy=multi-user.target
Pair with a systemd timer for scheduling:
[Unit]
Description=Run Clinker daily ETL at 2 AM
[Timer]
OnCalendar=*-*-* 02:00:00
Persistent=true
[Install]
WantedBy=timers.target
sudo systemctl enable --now clinker-daily.timer
Note: SuccessExitStatus=2 tells systemd that exit code 2 (partial success with DLQ entries) is not a service failure. See Exit Codes for the full reference.
Cron scheduling
# Run daily at 2 AM, log to syslog
0 2 * * * /opt/clinker/bin/clinker run \
/opt/clinker/pipelines/daily_etl.yaml \
--log-level warn --force \
2>&1 | logger -t clinker
# Collect metrics hourly
0 * * * * /opt/clinker/bin/clinker metrics collect \
--spool-dir /var/spool/clinker/ \
--output-file /var/log/clinker/metrics.ndjson \
--delete-after-collect
Environment-based configuration
Use the CLINKER_ENV variable or --env flag to activate environment-specific overrides:
# Production
CLINKER_ENV=production clinker run pipeline.yaml
# Staging
CLINKER_ENV=staging clinker run pipeline.yaml
Combined with channel overrides in the pipeline YAML, this allows a single pipeline definition to target different file paths, connection strings, or thresholds per environment.
Logging
Log levels for production
| Level | Use case |
|---|---|
warn | Recommended for production cron jobs. Prints warnings and errors only. |
info | Default. Includes progress messages. Useful during initial deployment. |
error | Minimal output. Only prints when something fails. |
debug | Troubleshooting. Generates significant output. |
trace | Development only. Extremely verbose. |
Directing logs
To syslog via logger:
clinker run pipeline.yaml --log-level warn 2>&1 | logger -t clinker
To a log file:
clinker run pipeline.yaml --log-level warn 2>> /var/log/clinker/etl.log
Systemd journal captures stdout and stderr automatically when running as a service.
DLQ monitoring
When a pipeline exits with code 2, records that could not be processed are written to the dead-letter queue file. Set up a daily check:
#!/bin/bash
# Check for DLQ files produced today
DLQ_DIR=/opt/clinker/output/
DLQ_FILES=$(find "$DLQ_DIR" -name "*_errors.csv" -mtime 0 -size +0c)
if [ -n "$DLQ_FILES" ]; then
echo "DLQ entries found:" | mail -s "Clinker DLQ Alert" ops@company.com <<EOF
The following DLQ files were produced today:
$DLQ_FILES
Review the files and address data quality issues.
EOF
fi
Batch ID for tracing
Use --batch-id with a meaningful, consistent naming scheme:
# Date-based
clinker run pipeline.yaml --batch-id "daily-$(date +%Y-%m-%d)"
# Include environment
clinker run pipeline.yaml --batch-id "prod-daily-$(date +%Y-%m-%d)"
The batch ID appears in metrics output and log lines, making it easy to correlate a specific run across logs, metrics, and DLQ files. On retries, use a different batch ID (e.g., append -retry-1) to distinguish attempts.
Upgrades
To upgrade Clinker:
- Validate the new version against your pipelines:
/opt/clinker/bin/clinker-new run pipeline.yaml --dry-run - Replace the binary:
cp clinker-new /opt/clinker/bin/clinker - Verify:
/opt/clinker/bin/clinker --version
There is no configuration migration. Pipeline YAML files are forward-compatible within the same major version.
CSV-to-CSV Transform
This recipe reads employee data from a CSV file, computes salary tiers using CXL expressions, and writes the enriched result to a new CSV file.
Input data
employees.csv:
id,name,department,salary
1,Alice Chen,Engineering,95000
2,Bob Martinez,Marketing,62000
3,Carol Johnson,Engineering,88000
4,Dave Williams,Sales,71000
5,Eva Brown,Marketing,58000
6,Frank Lee,Engineering,102000
Pipeline
salary_tiers.yaml:
pipeline:
name: salary_tiers
nodes:
- type: source
name: employees
config:
name: employees
type: csv
path: "./employees.csv"
schema:
- { name: id, type: int }
- { name: name, type: string }
- { name: department, type: string }
- { name: salary, type: int }
- type: transform
name: classify
input: employees
config:
cxl: |
emit id = id
emit name = name
emit department = department
emit salary = salary
emit level = if salary >= 90000 then "senior" else "junior"
emit salary_band = match {
salary >= 100000 => "100k+",
salary >= 90000 => "90-100k",
salary >= 70000 => "70-90k",
_ => "under 70k"
}
- type: output
name: report
input: classify
config:
name: salary_report
type: csv
path: "./output/salary_report.csv"
error_handling:
strategy: fail_fast
Run it
# Validate first
clinker run salary_tiers.yaml --dry-run
# Preview output
clinker run salary_tiers.yaml --dry-run -n 3
# Full run
clinker run salary_tiers.yaml
Expected output
output/salary_report.csv:
id,name,department,salary,level,salary_band
1,Alice Chen,Engineering,95000,senior,90-100k
2,Bob Martinez,Marketing,62000,junior,under 70k
3,Carol Johnson,Engineering,88000,junior,70-90k
4,Dave Williams,Sales,71000,junior,70-90k
5,Eva Brown,Marketing,58000,junior,under 70k
6,Frank Lee,Engineering,102000,senior,100k+
Key points
Schema declaration. The source node declares the schema explicitly with typed columns. This enables compile-time type checking of CXL expressions – if you write salary + name, the type checker catches the error before any data is read.
Emit statements. Each emit in the transform produces one output column. The output schema is defined entirely by the emit statements – input columns that are not emitted are dropped. This is intentional: explicit output schemas prevent accidental data leakage.
Match expressions. The match block evaluates conditions top to bottom and returns the value of the first matching arm. The _ wildcard is the default case and must appear last.
Error handling. The fail_fast strategy aborts the pipeline on the first record error. For production pipelines processing dirty data, consider dead_letter_queue instead – see Error Handling & DLQ.
Variations
Filtering records
Add a filter statement to exclude records:
- type: transform
name: classify
input: employees
config:
cxl: |
filter salary >= 60000
emit id = id
emit name = name
emit salary = salary
Records where salary < 60000 are dropped silently – they do not appear in the output or the DLQ.
Computed columns with type conversion
cxl: |
emit id = id
emit name = name
emit monthly_salary = (salary.to_float() / 12.0).round(2)
emit salary_display = "$" + salary.to_string()
The .to_float() conversion is required because salary is declared as int and division by a float literal requires matching types.
Multi-Input Combine
This recipe enriches order records with product metadata from a separate catalog stream using a combine node. Combine is a first-class N-ary operator: every input is declared up front, and the where expression uses qualified field references (orders.product_id, products.product_id) to express the join.
Input data
orders.csv:
order_id,product_id,quantity,unit_price
ORD-001,PROD-A,5,29.99
ORD-002,PROD-B,2,149.99
ORD-003,PROD-A,1,29.99
ORD-004,PROD-C,10,9.99
ORD-005,PROD-B,3,149.99
products.csv:
product_id,product_name,category
PROD-A,Widget Pro,Hardware
PROD-B,DataSync License,Software
PROD-C,Cable Kit,Hardware
Pipeline
order_enrichment.yaml:
pipeline:
name: order_enrichment
nodes:
- type: source
name: orders
config:
name: orders
type: csv
path: "./orders.csv"
schema:
- { name: order_id, type: string }
- { name: product_id, type: string }
- { name: quantity, type: int }
- { name: unit_price, type: float }
- type: source
name: products
config:
name: products
type: csv
path: "./products.csv"
schema:
- { name: product_id, type: string }
- { name: product_name, type: string }
- { name: category, type: string }
- type: combine
name: enrich
input:
orders: orders
products: products
config:
where: "orders.product_id == products.product_id"
match: first
on_miss: null_fields
cxl: |
emit order_id = orders.order_id
emit product_id = orders.product_id
emit product_name = products.product_name
emit category = products.category
emit quantity = orders.quantity
emit unit_price = orders.unit_price
emit line_total = orders.quantity.to_float() * orders.unit_price
propagate_ck: driver
- type: output
name: result
input: enrich
config:
name: enriched_orders
type: csv
path: "./output/enriched_orders.csv"
Run it
clinker run order_enrichment.yaml --dry-run
clinker run order_enrichment.yaml --dry-run -n 3
clinker run order_enrichment.yaml
Expected output
output/enriched_orders.csv:
order_id,product_id,product_name,category,quantity,unit_price,line_total
ORD-001,PROD-A,Widget Pro,Hardware,5,29.99,149.95
ORD-002,PROD-B,DataSync License,Software,2,149.99,299.98
ORD-003,PROD-A,Widget Pro,Hardware,1,29.99,29.99
ORD-004,PROD-C,Cable Kit,Hardware,10,9.99,99.90
ORD-005,PROD-B,DataSync License,Software,3,149.99,449.97
How combine works
A combine node declares every input in its input: map, binding each upstream stream to a qualifier used inside expressions:
- type: combine
name: enrich
input:
orders: orders # qualifier: upstream_node
products: products
config:
where: "orders.product_id == products.product_id"
propagate_ck: driver
The config: block carries four fields that shape behavior:
where– a CXL boolean expression. Every field reference must be qualified with its input name. The expression must contain at least one cross-input equality (e.g.orders.product_id == products.product_id); additional range or arbitrary conjuncts can be combined withand.match–first(default),all, orcollect. See below.on_miss–null_fields(default),skip, orerror. Applies only to records on the driving input that find no match.cxl– emit statements that shape the output row. Undermatch: collect, this field must be empty; the combine node auto-derives the output schema.
Match modes
match: first
Emit one output row per driver record, using the first matching build-side record. This is the standard 1:1 enrichment. When no match exists, the behavior is governed by on_miss.
config:
where: "orders.product_id == products.product_id"
match: first
match: all
Emit one output row for every matching build-side record. This is 1:N fan-out – if a driver record matches three build records, three rows are emitted.
- type: combine
name: expand_benefits
input:
employees: employees
benefits: benefits
config:
where: "employees.department == benefits.department"
match: all
cxl: |
emit employee_id = employees.employee_id
emit benefit = benefits.benefit_name
propagate_ck: driver
An employee in a department with three benefits produces three output records.
match: collect
Gather every matching build-side record into a single Array-typed field on the output row. The driver record appears once; the build matches are aggregated into a list. The cxl: body must be empty under match: collect – the combine node synthesizes the output as { driver fields..., <build_qualifier>: Array }.
- type: combine
name: gather
input:
orders: orders
products: products
config:
where: "orders.product_id == products.product_id"
match: collect
cxl: ""
propagate_ck: driver
Use collect when you need the set of matches as a single structured value (e.g. every price history row for an order). Use all when you need one flat row per match.
Unmatched records (on_miss)
on_miss controls what happens to driver records with zero matches:
config:
where: "orders.product_id == products.product_id"
on_miss: null_fields # default: emit with build fields set to null
config:
where: "orders.product_id == products.product_id"
on_miss: skip # inner-join semantics: drop unmatched drivers
config:
where: "orders.product_id == products.product_id"
on_miss: error # fail the pipeline on first unmatched driver
Use skip for inner-join semantics, null_fields for left-join semantics, and error for strict referential integrity where any miss should halt processing.
Composite keys
Chain multiple equalities with and to combine on more than one field. Each conjunct is a separate cross-input equality:
- type: combine
name: match_by_region
input:
sales: sales
targets: targets
config:
where: |
sales.department == targets.department
and sales.region == targets.region
cxl: |
emit department = sales.department
emit region = sales.region
emit actual = sales.amount
emit goal = targets.goal
propagate_ck: driver
Both equalities must hold for a record pair to match.
Equi plus residual filter
The where clause can mix equi predicates with additional filter conjuncts. Non-equality conjuncts are applied as a residual filter after the equi match:
- type: combine
name: high_value_enrichment
input:
orders: orders
products: products
config:
where: |
orders.product_id == products.product_id
and orders.amount >= 100
match: first
on_miss: skip
cxl: |
emit order_id = orders.order_id
emit product_name = products.product_name
emit amount = orders.amount
propagate_ck: driver
The equi conjunct drives the hash lookup; the amount >= 100 conjunct is evaluated as a post-filter. At least one cross-input equality is required in every combine.
Multi-input combine (three or more)
Combine accepts any number of inputs. Each pair of inputs that should be related needs an explicit equality in the where clause:
- type: combine
name: fully_enriched
input:
orders: orders
products: products
categories: categories
config:
where: |
orders.product_id == products.product_id
and products.category_id == categories.category_id
match: first
on_miss: null_fields
cxl: |
emit order_id = orders.order_id
emit product_name = products.product_name
emit category_name = categories.name
emit amount = orders.amount
propagate_ck: driver
Input order in the input: map is preserved, and downstream reasoning treats the first input as the default driving side unless a drive: hint overrides it.
Choosing the driving input
By default the planner picks a driving (probe) input and builds hash tables for the rest. Use drive: to force a specific input to be the driver – typically the larger stream, or the one whose ordering you want to preserve:
- type: combine
name: product_driven
input:
orders: orders
products: products
config:
where: "orders.product_id == products.product_id"
match: first
drive: products
cxl: |
emit product_id = products.product_id
emit product_name = products.product_name
emit sample_order_id = orders.order_id
propagate_ck: driver
With drive: products, the pipeline emits one row per product enriched with a matching order, instead of one row per order enriched with its product.
Memory considerations
Build-side inputs are materialized in memory as hash tables keyed by the equi columns. For each non-driving input, plan for roughly 1.5-2x the raw CSV size in heap. A 50 MB product catalog typically uses 75-100 MB of hash-table memory. Tune with --memory-limit; see Memory Tuning for spill thresholds and strategy overrides.
Routing to Multiple Outputs
This recipe splits a stream of order records into separate output files based on business rules. High-value orders go to one file, standard orders to another.
Input data
orders.csv:
order_id,customer,amount,region
ORD-001,Acme Corp,15000,US
ORD-002,Globex,450,EU
ORD-003,Initech,8500,US
ORD-004,Umbrella,22000,APAC
ORD-005,Stark Ind,950,US
ORD-006,Wayne Ent,3200,EU
Pipeline
order_routing.yaml:
pipeline:
name: order_routing
vars:
high_value_threshold: 5000
nodes:
- type: source
name: orders
config:
name: orders
type: csv
path: "./orders.csv"
schema:
- { name: order_id, type: string }
- { name: customer, type: string }
- { name: amount, type: float }
- { name: region, type: string }
- type: route
name: split_by_value
input: orders
config:
mode: exclusive
conditions:
high: "amount >= $vars.high_value_threshold"
default: standard
- type: output
name: high_value_output
input: split_by_value.high
config:
name: high_value_orders
type: csv
path: "./output/high_value.csv"
- type: output
name: standard_output
input: split_by_value.standard
config:
name: standard_orders
type: csv
path: "./output/standard.csv"
Run it
clinker run order_routing.yaml --dry-run
clinker run order_routing.yaml
Expected output
output/high_value.csv:
order_id,customer,amount,region
ORD-001,Acme Corp,15000,US
ORD-003,Initech,8500,US
ORD-004,Umbrella,22000,APAC
output/standard.csv:
order_id,customer,amount,region
ORD-002,Globex,450,EU
ORD-005,Stark Ind,950,US
ORD-006,Wayne Ent,3200,EU
How routing works
Port syntax
Route nodes produce named output ports. Downstream nodes reference these ports using dot syntax: split_by_value.high and split_by_value.standard.
The port names come from two places:
- Condition names in the
conditionsmap (here,high) - The
defaultfield (here,standard)
Exclusive mode
With mode: exclusive, each record goes to exactly one branch. Conditions are evaluated top to bottom – the first matching condition wins, and the record is sent to that port. Records that match no condition go to the default port.
Pipeline variables
The threshold is defined in pipeline.vars and referenced in the CXL expression as $vars.high_value_threshold. This makes it easy to adjust the threshold without editing the route condition, and channel overrides can change it per environment.
Variations
Multiple branches
Route nodes can have any number of named branches:
- type: route
name: split_by_region
input: orders
config:
mode: exclusive
conditions:
us: "region == \"US\""
eu: "region == \"EU\""
apac: "region == \"APAC\""
default: other
- type: output
name: us_output
input: split_by_region.us
config:
name: us_orders
type: csv
path: "./output/us_orders.csv"
- type: output
name: eu_output
input: split_by_region.eu
config:
name: eu_orders
type: csv
path: "./output/eu_orders.csv"
# ... additional outputs for apac, other
Transform before output
Insert a transform between the route and output to shape the data differently per branch:
- type: transform
name: enrich_high_value
input: split_by_value.high
config:
cxl: |
emit order_id = order_id
emit customer = customer
emit amount = amount
emit priority = "URGENT"
emit review_required = true
- type: output
name: high_value_output
input: enrich_high_value
config:
name: high_value_orders
type: csv
path: "./output/high_value.csv"
Combining routing with aggregation
Route first, then aggregate each branch independently:
- type: aggregate
name: high_value_summary
input: split_by_value.high
config:
group_by: [region]
cxl: |
emit total = sum(amount)
emit count = count(*)
This produces a per-region summary of high-value orders only.
Aggregation & Rollups
This recipe demonstrates grouping records and computing summary statistics. The pipeline filters active sales records, then rolls them up by department.
Input data
sales.csv:
id,department,amount,status,rep
1,Engineering,5000,active,Alice
2,Marketing,3000,active,Bob
3,Engineering,7000,active,Carol
4,Sales,4000,inactive,Dave
5,Marketing,2000,active,Eva
6,Engineering,9500,active,Frank
7,Sales,6000,active,Grace
8,Marketing,1500,inactive,Hank
Pipeline
dept_rollup.yaml:
pipeline:
name: dept_rollup
nodes:
- type: source
name: sales
config:
name: sales
type: csv
path: "./sales.csv"
schema:
- { name: id, type: int }
- { name: department, type: string }
- { name: amount, type: float }
- { name: status, type: string }
- { name: rep, type: string }
- type: transform
name: active_only
input: sales
config:
cxl: |
filter status == "active"
- type: aggregate
name: rollup
input: active_only
config:
group_by: [department]
cxl: |
emit total = sum(amount)
emit count = count(*)
emit average = avg(amount)
emit maximum = max(amount)
emit minimum = min(amount)
- type: output
name: report
input: rollup
config:
name: dept_totals
type: csv
path: "./output/dept_totals.csv"
Run it
clinker run dept_rollup.yaml --dry-run
clinker run dept_rollup.yaml
Expected output
output/dept_totals.csv:
department,total,count,average,maximum,minimum
Engineering,21500,3,7166.67,9500,5000
Marketing,5000,2,2500,3000,2000
Sales,6000,1,6000,6000,6000
One row per department. The inactive records (Dave’s $4000, Hank’s $1500) are excluded by the filter.
How aggregation works
Group-by keys
The group_by field lists the columns that define each group. Records with the same values for all group-by columns are aggregated together. The group-by columns appear automatically in the output – you do not need to emit them.
Aggregate functions
Available aggregate functions in CXL:
| Function | Description |
|---|---|
sum(expr) | Sum of values |
count(*) | Number of records |
avg(expr) | Arithmetic mean |
min(expr) | Minimum value |
max(expr) | Maximum value |
first(expr) | First value encountered |
last(expr) | Last value encountered |
Strategy selection
Clinker offers two aggregation strategies:
-
Hash aggregation (default): Builds an in-memory hash map keyed by the group-by columns. Works with any input order. Memory usage is proportional to the number of distinct groups.
-
Streaming aggregation: Processes records in order, emitting each group’s result as soon as the next group starts. Requires input sorted by the group-by keys. Uses minimal memory regardless of the number of groups.
The default strategy (auto) selects streaming when the optimizer can prove the input is sorted by the group-by keys, and hash otherwise. You can force a strategy:
config:
group_by: [department]
strategy: streaming # requires sorted input
See Memory Tuning for details on memory implications.
Variations
Multiple group-by keys
config:
group_by: [department, region]
cxl: |
emit total = sum(amount)
emit count = count(*)
Produces one row per unique (department, region) combination.
Pre-aggregation transform
Compute derived fields before aggregating:
- type: transform
name: prepare
input: sales
config:
cxl: |
filter status == "active"
emit department = department
emit amount = amount
emit is_large = amount >= 5000
- type: aggregate
name: rollup
input: prepare
config:
group_by: [department]
cxl: |
emit total = sum(amount)
emit large_count = sum(if is_large then 1 else 0)
emit small_count = sum(if not is_large then 1 else 0)
Aggregation followed by routing
Aggregate first, then route the summary rows:
- type: aggregate
name: rollup
input: active_only
config:
group_by: [department]
cxl: |
emit total = sum(amount)
- type: route
name: split_by_total
input: rollup
config:
mode: exclusive
conditions:
large: "total >= 10000"
default: small
This routes departments with over $10,000 in total sales to one output and the rest to another.
No group-by (grand total)
Omit group_by to aggregate all records into a single output row:
config:
cxl: |
emit grand_total = sum(amount)
emit record_count = count(*)
emit average_amount = avg(amount)
Time-windowed rollups
When the grouping dimension is event-time bucket, declare a
watermark: on every source
and a time_window:
on the aggregate. Three patterns cover the common shapes; all three
ship as runnable pipelines under examples/pipelines/.
Tumbling: hourly click counts
Non-overlapping one-hour buckets per user. Use when each record should contribute to exactly one reporting bucket.
examples/pipelines/tumbling_clicks.yaml:
pipeline:
name: tumbling_clicks
nodes:
- type: source
name: clicks
description: Per-user click stream with an event-time column.
config:
name: clicks
type: csv
path: ./data/tumbling_clicks.csv
options:
has_header: true
watermark:
column: event_ts
schema:
- { name: user_id, type: string }
- { name: event_ts, type: date_time }
- { name: kind, type: string }
- type: aggregate
name: hourly_clicks
description: Per-user click count, bucketed by event-time hour.
input: clicks
config:
group_by: [user_id]
time_window:
tumbling: { size: 1h }
cxl: |
emit user_id = user_id
emit n = count(*)
- type: output
name: results
input: hourly_clicks
config:
name: results
type: csv
path: ./output/tumbling_clicks.csv
error_handling:
strategy: fail_fast
Run:
cargo run -p clinker -- run examples/pipelines/tumbling_clicks.yaml
The source’s watermark advances with each record’s event_ts; each
hour-aligned bucket emits one row per user_id as soon as the
watermark crosses bucket_end. Records observed out-of-order land
in the DLQ as late_record — add delay: on the source or
allowed_lateness: on the aggregate if the input has a known
out-of-order tail.
Hopping: 1-hour sums advanced every 5 minutes
Overlapping one-hour windows that move forward every 5 minutes. Use for moving averages and rolling sums where one record should contribute to multiple overlapping reports.
examples/pipelines/hopping_sliding_5m_1h.yaml:
pipeline:
name: hopping_sliding_5m_1h
nodes:
- type: source
name: clicks
config:
name: clicks
type: csv
path: ./data/hopping_clicks.csv
options:
has_header: true
watermark:
column: event_ts
delay: 5s
schema:
- { name: user_id, type: string }
- { name: event_ts, type: date_time }
- { name: amount, type: int }
- type: aggregate
name: sliding_amount
input: clicks
config:
group_by: [user_id]
time_window:
hopping:
size: 1h
slide: 5m
allowed_lateness: 30s
cxl: |
emit user_id = user_id
emit total = sum(amount)
emit n = count(*)
- type: output
name: results
input: sliding_amount
config:
name: results
type: csv
path: ./output/hopping_sliding_5m_1h.csv
error_handling:
strategy: fail_fast
Run:
cargo run -p clinker -- run examples/pipelines/hopping_sliding_5m_1h.yaml
Each record fans into ceil(size / slide) = 12 overlapping
windows, so the output row count is roughly 12× the active-window
record count. The source’s delay: 5s plus the aggregate’s
allowed_lateness: 30s give the pipeline 35 seconds of total grace
beyond strict event-time order before a record drops to the DLQ.
Session: per-user multi-source login sessions
Variable-duration windows bounded by inactivity, computed across two independent sources. Use for activity grouping where the window length is data-driven rather than clock-aligned.
examples/pipelines/multi_source_session.yaml:
pipeline:
name: multi_source_session
nodes:
- type: source
name: src_web
description: Web login events.
config:
name: src_web
type: csv
path: ./data/session_logins.csv
options:
has_header: true
watermark:
column: event_ts
schema:
- { name: user_id, type: string }
- { name: event_ts, type: date_time }
- { name: source, type: string }
- type: source
name: src_mobile
description: Mobile login events.
config:
name: src_mobile
type: csv
path: ./data/session_mobile.csv
options:
has_header: true
watermark:
column: event_ts
schema:
- { name: user_id, type: string }
- { name: event_ts, type: date_time }
- { name: source, type: string }
- type: merge
name: all_logins
inputs: [src_web, src_mobile]
- type: aggregate
name: user_sessions
input: all_logins
config:
group_by: [user_id]
time_window:
session: { gap: 5m }
allowed_lateness: 30s
cxl: |
emit user_id = user_id
emit logins = count(*)
- type: output
name: results
input: user_sessions
config:
name: results
type: csv
path: ./output/multi_source_session.csv
error_handling:
strategy: fail_fast
Run:
cargo run -p clinker -- run examples/pipelines/multi_source_session.yaml
Each source declares its own watermark.column independently. The
aggregate’s close decision reads min_across_sources across both
sources’ partitions: a session can’t emit until both src_web and
src_mobile have advanced past session_end + allowed_lateness.
Drop the watermark: block on either source and the planner
rejects the pipeline with
E156.
When to pick each
| Kind | Bucket shape | Typical use |
|---|---|---|
tumbling | Disjoint, clock-aligned, fixed width | Hourly metrics, daily rollups, billing periods. |
hopping | Overlapping, clock-aligned, fixed width | Moving averages, sliding sums, anomaly detection where each record should affect multiple reports. |
session | Variable width, gap-bounded, per-key | User sessions, telemetry burst grouping, activity envelopes where the window length is data-driven. |
File Splitting
This recipe demonstrates splitting large output files into smaller chunks, optionally keeping related records together.
Basic record-count splitting
Split output into files of at most 5,000 records each:
pipeline:
name: monthly_report
nodes:
- type: source
name: transactions
config:
name: transactions
type: csv
path: "./data/transactions.csv"
schema:
- { name: id, type: int }
- { name: date, type: string }
- { name: department, type: string }
- { name: amount, type: float }
- { name: description, type: string }
- type: output
name: split_output
input: transactions
config:
name: monthly_report
type: csv
path: "./output/report.csv"
split:
max_records: 5000
naming: "{stem}_{seq:04}.{ext}"
repeat_header: true
Output files
output/report_0001.csv (5000 records + header)
output/report_0002.csv (5000 records + header)
output/report_0003.csv (remaining records + header)
Naming pattern variables
| Variable | Description | Example |
|---|---|---|
{stem} | Base filename without extension | report |
{ext} | File extension | csv |
{seq:04} | Zero-padded sequence number (width 4) | 0001 |
The path field provides the template: ./output/report.csv means stem is report and ext is csv.
Header behavior
When repeat_header: true, each output file includes the CSV header row. This is the recommended setting – each file is self-contained and can be processed independently.
Grouped splitting
Keep all records with the same group key value in the same file:
split:
max_records: 5000
group_key: "department"
naming: "{stem}_{seq:04}.{ext}"
repeat_header: true
oversize_group: warn
With group_key: "department", the splitter ensures that all records for a given department land in the same output file. A new file starts only at a group boundary (when the department value changes), even if the current file has not reached max_records yet.
Oversize group policy
If a single group contains more records than max_records, the oversize_group setting controls behavior:
| Policy | Behavior |
|---|---|
warn (default) | Log a warning and write all records for the group into one file, exceeding the limit |
error | Stop the pipeline with an error |
allow | Silently allow the oversized file |
For example, if max_records is 5,000 but the Engineering department has 7,000 records, the warn policy produces a file with 7,000 records and logs a warning.
Byte-based splitting
Split by file size instead of record count:
split:
max_bytes: 10485760 # 10 MB per file
naming: "{stem}_{seq:04}.{ext}"
repeat_header: true
The splitter estimates the current file size and starts a new file when the limit is approached. The actual file size may slightly exceed the limit because the current record is always completed before splitting.
Combined limits
Use both max_records and max_bytes together – whichever limit is reached first triggers a new file:
split:
max_records: 10000
max_bytes: 5242880 # 5 MB
naming: "{stem}_{seq:04}.{ext}"
repeat_header: true
This is useful when record sizes vary widely. Short records might produce a tiny file at 10,000 records, while long records might hit the byte limit well before 10,000.
Full pipeline example
A complete pipeline that reads a large transaction file, filters it, and splits the output:
pipeline:
name: split_transactions
nodes:
- type: source
name: transactions
config:
name: transactions
type: csv
path: "./data/all_transactions.csv"
schema:
- { name: id, type: int }
- { name: date, type: string }
- { name: department, type: string }
- { name: category, type: string }
- { name: amount, type: float }
- type: transform
name: current_year
input: transactions
config:
cxl: |
filter date.starts_with("2026")
- type: output
name: chunked
input: current_year
config:
name: transactions_2026
type: csv
path: "./output/transactions_2026.csv"
split:
max_records: 5000
group_key: "department"
naming: "{stem}_{seq:04}.{ext}"
repeat_header: true
oversize_group: warn
clinker run split_transactions.yaml --force
Practical considerations
-
Downstream consumers. Splitting is useful when the receiving system has file size limits (e.g., an upload API that accepts files up to 10 MB) or when parallel processing of chunks is desired.
-
Record ordering. Records within each output file maintain their original order from the pipeline. Across files, the sequence number (
{seq}) indicates the order. -
Group key sorting. For
group_keyto work correctly, the input should ideally be sorted by the group key. If the input is not sorted, records for the same group may appear in multiple files. Pre-sort with a transform if needed, or accept the split-group behavior. -
Overwrite behavior. Use
--forcewhen re-running a pipeline with splitting enabled. Without it, the pipeline aborts if any of the output chunk files already exist.
Intra-Record Closures
This recipe shows the complete intra-record fan-out shape: an NDJSON source where each record carries an array of line items, a transform that filters items by price and then fans each remaining item into its own output record, and a flat NDJSON sink ready for downstream billing.
The pieces involved:
- Arrow-syntax closures for predicates and projections.
- Array methods (
filter,map) for in-place transformation. - Bracket-index access (
it["sku"]) for reading fields off each map element. emit eachfor fan-out.- The Output node’s
include_unmappedflag for controlling which fields reach the sink.
Input data
orders.ndjson – one JSON object per line, each carrying a nested items array:
{"order_id":"O-1","customer":"alice@example.com","items":[{"sku":"a","price":10,"qty":2},{"sku":"b","price":20,"qty":1},{"sku":"c","price":3,"qty":5}]}
{"order_id":"O-2","customer":"bob@example.com","items":[{"sku":"a","price":10,"qty":1},{"sku":"d","price":50,"qty":1}]}
Each record has two order-level fields (order_id, customer) and an items array whose elements are maps with sku, price, and qty.
Goal
For each order:
- Drop items priced under $5 (a sub-threshold cutoff).
- Fan the surviving items into one output record each, carrying the order-level identifiers plus the per-item fields.
- Compute the per-line revenue (
unit_price * qty) for each output record.
Pipeline
billing_lines.yaml:
pipeline:
name: billing_lines
nodes:
- type: source
name: orders
config:
name: orders
type: json
options:
format: ndjson
path: "./orders.ndjson"
schema:
- { name: order_id, type: string }
- { name: customer, type: string }
- { name: items, type: any }
- type: transform
name: filter_lines
input: orders
config:
cxl: |
emit order_id = order_id
emit customer = customer
emit item_count = items.length()
emit kept = items.filter(it => it["price"] >= 5)
- type: transform
name: explode
input: filter_lines
config:
max_expansion: 10000
cxl: |
emit each it in kept {
emit order_id = order_id
emit customer = customer
emit sku = it["sku"]
emit unit_price = it["price"]
emit qty = it["qty"]
emit line_total = it["price"] * it["qty"]
}
- type: output
name: lines_out
input: explode
config:
name: lines_out
type: json
path: "./output/billing_lines.ndjson"
options:
format: ndjson
include_unmapped: false
exclude: [items, kept]
error_handling:
strategy: continue
Run it
# Validate first
clinker run billing_lines.yaml --dry-run
# Preview the first few output records
clinker run billing_lines.yaml --dry-run -n 3
# Full run
clinker run billing_lines.yaml
Expected output
output/billing_lines.ndjson:
{"order_id":"O-1","customer":"alice@example.com","sku":"a","unit_price":10,"qty":2,"line_total":20}
{"order_id":"O-1","customer":"alice@example.com","sku":"b","unit_price":20,"qty":1,"line_total":20}
{"order_id":"O-2","customer":"bob@example.com","sku":"a","unit_price":10,"qty":1,"line_total":10}
{"order_id":"O-2","customer":"bob@example.com","sku":"d","unit_price":50,"qty":1,"line_total":50}
Order O-1’s three input items collapse to two output records (the sku=c line was filtered out because its price was below $5). Order O-2’s two items both survive the filter and produce two output records.
How it works
Filter stage. The filter_lines transform reads each order, runs items.filter(it => it["price"] >= 5) to drop sub-threshold items, and stashes the survivors in a kept field. The closure body uses bracket indexing (it["price"]) because each it is a map; bracket indexing returns null for missing keys without aborting. The same record also carries an item_count projection so downstream nodes could route or audit on the original (pre-filter) item count.
Explode stage. The explode transform contains one emit each block over kept. For each surviving item, the body emits a flat record with the order-level identifiers (order_id, customer) repeated, plus the per-item fields lifted out of it. The body has no filter or nested emit each – those are forbidden inside the block; pre-filter upstream as we did, or post-filter in a downstream transform.
include_unmapped: false. The default Output policy is to pass every unmapped input field through. Here we set it to false so the order-level items array (carried through from the source), the item_count projection, and the intermediate kept array (used only as the fan-out source) do not leak into the per-line output. The exclude: [items, kept] list provides a belt-and-suspenders defense against future renaming.
max_expansion: 10000. Caps how many output records a single input order may produce. The default is 10000; we set it explicitly here so the value is visible in the YAML. Orders with arrays larger than the cap route to the DLQ with category expansion_limit_exceeded (see Transform Nodes -> Expansion Cap).
Variations
Pass through every input field
Remove include_unmapped: false (or set it to true) and the original order-level fields plus the intermediate kept array will appear on every output record. Useful when downstream consumers expect a complete record context, or when you need to audit what was filtered.
Emit a single record per order with the kept-items array
Drop the explode transform and route filter_lines directly to the Output. Each output record stays at order grain, with kept carrying the post-filter array. This is the same pipeline minus the fan-out step.
Reach for .flat_map instead of two transforms
When the per-element transformation is simple enough to fit in a single closure body, flat_map collapses the filter + project + explode pattern into one expression. It produces a flat array, which downstream nodes still see as a single field on the input record; the explicit emit each is what produces multiple output records.
Rewrite a nested field in place with .set
When you want to keep the record at order grain but mutate a value buried inside it, the set map method takes a dotted/indexed path and rewrites a single leaf, leaving every sibling untouched:
cxl: |
emit order = order.set("items[0].sku", "A-100").set("ship.region", "us-east")
The first set overwrites the SKU of the first item; the second writes ship.region, auto-creating the ship map if the order had no ship field yet. Because set is copy-on-write, this builds a fresh order document without disturbing the upstream binding. A path that conflicts with the existing shape (descending into a scalar, or an array index past the end) yields null for that set rather than partially writing – guard with catch if a path may not match every record.
See also
- Closures – the
it => bodyform. - Array Methods –
filter,map,find,any,flat_map,remove,length,join. - Map Methods –
keys,values,merge,set,remove_field. - Nested Paths – bracket-index and dotted-path navigation.
- Emit Each – the fan-out statement.
- Transform Nodes –
max_expansionand DLQ routing. - Output Nodes –
include_unmappedand field control.