Exit Codes & Error Diagnosis
Clinker uses structured exit codes to communicate the outcome of a pipeline run. These codes are designed for integration with schedulers, cron, CI systems, and monitoring tools.
Exit code reference
| Code | Meaning | Description |
|---|---|---|
| 0 | Success | Pipeline completed. All records processed successfully. |
| 1 | Configuration error | Invalid YAML, CXL syntax error, type mismatch, or DAG wiring problem. Fix the pipeline configuration. |
| 2 | Partial success | Pipeline ran to completion, but some records were routed to the dead-letter queue. Check the DLQ file. |
| 3 | Evaluation error | CXL runtime error during record processing (e.g., division by zero, type coercion failure). |
| 4 | I/O error | File not found, permission denied, disk full, or input format mismatch. |
Understanding exit code 2
Exit code 2 is not a crash. It means:
- The pipeline started and ran to completion.
- All viable records were processed and written to output files.
- Some records could not be processed and were diverted to the dead-letter queue.
Your scheduler should treat exit code 2 as a warning, not a failure. The DLQ file contains the problematic records along with the error that caused each one to be rejected.
To control when exit code 2 escalates to a hard failure, use --error-threshold:
# Abort if more than 100 records hit the DLQ
clinker run pipeline.yaml --error-threshold 100
With a threshold set, the pipeline aborts (exit code 3) when the DLQ count exceeds the threshold, rather than continuing to completion.
Diagnosing failures
Exit code 1: Configuration error
The error message includes a span-annotated diagnostic pointing to the exact location of the problem:
Error: CXL type error in node 'transform_1'
--> pipeline.yaml:25:15
|
25 | emit total = amount + name
| ^^^^^^^^^^^^^ cannot add Int and String
Action: Fix the YAML or CXL expression indicated in the diagnostic, then re-run with --dry-run to confirm the fix.
Exit code 2: Partial success (DLQ entries)
Check the DLQ file for details:
# The DLQ path is shown in the run output and in metrics
cat output/errors.csv
Common causes:
- Null values in fields that a CXL expression does not handle
- Data that does not match the declared schema (e.g., non-numeric value in an integer column)
- Coercion failures between types
Action: Review the DLQ records, fix the data or add null handling to CXL expressions, and re-run.
Exit code 3: Evaluation error
A CXL expression failed at runtime. The error message includes the failing expression and the record that triggered it:
Error: division by zero in node 'compute_ratio'
expression: emit ratio = total / count
record: {total: 500, count: 0}
Action: Add guard conditions to the CXL expression:
emit ratio = if count == 0 then 0 else total / count
Exit code 4: I/O error
File system or format errors:
Error: file not found: ./data/customers.csv
--> pipeline.yaml:8:12
Common causes:
- Input file does not exist or path is wrong
- Permission denied on input or output directories
- Output file already exists (use
--forceto overwrite) - Disk full during output writing
- Input file format does not match the declared type (e.g., invalid CSV)
Action: Fix file paths, permissions, or disk space, then re-run.
Plan-time diagnostic codes
The process exit codes above tell a scheduler whether the run
succeeded. The E### codes below appear inside the structured
Error: messages a configuration error (exit code 1) prints, and
identify the specific compile-time check that rejected the
pipeline. The codes below cover the event-time watermark and
time-windowed aggregate surface
(issue #61);
related code sets live in
Pipeline Variables,
Channels, and
Correlation Keys.
| Code | Trigger | Remediation |
|---|---|---|
| E154 | A source declares watermark.column: <col> but <col> is not present in that source’s schema: block. | Add the column to schema:, or remove the watermark: block. |
| E155 | A source declares watermark.column: <col> and the column exists, but its declared CXL type is not date_time or date. | Change the column’s type: to date_time or date, or point watermark.column at a column that already has one of those types. |
| E156 | An aggregate declares time_window: but at least one upstream-reachable source does not declare watermark.column. | Add watermark: { column: <event-time-column> } to each listed source, or remove time_window: from the aggregate. Without a watermark on every upstream source, min_across_sources never advances past None and the window can never close. |
See Source Nodes → Watermarks and Aggregate Nodes → Time-windowed aggregates for the field semantics each code is enforcing.
DLQ category: LateRecord
When a time-windowed aggregate sees a record whose event time falls
inside an already-closed window
(window_end + allowed_lateness < min_across_sources), the engine
routes the record to the DLQ instead of attempting to fold it into
a finalized accumulator. Mirrors Flink’s
sideOutputLateData
and Spark Structured Streaming’s late-data drop.
The DLQ row carries:
_cxl_dlq_error_category=late_record_cxl_dlq_stage=time_window:<aggregate-name>_cxl_dlq_error_detail— the closed window’s[start, end)bounds as i64 nanoseconds since the Unix epoch
Tune watermark.delay (source-side, applies before any aggregate)
or allowed_lateness (operator-side, applies per aggregate) to
absorb expected out-of-order tails before they reach this path.
Scheduler integration
Cron script
#!/bin/bash
set -euo pipefail
PIPELINE=/opt/clinker/pipelines/daily_etl.yaml
METRICS_DIR=/var/spool/clinker/
clinker run "$PIPELINE" \
--memory-limit 512M \
--log-level warn \
--metrics-spool-dir "$METRICS_DIR" \
--force
EXIT=$?
case $EXIT in
0)
echo "$(date): Success" >> /var/log/clinker/daily_etl.log
;;
2)
echo "$(date): Warning - DLQ entries produced" >> /var/log/clinker/daily_etl.log
mail -s "Clinker ETL Warning: DLQ entries" ops@company.com < /dev/null
;;
*)
echo "$(date): FAILURE (exit code $EXIT)" >> /var/log/clinker/daily_etl.log
mail -s "Clinker ETL FAILURE (exit $EXIT)" ops@company.com < /dev/null
;;
esac
exit $EXIT
CI pipeline (GitHub Actions)
- name: Run ETL pipeline
run: clinker run pipeline.yaml --dry-run
# Exit code 1 fails the build on config errors
- name: Smoke test with real data
run: clinker run pipeline.yaml --dry-run -n 100
# Catches runtime evaluation errors
Systemd
Systemd Type=oneshot services interpret non-zero exit codes as failures. To allow exit code 2 (partial success) without triggering service failure:
[Service]
Type=oneshot
SuccessExitStatus=2
ExecStart=/opt/clinker/bin/clinker run /opt/clinker/pipelines/daily_etl.yaml --force