Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Metrics & Monitoring

Clinker writes per-execution metrics as JSON files to a spool directory. These files can be collected into an NDJSON archive for ingestion into monitoring systems.

Enabling metrics

There are three ways to enable metrics collection, listed from highest to lowest priority:

CLI flag:

clinker run pipeline.yaml --metrics-spool-dir ./metrics/

Environment variable:

export CLINKER_METRICS_SPOOL_DIR=./metrics/
clinker run pipeline.yaml

YAML config:

pipeline:
  metrics:
    spool_dir: "./metrics/"

When metrics are enabled, each execution writes one JSON file to the spool directory, named <execution_id>.json.

Metrics schema

Each metrics file follows schema version 1:

{
  "execution_id": "01912345-6789-7abc-def0-123456789abc",
  "schema_version": 1,
  "pipeline_name": "customer_etl",
  "config_path": "/opt/clinker/pipelines/daily_etl.yaml",
  "hostname": "prod-etl-01",
  "started_at": "2026-04-11T10:00:00Z",
  "finished_at": "2026-04-11T10:00:05Z",
  "duration_ms": 5000,
  "exit_code": 0,
  "records_total": 50000,
  "records_ok": 49950,
  "records_dlq": 50,
  "execution_mode": "streaming",
  "peak_rss_bytes": 134217728,
  "thread_count": 4,
  "input_files": ["./data/customers.csv"],
  "output_files": ["./output/enriched.csv"],
  "dlq_path": "./output/errors.csv",
  "error": null
}

Field reference

FieldTypeDescription
execution_idstringUUID v7 or custom --batch-id value
schema_versionintegerAlways 1 for this release
pipeline_namestringThe name from the pipeline YAML
config_pathstringAbsolute path to the config file
hostnamestringMachine hostname
started_atstringISO 8601 UTC timestamp
finished_atstringISO 8601 UTC timestamp
duration_msintegerWall-clock duration in milliseconds
exit_codeintegerProcess exit code (see Exit Codes)
records_totalintegerTotal records read from all sources
records_okintegerRecords that reached an output node
records_dlqintegerRecords routed to the dead-letter queue
execution_modestringstreaming or batch
peak_rss_bytesintegerMaximum resident set size during execution
thread_countintegerThread pool size used
input_filesarrayPaths to all source files
output_filesarrayPaths to all output files written
dlq_pathstring/nullPath to the DLQ file, or null if none
errorstring/nullError message on failure, or null on success

Collecting metrics

The spool directory accumulates one file per execution. Use clinker metrics collect to sweep them into an NDJSON archive:

clinker metrics collect \
  --spool-dir ./metrics/ \
  --output-file ./metrics/archive.ndjson \
  --delete-after-collect

This appends all spool files to the archive (one JSON object per line) and removes the originals. The NDJSON format is compatible with most log aggregation and monitoring tools.

Preview without writing:

clinker metrics collect \
  --spool-dir ./metrics/ \
  --output-file ./metrics/archive.ndjson \
  --dry-run

Integration with monitoring systems

Grafana / Prometheus

Parse the NDJSON archive with a log shipper (Promtail, Filebeat, Vector) and create dashboards tracking:

  • duration_ms – execution time trends
  • records_dlq – data quality over time
  • peak_rss_bytes – memory utilization

Datadog

Ship NDJSON to Datadog Logs, then create metrics from log attributes:

# Example: tail the archive and ship to Datadog
tail -f ./metrics/archive.ndjson | datadog-agent log-stream

ELK Stack

Filebeat can ingest NDJSON directly:

# filebeat.yml
filebeat.inputs:
  - type: log
    paths:
      - /var/log/clinker/metrics.ndjson
    json.keys_under_root: true

Simple alerting with jq

For environments without a full monitoring stack, use jq to query the archive directly:

# Find all runs with DLQ entries in the last 24 hours
jq 'select(.records_dlq > 0)' metrics/archive.ndjson

# Find runs that exceeded 400MB RSS
jq 'select(.peak_rss_bytes > 419430400)' metrics/archive.ndjson

# Average duration by pipeline
jq -s 'group_by(.pipeline_name) | map({
  pipeline: .[0].pipeline_name,
  avg_ms: (map(.duration_ms) | add / length)
})' metrics/archive.ndjson

Operational recommendations

  • Always enable metrics in production. The overhead is negligible (one small JSON write at the end of each run).
  • Run metrics collect --delete-after-collect on a schedule (e.g., hourly) to prevent spool directory growth.
  • Use --batch-id with meaningful identifiers to correlate metrics across retries and environments.
  • Alert on records_dlq > 0 to catch data quality regressions early.
  • Track peak_rss_bytes trends to anticipate when memory limits need adjustment.