Network Sources (REST)
A Source reads from the filesystem by default. To pull records from a
network endpoint instead, declare a transport: block on the Source. The
transport selects where records come from; it sits above the on-disk
type: (the format), which for a REST source still selects how the
response bodies decode.
A network transport is a finite-pull source: it runs on its own thread, drives a synchronous client to cursor exhaustion, then exits. There is no daemon, no event loop, and no async runtime — the same single- process, run-to-drain model as a file pipeline. Finiteness is a hard property of the reader: a REST source caps its pull with an explicit page/record limit, so an unbounded endpoint cannot keep it running forever.
A network source still requires a schema: block. That authored
schema is the row-to-record target: the reader maps each decoded object
onto it, coercing values leniently. A per-row value that cannot coerce is
left unchanged at the reader and routed to the dead-letter queue at the
Transform stage — identical to file-source semantics. A network source
declares no file matcher (path / glob / regex / paths);
declaring one is a configuration error (E219).
Because a network source has no file path, its $source.file provenance
column and the {source_file} output template both resolve to a stable
synthetic identifier, <source:NAME>, where NAME is the Source node’s
name.
REST sources
A rest source issues paginated HTTP GETs against a base URL, decoding
each response body through the declared json or xml format. (Other
formats are rejected with E220 — a REST body is a multi-record
document, not a flat CSV/fixed-width stream.)
nodes:
- type: source
name: orders_api
config:
name: orders_api
type: json
options:
format: array # each page body is a JSON array of objects
transport:
kind: rest
url: https://api.example.com/v1/orders
max_pages: 50 # HARD page cap — required
pagination:
strategy: link_header
auth:
scheme: bearer
token: "${ORDERS_TOKEN}"
schema:
- { name: order_id, type: int }
- { name: total, type: float }
- { name: placed_at, type: date_time }
Pagination strategies
The pagination.strategy selects how the reader advances pages and
detects the last one. Whatever the strategy, the pull always stops at the
max_pages / max_records cap, even when the server keeps offering more.
-
none(default) — a singleGET; the body is the whole result. -
offset—?offset=N&limit=L, advancing the offset by the page size each request. The last page is the one that returns fewer rows thanlimit.pagination: strategy: offset limit: 200 offset_param: offset # optional, defaults shown limit_param: limit -
cursor_token— the reader reads a continuation token from a JSON pointer in each response and sends it back on the next request. Paging stops when the token field is absent or null.pagination: strategy: cursor_token cursor_param: page_token next_token_pointer: /meta/next_page # RFC 6901 JSON pointer -
link_header— the reader follows the URL in the response’s RFC 5988Link: <…>; rel="next"header until no such link is present.pagination: strategy: link_header
Authentication
auth.scheme selects the credential sent on every request:
-
none(default) — no auth header. -
bearer— sendsAuthorization: Bearer <token>. -
header— sends an arbitrary static header, e.g. an API key.auth: scheme: header name: X-API-Key value: "${API_KEY}"
Reliability and finiteness knobs
| Key | Default | Meaning |
|---|---|---|
max_pages | — | Required. Hard ceiling on pages fetched, regardless of the server. |
max_records | none | Optional hard ceiling on records emitted. |
retries | 3 | Bounded retries on a transient failure (5xx, connect/timeout error). A 4xx is fatal — retrying cannot help. |
timeout_secs | 30 | Per-request timeout. Bounds in-flight time so an interrupt lands within the shutdown window. |
A partial-page decode failure routes that page’s offending rows to the DLQ per-row, exactly like a file source; it does not abort the pull.
Shutdown
On SIGINT/SIGTERM the reader polls its cancellation handle at each
page boundary and stops cleanly with a normal end-of-input — the same
graceful drain a file source performs. The timeout_secs per-request
bound caps how long a single in-flight request can delay that stop.