Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Network Sources (REST)

A Source reads from the filesystem by default. To pull records from a network endpoint instead, declare a transport: block on the Source. The transport selects where records come from; it sits above the on-disk type: (the format), which for a REST source still selects how the response bodies decode.

A network transport is a finite-pull source: it runs on its own thread, drives a synchronous client to cursor exhaustion, then exits. There is no daemon, no event loop, and no async runtime — the same single- process, run-to-drain model as a file pipeline. Finiteness is a hard property of the reader: a REST source caps its pull with an explicit page/record limit, so an unbounded endpoint cannot keep it running forever.

A network source still requires a schema: block. That authored schema is the row-to-record target: the reader maps each decoded object onto it, coercing values leniently. A per-row value that cannot coerce is left unchanged at the reader and routed to the dead-letter queue at the Transform stage — identical to file-source semantics. A network source declares no file matcher (path / glob / regex / paths); declaring one is a configuration error (E219).

Because a network source has no file path, its $source.file provenance column and the {source_file} output template both resolve to a stable synthetic identifier, <source:NAME>, where NAME is the Source node’s name.

REST sources

A rest source issues paginated HTTP GETs against a base URL, decoding each response body through the declared json or xml format. (Other formats are rejected with E220 — a REST body is a multi-record document, not a flat CSV/fixed-width stream.)

nodes:
  - type: source
    name: orders_api
    config:
      name: orders_api
      type: json
      options:
        format: array        # each page body is a JSON array of objects
      transport:
        kind: rest
        url: https://api.example.com/v1/orders
        max_pages: 50         # HARD page cap — required
        pagination:
          strategy: link_header
        auth:
          scheme: bearer
          token: "${ORDERS_TOKEN}"
      schema:
        - { name: order_id, type: int }
        - { name: total,    type: float }
        - { name: placed_at, type: date_time }

Pagination strategies

The pagination.strategy selects how the reader advances pages and detects the last one. Whatever the strategy, the pull always stops at the max_pages / max_records cap, even when the server keeps offering more.

  • none (default) — a single GET; the body is the whole result.

  • offset?offset=N&limit=L, advancing the offset by the page size each request. The last page is the one that returns fewer rows than limit.

    pagination:
      strategy: offset
      limit: 200
      offset_param: offset     # optional, defaults shown
      limit_param: limit
    
  • cursor_token — the reader reads a continuation token from a JSON pointer in each response and sends it back on the next request. Paging stops when the token field is absent or null.

    pagination:
      strategy: cursor_token
      cursor_param: page_token
      next_token_pointer: /meta/next_page   # RFC 6901 JSON pointer
    
  • link_header — the reader follows the URL in the response’s RFC 5988 Link: <…>; rel="next" header until no such link is present.

    pagination:
      strategy: link_header
    

Authentication

auth.scheme selects the credential sent on every request:

  • none (default) — no auth header.

  • bearer — sends Authorization: Bearer <token>.

  • header — sends an arbitrary static header, e.g. an API key.

    auth:
      scheme: header
      name: X-API-Key
      value: "${API_KEY}"
    

Reliability and finiteness knobs

KeyDefaultMeaning
max_pagesRequired. Hard ceiling on pages fetched, regardless of the server.
max_recordsnoneOptional hard ceiling on records emitted.
retries3Bounded retries on a transient failure (5xx, connect/timeout error). A 4xx is fatal — retrying cannot help.
timeout_secs30Per-request timeout. Bounds in-flight time so an interrupt lands within the shutdown window.

A partial-page decode failure routes that page’s offending rows to the DLQ per-row, exactly like a file source; it does not abort the pull.

Shutdown

On SIGINT/SIGTERM the reader polls its cancellation handle at each page boundary and stops cleanly with a normal end-of-input — the same graceful drain a file source performs. The timeout_secs per-request bound caps how long a single in-flight request can delay that stop.