AI Worker Architecture

A good worker architecture makes the happy path easy and the failure path explicit. It treats runtime limits and visibility as first-class features, not afterthoughts.

Most production failures are not model failures. They are timeouts, rate limits, partial outputs, tool errors, missing context, and retries that create duplicates. If you are implementing retry logic, treat it as a deliberate, testable pattern.

This page pairs with the worker protocol (stable request/response) and often benefits from a dedicated validator worker before side effects.

Key ideas#

Schema in, schema out: validate at the boundary (and re-validate before publishing).
Hard timeouts and budget caps prevent unbounded execution.
Retries must be safe: idempotency keys and backoff strategies.
Isolation boundaries reduce blast radius: network, filesystem, and tool permissions.
Observability is structured: logs/metrics/traces with stable field names.

Diagram#

+-------------------------------+
| Worker runtime                |
|                               |
|  request -> validate schema   |
|        -> enforce constraints |
|        -> execute (model/tools)---+
|        -> validate outputs        |
|        -> emit logs/metrics/traces|
|        -> response                |
+----------------------------------+

Contract Boundary#

A worker boundary should be strict: validate inputs before execution, and validate outputs before publishing them downstream.

When contracts are explicit, failures become routable: the orchestrator can retry, fall back, or escalate based on stable status and error codes.

Constraints#

Constraints are hard limits that keep execution bounded.

Timeout: hard ceiling for end-to-end execution.
Budget: caps on tokens, compute, or external calls.
Tools: an allow-list (and ideally per-tool quotas).
Rate limits: per-tenant or per-user throttles.

Retries and Idempotency#

Retries are not optional in distributed systems. The question is whether they are safe.

Use an idempotency key so duplicates do not cause duplicate side effects.
Use exponential backoff (with jitter) for transient failures.
Separate retryable failures (timeouts, rate limits) from permanent failures (schema mismatch).

Isolation Boundaries#

Isolation reduces blast radius. Treat a worker like untrusted code: restrict what it can read, write, and call.

Network egress controls and domain allow-lists.
Filesystem isolation or read-only mounts.
Least-privilege credentials for external services.

Artifacts and Storage#

Large outputs should be stored out-of-band as artifacts (files, blobs, or records), with the worker returning references (IDs or URLs).

This keeps contracts small, makes retries safer, and avoids passing large payloads between workers.

Observability#

Make observability part of the interface.

Trace fields: trace_id, span_id, parent_span_id.
Logs: structured fields like worker_id, attempt, duration_ms, status, error_code.
Metrics: success/failure counts, retry counts, and latency percentiles.

FAQ#

Where should schema validation happen?

At the boundary: before execution for inputs, and before returning/publishing for outputs.

How do I prevent duplicate work on retries?

Use idempotency keys and ensure downstream writes are idempotent or transactional.

What observability fields matter most?

At minimum: trace_id, worker_id, attempt, duration_ms, status, and a stable error_code for failures.