Web Scrape Worker

Problem: Fetch a web page, extract specific fields, and return normalized JSON.

This example follows the core principles described in the AI Worker Design Patterns and uses the standard Worker Protocol schema.

Key ideas#

Keep the worker single-purpose and explicit about inputs and outputs.
Put hard limits in the contract (timeout, retries, tools allowed).
Make failures machine-actionable with stable error codes.
Emit structured signals so orchestrators can route, retry, or escalate.

Diagram#

url -> fetch -> parse -> extract -> data

Worker spec#

worker_id: web-scrape-worker
version: 1.0
purpose: Fetch a web page, extract specific fields, and return normalized JSON.
inputs:
  - url: string
  - extraction_spec: object
  - user_agent: string (optional)
outputs:
  - data: object
  - source: object
  - warnings: array
constraints:
  timeout_seconds: 60
  max_tokens: 1500
  tools_allowed: [fetch_url, parse_html]
retries:
  max_attempts: 2
  backoff: exponential
observability:
  trace_id: required
  log_fields: [worker_id, attempt, duration_ms]

Input schema#

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "url": {
      "type": "string"
    },
    "extraction_spec": {
      "type": "object"
    },
    "user_agent": {
      "type": "string"
    }
  },
  "required": [
    "url",
    "extraction_spec"
  ]
}

Output schema#

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "additionalProperties": true,
  "properties": {
    "data": {
      "type": "object"
    },
    "source": {
      "type": "object"
    },
    "warnings": {
      "type": "array"
    }
  }
}

Constraints#

{
  "timeout_seconds": 60,
  "max_tokens": 1500,
  "retries": {
    "max_attempts": 2,
    "backoff": "exponential"
  },
  "rate_limit": "per-tenant (example: 10/min)",
  "tools_allowed": [
    "fetch_url",
    "parse_html"
  ]
}

Failure modes & handling#

Robots or access denied: error_code=access_denied (non-retryable unless policy allows).
Timeout fetching: error_code=fetch_timeout (retryable with backoff).
HTML structure changed: return warnings and partial extraction with missing fields.

Observability signals#

logs: worker_id, attempt, duration_ms, status, error_code
metrics: success_count, failure_count, retry_count, p95_duration_ms
trace fields: trace_id, span_id, upstream_request_id (if present)

FAQ#

Should the worker return partial results on failure?

If partial results are safe and useful, return them with a stable status and error_code. Otherwise fail fast and let orchestration decide.

Where should large artifacts go?

Store them externally (object storage or DB) and return a reference (URL or artifact ID) in the response.

How should I choose timeouts?

Set a hard ceiling based on SLOs and queue backpressure. Prefer smaller workers with tighter timeouts over monolith workers.