Documentation

Run modes

  • Site crawl — fetch known seed URLs and follow links within a scope (allowed domains, depth, max pages), extracting structured records as you go.
  • Discovery — find new domains beyond your seeds using registries, search and the Common Crawl index, probe each for what you're after, and grow a de-duplicated catalog.

Fetching tiers

Each run starts at a tier and escalates only if a target blocks it: plain HTTP → stealth headers/sessions → rotating proxies → headless browser → managed bypass.

Outputs

Deliver results by storing them, POSTing an HMAC-signed webhook, or pushing to object storage. Schedule re-scrapes and get notified when monitored content changes.

Goal webhooks fire the instant a goal is met — a new artifact of a record type is discovered (a new OpenAPI spec, an MSDS sheet) or a monitored page meaningfully changes (a price change). We POST the single matching item to your URL as soon as it's found, filtered by record type and spider. Configure them on the Goal webhooks page.

API

Submit scrapes and manage spiders and schedules over the REST API with an API key. An OpenAPI spec is published at /openapi.yaml, and a zero-dependency reference Python client is at /sdk/yakspider.py.

A synchronous POST /api/v1/scrape with a url accepts a format of html, text, or markdown — the page body comes back in result.content, with markdown giving clean, LLM-ready output. Try requests interactively and copy them as curl / Python / TypeScript in the API Player inside the app.

More endpoints: POST /api/v1/batch fetches up to 1,000 URLs in one async run (poll GET /api/v1/scrape/:id for results); POST /api/v1/screenshot renders a page and returns a base64 image; and a session parameter on a scrape uses a named Session cookie jar so a login persists across requests.

AI & agents

An MCP server is available at POST /mcp (JSON-RPC, same API key) exposing scrape_url, run_spider, batch_scrape, get_run, and search_catalog tools — point Claude or any MCP client at it to drive the whole pipeline. In the app, the spider wizard can also draft a spider from a plain-English description when an LLM is configured.

Two more AI assists run on the LLM gateway: semantic change alerts only fire when a monitored page's meaning actually moves (not trivial markup churn), and self-healing selectors propose replacement CSS selectors when a rule stops matching — approve each on the Selector fixes page.

Free tools

No-signup utilities at /tools: a CSS selector tester, an HTML→Markdown converter, a request code generator, and /tools/echo (reflects your request headers as a target sees them).

Credits & plans

Usage is metered in credits, scaling with fetch difficulty (plain HTTP is cheapest; headless rendering and managed bypass cost more). You only pay for success — blocked and failed fetches cost zero credits. Each plan includes a monthly credit allowance, a concurrent-run limit, and the fetch tiers it unlocks; manage your plan on the Billing page and per-target politeness on the Rate limiter page.

Plan-limit responses carry a stable code and a doc link: 402 credit_exhausted / feature_inactive, 403 tier_not_allowed, and 429 concurrency_exceeded.

Endpoint reference

Generated from the OpenAPI spec.

Method Path Summary
POST /api/v1/batch Batch scrape — fetch many URLs in one async ad-hoc run (no saved spider)
GET /api/v1/catalog Query the discovered-artifact catalog
POST /api/v1/preview Dry-run a spider config against live conditions (no persistence)
POST /api/v1/scrape Submit a scrape (sync single-URL, or async by spider_id)
GET /api/v1/scrape/{id} Get a run's status and results
POST /api/v1/screenshot Render a URL with the headless browser and return a base64 image
GET /api/v1/spiders List spiders
POST /api/v1/spiders Create a spider
DELETE /api/v1/spiders/{id} Delete a spider
GET /api/v1/spiders/{id} Get a spider
PATCH /api/v1/spiders/{id} Update a spider
GET /api/v1/spiders/{id}/runs List a spider's runs