Documentation
Run modes
- Site crawl — fetch known seed URLs and follow links within a scope (allowed domains, depth, max pages), extracting structured records as you go.
- Discovery — find new domains beyond your seeds using registries, search and the Common Crawl index, probe each for what you're after, and grow a de-duplicated catalog.
Fetching tiers
Each run starts at a tier and escalates only if a target blocks it: plain HTTP → stealth headers/sessions → rotating proxies → headless browser → managed bypass.
Outputs
Deliver results by storing them, POSTing an HMAC-signed webhook, or pushing to object storage. Schedule re-scrapes and get notified when monitored content changes.
Goal webhooks fire the instant a goal is met — a new artifact of a record type is discovered (a new OpenAPI spec, an MSDS sheet) or a monitored page meaningfully changes (a price change). We POST the single matching item to your URL as soon as it's found, filtered by record type and spider. Configure them on the Goal webhooks page.
API
Submit scrapes and manage spiders and schedules over the REST API with an API key. An
OpenAPI spec is published at /openapi.yaml, and a zero-dependency reference
Python client is at /sdk/yakspider.py.
A synchronous POST /api/v1/scrape
with a url
accepts a format
of html, text, or markdown
—
the page body comes back in result.content, with markdown
giving clean, LLM-ready output. Try requests interactively and copy them as
curl / Python / TypeScript in the API Player
inside the app.
More endpoints: POST /api/v1/batch
fetches up to 1,000 URLs in one async run
(poll GET /api/v1/scrape/:id
for results); POST /api/v1/screenshot
renders a page and returns a base64 image; and a session
parameter on a scrape
uses a named Session
cookie jar so a login persists across requests.
AI & agents
An MCP server
is available at POST /mcp
(JSON-RPC, same API
key) exposing scrape_url, run_spider, batch_scrape, get_run, and
search_catalog
tools — point Claude or any MCP
client at it to drive the whole pipeline. In the app, the spider wizard can also
draft a spider from a plain-English description
when an LLM is configured.
Two more AI assists run on the LLM gateway: semantic change alerts only fire when a monitored page's meaning actually moves (not trivial markup churn), and self-healing selectors propose replacement CSS selectors when a rule stops matching — approve each on the Selector fixes page.
Free tools
No-signup utilities at /tools: a CSS selector tester, an HTML→Markdown
converter, a request code generator, and /tools/echo
(reflects your request headers as a target sees them).
Credits & plans
Usage is metered in credits, scaling with fetch difficulty (plain HTTP is cheapest; headless rendering and managed bypass cost more). You only pay for success — blocked and failed fetches cost zero credits. Each plan includes a monthly credit allowance, a concurrent-run limit, and the fetch tiers it unlocks; manage your plan on the Billing page and per-target politeness on the Rate limiter page.
Plan-limit responses carry a stable code
and a doc
link: 402 credit_exhausted
/ feature_inactive, 403 tier_not_allowed, and 429 concurrency_exceeded.