Tool Specifications¶

These tools let your AI assistant search the web, read pages, find academic papers, track multi-step research, and more — always returning real, verifiable sources. Below are the detailed schemas and behavioral contracts for each tool.

Note: Output schemas describe the JSON shape returned by each tool. See the corresponding internal/tools/*.go file for the implementation. Input schemas are auto-generated from struct jsonschema tags.

Tool Registration Pattern¶

Each tool follows the pattern in internal/tools/registry.go: a typed input struct with jsonschema tags (the SDK auto-generates JSON Schema from these) and a register* function that calls mcp.AddTool. See internal/tools/search.go for a representative example.

Cache-Key Contract¶

Every result-affecting parameter MUST be included in a tool's cache key. This single rule prevents an entire class of cache-collision bugs — two requests that would produce different results must never share a key (e.g. different providers, or a smaller max_length that would serve a later larger request a truncated body).

Canonical implementations: searchCacheKey(...) (internal/tools/search.go) for the search family — keys on the tool name plus every query parameter including provider — and scrapeCacheKey(url, mode, maxLength) (internal/tools/scrape.go) for scrapes.
Each key carries a version segment (e.g. v2); bump it whenever the cached response shape changes so a post-upgrade cache hit can never serve a blob missing a newly-added field.
Enforcement: internal/tools/cachekey_test.go guards today's parameters. When you add a tool or a result-affecting parameter, extend both the key and that test — the test only covers the params it knows about, so a new param can reintroduce the bug without failing any existing assertion.

Large-Payload Linking (resource_link)¶

The heaviest tools — scrape_page (mode: raw), search_and_scrape, and research_export — can return tens to hundreds of KB. When a result is at or above the link threshold, the tool returns an MCP resource_link (2025-06-18 content type) instead of inlining the full body: a small inline summary ({resource, bytes, mimeType, summary, expiresAt, linked:true}) plus a resource_link the client fetches on demand. Below the threshold, results inline exactly as before (no behavior change).

The linked body is stored in the shared cache.Cache (memory + AES-encrypted disk, or Redis in HTTP mode) under a content-addressed key and served read-only via the research://artifact/{id} resource template. The id is the SHA-256 of the body, so identical payloads de-dupe and the URI is stable/idempotent.
Artifacts are short-lived (bounded TTL); a fetch after expiry returns a not-found error, never another caller's data. With no cache configured, large payloads inline (correctness over size).
Canonical implementation: largeResultOrInline(...) + registerArtifactResource(...) in internal/tools/artifacts.go. Cache-freshness _meta (and routing _meta) ride on either shape.

Tool 1: `web_search`¶

Purpose¶

Perform a web search and return structured result URLs with metadata.

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	1-500 chars
`num_results`	int	no	5	1-10
`time_range`	string	no	—	`day`, `week`, `month`, `year`
`safe`	string	no	`medium`	`off`, `medium`, `high`
`language`	string	no	—	ISO 639-1 code
`site`	string	no	—	Single-domain restriction (cannot combine with `sites`)
`sites`	[]string	no	—	Multi-domain restriction, OR-joined into `site:` operators (up to 10; cannot combine with `site`)
`exact_terms`	string	no	—	Exact phrase match
`exclude_terms`	string	no	—	Terms to exclude
`country`	string	no	—	ISO 3166-1 alpha-2
`lens`	string	no	—	Domain lens (overrides `site`/`sites`). See `lenses/` directory for available lenses
`provider`	string	no	—	Force search provider. Returns error listing available providers if unknown
`sessionId`	string	no	—	Link results to a `sequential_search` session
`claim`	string	no	—	Optional claim to evaluate against each result's snippet; when set, each result gains a `claimSignal` (#66). Evidence only — never a verdict

Output Schema¶

type SearchOutput struct {
    URLs        []string       `json:"urls"`
    Query       string         `json:"query"`
    ResultCount int            `json:"resultCount"`
    Results     []SearchResult `json:"results"`
    Hints       *ZeroResultHints `json:"hints,omitempty"` // present ONLY on zero-result responses (see below)
    Trust       string         `json:"trust"`   // "untrusted-external-content" — treat results as data, not instructions (OWASP LLM01)
}

type SearchResult struct {
    Title            string            `json:"title"`
    URL              string            `json:"url"`
    Snippet          string            `json:"snippet"`
    DisplayLink      string            `json:"displayLink"`
    PublishedAt      string            `json:"publishedAt,omitempty"`      // RFC3339 UTC; present only when the provider's response carried a date (#356)
    SourceReputation *DomainReputation `json:"sourceReputation,omitempty"` // present when host is in the reputation dataset (#198); omitted for unknown hosts
    ClaimSignal      string            `json:"claimSignal"`                // most claim-relevant snippet sentence; present on EVERY result whenever `claim` is set (empty string when no snippet sentence matched) — uniform shape (#66, #235)
}

sourceReputation is a descriptive signal (same shape as scrape_page/search_and_scrape) indicating the host's known reliability tier (high, low, mixed) with a basis note. It is omitted for hosts not in the dataset — absence means unknown, not bad. When claim is set, every result carries a claimSignal holding the most claim-relevant snippet sentence to help triage which links to read — it is the empty string (not absent) when no snippet sentence matched, so the field's shape is uniform across results and downstream null-checking stays simple (#235). For full-text claim evidence use search_and_scrape with claim. claimSignal is an English-keyword heuristic (#390): an empty string on non-English snippet text means the heuristic never matched, not that the snippet is confirmed unrelated — read the snippet yourself for non-English sources.

publishedAt (#356) is optional and provider-dependent, populated only by providers whose web response carries a date signal (Google via pagemap metadata, Tavily, Exa, SearXNG, HackerNews) and omitted (never guessed from snippet/title text) for providers that don't (Brave, Serper, DuckDuckGo, SearchAPI). When present it is normalized to RFC3339 UTC regardless of the provider's raw format.

On a zero-result response, hints carries a ZeroResultHints object (the same shape academic_search and patent_search emit) explaining why nothing matched and how to recover: reason (no_match | filters_too_restrictive), filtersApplied (the constraints that may have eliminated results — site, sites, lens, time_range, country, language, exact_terms, exclude_terms), suggestedActions (remove-filter / try-different-provider), and epistemicWarning (#357: a fixed reminder that zero results do not confirm absence — the fact may exist and simply be unreachable by the current query/provider; never assert non-existence from an empty result set). Suggested alternative providers are limited to those configured and currently healthy. On any non-empty result set the field is omitted.

Behavior¶

If SEARCH_ROUTING is set, route through the multi-provider Router (priority-ordered fallback with per-provider circuit breakers).
If lens is specified and has a dedicated cx, route directly to that Google PSE engine.
If lens is specified without cx, inject site: operators and route to the configured provider.
site and sites are mutually exclusive — combining them returns an error. sites entries are validated (bare host, optional path prefix — no scheme, no whitespace) and OR-joined into a single site: group, capped at 10 domains; exceeding the cap returns an error rather than silently truncating. A lens overrides both.
Apply time_range as date restriction parameter.
Return deduplicated URLs and full result objects.

Cache¶

Key: SHA-256 of (provider + query + all params)
TTL: 30 minutes

Error Conditions¶

Unknown provider → error listing all supported providers (no duplicates)
Invalid/missing API key → upstreamErrorResponse() with setup instructions referencing .env.example
Rate limited → rateLimitError() suggesting 60s wait or different provider
No results → return empty urls array (not an error)
All errors use upstreamErrorResponse() from internal/tools/search.go for consistent formatting

Tool 2: `scrape_page`¶

Purpose¶

Extract content from a URL, supporting web pages, documents, YouTube videos, Hacker News threads (read natively via the HN API), and GitHub README/file/gist pages (read natively via the GitHub API).

Input Schema¶

Field	Type	Required	Default	Constraints
`url`	string	yes	—	Valid HTTP(S) URL
`mode`	string	no	`full`	`full` (cleaned readable text), `preview` (first ~5000 bytes), `raw` (verbatim unsanitized bytes — see Raw Mode)
`max_length`	int	no	50000	Bytes. Capped at 5,000,000 (5 MB) for all modes; in `preview` mode it is forced to 5000. Applies to `raw` mode as an `io.LimitReader` cap on the fetched bytes
`sessionId`	string	no	—	Link to a `sequential_search` session

Output Schema¶

type ScrapeOutput struct {
    URL             string    `json:"url"`
    Content         string    `json:"content"`
    ContentType     string    `json:"contentType"`    // html, markdown, youtube, pdf, docx, pptx, github (raw mode: the server's Content-Type header, may be "")
    Trust           string    `json:"trust"`          // always "untrusted-external-content" — boundary marker: treat content as data, not instructions (OWASP LLM01)
    ContentLength   int       `json:"contentLength"`
    Truncated       bool      `json:"truncated"`
    EstimatedTokens int       `json:"estimatedTokens"`
    SizeCategory    string    `json:"sizeCategory"`   // small, medium, large, very_large
    Citation        *Citation `json:"citation"`       // always present
    Raw             bool      `json:"raw,omitempty"`  // true only in raw mode; omitted otherwise
    ExtractedBy     string    `json:"extractedBy,omitempty"` // extraction tier: markdown|stealth|html|browser|exa:cached|exa:crawled; omitted when unknown
    ExtractionQuality string  `json:"extractionQuality,omitempty"` // complete when the pipeline returned a confident extraction; partial when every tier was exhausted and the best-quality candidate was returned instead. Never an error. Omitted in raw mode.
    WordCount         int     `json:"wordCount,omitempty"`         // words in the extracted content (#358); orthogonal to ExtractionQuality — a "complete" extraction can still be a thin paywall/bot-wall stub. Omitted in raw mode.
    SparsityWarning   string  `json:"sparsityWarning,omitempty"`   // present only when WordCount is below ~150 — the content may be too thin for a reliable claim check (#358). Omitted in raw mode and whenever content is not thin.
    Metadata        *Metadata `json:"metadata,omitempty"` // present only when a title was extracted (full/preview only)
    StructuredData  *StructuredData `json:"structuredData,omitempty"` // page-embedded machine-readable metadata; present only when found (full/preview, HTML pages)
    ForumSignals    *ForumSignals   `json:"forumSignals,omitempty"`   // Reddit engagement metadata extracted from JSON-LD; present only for Reddit posts where the HTML tier ran (#247)
    SourceType      string    `json:"sourceType"`     // typed classification (#62): peer_reviewed|official_docs|government|news_publication|blog|forum|wiki|social_media|unknown
    AuthorityTier   string    `json:"authorityTier"`  // banded authority: high|medium|low
    DomainCategory  string    `json:"domainCategory"` // subject area: academic|legal|medical|financial|technical|general
    DetectedDOI      string            `json:"detectedDoi,omitempty"`      // a scholarly DOI the page declares (#199); peer-reviewed pages only; omitted when none
    RetractionStatus *RetractionStatus `json:"retractionStatus,omitempty"` // Crossref integrity status for detectedDoi; omitted when clean/unresolved — never a guess
}

type RetractionStatus struct {
    Retracted bool   `json:"retracted"`           // true for a formal retraction/withdrawal/removal
    Kind      string `json:"kind"`                // retraction | expression_of_concern | correction
    Date      string `json:"date,omitempty"`      // notice date (YYYY-MM-DD) when supplied
    NoticeDOI string `json:"noticeDoi,omitempty"` // DOI of the retraction/correction notice
    Source    string `json:"source,omitempty"`    // provenance: retraction-watch | publisher
}

type Metadata struct {
    Title  string `json:"title"`
    Author string `json:"author"`
}

type ForumSignals struct {
    Platform        string `json:"platform"`                  // Forum platform, e.g. "reddit"
    Upvotes         int    `json:"upvotes"`                   // Vote count from JSON-LD interactionStatistic
    Comments        int    `json:"comments"`                  // Comment count
    DatePublished   string `json:"datePublished,omitempty"`   // ISO 8601 publish date when available
    AuthorName      string `json:"authorName,omitempty"`      // Original poster name when available
    CredibilityNote string `json:"credibilityNote,omitempty"` // Contextual note, e.g. vote-manipulation risk when Upvotes < 20
}

type StructuredData struct {
    JSONLD    []json.RawMessage `json:"jsonLd,omitempty"`    // each <script type="application/ld+json"> block, verbatim
    OpenGraph map[string]string `json:"openGraph,omitempty"` // og:* and article:* meta, keys keep their prefix
    Citation  map[string]string `json:"citation,omitempty"`  // Highwire <meta name="citation_*"> tags
}

type Citation struct {
    URL          string           `json:"url"`
    AccessedDate string           `json:"accessedDate"`
    Metadata     CitationMetadata `json:"metadata"`
    Formatted    CitationFormats  `json:"formatted"`
}

type CitationMetadata struct {
    Title  string `json:"title"`
    Author string `json:"author"`
    Site   string `json:"site"`
    Date   string `json:"date"`
}

type CitationFormats struct {
    APA string `json:"apa"`
    MLA string `json:"mla"`
}

On a cache hit, the result also carries a top-level _meta block with cache-freshness provenance (cached: true, ageSeconds, maxAgeSeconds, freshness) — see Cache Freshness Provenance. Freshly fetched scrapes have no _meta.

In raw mode the output additionally carries "raw": true, and contentType is the server's real Content-Type header (it may be empty). No metadata block is emitted.

Tables in content (#48). HTML <table> elements are rendered as GitHub-flavored markdown pipe tables inside content (header row + --- separator + data rows), preserving row/column structure instead of flattening cells into disconnected fragments. Pipe characters in cells are escaped and multi-line cells are collapsed to a single row. Layout, malformed, single-column, and nested tables degrade gracefully to plain text — never an error, never a panic.

Forum engagement signals (#247). For Reddit posts where the HTML extraction tier ran, the response carries a forumSignals object: platform ("reddit"), upvotes (vote count from JSON-LD interactionStatistic), comments (comment count), datePublished (ISO 8601), authorName (original poster), and credibilityNote (set when upvotes < 20, noting vote-manipulation risk). The field is omitted entirely for non-Reddit URLs, raw mode, and any non-HTML extraction tier (markdown, browser, document, YouTube, Twitter) — never null, only present-or-absent.

Structured data (#46). When the page embeds machine-readable metadata, the response carries a structuredData object alongside content: jsonLd (each <script type="application/ld+json"> block, kept verbatim — invalid JSON is skipped, never failing the scrape), openGraph (og:*/article:* meta, keys keep their prefix), and citation (Highwire citation_* meta — DOI, authors, journal). The whole object is omitted when no such markup is present, and each sub-field is omitted when empty. It is produced by the HTML-extraction tiers only (absent for raw mode, PDFs, YouTube, and markdown-tier results), is independently size-bounded so a pathological page cannot blow the response budget, and is untrusted external data under the same trust boundary as content.

Extraction provenance (extractedBy). When known, the response names the tier that produced the content: markdown, stealth, html, browser, github:raw/github:contents-api/github:gist-api, or — for the paid Exa fallback — exa:cached / exa:crawled. It lets a caller see whether content came from a free local tier or the metered Exa /contents API (Tier 5, present only when EXA_API_KEY is set). Omitted when unknown (e.g. document/YouTube routes).

Content-volume signal (#358). wordCount is emitted for every non-raw scrape and is a cheap, deterministic proxy for how much prose was actually extracted — it is orthogonal to extractionQuality, which reflects pipeline tier exhaustion, not content volume: a complete extraction can still be a thin paywall/bot-wall stub (a few sentences behind a subscribe wall) if that stub was returned confidently by an early tier. When wordCount falls below ~150, sparsityWarning is added with a human-readable note that claim checks against this source may be unreliable. Both fields are omitted in raw mode. verify_citation, audit_bibliography, and search_and_scrape surface the same signal (see their sections below) so a caller can tell a thin stub from a genuinely well-supported claim check.

Typed source classification (#62). Every scrape response (full and raw) carries three categorical fields alongside the numeric content: sourceType (the kind of source — derived from Schema.org @type / Highwire citation_* meta when present, else a domain heuristic, else unknown), authorityTier (high/medium/low, a banding of the internal authority score), and domainCategory (academic/legal/medical/financial/technical/general, from a domain heuristic). They let the model hedge in natural language by source type. They are best-effort hints derived from untrusted page data — treat them as signals, not guarantees. (In raw mode, with no structured-data extraction, sourceType falls back to the host heuristic.)

Scholarly DOI + integrity status (#199). When a page classifies as peer_reviewed or sits on a known academic-journal host (the latter so detection still engages when an extraction tier strips the citation metadata, e.g. the cached-text fallback), the response surfaces detectedDoi — the DOI the page declares, read (in descending order of authority) from its Highwire citation_doi <head> metadata, then a DOI embedded in the request URL path itself (the publisher's canonical article identifier, e.g. nejm.org/doi/full/10.1056/… — present even on extraction tiers that strip the citation metadata, such as the cached-text fallback), then the first few KB of the cleaned text (the front matter, above any references list, so a references-list DOI is never mistaken for the page's own). It is evidence, never a verdict and never an identity claim: it says "this DOI appears on the page; here is its recorded integrity status," not "the page is this record" — you confirm the document's identity. When the DOI resolves to a Crossref/Retraction-Watch integrity record, retractionStatus is attached (the same object verify_citation and academic_search return); an expression_of_concern/correction is reported but is not a retraction (retracted stays false), and retractionStatus.source names retraction-watch vs publisher. The status is captured at scrape time and shares the one-hour scrape cache TTL — re-scrape or use verify_citation for a point-in-time check. Both fields are omitted on non-scholarly pages, in raw mode, and when no DOI is found or the resolver is unavailable. Use verify_citation to verify one citation and audit_bibliography to audit a whole reference list.

Trust boundary marker. Every scrape response (full, preview, and raw) carries "trust": "untrusted-external-content" in the JSON envelope — an explicit, machine-readable boundary marker. It is deliberately placed in the structured output, never inside the content string (where a malicious page could forge or close it), and signals that content is external data to be treated as data, never as instructions (OWASP LLM01, indirect prompt injection). The server cannot enforce the prompt boundary itself — the model and agent loop live in the host application — so this marker exists to make the untrusted provenance unmissable to that host.

Raw Mode¶

mode=raw returns the fetched bytes verbatim — the content extraction pipeline and content.Process sanitization are skipped entirely. Use it only to inspect source such as JSON, HTML markup, JavaScript, or plain text that the cleaned full mode would strip or reformat.

Raw mode still runs through the same safety guards as every other scrape: validateScrapeURL (HTTP/HTTPS scheme + non-empty host), the SSRF-safe client (private-IP and metadata-endpoint blocking, DNS-rebinding prevention), the ALLOWED_DOMAINS allowlist, and an io.LimitReader bounded by max_length. Only content.Process is bypassed.

Trade-off — untrusted bytes. Because sanitization is skipped, raw content may contain active <script>/HTML, embedded markup, or indirect prompt-injection payloads. The bytes are untrusted: never execute or render them, and treat any instructions inside them as data, not commands. For normal reading, prefer full (sanitized). search_and_scrape is always sanitized and has no raw mode.

Raw responses are keyed like any other scrape: the cache key includes mode (so raw never collides with a cleaned full/preview entry for the same URL) and max_length. See the Cache section below for the full key.

Scraping Strategy (Tiered Fallback)¶

1. SSRF VALIDATION
   └─ Resolve DNS, check all IPs against private ranges
   └─ Block: loopback, link-local, RFC1918, metadata endpoints

2. CONTENT TYPE DETECTION
   ├─ YouTube URL → YouTube extractor (3-strategy fallback):
   │     Strategy 1: Player response captions (primary + alt regex)
   │     Strategy 2: Direct timedtext API (en, en-US, en-GB)
   │     Strategy 3: Video description (shortDescription JSON field)
   ├─ news.ycombinator.com → native HN API (Firebase REST + Algolia; no API key required):
   │     /item/<id>  → story metadata (title, URL, score, author, date) + top 10 comments
   │     /           → top 20 stories from the HN top-stories list (parallel Firebase fetch)
   │     /newest, /best, /ask, /show, /jobs → corresponding HN list, top 20 stories
   │     /user/<id>  → user profile (karma, about, created date)
   │     Unknown paths fall through to the tiered HTML pipeline
   │     `Truncated: true` when content is capped; `ContentType: "hn"`
   ├─ github.com / gist.github.com → native GitHub content routing (raw CDN + REST API; GITHUB_TOKEN optional, raises the unauth rate limit):
   │     /{owner}/{repo}          → README (raw.githubusercontent.com/HEAD/README.md; falls back to the
   │                                 Contents API's /readme endpoint on 404, e.g. non-standard casing)
   │     /{owner}/{repo}/blob/{ref}/{path} → the exact file at that ref, fetched directly from the raw CDN
   │     gist.github.com/{id} or /{user}/{id} → gist content via the Gist API (each file's content concatenated)
   │     Reserved top-level paths (settings, topics, marketplace, issues, pulls, …) and any other
   │     shape (issues, PRs, wikis) fall through to the tiered HTML pipeline
   ├─ .pdf / application/pdf → PDF parser
   ├─ .docx / application/vnd.openxmlformats* → DOCX parser
   └─ .pptx / application/vnd.ms-powerpoint → PPTX parser

3. WEB PAGE EXTRACTION (4 free tiers, ordered by speed; + optional paid Exa tier last)
   a) Tier 1: MARKDOWN NEGOTIATION (fastest, ~200ms)
      ├─ Send GET with Accept: text/markdown
      ├─ 5-second timeout
      ├─ Verify response is actually markdown (heuristic check)
      └─ If content-type mismatch or too short → next tier

   b) Tier 2: STEALTH HTTP CLIENT (fast, ~300ms)
      ├─ Browser-like TLS fingerprint (TLS 1.2+, HTTP/2)
      ├─ Full Chrome 131 headers (User-Agent, Sec-Ch-Ua, Sec-Fetch-*)
      ├─ Parse with goquery (article > [role=main] > main > body)
      ├─ Remove: script, style, nav, footer, aside, ads, popups
      ├─ SSRF protection via safe dialer when AllowPrivateIPs=false
      └─ If below 100-char threshold → next tier

   c) Tier 3: HTML EXTRACTION via goquery (standard, ~500ms)
      ├─ Fetch page with standard Accept header
      ├─ Parse with goquery
      ├─ Extract: article > main > body (priority order)
      ├─ Remove: script, style, nav, footer, aside, ads
      ├─ Minimum content: 100 bytes, 10% meaningful text ratio
      └─ If below threshold → next tier

   d) Tier 4: HEADLESS BROWSER via go-rod + stealth (slow, ~5s)
      ├─ Browser pool with lazy init + singleton pattern
      ├─ go-rod/stealth plugin (navigator spoofing, WebGL masking)
      ├─ Used for: Known SPA domains, JS-rendered content, bot challenges
      ├─ Wait for: page stability (2s for SPA domains, 500ms otherwise) OR 30s timeout
      ├─ Extract: rendered DOM via JavaScript evaluation
      └─ Graceful cleanup via Pipeline.Close()

   e) Tier 5: EXA /contents (PAID, opt-in, last resort) — only when EXA_API_KEY is set
      ├─ Neural extractor: POST https://api.exa.ai/contents (x-api-key auth)
      ├─ Runs ONLY after every free tier above failed to extract >100 bytes,
      │  so the common path never incurs Exa cost
      ├─ Recovers bot-blocked / JS-heavy pages the local tiers cannot
      └─ Records provenance into extractedBy: "exa:cached" (served from Exa's
         cache) or "exa:crawled" (freshly fetched by Exa)

4. CONTENT PROCESSING
   ├─ Sanitize: strip hidden text, zero-width chars, dangerous patterns
   ├─ Truncate: at paragraph/sentence boundary if > max_length
   ├─ Estimate tokens: length / 4
   └─ Extract citation: from <meta> tags, URL, response headers

Known SPA Domains (require headless browser)¶

go.dev, pkg.go.dev
patents.google.com, scholar.google.com, news.google.com
trends.google.com, youtube.com
linkedin.com, facebook.com, instagram.com
medium.com, dev.to

Note: twitter.com and x.com are not in the SPA list — they use a dedicated FxTwitter API path, not the browser tier.

Cache¶

Key: SHA-256 of (url + mode + max_length) — max_length is part of the key so a larger request never serves a shorter cached body
TTL: 1 hour

Error Taxonomy (`internal/scraper/errors.go`)¶

All scrape errors are typed as ScrapeError{Kind, Message, Cause, URL, Tier}. The scrapeErrorResponse() function in internal/tools/scrape.go maps each kind to an actionable LLM-facing message:

ErrorKind	Retryable	Trigger	LLM Message (verbatim shape from `scrape.go`)
`ErrValidation`	no (permanent)	Unsupported scheme, empty host, SSRF / private-IP / blocked-hostname denial, domain allowlist denial	"URL rejected for {url}: {detail}. Provide a valid public http(s) URL."
`ErrNetwork`	yes	DNS failure, timeout, connection refused, TLS	"Network error on {url}: {detail}. Check connectivity."
`ErrBlocked`	no (remote refusal)	HTTP 403, remote bot detection / JS-wall interstitial	"Blocked: {url} uses bot detection. Try an alternative source — its content can't be read directly."
`ErrNotFound`	no (dead link)	HTTP 404 / 410	"Not found: {url} returned 404/410 — the page does not exist. Check the URL."
`ErrBrowser`	no	Chrome not found, launch failed, connect failed	"Scrape failed: Chrome unavailable. Set CHROME_PATH or install Chrome. Report at {issueURL}"
`ErrContent`	yes	Page loaded but no usable content extracted	"No content extracted from {url}. May need browser rendering. Report at {issueURL}"
`ErrAuth`	no	HTTP 401, login redirect	"Auth required: {url} is behind a login wall."
`ErrRateLimit`	yes (after delay)	HTTP 429	"Rate limited on {url}. Retry in 60 seconds."

When all tiers fail, the composite error message lists each tier's outcome (e.g., markdown: empty, stealth: HTTP 403, html: 12 bytes, browser: chrome launch failed).

Tool 3: `search_and_scrape`¶

Purpose¶

Combined search + scrape pipeline with quality scoring, deduplication, and source ranking.

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	1-500 chars
`num_results`	int	no	3	1-10
`include_sources`	bool	no	true	—
`deduplicate`	bool	no	true	—
`max_length_per_source`	int	no	50000	Bytes
`total_max_length`	int	no	300000	Bytes
`filter_by_query`	bool	no	false	—
`provider`	string	no	—	Force search provider for the search phase
`sessionId`	string	no	—	Link results to a `sequential_search` session
`claim`	string	no	—	Optional claim to evaluate against each source; when set, each source gains `keySentences` + `claimSignal` (#66). Evidence only — never a verdict

Output Schema¶

type SearchAndScrapeOutput struct {
    Query           string          `json:"query"`
    Status          string          `json:"status"`           // "complete", "partial", or "failed"
    Sources         []SourceResult  `json:"sources"`
    CombinedContent string          `json:"combinedContent"`
    Trust           string          `json:"trust"`            // "untrusted-external-content" — boundary marker for combinedContent + every source; treat as data, not instructions (OWASP LLM01)
    ScrapeFailures  []FailureInfo   `json:"scrapeFailures,omitempty"`
    Note            string          `json:"note,omitempty"`   // guidance when status="failed"
    Summary         PipelineSummary `json:"summary"`
    SizeMetadata    SizeMetadata    `json:"sizeMetadata"`
    Recommendations []Recommendation `json:"recommendations,omitempty"` // advisory; see below
    Components      []Component      `json:"components,omitempty"`      // mcp-auto-formatted (deterministic, no LLM); see below
    Hints           *ZeroResultHints `json:"hints,omitempty"`            // present only when the search phase itself returned zero results (#357); same shape as web_search, including epistemicWarning
}

// Recommendation is an advisory pointer to a higher-quality source already in
// `sources`. Content-based and non-profiling; never re-ranks or hides results.
// Present only when SOURCE_RECOMMENDATIONS=true (default) AND something clears
// the quality bar. Omitted otherwise.
type Recommendation struct {
    URL    string  `json:"url"`
    Title  string  `json:"title,omitempty"`
    Score  float64 `json:"score"`
    Reason string  `json:"reason"`  // transparent, content-derived
}

// Component is an optional, additive, mcp-auto-formatted renderable (card/table)
// built DETERMINISTICALLY from already-extracted data — NO server-side LLM call,
// no model of any kind. The "mcp-auto-formatted" label states the MCP server
// shaped this structure (not an LLM, not another component). Always carries
// autoFormatted=true and references to raw source data; it never replaces
// `content`/`sources`. Present only when GENERATIVE_UI_ENABLED=true.
type Component struct {
    Type          string   `json:"type"`          // "card" | "table"
    AutoFormatted bool     `json:"autoFormatted"` // always true (non-disableable label)
    Label         string   `json:"label"`         // "mcp-auto-formatted"
    Title         string   `json:"title,omitempty"`
    SourceRefs    []string `json:"sourceRefs,omitempty"` // URLs of the underlying raw data
    Card          *Card    `json:"card,omitempty"`
    Table         *Table   `json:"table,omitempty"`
}

type SourceResult struct {
    URL               string        `json:"url"`
    Title             string        `json:"title,omitempty"`
    PublishedAt       string        `json:"publishedAt,omitempty"`       // RFC3339 UTC, carried over from the discovery search result (#356); present only when that provider's response carried a date
    Content           string        `json:"content"`
    ContentType       string        `json:"contentType"`
    Trust             string        `json:"trust"`        // "untrusted-external-content" (see top-level Trust)
    Scores            *QualityScore `json:"scores,omitempty"`
    SourceType        string        `json:"sourceType,omitempty"`        // typed classification (#62): peer_reviewed|official_docs|government|news_publication|blog|forum|wiki|social_media|unknown
    AuthorityTier     string        `json:"authorityTier,omitempty"`     // high|medium|low
    DomainCategory    string        `json:"domainCategory,omitempty"`    // academic|legal|medical|financial|technical|general
    ExtractionQuality string        `json:"extractionQuality,omitempty"` // complete or partial — tier exhaustion, not content volume; see WordCount for that (#358)
    WordCount         int           `json:"wordCount,omitempty"`         // words extracted from this source (#358); below ~150 the source may be a paywall/bot-wall stub — see the summary's SparseSources count
    ClaimSignal       string        `json:"claimSignal,omitempty"`       // strongest claim-relevant sentence; present only when `claim` is set and matched (#66). English-keyword heuristic (#390): a miss on non-English source text means the heuristic didn't match, not that the source is confirmed unrelated
    KeySentences      []string      `json:"keySentences,omitempty"`      // top claim-relevant sentences in document order; present only with `claim`
}

type FailureInfo struct {
    URL             string `json:"url"`
    Kind            string `json:"kind,omitempty"`            // error category (blocked, auth_required, etc.)
    Reason          string `json:"reason"`
    Retryable       bool   `json:"retryable"`
    SuggestedAction string `json:"suggestedAction,omitempty"` // recovery hint
}

type QualityScore struct {
    Overall        float64 `json:"overall"`
    Relevance      float64 `json:"relevance"`
    Freshness      float64 `json:"freshness"`
    Authority      float64 `json:"authority"`
    ContentQuality float64 `json:"contentQuality"`
}

type PipelineSummary struct {
    URLsSearched     int `json:"urlsSearched"`
    URLsScraped      int `json:"urlsScraped"`
    URLsFailed       int `json:"urlsFailed"`
    SparseSources    int `json:"sparseSources"`    // sources with wordCount < ~150 (#358) — a paywall/bot-wall stub can still count toward URLsScraped; counted before `filter_by_query` removes any source, so it stays accurate even when filtering strips the thin ones
    ProcessingTimeMs int `json:"processingTimeMs"`
}

Behavior¶

Execute search (via configured provider)
Scrape all result URLs in parallel (bounded concurrency: 5)
If deduplicate: paragraph-level djb2 hashing, drop blocks whose hash exactly matches one already seen (exact-match dedup, not fuzzy similarity)
Score and rank sources by quality (weighted: relevance 35%, freshness 20%, authority 25%, content 20%)
If filter_by_query: extract keywords, remove sources below relevance threshold
Combine content, truncate to total_max_length
Return structured result with scores and metadata
Optionally append recommendations (advisory, content-based; SOURCE_RECOMMENDATIONS, default on) and components (mcp-auto-formatted renderables, deterministic — no LLM; GENERATIVE_UI_ENABLED, default off) — both derived purely from the quality scores already computed, with no extra scoring pass and no model call

Recommendations & components (additive)¶

recommendations surface the highest-quality related sources from the current result set using the transparent quality signals (authority, relevance, freshness, content). They are advisory only — sources ordering is never changed and the caller can ignore them. Strictly content-based: no user-behavior inputs, no profiling. Toggle with SOURCE_RECOMMENDATIONS (default true). Behavior-based/personalized ranking is explicitly out of scope.
components are optional renderable structures (source cards, a quality-comparison table) assembled deterministically from data already extracted — there is no server-side LLM call and no model of any kind. Every component is labelled autoFormatted: true / "mcp-auto-formatted" (stating the MCP server shaped it, not an LLM) and references the raw source URLs, so nothing is hidden or unverifiable. Off by default (GENERATIVE_UI_ENABLED=false); when off, output is byte-for-byte unchanged. The raw content/sources are always present regardless.

Cache¶

NOT cached as a whole (composed of cached sub-operations)
Individual search and scrape results are cached per their own TTLs

Tool 4: `image_search`¶

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	1-500 chars
`num_results`	int	no	5	1-200 (Brave up to 200; Google up to 10)
`size`	string	no	—	huge, icon, large, medium, small, xlarge, xxlarge. Google/SearchAPI only — Brave ignores it.
`type`	string	no	—	clipart, face, lineart, stock, photo, animated. Google/SearchAPI only — Brave ignores it.
`color_type`	string	no	—	color, gray, mono, trans. Google/SearchAPI only — Brave ignores it.
`dominant_color`	string	no	—	black, blue, brown, gray, green, orange, pink, purple, red, teal, white, yellow. Google/SearchAPI only — Brave ignores it.
`file_type`	string	no	—	jpg, gif, png, bmp, svg, webp. Google/SearchAPI only — Brave ignores it.
`safe`	string	no	`medium`	off, medium, high. On Brave images only `off` and `strict` apply (any non-`off` maps to `strict`).
`country`	string	no	—	ISO 3166-1 alpha-2 (e.g. `us`, `gb`). Honored by Brave and Google.
`language`	string	no	—	BCP 47 / 2-letter code (e.g. `en`, `de`). Honored by Brave (`search_lang`) and Google (`lr`).
`provider`	string	no	—	Force search provider

Output Schema¶

type ImageSearchOutput struct {
    Images      []ImageResult    `json:"images"`
    Query       string           `json:"query"`
    ResultCount int              `json:"resultCount"`
    Hints       *ZeroResultHints `json:"hints,omitempty"` // present ONLY on zero-result responses (#357); same shape as web_search, including epistemicWarning
    Trust       string           `json:"trust"`   // "untrusted-external-content"
}

type ImageResult struct {
    Title         string `json:"title"`
    Link          string `json:"link"`
    ThumbnailLink string `json:"thumbnailLink,omitempty"`
    DisplayLink   string `json:"displayLink"`
    ContextLink   string `json:"contextLink,omitempty"`
    Width         int    `json:"width,omitempty"`
    Height        int    `json:"height,omitempty"`
    FileSize      string `json:"fileSize,omitempty"` // optional, provider-dependent; omitted when the provider does not report it
}

Provider notes¶

size, type, color_type, dominant_color, and file_type are Google/SearchAPI-only filters — they are not documented Brave image params, so the Brave adapter never sends them (Brave would silently drop them). country, language, and safe are honored across providers. The size bucket is a hint the provider applies loosely — returned dimensions may not strictly match the requested bucket; use width/height to filter precisely when exact sizing matters.
fileSize, contextLink, width, and height are optional and provider-dependent — each is emitted only when the configured provider reports it and is omitted (never fabricated) otherwise. No currently-configured provider populates fileSize, so treat it as reserved/best-effort.

Cache¶

Key: SHA-256 of (query + all filter params)
TTL: 30 minutes

Tool 5: `news_search`¶

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	1-500 chars
`num_results`	int	no	5	1-50 (Brave up to 50; Google up to 10)
`time_range`	string	no	`week`	hour, day, week, month, year
`sort_by`	string	no	`relevance`	relevance, date. Google only — Brave news has no sort param and ignores it.
`news_source`	string	no	—	Domain filter (e.g. `reuters.com`). Google only — Brave news has no source filter and ignores it.
`country`	string	no	—	ISO 3166-1 alpha-2 (e.g. `us`, `gb`). Honored by Brave news.
`language`	string	no	—	BCP 47 / 2-letter code (e.g. `en`, `de`). Honored by Brave news (`search_lang`).
`safe`	string	no	—	SafeSearch level: off, moderate, strict. Honored by Brave news.
`provider`	string	no	—	Force search provider
`sessionId`	string	no	—	Link results to a `sequential_search` session

Output Schema¶

type NewsSearchOutput struct {
    Articles    []NewsArticle `json:"articles"`
    Query       string        `json:"query"`
    ResultCount int           `json:"resultCount"`
    Hints       *ZeroResultHints `json:"hints,omitempty"` // present ONLY on zero-result responses (see below)
    Trust       string        `json:"trust"`   // "untrusted-external-content"
}

type NewsArticle struct {
    Title       string `json:"title"`
    URL         string `json:"url"`
    Source      string `json:"source"`
    PublishedAt string `json:"publishedAt,omitempty"` // optional, provider-dependent; ISO-8601 (RFC3339 UTC) when present
    Snippet     string `json:"snippet"`
}

On a zero-result response, hints carries the same ZeroResultHints object as web_search/academic_search/patent_search, including its fixed epistemicWarning (#357 — see web_search above for the full field list). The active freshness window (default week) and any news_source are surfaced in filtersApplied, since an over-narrow recency window is the most common reason news returns nothing; suggested alternative providers are limited to those configured and healthy. Omitted on any non-empty result set.

Behavior¶

Route to configured search provider's news endpoint.
Apply time_range as date restriction.
If news_source specified, add as domain filter (Google only).
Sort by sort_by: relevance (default) uses the provider's native ranking; date requests newest-first ordering (Google only).
Return deduplicated articles.

Provider notes¶

sort_by and news_source are Google-only controls — Brave's news API has no sort or single-source parameter, so the Brave adapter never sends them (the schema descriptions mark them provider-conditional rather than dropping the fields, since Google genuinely honors them). country, language, and safe are honored by Brave news.
publishedAt is optional and provider-dependent: populated when the provider exposes a publish timestamp (Google CSE via page metadata; Brave/Exa/Serper/SearchAPI/SearXNG/Tavily natively), omitted (not fabricated) when the provider supplies none — so treat it as best-effort. When present it is always normalized to ISO-8601 (RFC3339 UTC) regardless of the provider's raw format (RFC1123, relative ages like "3 days ago"/"2h", or bare dates), so values sort and compare consistently across providers; an unparseable timestamp is dropped rather than passed through.
sort_by=date maps to Google's date-sort control; exact ordering and time_range=hour granularity depend on the provider's index and may be approximate. News providers may also surface high-ranking forum/aggregator pages — news_source narrows to a trusted outlet when that matters (Google).

Cache¶

TTL: 15 minutes (news is time-sensitive)

Tool 6: `academic_search`¶

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	1-500 chars
`num_results`	int	no	5	1-10
`year_from`	int	no	—	1900-2030
`year_to`	int	no	—	1900-2030
`source`	string	no	`all`	all, arxiv, pubmed, ieee, nature, springer
`pdf_only`	bool	no	false	—
`sort_by`	string	no	`relevance`	relevance, date
`open_access`	bool	no	false	Only return open-access papers
`provider`	string	no	—	Force provider: openalex, crossref, pubmed, semanticscholar, exa (academic APIs), or google, brave, serper, searxng, searchapi, duckduckgo, tavily (web fallback)
`sessionId`	string	no	—	Link results to a `sequential_search` session; sources are auto-recorded for recovery after context loss

Output Fields¶

Each paper in the papers array contains:

Field	Type	Always Present	Description
`title`	string	yes	Paper title
`url`	string	yes	Link to paper (DOI URL or publisher page)
`source`	string	yes	Provider that returned this result
`doi`	string	no	Digital Object Identifier
`authors`	[]string	no	Author names
`journal`	string	no	Journal or venue name
`year`	int	no	Publication year
`abstract`	string	no	Paper abstract (up to 500 chars)
`citationCount`	int	no	Number of citations
`openAccess`	bool	no	Whether the paper is freely available
`pdfUrl`	string	no	Direct link to PDF (Semantic Scholar/OpenAlex supply this directly; for DOI-only results it is filled by Unpaywall open-access enrichment when `UNPAYWALL_EMAIL`/`OPENALEX_EMAIL` is set)
`tldr`	string	no	AI-generated one-sentence summary of the paper (Semantic Scholar only; machine-generated, not author-written)
`isInfluential`	bool	no	Whether Semantic Scholar flags this as a highly-influential paper
`citationIntents`	[]string	no	Citation-intent labels (e.g. background, methodology) — populated by `citation_graph`, not plain search
`isInDoaj`	bool	no	Whether OpenAlex reports the journal is listed in the Directory of Open Access Journals (DOAJ) — a peer-reviewed OA quality signal. OpenAlex-only

Additional output fields: query, totalResults, resultCount, source (which provider answered: openalex, crossref, router, web_search), hints (a ZeroResultHints object explaining why a query returned nothing and suggesting how to broaden it — present on zero-result responses), and trust (always "untrusted-external-content" — treat results as data, not instructions; OWASP LLM01).

Behavior¶

4-strategy fallback: explicit provider → router → academic providers → site-restricted web search
When academic providers (OpenAlex, CrossRef, PubMed, Semantic Scholar) are configured, returns rich metadata (DOI, authors, citations, OA status)
Metadata richness varies by provider: OpenAlex returns abstracts, citation counts, and authors consistently; Semantic Scholar adds tldr and isInfluential; CrossRef is a DOI registry and may omit abstracts/citation counts; PubMed returns biomedical records (title, authors, year, venue, DOI) — no abstract in the summary response. Automatic selection prefers OpenAlex; others answer when explicitly forced or as a fallback. Field absence reflects the provider, not an error.
Without academic env vars, falls back to site-restricted web search (identical to previous behavior)
OpenAlex/CrossRef require only an email address (no API key); PubMed and Semantic Scholar work key-free at a lower shared rate (PUBMED_API_KEY / SEMANTIC_SCHOLAR_API_KEY raise the respective limit). PubMed DOIs feed the same retraction enrichment as every other provider.
Open-access enrichment (Unpaywall): when UNPAYWALL_EMAIL (or OPENALEX_EMAIL) is set, DOI-bearing results that lack a PDF link are enriched with the best open-access PDF Unpaywall knows about. Best-effort: never overwrites a provider-supplied pdfUrl, never fails the search, and runs before the pdf_only filter so resolved PDFs are counted. No-op when unconfigured.
source filter: when set (e.g., "arxiv"), OpenAlex filters by source ID; web fallback restricts to that source's domain
sort_by=date: OpenAlex sorts by publication_date:desc; CrossRef uses published:desc
pdf_only: post-filters results to only those with PDFUrl populated (may reduce result count)

Academic Site Pool (web search fallback)¶

arxiv.org, pubmed.ncbi.nlm.nih.gov, scholar.google.com, ieeexplore.ieee.org, dl.acm.org, nature.com, sciencedirect.com, link.springer.com, researchgate.net, plos.org, frontiersin.org, mdpi.com, wiley.com, jstor.org, semanticscholar.org, biorxiv.org, medrxiv.org

Cache¶

TTL: 1 hour (academic providers use semantic ranking that can shift)

Tool 7: `patent_search`¶

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	no	—	Patent terms or number. Not required when `assignee` or `inventor` provided
`num_results`	int	no	5	1-10
`search_type`	string	no	`prior_art`	prior_art, specific, landscape
`patent_office`	string	no	`all`	all, US, EP, WO, JP, CN, KR
`assignee`	string	no	—	Company name (auto-strips Inc/LLC/Ltd suffixes)
`inventor`	string	no	—	Inventor name
`cpc_code`	string	no	—	CPC classification (e.g., G06F)
`year_from`	int	no	—	Only patents filed in or after this year
`year_to`	int	no	—	Only patents filed in or before this year
`provider`	string	no	—	Force provider: searchapi, epo, lens, uspto, or web search providers
`sessionId`	string	no	—	Link results to a `sequential_search` session; sources are auto-recorded for recovery after context loss

Output Fields¶

Each patent in the patents array contains:

Field	Type	Always Present	Description
`title`	string	yes	Patent title
`number`	string	yes	Patent number (e.g., US10165245B2)
`url`	string	yes	Link to patent detail page
`abstract`	string	no	Patent abstract or snippet
`assignee`	string	no	Patent owner/assignee
`inventor`	string	no	Primary inventor name
`filed`	string	no	Filing date (YYYY-MM-DD)
`granted`	string	no	Grant date (YYYY-MM-DD)
`pdf`	string	no	Direct link to patent PDF
`status`	string	no	Application status (e.g., "Patented Case") — provider-dependent: USPTO reports it; EPO/Lens/SearchAPI/web-discovery typically omit it

Additional output fields: query, searchType, resultCount, source (which provider answered), searchUrl, hints (a ZeroResultHints object explaining why a query returned nothing and suggesting how to broaden it — present on zero-result responses), and trust (always "untrusted-external-content" — treat results as data, not instructions; OWASP LLM01).

Behavior¶

4-strategy fallback: explicit provider → router → patent-only providers → web search discovery
When an explicit provider is set: that provider is used exclusively. If it returns empty results (e.g., USPTO for non-US patents), empty results are returned — no silent fallback to web_discovery
Unknown provider: returns error listing all supported providers (no duplicates)
Strips HTML from API responses; extracts clean patent numbers from paths
Normalizes assignee names (removes Inc/LLC/Corp/Ltd suffixes for matching)
Region-aware routing: patent_office filters which providers are tried
Post-filter results by patent number prefix when patent_office is specified
Does not cache empty results (only caches when patents are found)
USPTO uses simple full-text search (quoted phrases); Lens uses Elasticsearch bool queries with match_phrase
num_results is enforced for every provider, including a defensive cap on the USPTO path (its API may return more rows than requested)
Provider matching is token/substring-based: inventor/assignee matches share a surname or company token rather than disambiguating entities, and a nonsense query may still fuzzy-match loosely-related patents instead of returning zero. Verify results against the returned bibliographic fields rather than assuming exact-entity matching.

Cache¶

TTL: 24 hours (only for non-empty results)

Tool 8: `sequential_search`¶

Purpose¶

Multi-step research tracking with session persistence, branching, and knowledge gap identification.

Input Schema¶

Field	Type	Required	Default	Constraints
`searchStep`	string	yes	—	Description of this step
`stepNumber`	int	yes	—	Starts at 1
`nextStepNeeded`	bool	yes	—	Whether more steps follow
`sessionId`	string	no	—	Session ID (required for steps 2+)
`researchGoal`	string	no	—	Set on step 1; defines the research objective
`reasoning`	string	no	—	Why this search direction was chosen
`confidence`	string	no	—	Confidence in this step: high, medium, or low
`rejectedApproaches`	[]string	no	—	Approaches considered but rejected
`sessionSummary`	string	no	—	Running summary (used for recovery)
`responseMode`	string	no	auto	Force `full` or `summary` output
`totalStepsEstimate`	int	no	—	Estimated total steps
`isRevision`	bool	no	false	Revising a previous step
`revisesStep`	int	no	—	Step being revised
`branchFromStep`	int	no	—	Branching point
`branchId`	string	no	—	Branch identifier
`knowledgeGap`	string	no	—	Gap identified
`depth`	string	no	`quick`	Iteration-assist level: `quick`, `standard`, or `thorough` (see Iterative Depth below)

Session Management¶

Sessions created on first call (stepNumber=1)
A stepNumber > 1 call with no sessionId is rejected with guidance (pass the sessionId, recover with get_research_session, or restart at step 1) — it does not silently start a new session, so a lost sessionId never orphans the in-flight research trail
Session ID: UUID v4, returned in output
TTL: 4 hours of inactivity (configurable via SESSION_TTL), resets on every access
Max concurrent sessions: 50 per tenant (oldest evicted when exceeded)
Max steps per session: 200 (configurable via SESSION_MAX_STEPS)
Persistence: encrypted disk (AES-256-GCM), survives server restarts
Cleanup: goroutine every 15 minutes removes expired sessions from memory + disk
Per-tenant isolation: sessions keyed by {tenantID}:{sessionID}
Recovery: use get_research_session tool after context loss
Response modes: "full" for ≤8 steps, "summary" for >8 (override with responseMode input)

Output Schema¶

type SequentialSearchOutput struct {
    SessionID          string           `json:"sessionId"`
    ResponseMode       string           `json:"responseMode"`        // "full" or "summary"
    ResearchGoal       string           `json:"researchGoal"`
    CurrentStep        int              `json:"currentStep"`         // echoes the input stepNumber
    TotalStepsEstimate int              `json:"totalStepsEstimate"`
    IsComplete         bool             `json:"isComplete"`          // !nextStepNeeded
    StartedAt          string           `json:"startedAt"`
    CompletedAt        string           `json:"completedAt,omitempty"` // set only when complete
    Warning            string           `json:"warning,omitempty"`     // e.g. max-steps reached
    Trust              string           `json:"trust"`                 // "untrusted-external-content" — echoed source metadata is external data

    // "full" mode (default for <=8 steps):
    Steps              []StepIndexEntry `json:"steps,omitempty"`     // one-liner index, full mode only

    // "summary" mode (default for >8 steps):
    Summary            string           `json:"summary,omitempty"`   // summary mode only
    StepIndex          []StepIndexEntry `json:"stepIndex,omitempty"` // summary mode only

    // Both modes:
    LastSteps          []ResearchStep   `json:"lastSteps,omitempty"` // most recent full steps
    Gaps               []KnowledgeGap   `json:"gaps,omitempty"`
    Sources            []ResearchSource `json:"sources,omitempty"`
}

type StepIndexEntry struct {
    StepNumber int    `json:"stepNumber"`
    OneLiner   string `json:"oneLiner"`
    BranchID   string `json:"branchId"`
    Confidence string `json:"confidence"`
}

The key set depends on responseMode: full mode emits steps; summary mode emits summary + stepIndex instead. Both emit lastSteps, gaps, and sources. This tool does not emit a _meta block (no caching).

Iterative Depth (`depth`)¶

An optional iteration-assist level (#67). The server stays infrastructure, not synthesis — it never writes an answer, only richer metadata and (for thorough) raw results.

Level	Behavior
`quick` (default)	Record the step and return. Byte-for-byte the prior behavior — no extra fields.
`standard`	Also analyze coverage of the sources gathered so far and suggest refinement queries. No auto-execution. Adds `depth`, `coverage`, and `refinementQueries`.
`thorough`	Same as `standard`, plus auto-runs up to 3 of the suggested refinement queries as web searches and attaches their merged, provenance-tagged results. Adds `refinementResults` (and `refinementNote` when the suggestion list was truncated).

Extra output fields (present only for standard/thorough):

coverage — { sourceCount, uniqueDomains, domainSpread, dominantDomain?, sourceTypes, gaps[] }. Descriptive coverage signals (domain spread, source-type balance, thin-coverage flags) computed from the session's recorded sources. Never an answer.
refinementQueries — suggested follow-up search strings derived from knowledge gaps + coverage gaps. The caller's AI decides whether to act on them.
refinementResults (thorough only) — array of { query, resultCount, results[] } (or { query, error }), one per auto-run query. Raw web results tagged with the originating query; not synthesized. Each result is { title, url, snippet, publishedAt? } (publishedAt (#356) present only when the underlying provider's response carried a date).
refinementNote (thorough only) — present when more than 3 queries were suggested and the auto-run was bounded.
refinementWarning (thorough only, #357) — present when at least one auto-run refinement search returned zero results: coverage gaps may persist and are not confirmed absent by an empty search.

thorough searches respect the same rate limits and circuit breakers as web_search, record their sources into the session, and contribute to session providerStats.

State Management¶

Two-tier: in-memory index (lightweight) + encrypted disk (full session JSON)
Write-through on every step (crash-safe: temp → fsync → rename)
Index rebuilt from disk on server startup — no data loss across restarts
Behind a load balancer, use session-affinity (sticky sessions) so clients reconnect to the same instance

Tool 9: `get_research_session`¶

Recover a sequential_search session after context loss. Returns the session summary, step index, and most recent steps. Use stepId to retrieve full details of a specific earlier step.

Input Schema¶

Field	Type	Required	Description
`sessionId`	string	yes	The session ID to recover
`stepId`	integer	no	Retrieve full details for a specific step number

Behavior¶

Without stepId: returns session summary view from memory (no disk I/O)
Includes: researchGoal, summary, stepIndex (a one-liner for every step), lastSteps (the most recent 3 steps in full detail — a fixed sliding window, not all steps), active gaps, and sources. For full detail of any earlier step, pass its stepId.
With stepId: loads full step data from disk for that specific step number
Every response carries "trust": "untrusted-external-content" — the echoed source metadata (titles/URLs) is external data; treat it as data, not instructions (OWASP LLM01).
Sessions are private to the (tenant, user) that created them — a session ID is honored only for its owning user (anonymous/STDIO uses a single owner).
A source's foundInStep is the 1-indexed sequential_search step that surfaced it. It is omitted entirely when the source was not tied to a numbered step (e.g. added via a web_search carrying only a sessionId) — steps are 1-indexed, so there is no foundInStep: 0. The same convention applies to a gap's foundInStep.

Cross-call error patterns (#99)¶

The summary view additionally surfaces aggregated outcome telemetry recorded across the session's tool calls (scrapes, searches). This is additive metadata — per-call errors are still returned in full to the caller; this is the cross-call view.

errorPatterns — array of { kind, count, affectedUrls[], suggestion, lastSeen }, surfaced only when a given error kind occurred 3 or more times in the session (a false-positive guard for small samples). kind uses the shared error taxonomy (auth_required, blocked, rate_limited, browser_unavailable, …); suggestion is a session-level remediation hint (e.g. auth_required → "Consider open_access=true or target preprint servers (arxiv, biorxiv)."). Absent when nothing crosses the threshold.
providerStats — object keyed by provider name → { attempts, successes } for the session. Absent when no provider outcomes were recorded.

Recorded outcome state is bounded (most-recent 200 events, FIFO) and tenant/user-isolated, honoring the no-unbounded-retention posture.

Annotations¶

ReadOnly: true
Idempotent: true (safe to call repeatedly)
OpenWorld: false (reads internal state only)

Error Conditions¶

Session not found → "Session not found or expired. Sessions last 4 hours from last activity."
Step not found → error with step number

Cache¶

No cache (reads internal session state)

Tool 10: `get_my_analytics`¶

Opt-in, consent-gated (#92). Registered only when USER_ANALYTICS_ENABLED=true. Read-only.

Purpose¶

Return the calling user's own usage analytics (tools used, counts, first/last seen) for their tenant. This is per-user data under GDPR / Quebec Law 25, so it is off by default, collected only after recorded consent, isolated per user, encrypted at rest, and covered by the data-subject access/erasure endpoints (/admin/data).

Input Schema¶

No inputs. The subject is always the authenticated caller — a user can never request another user's analytics.

Output Schema¶

type GetMyAnalyticsOutput struct {
    Status    string         `json:"status"`           // "ok" | "empty" | "no_consent" | "unavailable"
    Reason    string         `json:"reason,omitempty"`
    Analytics *UserAnalytics `json:"analytics,omitempty"`
}

type UserAnalytics struct {
    TenantID    string           `json:"tenantId"`
    UserID      string           `json:"userId"`
    TotalCalls  int64            `json:"totalCalls"`
    ToolCounts  map[string]int64 `json:"toolCounts"`
    FirstSeen   string           `json:"firstSeen,omitempty"`
    LastSeen    string           `json:"lastSeen,omitempty"`
    RecentTools []string         `json:"recentTools,omitempty"`
}

Behavior¶

Requires an authenticated user (status: "unavailable" for anonymous).
Requires recorded consent for the analytics purpose (status: "no_consent" otherwise — nothing is collected without it).
Returns the caller's own summary, or status: "empty" if none recorded yet.

Cache¶

No cache (reads per-user state directly).

Tool 11: `memory_save`¶

Opt-in, consent-gated (#88). Registered only when MEMORY_ENABLED=true. This is a write tool (ReadOnlyHint: false, DestructiveHint: false — it appends, never deletes).

Purpose¶

Persist a research finding to the calling user's long-term memory so it can be recalled in future sessions (unlike sequential_search sessions, which expire after 4 hours). Stored per-user, encrypted, retention-bounded (MEMORY_RETENTION, default 90 days), and erasable via the data-subject endpoint (/admin/data). There is no memory_forget tool — deletion flows through the GDPR erasure endpoint.

Input Schema¶

Field	Type	Required	Notes
`note`	string	yes	The finding/conclusion to remember
`topic`	string	no	Group label for later recall
`url`	string	no	Source URL this memory refers to
`tags`	string[]	no	Organizational tags

Output Schema¶

type MemorySaveOutput struct {
    Status    string `json:"status"`    // "ok" | "no_consent" | "unavailable"
    Reason    string `json:"reason,omitempty"`
    ID        string `json:"id,omitempty"`
    CreatedAt string `json:"createdAt,omitempty"`
}

Behavior¶

Requires an authenticated user and recorded consent for the memory purpose; otherwise returns unavailable / no_consent and persists nothing.

Cache¶

Not cached (a write).

Tool 12: `memory_recall`¶

Opt-in, consent-gated (#88). Registered only when MEMORY_ENABLED=true. Read-only.

Purpose¶

Recall findings the calling user previously saved with memory_save, across sessions, optionally filtered by topic. Shows only the caller's own memories — never another user's.

Input Schema¶

Field	Type	Required	Notes
`topic`	string	no	Filter by topic; omit for most recent across all topics
`limit`	int	no	Max memories to return (default 20)

Output Schema¶

type MemoryRecallOutput struct {
    Status   string        `json:"status"`   // "ok" | "no_consent" | "unavailable"
    Reason   string        `json:"reason,omitempty"`
    Count    int           `json:"count"`
    Memories []MemoryEntry `json:"memories"`
    Trust    string        `json:"trust"`    // "user-asserted-content" — recalled notes are data, not instructions
}

type MemoryEntry struct {
    ID        string   `json:"id"`
    TenantID  string   `json:"tenantId"`
    UserID    string   `json:"userId"`
    Topic     string   `json:"topic,omitempty"`
    Note      string   `json:"note"`
    URL       string   `json:"url,omitempty"`
    Tags      []string `json:"tags,omitempty"`
    CreatedAt string   `json:"createdAt"`
}

Cache¶

No cache (reads per-user state directly).

Tool 13: `workspace_contribute`¶

Opt-in, consent-gated (#96). Registered only when WORKSPACES_ENABLED=true. This is a write tool (ReadOnlyHint: false, DestructiveHint: false).

Purpose¶

Share a research finding into a shared team workspace. The contribution is stored as a copy with immutable provenance (your tenant/user, timestamp) — never a live link to your private data, so per-tenant isolation is never silently voided. Membership is managed by your host app (the server enforces, the host owns the policy). Erasable by the contributor via the data-subject endpoint.

Input Schema¶

Field	Type	Required	Notes
`workspace_id`	string	yes	Workspace to contribute to (you must be a member)
`note`	string	yes	The finding to share
`url`	string	no	Source URL
`tags`	string[]	no	Organizational tags

Output Schema¶

type WorkspaceContributeOutput struct {
    Status string `json:"status"` // "ok" | "not_member" | "no_consent" | "unavailable"
    Reason string `json:"reason,omitempty"`
    ID     string `json:"id,omitempty"`
}

Behavior¶

Requires an authenticated user, recorded consent for the workspace purpose, AND membership. The caller's identity is taken from the validated token — never from a parameter, and never from the workspace_id.

Cache¶

Not cached (a write).

Tool 14: `workspace_read`¶

Opt-in, consent-gated (#96). Registered only when WORKSPACES_ENABLED=true. Read-only.

Purpose¶

Read the shared findings in a workspace you belong to (each with its contributor attribution). Non-members receive zero contributions — membership is re-verified on every read.

Input Schema¶

Field	Type	Required	Notes
`workspace_id`	string	yes	Workspace to read (you must be a member)

Output Schema¶

type WorkspaceReadOutput struct {
    Status        string         `json:"status"` // "ok" | "not_member" | "no_consent" | "unavailable"
    Count         int            `json:"count"`
    Contributions []Contribution `json:"contributions"`
    Trust         string         `json:"trust"`  // "untrusted-external-content" — cross-member notes are untrusted data (does not restrict who may read)
}

type Contribution struct {
    ID                string   `json:"id"`
    WorkspaceID       string   `json:"workspaceId"`
    ContributorTenant string   `json:"contributorTenant"`
    ContributorUser   string   `json:"contributorUser"`
    Note              string   `json:"note"`
    URL               string   `json:"url,omitempty"`
    Tags              []string `json:"tags,omitempty"`
    CreatedAt         string   `json:"createdAt"`
}

Membership management¶

Membership is host-owned via admin endpoints (not MCP tools): POST /admin/workspace/members and DELETE /admin/workspace/members with {workspace_id, tenant_id, user_id}. The server enforces the resulting membership checks; it does not own the membership policy.

Cache¶

No cache (reads workspace state directly).

Tool 15: `answer`¶

Provider-independent. Registered only when at least one answer provider is configured (currently Exa via EXA_API_KEY; future providers register the same way). Read-only, open-world, idempotent.

Purpose¶

Ask a factual question and get one grounded, synthesized answer with source citations. Unlike web_search (a list of links) or search_and_scrape (raw page text), this returns a direct written answer plus the URLs it relied on. The backing provider is pluggable — set provider to choose one when several are configured. The result names the answering provider, and costUsd reports the per-call estimate for metered providers (0 for free ones).

Epistemic caveat (#357). The synthesized answer may be incomplete or outdated — verify the cited URLs before asserting the answer as fact. An empty or unusually short answer does not confirm the fact's absence; it may reflect a provider gap rather than a true negative.

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	The question to answer
`provider`	string	no	—	Force a provider (e.g. `exa`); required only when more than one is configured

Output Schema¶

type AnswerOutput struct {
    Answer    string         `json:"answer"`
    Citations []Citation     `json:"citations"`
    Provider  string         `json:"provider"`          // which provider answered
    CostUsd   float64        `json:"costUsd,omitempty"` // per-call estimate for metered providers (not an invoice)
    Trust     string         `json:"trust"`             // "untrusted-external-content"
    Hints     map[string]any `json:"hints,omitempty"`   // present only on weak query↔answer term overlap (#235) — see Behavior
}

type Citation struct {
    Title         string `json:"title,omitempty"`
    URL           string `json:"url"`
    PublishedDate string `json:"publishedDate,omitempty"`
}

Behavior¶

Resolve the search.AnswerSearcher for the requested provider (or the sole configured one).
Call the provider; map its grounded answer + citations + cost into the output.
costUsd and the resolved provider are surfaced into audit metadata (cost_usd, provider).
The answer is external content — trust is always "untrusted-external-content".
hints (#235): the answer provider routes any query to a plausible interpretation. So an off-target or ambiguous query still returns a real, non-fabricated answer — just possibly to a loosely-related question. hints gets added when fewer than half of the query's significant terms (2+ required) appear in the synthesized answer text: {"confidence": "low", "reason": "weak_query_result_overlap", "message": "...", "termsMatched": N, "termsTotal": M}. Omitted when overlap is adequate, or the query's too short to judge.

Cache¶

TTL: 1 hour (keyed by query + provider)

Tool 16: `structured_search`¶

Provider-independent. Registered only when at least one structured-search provider is configured (currently Exa via EXA_API_KEY). Read-only, open-world, idempotent.

Purpose¶

Search the web and extract structured data from each result. Supply a JSON schema to pull specific fields back as JSON per result, and/or a category to focus the search. The backing provider is pluggable (provider field). Valid category values and any schema limits are provider-specific and validated by the chosen provider — an unsupported value returns an error listing the valid options. The result names the provider, and costUsd reports the per-call estimate for metered providers.

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	What to search for (entity name for entity lookups)
`category`	string	no	—	Provider-specific result category (validated by the provider)
`num_results`	int	no	5	1-10
`schema`	object	no	—	JSON Schema for per-result field extraction; provider-specific limits apply
`provider`	string	no	—	Force a provider (e.g. `exa`); required only when more than one is configured

Output Schema¶

type StructuredOutput struct {
    Query       string           `json:"query"`
    Category    string           `json:"category"`
    ResultCount int              `json:"resultCount"`
    Results     []StructuredItem `json:"results"`
    Provider    string           `json:"provider"`          // which provider answered
    CostUsd     float64          `json:"costUsd,omitempty"` // per-call estimate for metered providers
    Trust       string           `json:"trust"`             // "untrusted-external-content"
    Hints       map[string]any   `json:"hints,omitempty"`   // present only on weak query↔results term overlap (#235) — see Behavior
}

type StructuredItem struct {
    Title         string          `json:"title,omitempty"`
    URL           string          `json:"url"`
    PublishedDate string          `json:"publishedDate,omitempty"`
    Author        string          `json:"author,omitempty"`
    Summary       json.RawMessage `json:"summary,omitempty"`    // schema-conforming JSON (best-effort), or plain text summary
    Highlights    []string        `json:"highlights,omitempty"` // verbatim source snippets — the authoritative payload
    Entities      json.RawMessage `json:"entities,omitempty"`   // provider-specific structured entities, when available
}

Behavior¶

Resolve the search.StructuredSearcher for the requested provider (or the sole configured one).
The provider validates its own constraints (category vocabulary, schema limits) before any paid call; a violation returns a validation tool-error, never a wasted upstream request.
When schema is set, each result's summary is JSON conforming to it; otherwise it is a plain text summary. Schema extraction is best-effort and provider-side: the provider's extractor fills each field from the page, and a value it can't confidently resolve comes back null even when that value is visible in highlights. Treat highlights (verbatim source snippets) as the authoritative payload and summary as a convenience — do not assume every schema field is populated. Providers may populate per-result entities for entity categories (e.g. Exa's company).
costUsd and the resolved provider are surfaced into audit metadata. Results are external content — trust is always "untrusted-external-content".
hints (#235): same weak-relevance signal as answer. hints flags {"confidence": "low", "reason": "weak_query_result_overlap", ...} when fewer than half the query's significant terms (2+ required) appear across the combined result text. Omitted when overlap is adequate, or the query's too short to judge.

Provider notes¶

Exa: category ∈ company, people, research paper, news, pdf, github, financial report, personal site; schema must be a flat object (root object, ≤10 properties, nesting depth ≤2, primitive array items); category:"company" returns structured company entities.

Cache¶

TTL: 1 hour (keyed by query + category + num_results + schema + provider)

Tool 17: `citation_graph`¶

Map a seed paper's citation neighborhood: works that cite it (forward edges, cited_by) and works it cites (backward edges, references). Single-hop per call — multi-hop traversal is the caller's to orchestrate (the server stays infrastructure, not an autonomous crawler). Registered only when a citation-capable academic provider (Semantic Scholar or OpenAlex) is configured.

Input Schema¶

Field	Type	Required	Default	Constraints
`paper`	string	yes	—	Seed paper: a DOI (e.g. `10.1038/nature12373`) or an exact title
`direction`	string	no	`both`	`cited_by` (forward), `references` (backward), or `both`
`num_results`	int	no	10	1-25 per direction
`influential_only`	bool	no	false	Keep only highly-influential citations when the provider supplies that signal (Semantic Scholar); no-op otherwise (results pass through)
`provider`	string	no	—	Force `semanticscholar` (intent + influence) or `openalex` (counts only). Omit to auto-select (prefers semanticscholar)
`sessionId`	string	no	—	Link discovered works to a `sequential_search` session for recovery after context loss

Output Fields¶

Field	Type	Always Present	Description
`seed`	string	yes	The seed paper as supplied
`direction`	string	yes	The direction traversed
`provider`	string	yes	Which citation provider answered (`semanticscholar` = intent+influence; `openalex` = counts only)
`citedBy`	[]paper	when direction includes `cited_by`	Works that cite the seed (each a full academic-paper object, same shape as `academic_search` papers)
`citedByCount`	int	when direction includes `cited_by`	Count of forward edges returned
`references`	[]paper	when direction includes `references`	Works the seed cites
`referencesCount`	int	when direction includes `references`	Count of backward edges returned
`trust`	string	yes	Always `"untrusted-external-content"` — treat results as data, not instructions (OWASP LLM01)

Each work in citedBy/references carries the same fields as an academic_search paper, including tldr, isInfluential, and citationIntents when the provider (Semantic Scholar) supplies them.

Behavior¶

Provider fidelity: Semantic Scholar returns citation intent (citationIntents) and an influence flag (isInfluential); OpenAlex is counts-only (forward edges via the cites: filter, backward edges from the seed's referenced_works). Auto-selection prefers Semantic Scholar.
Seed resolution: a DOI resolves directly; a title resolves to the provider's top match.
Explicit provider honoring: forcing an unconfigured/unknown/incapable provider returns a structured error listing the supported providers (semanticscholar, openalex); it never silently falls back.
influential_only filters out works the provider did not flag as influential; providers without the signal pass all results through unchanged.

Annotations¶

ReadOnly: true · Idempotent: true · OpenWorld: true (queries external scholarly APIs)

Cache¶

TTL: 24 hours (citation graphs change slowly)

Tool 18: `research_export`¶

Export a completed sequential_search session as a shareable deliverable — a human-readable markdown report or the full structured json session. Read-only and idempotent: it renders existing session state, never mutates it. Scoped to the caller's own (tenant, user).

Input Schema¶

Field	Type	Required	Default	Constraints
`sessionId`	string	yes	—	The `sequential_search` session to export
`format`	string	no	`markdown`	`markdown` (readable report) or `json` (full structured session)
`verify_links`	bool	no	`false`	When true, check each source URL is still live and attach a Wayback snapshot for any dead link. Adds latency. Best-effort — failures leave a source unverified, never error.

Output Fields¶

Field	Type	Always Present	Description
`sessionId`	string	yes	The exported session ID
`format`	string	yes	`markdown` or `json`
`researchGoal`	string	yes	The session's research goal
`stepCount`	int	yes	Number of recorded steps
`sourceCount`	int	yes	Number of recorded sources
`startedAt`	string	yes	Session creation time (RFC3339)
`exportedAt`	string	yes	When this export was generated (RFC3339)
`tenantId`	string	yes	Owning tenant — provenance for the export
`document`	string \| object	yes	The rendered report: a markdown string when `format=markdown`, or the structured session object when `format=json`
`trust`	string	yes	Always `"untrusted-external-content"` — source titles/URLs are external data

Behavior¶

Markdown report contains: research-goal heading, a metadata block (session id, started, step/source counts), every step with its reasoning/confidence/rejected-approaches/timestamp (revisions and branches are labeled in the step heading), an Open Questions section (knowledge gaps), a numbered Sources list, and a provenance footer (export time, tenant).
Deterministic: same session → byte-identical output aside from the exportedAt stamp (idempotency).
Sessions are private to their owning (tenant, user); a leaked sessionId is honored only for its owner.
verify_links=true: runs a batched SSRF-safe liveness check on every recorded source URL and attaches a Wayback Machine snapshot URL for any dead link. Best-effort — a URL that can't be checked is left unverified, never an error. Off by default (adds latency proportional to source count).

Annotations¶

ReadOnly: true · Idempotent: true · OpenWorld: false (reads internal session state)

Error Conditions¶

Missing sessionId → validation error
Unknown/expired session → "Session not found or expired. Sessions last 4 hours from last activity."
Invalid format → validation error (use markdown or json)

Cache¶

No cache (renders live session state)

Tool 19: `format_bibliography`¶

Turn a set of sources into a formatted bibliography. Pick a human-readable style (APA, MLA) or a reference-manager interchange format (BibTeX, RIS, CSL-JSON) that imports straight into Zotero / EndNote / Mendeley. Sources come from either a sequential_search session (its recorded sources) or an explicit list the caller supplies (e.g. academic_search / citation_graph results — pass their doi so the persistent id survives). Read-only and idempotent.

Input Schema¶

Field	Type	Required	Default	Constraints
`style`	string	no	`apa`	`apa`, `mla`, `bibtex`, `ris`, or `csl-json`
`sessionId`	string	no	—	Build from this session's recorded sources. Provide this or `sources`
`sources`	[]object	no	—	Explicit sources. Provide this or `sessionId`. Each: `url` (required), `title`, `author`, `site`, `date`, `doi`

Output Fields¶

Field	Type	Always Present	Description
`style`	string	yes	The citation style used (`apa`/`mla`/`bibtex`/`ris`/`csl-json`)
`entryCount`	int	yes	Number of unique entries rendered (after de-duplication by URL)
`bibliography`	string	yes	The formatted bibliography. For `apa`/`mla`/`bibtex`/`ris`, records separated by blank lines; for `csl-json`, a JSON array string
`sessionId`	string	no	Echoed when sources were drawn from a session
`trust`	string	yes	Always `"untrusted-external-content"` — source metadata is external data

Behavior¶

Sources are de-duplicated by URL (first occurrence wins) and ordered deterministically: APA/MLA alphabetically by the rendered line; BibTeX/RIS/CSL-JSON by (collision-free) cite key. Same inputs → byte-identical output (interchange formats omit the accessed-date stamp so they stay reproducible).
BibTeX cite keys are surname + year + first-title-word (e.g. vaswani2017attention), made collision-free within the list by appending a/b/c…; BibTeX-significant characters in values are escaped.
RIS records use TY - JOUR when a DOI is present (the entry is almost certainly a journal article); others use TY - ELEC. One AU line per author (split on ; / and), with DO carrying the bare DOI and UR the URL; values are stripped of line breaks so a title can't inject extra RIS tags.
CSL-JSON is a JSON array: entries with a DOI use "type": "article-journal"; others use "type": "webpage". (id = cite key, authors as {"literal": …}, issued date-parts, container-title, DOI, URL); all values are JSON-escaped.
DOI (when supplied) is normalized to the bare 10.x/y form and emitted into bibtex (doi), ris (DO), and csl-json (DOI). It is not network-verified here — use verify_citation for that.
An unrecognized style is rejected at the tool boundary (the lower-level formatter falls back to APA, but the tool validates first).
Entries with no url are skipped. Either sessionId or a non-empty sources list is required.
Session-sourced bibliographies are scoped to the caller's own (tenant, user).

Annotations¶

ReadOnly: true · Idempotent: true · OpenWorld: false (formats supplied/stored data)

Cache¶

No cache (pure formatting of supplied/stored data)

Tool 20: `filing_search`¶

Search SEC EDGAR — the authoritative primary source for US public-company disclosures (10-K/10-Q/8-K/S-1/DEF 14A/…). Registered only when a filing provider is configured (edgar, which needs a contact email for SEC's required User-Agent).

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes*	—	Company name, ticker, or CIK — or free text to full-text search all filings. *Required unless `ticker` is set
`form_type`	string	no	—	Restrict to a filing type (10-K, 10-Q, 8-K, S-1, DEF 14A, …)
`ticker`	string	no	—	Direct ticker lookup; takes precedence over `query` for entity resolution
`date_from`	string	no	—	Only filings on/after this date (YYYY-MM-DD)
`date_to`	string	no	—	Only filings on/before this date (YYYY-MM-DD)
`facts`	bool	no	false	Return structured XBRL company facts (revenue, net income, EPS, assets) instead of a filing list
`num_results`	int	no	5	1-10
`provider`	string	no	—	Force a filing provider: `edgar`
`sessionId`	string	no	—	Link results to a `sequential_search` session

Output Fields¶

Each item in filings[]: company, cik, formType, filingDate, periodOfReport, accession, url (document link; pair with scrape_page), description, source. In facts=true mode each item is one XBRL fact: concept, unit, value (exactly as filed — no rounding). Plus query, resultCount, provider, hints (zero-result), and trust (untrusted-external-content).

Behavior¶

Entity resolution: a ticker/CIK/known-company query resolves to a CIK and lists its recent filings from the submissions API; otherwise a full-text search runs across all filers (EFTS).
facts=true returns a curated set of headline XBRL concepts (revenue, net income, assets, EPS, …), most-recent value each, passed through verbatim.
Required User-Agent: SEC blocks requests without a descriptive UA + contact email; the provider only registers when EDGAR_CONTACT_EMAIL (or OPENALEX_EMAIL) is set. No request is ever made without it.
Ticker→CIK map is fetched once and cached for the process lifetime.

Annotations¶

ReadOnly: true · Idempotent: true · OpenWorld: true (queries the live SEC API)

Cache¶

TTL: 24 hours (only for non-empty results)

Tool 21: `legal_search`¶

Search US court opinions (federal + state) via CourtListener for case-law research and precedent tracing. Registered only when a case provider is configured (courtlistener, which works keyless at a lower rate).

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	Legal topic, case name (e.g. `Miranda v. Arizona`), or statutory reference
`jurisdiction`	string	no	—	Court id: `scotus`, `ca9`, `ny`, …
`date_from`	string	no	—	Only opinions decided on/after this date (YYYY-MM-DD)
`date_to`	string	no	—	Only opinions decided on/before this date (YYYY-MM-DD)
`num_results`	int	no	10	1-20
`provider`	string	no	—	Force a case-law provider: `courtlistener`
`sessionId`	string	no	—	Link results to a `sequential_search` session

Output Fields¶

Each item in cases[]: caseName, citation (Bluebook), court, courtId, dateFiled, docketNumber, citationCount, url (opinion page; scrape_page for full text), source. Plus query, resultCount, provider, hints, and trust (untrusted-external-content).

Behavior¶

Searches the CourtListener v4 opinions index; jurisdiction maps to the court filter, dates to filed_after/filed_before.
Auth: works keyless at ~100 req/day; COURTLISTENER_API_TOKEN raises the limit (~5000/day). The token is sent as an Authorization header and never logged.
Anti-hallucination workflow: pair with the legal lens (web_search with lens: legal, an authority-weighted primary-source pack — see lenses/README.md) for context, and with verify_citation to confirm a cited case actually exists before relying on it.

Annotations¶

ReadOnly: true · Idempotent: true · OpenWorld: true (queries the live CourtListener API)

Cache¶

TTL: 24 hours (only for non-empty results)

Tool 22: `econ_search`¶

Look up macroeconomic and development data. FRED (Federal Reserve Economic Data) — 800K+ US time series (GDP, CPI, unemployment, rates); World Bank Open Data — global development indicators for 200+ economies; OECD (SDMX) — economic indicators for OECD economies (national accounts, prices, labour, trade); Eurostat — official European statistics. World Bank, OECD, and Eurostat are keyless, so econ_search is always registered; FRED adds the US macro series when FRED_API_KEY is set.

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes*	—	Keyword to search series by (matches indicator name for World Bank). *Provide this OR `series_id`
`series_id`	string	yes*	—	A series ID to fetch observations: FRED (`GDP`, `CPIAUCSL`, `UNRATE`), a World Bank indicator (`NY.GDP.MKTP.CD`), an OECD dataflow ref (`agency,dataflow,version` — returned by a keyword search), or a Eurostat dataset code (`une_rt_m`). *Provide this OR `query`
`country`	string	no	`WLD`	ISO code for multi-country providers (`worldbank` default `WLD`; `oecd` → REF_AREA, e.g. `USA`; `eurostat` → geo, e.g. `DE`). Ignored by `fred`
`date_from`	string	no	—	Only observations on/after this date (YYYY-MM-DD or YYYY)
`date_to`	string	no	—	Only observations on/before this date (YYYY-MM-DD or YYYY)
`frequency`	string	no	—	FRED only: resample d, w, m, q, a
`units`	string	no	—	FRED only: units transform (e.g. `pch`, `pc1`); omit for raw levels
`num_results`	int	no	5 (search) / 10 (observations)	—
`provider`	string	no	—	Force an economic-data provider: `fred`, `worldbank`, `oecd`, or `eurostat`

Output Fields¶

mode is series (keyword search) or observations (series_id lookup). In series mode each results[] item: seriesId, title, units, frequency, lastUpdated, notes. In observations mode: seriesId, date, value (exactly as returned — no rounding; missing observations carry no value), plus title and units for multi-dimensional providers (OECD/Eurostat) so interleaved subgroup series — youth vs total, male vs female — are tellable apart (a single FRED/World Bank series carries neither). Plus query, seriesId (echoed in observations mode), country (echoed for a World Bank lookup), resultCount, provider, hints, and trust (untrusted-external-content).

Behavior¶

series_id set → returns that series' observations; otherwise keyword-searches series. With no date_from the window is the most-recent num_results (latest first); with a date_from it is the first num_results on/after that date (anchored at the requested start, oldest first) so the filter is never silently dropped. FRED honors frequency/units; World Bank scopes by country (default WLD) and filters by year.
World Bank / OECD / Eurostat have no server-side keyword search, so keyword mode lists the provider's catalogue (WDI indicators / OECD dataflows / Eurostat datasets) once and filters by name client-side; for OECD/Eurostat the matched id is the series_id to fetch observations with. Multi-word queries use AND-matching (all words must appear in the name — "quarterly GDP growth" matches titles containing each word even when not adjacent); single-word queries require a contiguous substring.
OECD addresses a series by a dataflow ref (agency,dataflow,version) plus a REF_AREA country filter; observations are decoded from SDMX-JSON at the requested time granularity (monthly/quarterly/annual — the period is not truncated to the year). Eurostat addresses a dataset by code plus a geo filter; observations are decoded from the JSON-stat cube by recovering every dimension's coordinate, surfacing its status flag (provisional/estimated) as notes. A dataset is multi-dimensional (sex/age/unit/adjustment/…), so each title carries the dimension labels that distinguish one series from another (e.g. "…— Females, Percentage of population in the labour force") and units carries the unit dimension; values pass through verbatim (no rounding).
Provider honoring: an explicit provider is used exclusively; otherwise the first configured provider answers (FRED if keyed, else World Bank/OECD/Eurostat in order). An error/empty returns a structured zero-result with hints (no silent cross-provider fallback).
Auth: World Bank, OECD, and Eurostat are keyless (always available). FRED_API_KEY (free at fred.stlouisfed.org) enables FRED; it is sent as a query param and never logged.

Annotations¶

ReadOnly: true · Idempotent: true · OpenWorld: true (queries the live FRED / World Bank / OECD / Eurostat APIs)

Cache¶

TTL: 6 hours (only for non-empty results)

Tool 23: `verify_citation`¶

Purpose¶

Verify a single citation before relying on it — confirm it exists, matches a real record, hasn't been retracted, and still resolves. Built to catch AI-fabricated or retracted citations before they ship (legal filings, papers, articles). Composes the retraction enrichment, the link verifier, and the academic searchers; adds no new provider.

Input Schema¶

Field	Type	Required	Description
`citation`	string	yes	A DOI (`10.1038/nature12373`), a URL, or a free-text reference string. The tool auto-detects which.
`claim`	string	no	The assertion this citation is cited for. When set, the source (the live URL, or its Wayback snapshot when dead) is fetched and checked for whether it actually addresses the claim — surfacing evidence sentences and flagging mischaracterization. Coverage + evidence, never a support/refute verdict. Off unless provided; adds one fetch. Omitting it leaves mischaracterization unchecked — the response flags this via `claimCheckSkipped` (#359).

Output Schema¶

input, inputType (doi|url|reference), exists (bool), matchedRecord (the academic record — for a DOI input it is only the work whose DOI exactly equals the input, never a near-neighbor, and is omitted when no exact record exists or exists:false), matchConfidence (high|medium|low|none), detectedDoi (for a URL input that resolves to a scholarly article: the DOI extracted from the page — citation_doi meta, the URL path, or references-safe front matter — which then drives the retraction + title-match checks just like a DOI input; omitted when no scholarly DOI was found), titleMatch (match|mismatch|not_checked: whether a title — text supplied alongside a DOI, or a scholarly page's own title for a URL input — matches the matched record's actual title; mismatch means ≥2 substantive title tokens are absent from the record — the caller may have cited the wrong paper; not_checked when there is no title text or only a bare DOI), retractionStatus (Crossref integrity status when retracted/corrected; omitted when clean), httpStatus + archivedUrl (for URL inputs — live status and a Wayback snapshot for dead links; exists:true with a 403/429/503 httpStatus means the server is reachable but refused the verifier — the resource exists, it is not a dead link), provenance (how each piece of evidence was obtained), and the trust marker. When no claim was supplied, claimCheckSkipped (true) and claimCheckSkippedReason explain that existence/retraction were checked but mischaracterization was not (#359). When a claim was supplied: claim (echo), claimSupport (addressed|partially_addressed|not_addressed|source_unavailable), claimEvidence (claim-relevant source sentences), claimSourceUrl (the URL actually fetched), contrastSignal (true only — a negation/contrast cue, read the evidence yourself), and — only when the fetched source was thin — contentWords (word count) and sparsityNote (#358: the claim check ran against a paywall/bot-wall stub; annotates claimSupport, never changes it).

Behavior¶

Evidence, never a verdict. The tool reports what it found (exists/matches/retracted/resolves); the caller decides whether to cite. It never synthesizes a true/false judgment.
DOI input → existence + retraction via Crossref works/{doi}; the matched record is fetched by exact-DOI entity lookup (the DOIResolver capability, e.g. OpenAlex /works/doi:{doi}) so matchedRecord is always the cited work or nothing — a relevance DOI search returns near-neighbors, which are never shown as this DOI's record. matchedRecord/matchConfidence are omitted when no exact record is found or exists:false. When the citation string also carries a title alongside the DOI, titleMatch compares those tokens against the matched record's title: mismatch fires only when ≥2 substantive tokens supplied are absent from the record (the caller may have paired the wrong title with this DOI), while not_checked means only a bare DOI was given. Zero false positives: a single coincidental token is never flagged mismatch.
URL input → liveness via the SSRF-safe link verifier; a Wayback archivedUrl when the live link is dead. When the URL resolves to a scholarly article (classified peer-reviewed/academic), the page is fetched once and its DOI is extracted (citation_doi meta → URL path → references-safe front matter, the same authority order as scrape_page); a found DOI then drives the full DOI enrichment — detectedDoi, retractionStatus, matchedRecord/matchConfidence (exact-DOI lookup), and titleMatch comparing the page's own title against the matched record. A non-scholarly page stays liveness-only — a DOI-shaped string in prose is never surfaced.
Free-text input → best-match academic lookup with a transparent token-overlap matchConfidence; retraction checked when the match carries a DOI.
Claim check (optional). When a claim is given, the source is fetched (the live URL, a matched record's URL, or a Wayback snapshot) and measured for topical overlap with the same lexical, model-free coverage as audit_bibliography (no model/embedding): claimSupport reports COVERAGE not stance, not_addressed is the mischaracterization signal (only when a source was actually read), and contrastSignal flags a negation cue. source_unavailable when no fetchable source (e.g. a DOI/reference whose matched record carries no URL). Off — and zero added latency — unless a claim is supplied. claimEvidence and contrastSignal are English-keyword heuristics (#390): on non-English source text an empty/false result means the heuristic didn't match, not that the claim is confirmed unaddressed/unopposed — read the source yourself when it isn't English. The same caveat applies to conflictOfInterest, which matches English employment/funding/equity phrases against author bio text.
Degrades gracefully when a resolver is unconfigured (reports the gap in provenance); never panics.
To check a whole reference list at once (a document, an explicit list, or a session), use audit_bibliography — the corpus-level companion that runs these same checks over every entry.

Annotations¶

ReadOnly: true · Idempotent: true · OpenWorld: true (queries live external sources)

Cache¶

Not cached (a verification is a point-in-time liveness/integrity check).

Tool 24: `verify_recommendation`¶

Purpose¶

Audit an AI-generated recommendation list (a listicle, product ranking, or comparison) for anti-sloptimization signals. Given a list of recommendations with optional URLs and author bios, returns per-item evidence: self-promotion patterns (a brand ranking itself first), conflicts of interest (the author is employed by the recommended company), domain reputation, and link liveness. Built to catch GEO (Generative Engine Optimization) and brand-favoring recommendations so you can decide whether the list is trustworthy or gaming you. Compose with web_search + verify_citation to audit sources and claims.

Input Schema¶

Field	Type	Required	Default	Description
`recommendations`	[]object	yes	—	Array of recommendations (max 100). Each has: `title` (the recommendation), `url` (optional), `author` (optional), `authorBio` (optional author affiliation/bio).
`claim`	string	no	—	Optional context describing what the list claims (e.g. "best e-commerce platforms for small businesses"). When set, triggers corroboration searches across independent journalism and tech sources per recommendation.
`numCorroborationResults`	int	no	5	Results to fetch per lens per recommendation when `claim` is set. Max 10.

Output Schema¶

itemCount (recommendations audited), recommendations[] — per item: title (echo), url (echo when provided), author (echo when provided), selfPromotionSignal (present when the recommendation text contains ranking patterns favoring the brand; rare for structured lists, more common on full-page audits), conflictOfInterest (present when the author has a detected financial stake — employment / funding / equity — in the recommended entity), domainReputation (domain reputation when the URL host is in the known sources dataset; omitted for unlisted hosts), linkLive (true when the URL resolves 2xx/3xx; false when dead), httpStatus (live HTTP status, 0 = unreachable/timeout), corroborationSearches[] (present when claim was given — one entry per corroboration lens with query, lens, resultCount, agreeCount, disagreeCount, silentCount, and topResults[]), flags (conflict_of_interest / dead_link / unknown_reputation / low_reputation; empty = no issues), reasons[] (human-readable explanations for any flags), and the trust marker. Top-level aggregateFlags[] (present only when claim is given): no_independent_corroboration fires when zero results across all lenses agreed with any recommendation.

Behavior¶

Evidence, never a verdict. The tool reports what it found (conflicts/reputation/liveness/corroboration); the caller decides whether to trust the recommendation.
Conflict of interest detection: scans author bio for employment, funding, or equity indicators (e.g. "at Shopify", "advisor to") that match entities mentioned in the recommendation text. Conservative — false negatives preferred to false positives.
Self-promotion signal (when present): indicates a ranking list putting its own brand first (a brand blog recommending itself as #1, etc.). Uses lexical matching: the URL's domain token (e.g. shopify.com → "shopify") must appear as the rank-1 item name. Does not detect corporate ownership (e.g. adobe.com ranking "Marketo" #1, since Marketo is an Adobe acquisition but the names don't match). See the open enhancement issue for scope-expansion plans.
Domain reputation: when known, surfaces the source's reputation tier from the embedded allowlist (academic, news, official docs, etc.).
Link liveness: batched SSRF-safe check of all provided URLs; present only when URLs are given.
Corroboration search (#246): when claim is provided and a recommendation has a title, the tool issues one site-scoped search per lens (journalism — government/public-record/filing sources, SEC/court/data.gov; tech — Ars Technica, TechCrunch, The Verge, Wired) against the default provider. Each result is scored: a negation/refutation cue in either the result's claimSignal (the single most claim-relevant snippet sentence) or its title → disagreeCount (checking the title too catches refutation language that lands in a headline but not the snippet); otherwise an empty claimSignal → silentCount (no sentence mentioned the title); any other non-empty claimSignal → agreeCount. The aggregate flag no_independent_corroboration fires when no result across all lenses agreed with any recommendation. Fail-open: a missing lens or provider error leaves corroborationSearches nil without failing the audit. claimSignal, agreeCount/disagreeCount/silentCount, and conflictOfInterest are all English-keyword heuristics (#390): on non-English result/bio text a miss means the heuristic didn't match, not that the signal is confirmed absent — read the underlying title/snippet/bio yourself for non-English sources.
Scope: up to 100 recommendations per call; any excess returns an error.

Annotations¶

ReadOnly: true · Idempotent: true · OpenWorld: true (queries live external sources for link liveness)

Cache¶

Link liveness checks are cached for 1 hour; domain reputation lookups use the embedded dataset (no network).

Tool 25: `clinical_search`¶

Search ClinicalTrials.gov — the NIH registry of 400K+ clinical studies — for evidence-based-medicine and systematic-review research. ClinicalTrials.gov is keyless, so this tool is always registered. Discovery + primary-source retrieval only — not medical advice.

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes*	—	Free-text across trial fields. *Provide at least one of `query`/`condition`/`intervention`/`sponsor`
`condition`	string	yes*	—	Disease/condition (e.g. `covid-19`)
`intervention`	string	yes*	—	Drug/device/treatment (e.g. `remdesivir`)
`sponsor`	string	yes*	—	Lead sponsor / funder
`status`	string	no	—	Recruitment status filter: `RECRUITING`, `COMPLETED`, `TERMINATED`, …
`num_results`	int	no	10	1–100
`provider`	string	no	—	Force a clinical-trials provider: `clinicaltrials`
`sessionId`	string	no	—	Record results as sources on a `sequential_search` session

Output Fields¶

Each trials[] item: nctId, title, status, phases (array), conditions (array), interventions (array), sponsor, startDate, hasResults (bool — whether results are posted), url (study page; scrape_page for the full registration), source. Plus query, resultCount, provider, hints (when empty), and trust (untrusted-external-content).

Behavior¶

Combine query/condition/intervention/sponsor/status to narrow the registry's structured facets; at least one is required.
Provider honoring: an explicit provider is used exclusively; otherwise the first configured provider answers. An error/empty returns a structured zero-result with hints (no silent fallback).
A bad request surfaces as a structured upstream error (the API returns text/plain errors, decoded as a message snippet); a 404/no-match is an empty result, never a panic.
Auth: keyless — ClinicalTrials.gov v2 needs no API key.

Annotations¶

ReadOnly: true · Idempotent: true · OpenWorld: true (queries the live ClinicalTrials.gov API)

Cache¶

TTL: 6 hours (only for non-empty results)

Tool 26: `local_search`¶

Search for physical places (restaurants, cafes, shops, services, points of interest) by local intent query. Backed by Brave's three-call local pipeline: web search with result_filter=locations to collect ephemeral location IDs, then local/pois for structured POI details, then local/descriptions for AI-generated descriptions (best-effort). Requires BRAVE_API_KEY; the tool is not registered when the key is absent. Location IDs are ephemeral — never persisted beyond the request lifecycle.

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	Local intent query (e.g. `'best coffee shops near downtown Seattle'`)
`near`	string	no	—	Free-text location bias (city, neighborhood, region). Used as the location anchor when no coordinates are given (Brave: sent as a location header, not appended to the query). Coordinates take precedence.
`latitude`	number	no	—	WGS-84 latitude (−90…90) of the search anchor. With `longitude`, takes precedence over `near`, anchors the place index, and distance-ranks results nearest-first.
`longitude`	number	no	—	WGS-84 longitude (−180…180). Must be paired with `latitude` to take effect.
`radius`	number	no	0	Distance filter in meters, applied only when `latitude`/`longitude` are set. Drops places farther than this from the anchor. 0 = no filter. Independent of `units`.
`country`	string	no	—	ISO 3166-1 alpha-2 country restriction (e.g. `US`, `GB`)
`units`	string	no	—	`metric` or `imperial` (display only)
`num_results`	int	no	5	1–20
`provider`	string	no	—	Force a local-search provider: `brave`
`sessionId`	string	no	—	Record results as sources on a `sequential_search` session

Output Fields¶

Each places[] item: id (ephemeral), name, address, lat, lon, phone, website (use scrape_page for the full site), categories (array), rating, ratingCount, hours (array, e.g. 'Thursday: 06:59-17:00'), description (absent when unavailable), source. Plus query, resultCount, provider, hints (when empty), and trust (untrusted-external-content).

Behavior¶

Provider honoring: an explicit provider is used exclusively; an unknown or unconfigured provider returns a structured error (no silent fallback).
Location anchoring (Brave): when latitude/longitude are supplied they are sent as X-Loc-Lat/X-Loc-Long headers on the step-1 locations call (header values are never logged) and take precedence over near; otherwise near/country populate the X-Loc-City/X-Loc-Country text-fallback headers. The query is not suffixed with the location. Steps 2/3 (local/pois, local/descriptions) send no location headers, matching Brave's reference client.
Distance ranking: with an anchor coordinate present, returned POIs are sorted nearest-first by haversine distance; radius (meters) additionally drops places beyond that distance. Coordinates are optional and validated at the boundary — latitude and longitude must be given together and within range, and radius must be non-negative, else the call returns a structured validation error.
The descriptions step is best-effort: a failure there does not fail the whole call — results are returned without descriptions.
Returns an empty places array (with hints) when no location results match; never panics.

Annotations¶

ReadOnly: true · Idempotent: true · OpenWorld: true (queries the live Brave Search API)

Cache¶

TTL: 6 hours (only for non-empty results)

Tool 27: `audit_bibliography`¶

Purpose¶

The corpus-level companion to verify_citation: audit a whole bibliography at once. Read a CSL-JSON / RIS / BibTeX document (what format_bibliography exports), an explicit list of references, or a sequential_search session's sources, and run the same trust checks over every entry — does it exist, is it retracted, does its link still resolve, and (optionally, per entry) does the source actually address the claim it's cited for. Built to catch fabricated, retracted, or mischaracterized citations across a full reference list (legal filings, papers, systematic reviews) before they ship. Composes the retraction enrichment, the link verifier, the academic searchers, and claim-evidence extraction; adds no new provider.

Input Schema¶

Field	Type	Required	Default	Constraints
`bibliography`	string	yes*	—	A bibliography document. *Provide one of `bibliography`/`entries`/`sessionId`
`format`	string	no	`auto`	`auto` (detect), `csl-json`, `ris`, or `bibtex`
`entries`	[]object	yes*	—	Explicit references (`url`, `title`, `author`, `site`, `date`, `doi`, `claim`). *One of `bibliography`/`entries`/`sessionId`
`sessionId`	string	yes*	—	Audit a `sequential_search` session's recorded sources. *One of `bibliography`/`entries`/`sessionId`

Precedence when more than one is supplied: entries → bibliography → sessionId. A per-entry claim is honored only in the explicit entries mode (a document/session carries no per-entry claims).

Output Schema¶

source (where entries came from: entries / bibliography:<format> / session), entryCount, summary ({total, retracted, deadLink, notFound, unchecked, mischaracterized, ok, claimCheckSkippedCount, thinContentCount} — claimCheckSkippedCount counts entries with no claim provided (#359: existence/retraction checked, mischaracterization not); thinContentCount counts entries whose claim check ran against thin content, <150 words (#358) — their claimSupport may not reflect the full document), and entries[] — per entry: index, title, doi, url, exists (bool), retractionStatus (when retracted/corrected), linkLive + httpStatus, archivedUrl (Wayback snapshot for a dead link), flags (retracted / dead_link / not_found / unchecked / mischaracterized; empty = clean), reason (a human-readable explanation for a flagged entry), and — when a claim was given — claim, claimSupport (addressed / partially_addressed / not_addressed / source_unavailable), claimEvidence (relevant source sentences), claimSourceUrl (the URL actually fetched), and — only when the fetched source was thin — claimContentWords (word count) and claimSparsityNote (#358: annotates claimSupport, never changes it). Plus checkedAt (RFC 3339 point-in-time stamp), the trust marker, warning (#359: present only when no entry in the corpus carried a claim — the audit checked existence and retraction only, across the whole corpus, not just one entry), and — only when the per-call cap is exceeded — skipped + skippedNote.

Behavior¶

Evidence, never a verdict (same contract as verify_citation). It reports what it found per entry and a corpus summary; the caller decides what to fix.
One pass, bounded. All entry URLs are checked in a single batched, concurrency-bounded link pass; DOI existence+retraction (one Crossref call each), academic existence lookups, and the optional per-entry claim fetch all run concurrently (bounded). A DOI is authoritative for existence+retraction; without one, existence is confirmed by a best-match academic title lookup.
Claim check (optional, #174). When an entry carries a claim, the source page is fetched (the live URL, or its Wayback snapshot when the live link is dead) and measured for topical overlap via transparent term coverage. claimSupport reports coverage, not a stance: addressed (strong overlap — claim-relevant sentences returned in claimEvidence for you to judge direction), partially_addressed (some overlap — evidence shown but not flagged; ambiguous, you judge), not_addressed (the source addresses none of the claim → the mischaracterized flag), or source_unavailable (no fetchable source). It never asserts "supports"/"refutes" — the extractor surfaces sentences, not direction — and the flag fires only on zero overlap of a fetched source, so a real-but-tangential source is never falsely accused (under-flagging is the safe direction). The check is lexical, not semantic (no model/embedding dependency): coverage is measured as the peak overlap within a sentence window (not across the whole page), so a narrow claim whose terms are merely scattered across a long, broad article scores low local coverage rather than a misleading partially_addressed. Borderline partial overlap is still shown as evidence for you to judge, not flagged. Because a source can share a claim's terms while contradicting it, a contrastSignal: true is added when a claim-relevant sentence carries a negation/refutation cue (e.g. "no significant", "did not", "no evidence", "contradicts") — a neutral "read this sentence yourself" heads-up so an addressed result is never mistaken for confirmation. The cue list holds only explicit negation/refutation terms, never bare discourse connectives like "however"/"although"/"whereas" (which contrast two things without opposing the claim and would false-positive on supporting sources). Off unless a claim is given (no added latency otherwise). claimEvidence and contrastSignal are English-keyword heuristics (#390): on non-English source text an empty/false result means the heuristic didn't match, not that the claim is confirmed unaddressed/unopposed — read the source yourself when it isn't English.
Flagging (deliberately distinguishes evidence of a problem from absence of evidence): retracted = the DOI/record is retracted (an expression-of-concern/correction is surfaced in retractionStatus but not flagged retracted); dead_link = a URL was checked and the server did not respond — 4xx-gone or network failure (a Wayback archivedUrl is attached when one exists); a linkLive:true result with a 403/429/503 httpStatus means the server is reachable but blocked the verifier — the resource exists and dead_link is not set; not_found = a DOI was looked up against Crossref and had no match — a possible fabrication; unchecked = the entry could not be corroborated by any check (no identifier, no live link) — absence of evidence, not evidence of absence (e.g. a book, a paywalled or offline source); mischaracterized = a claim was given and the fetched source does not address it. These are never conflated, and each carries a reason so a legitimate uncheckable source is never read as fake.
Capped at the first 200 entries per call; any overflow is reported in skipped/skippedNote (never silently dropped).
Session audits are scoped to the caller's own (tenant, user). Degrades gracefully when a resolver/scraper is unconfigured; never panics.

Annotations¶

ReadOnly: true · Idempotent: true · OpenWorld: true (queries live Crossref, the open web, and the Internet Archive)

Cache¶

Not cached (a point-in-time liveness/integrity audit).

Tool 28: `archive_source`¶

Purpose¶

Capture a fresh Internet Archive (Wayback Machine) snapshot of a URL via Save Page Now, so a source you intend to cite stays verifiable even if the page later changes or disappears. This is the trust suite's one write tool: the rest of the suite can tell you a link is dead and surface an existing snapshot (read-only); this one creates a new snapshot. It makes "stays honest" durable rather than point-in-time.

Input Schema¶

Field	Type	Required	Description
`url`	string	yes	The URL to capture a fresh snapshot of. Must be a public `http`/`https` URL (private/loopback hosts are refused).

Output Schema¶

requestedUrl (echo), snapshotUrl (the https://web.archive.org/web/<timestamp>/<url> snapshot; omitted when pending/unavailable), archivedAt (RFC 3339 — when a fresh capture was confirmed; present only on a fresh capture), captured (bool — true only for a fresh capture by this call), status (archived|existing|pending|unavailable), httpStatus (Save Page Now endpoint status), reason (why no fresh capture, for existing/pending/unavailable), pollUrl (Wayback wildcard URL to check manually once SPN's in-flight ingestion completes; present only when status:"pending" and no existing snapshot was found), source (web.archive.org Save Page Now), provenance, and the trust marker.

Behavior¶

Write tool, non-destructive. It creates a public archive entry; it never deletes or mutates existing data. Annotated ReadOnly:false, Destructive:false, Idempotent:true (Save Page Now dedups within its rate window).
Best-effort and honest. Save Page Now is rate-limited and slow — a fresh capture is not guaranteed. The tool retries with backoff within its ~25 s budget so a slow-but-successful first-time capture is confirmed in-call. When one can't be made, the tool falls back to the most recent existing snapshot and reports captured:false / status:"existing"; when nothing is confirmed it reports status:"pending" with a pollUrl to check once in-flight ingestion completes. It never errors on a slow/throttled capture.
Evidence, never a verdict. It returns the snapshot artifact + provenance; it does not judge the source.
SSRF-safe. The outbound request goes to the fixed web.archive.org host through the SSRF-safe client (every redirect hop IP-revalidated); the submitted URL is additionally validated and private/loopback hosts are refused before the request.
Optional credentials. Keyless Save Page Now works by default; setting IA_ACCESS_KEY + IA_SECRET_KEY authenticates the request for higher reliability (keys are never logged or echoed).
status:"unavailable" when no link verifier is configured (graceful, not an error).
Use verify_citation first to see whether a link is already dead or already archived.

Annotations¶

ReadOnly: false (write) · Destructive: false · Idempotent: true · OpenWorld: false (advisory; the call does reach the Internet Archive)

Cache¶

Not cached (each call is an explicit archive request).

Tool 29: `brand_research`¶

Purpose¶

Research a company's complete brand identity — colors, logos, typography, tone of voice, social handles, and W3C design tokens — from any domain or company name. Probes official brand portals and brand guideline pages; only returns high-confidence structured data found directly on those pages (empty fields = genuinely not found). When a brand portal is found, the fully rendered page text is stored as a resource (brand_portal_resource) that an AI agent can pass to read_resource to analyze colors, typography, and other details directly. When no brand portal is found, the suggestion field guides the AI agent to use scrape_page on the homepage instead.

Use brand_research when you need structured brand JSON. Use brand-guidelines (MCP Prompt) when you want LLM interpretation for a specific use case (landing page, email, video brief).

When to use vs. other tools¶

Use case	Tool
Get raw brand data (colors, logos, fonts as JSON)	`brand_research`
Generate brand-compliant content for a specific use case	`brand-guidelines` prompt
Scrape the brand homepage	`scrape_page`
Search for brand guidelines PDF	`search_and_scrape`

Input Schema¶

Field	Type	Required	Default	Description
`url`	string	no*	—	Domain or URL (e.g. `kaltura.com`, `https://kaltura.com`). Preferred when both are supplied.
`company_name`	string	no*	—	Company name used to resolve domain when `url` is omitted.
`depth`	string	no	`standard`	`quick` (meta only, ~1–2s), `standard` (adds brand-page probe, ~3–6s), `full` (adds web search, ~8–15s).
`include_design_tokens`	bool	no	false	When true, include W3C DTCG `design_tokens` object for Style Dictionary / Tokens Studio / Figma Variables.
`sessionId`	string	no	—	Link to a `sequential_search` session.

*At least one of url or company_name is required.

Output Schema¶

Field	Type	Description
`identity`	object	`name`, `domain`, `tagline`, `description`, `industry`, `founded`, `location`
`colors`	object	`primary`, `secondary`, `accent`, `background`, `text`, `surface`, `text_secondary` (hex strings) + `palette` array with `hex`, `name`, `role`, `brightness`
`logos`	object	`primary`, `dark`, `icon` (each: `url`, `format`, `width`, `height`) + `favicon`, `og_image`
`typography`	object	`heading`, `body`, `mono` (each: `family`, `weights`, `origin`, `origin_id`) + `google_fonts_url`, `scale`
`tone_of_voice`	object	`summary`, `attributes` array, `dos_and_donts` object
`social`	object	`twitter`, `linkedin`, `github`, `youtube`, `facebook`, `instagram` (URLs)
`sources`	array	Which tiers contributed: `name` (`homepage_meta`/`brand_page`/`web_search`), `url`, `fields`, `scrapeQuality` (`"thin"` when that page's content was below ~150 words (#358) — a content-volume signal, not an error; the source may still have contributed real fields)
`guidelines_url`	string	First discovered brand guidelines page URL. Candidate pages are classified with English soft-404/login-page and brand-page keyword matching (#390); a genuine non-English brand portal may be missed or misclassified — verify by reading `brand_portal_resource` yourself when the target site isn't English
`brand_portal_resource`	string	`research://artifact/{id}` URI — pass to `read_resource` to retrieve the full rendered brand portal text for AI analysis
`suggestion`	string	Guidance for the AI agent when no brand portal was found (recommends `scrape_page` on the homepage), OR when a portal URL was found but its content could not be extracted (recommends `scrape_page` on the portal URL)
`design_tokens`	object	W3C DTCG format (`$value`/`$type` per token) — only when `include_design_tokens: true`
`coverage`	object	`colors`, `logos`, `typography` — each `full`/`partial`/`none`/`extraction_blocked` (#358: a brand page was found but its content was too thin to read — a JS-SPA skeleton or bot-wall, distinct from `none` which means no page was found at all); `tone_of_voice` — `found`/`none`
`cache_age`	integer	Seconds since cache was written. `0` = live fetch. Cache TTL: 24 hours.
`trust`	string	Always `untrusted-external-content`

Behavior¶

Three-tier extraction pipeline. Tiers (2) homepage structured-data + meta and (4) brand-page probe (concurrent HEAD requests, 26 patterns: 5 subdomains + 21 paths) run concurrently via goroutines. Tier (5) web search (depth=full only) runs sequentially after they complete. Higher tiers never overwrite lower-tier values.
Authoritative-data-first. Only high-confidence data found explicitly on brand portal pages is returned. Empty fields mean genuinely not found — never inferred or guessed. No CSS heuristics.
Brand-page probe runs all 26 candidates concurrently, picks the highest-priority match (dedicated brand subdomains beat path matches), rejects redirects to the homepage or a different host. For dynamic SPA brand portals (e.g. Corebook.io), the page is rendered via browser tier and navigation links are extracted from the rendered DOM to find color/typography sub-pages.
Brand portal resource. When a brand portal is found, the full rendered page text (including any color sub-pages) is stored as an encrypted artifact (research://artifact/{id}, 30-min TTL) and returned in brand_portal_resource. Pass this URI to read_resource so an AI agent can analyze the raw rendered content for colors, typography, and other details.
Fallback suggestion. When no brand portal is found, suggestion is populated with guidance to use scrape_page on the homepage for fully rendered content analysis. When a portal URL was found but its content could not be extracted (e.g. a gated SPA), suggestion instead recommends scrape_page on the portal URL.
Tone of voice is only extracted from explicit brand guidelines pages — not inferrable from meta or CSS.

Annotations¶

ReadOnly: true · Destructive: false · Idempotent: true · OpenWorld: true

Cache¶

TTL: 24 hours. Key: SHA-256 of domain + depth. cache_age field shows seconds since last fetch.

Tool 30: `awesome_list_search`¶

Search the ecosyste.ms Awesome API for community-curated "awesome-*" lists on a GitHub topic — structured, filterable coverage of the awesome-list ecosystem beyond what the web_search awesome-lists lens offers via free-text search alone. ecosyste.ms is keyless, so this tool is always registered; an optional ECOSYSTEMS_EMAIL raises the caller's rate-limit tier via the "polite pool."

Input Schema¶

Field	Type	Required	Default	Constraints
`topic`	string	yes*	—	GitHub topic slug (e.g. `osint`, `go`, `machine-learning`). *Provide at least one of `topic`/`query`
`query`	string	yes*	—	Free-text fallback used when `topic` is empty or doesn't resolve to a known topic. *Provide at least one of `topic`/`query`
`min_stars`	int	no	—	Minimum GitHub stars on the list's repository
`min_projects`	int	no	—	Minimum number of curated entries in the list
`sort_by`	string	no	`stars`	`stars`, `projects`, or `updated`
`num_results`	int	no	10	1–100
`provider`	string	no	—	Force an awesome-list provider: `ecosystems`
`sessionId`	string	no	—	Record results as sources on a `sequential_search` session

Output Fields¶

Each lists[] item: name, fullName (owner/repo of the list's source repository), url (list page; scrape_page for the full curated list), description, projectsCount (curated-entry count), stars, topics (array), lastSyncedAt (when ecosyste.ms last synced this list from its repository), source. Plus query, resultCount, provider, hints (when empty), and trust (untrusted-external-content).

Behavior¶

Provide topic and/or query — at least one is required. query feeds the same underlying topic match when topic is empty.
The input is lowercased and hyphenated before matching ("Mental Health" → mental-health), since ecosyste.ms's topic matching is exact-string and case-sensitive with no normalization of its own.
If a multi-word input misses as one compound slug, each substantive word (2+ characters, stopwords like "a"/"of"/"the" skipped) is retried independently against the API and the hits are merged and deduped by repository — recovers cases like personal finance (no such compound slug exists, but finance does) without a caller having to know to split it themselves. A genuine single-word miss (e.g. parenting — the real upstream slug is parent) is not retried further; there's nothing left to split.
Archived source repositories are excluded from results.
min_stars/min_projects filter by the list repository's stars and curated-entry count; results are sorted (descending) by sort_by, default stars.
Provider honoring: an explicit provider is used exclusively; otherwise the first configured provider answers. An error/empty returns a structured zero-result with hints (no silent fallback).
A 404/no-match from the API is an empty result, never a panic.
Auth: keyless — ecosyste.ms needs no API key (5,000 req/hour anonymous rate limit). Set ECOSYSTEMS_EMAIL (falls back to OPENALEX_EMAIL) to join the "polite pool" and raise the limit; ECOSYSTEMS_API_KEY is also sent but only takes effect on ecosyste.ms's paid plans, not the free self-service keys from ecosyste.ms/login.

Annotations¶

ReadOnly: true · Idempotent: true · OpenWorld: true (queries the live ecosyste.ms API)

Cache¶

TTL: 6 hours (only for non-empty results)

Cross-Cutting Concerns¶

Timeouts¶

Scrape-tier timeouts are hardcoded in source; only HTTP server timeouts are env-configurable (see docs/DEPLOYMENT.md).

Operation	Default	Max
Search API call	10s	30s
Markdown negotiation	5s	10s
Stealth scrape (http client)	20s	—
HTML scrape (goquery)	15s	30s
Browser scrape (go-rod)	30s	60s
YouTube transcript	30s	60s
Document download	30s	60s
Total tool execution	60s	120s

Content Size Limits¶

Content	Limit
`scrape_page` content — default `max_length`	50 KB
`scrape_page` content — hard cap (`maxScrapeLength`, all modes incl. `raw`)	5 MB
`search_and_scrape` per-source content — default	50 KB
`search_and_scrape` combined content — default `total_max_length`	300 KB
Document download	50 MB default
YouTube transcript	100 KB

Token Estimation¶

Formula: len(content) / 4 (conservative, ~4 chars per token)
Size categories: small (<5K chars), medium (<20K), large (<50K), very_large (>=50K)

Cache Freshness Provenance¶

Cacheable tools attach a top-level MCP _meta block so a client can tell whether a result was served from cache and how stale it is. The fields (set in cachedResultWithMeta / freshResult, internal/tools/errors.go) are:

_meta rides on the MCP CallToolResult envelope — a sibling of content, not a key inside the content JSON body. A client reads it from the result's meta/_meta field, never by parsing the body string. The end-to-end roundtrip guard is TestWebSearchCacheMeta_PresentOnFreshAndCacheHit.

Field	Type	Meaning
`cached`	bool	`true` if served from cache, `false` if freshly fetched
`ageSeconds`	int	Age of the cached entry in seconds (`0` for fresh)
`maxAgeSeconds`	int	The entry's TTL in seconds
`freshness`	string	Human-readable freshness label (e.g. `fresh`)

Which tools emit _meta:

Tool	Fresh result	Cache hit
`web_search`	yes (`cached: false`)	yes (`cached: true`)
`image_search`, `news_search`, `academic_search`, `patent_search`, `scrape_page`	no	yes (`cached: true`)
`search_and_scrape`, `sequential_search`, `get_research_session`	no (not cached as a unit)	n/a

Routing Provenance (`_meta.routing`)¶

When multi-provider routing (SEARCH_ROUTING) is active, the search-family tools attach an operator-facing routing block to the same _meta channel (set in routingMeta / withRoutingMeta, internal/tools/errors.go; captured by the Router via search.RoutingTrace). It coexists with the cache block — withRoutingMeta merges into _meta without clobbering the freshness keys.

This is operator/debug data, not content. It is LLM-invisible (a sibling of content, never fed to the model) and never appears in the result body — the Router's job is to make providers interchangeable to the model. The drift guard TestRoutingMeta_PresentOnResultAndAbsentFromContent fails CI if any routing field leaks into LLM-facing content.

Field	Type	Meaning
`provider_used`	string	The provider that served the result (omitted on a cache hit)
`providers_attempted`	[]string	Providers tried in priority order, up to the one that served
`fallback`	bool	`true` when the served provider was not the first attempted
`fallback_reason`	string	Coarse enum: `circuit_open` or `primary_unavailable` (omitted unless a fallback occurred). No raw breaker counts or upstream error text.
`cache_hit`	bool	`true` when served from cache (provider attribution is then omitted — the cached blob's provenance is not this call's routing)
`latency_ms`	int	Server-side end-to-end latency for the call

The provider name is the disclosure boundary: no upstream URLs, credentials, or breaker internals appear. The block is omitted entirely when there is nothing to observe — a single-provider / no-routing deployment, or a non-routed capability. Routing applies to the Router-routed capabilities only (web / images / news / patents / academic); the synthesis (answer, structured_search), citation_graph, and structured-domain (filing_search, legal_search, econ_search, clinical_search) tools resolve a single provider directly and already name it in the result body's source/provider field — they have no fallback ladder to observe. The same routing summary is also recorded under audit.AuditEvent.Metadata["routing"].

For the aggregate, on-demand operator views (recent errors, live provider/breaker health) see the diagnostics:// MCP Resources and the HTTP-mode dashboard in docs/DEPLOYMENT.md.

MCP Resources¶

Read-only, on-demand views exposed as MCP Resources (not tools). Read with ReadResource(uri).

URI	Name	Description
`stats://tools`	Tool Statistics	Per-tool call count, latency, and error rate
`stats://sessions`	Active Sessions	Count of live research sessions
`stats://rate-limits`	Rate Limit Status	Per-tenant and global quota config + remaining
`stats://providers`	Configured Providers	Every configured provider by name and capability type
`lenses://catalog`	Search Lens Catalog	All available search lenses — name, description, domain count, and whether a dedicated Custom Search Engine is configured. Pass a `name` to `web_search`, `academic_search`, `news_search`, or `image_search` as the `lens` parameter to restrict results to authoritative sources for that domain.
`diagnostics://errors/recent`	Recent Errors	Bounded, newest-first ring of recent tool errors (redacted, tenant-scoped)
`diagnostics://health`	Provider Health	Live circuit-breaker state per provider; empty when multi-provider routing is not enabled
`research://artifact/{id}`	Research Artifact	Large-payload store for `scrape_page` (raw mode), `search_and_scrape`, and `research_export` results served via `resource_link`

Audit & Tenant Scope¶

Every tool call is logged through deps.Auditor.Log() as an audit.AuditEvent (internal/audit/logger.go) carrying tenant_id, user_id, request_id, tool_name, duration_ms, success, and an optional error_code (field names are the JSON tags on AuditEvent). Tenant and user identity are read from the request context (auth.TenantIDFromContext / auth.UserIDFromContext).

Privacy: the raw query text is attached to metadata.query only when AUDIT_INCLUDE_REQUEST_BODY is enabled (Auditor.IncludeRequestBody()); otherwise just metadata.query_length is recorded. All metadata string values and error strings pass through audit.MaskSecrets so credentials never persist. Cache keys and session keys are tenant-scoped, so one tenant cannot read another's cached or session data.

Unified Error Handling¶

All tools use a dual-format error response: a natural-language first line + a JSON block with machine-readable metadata:

Rate limited (google). Wait 60 seconds and retry, or try a different provider.

{"error":{"kind":"rate_limited","retryable":true,"retryAfterSeconds":60,"suggestedAction":"retry_after_delay","provider":"google"}}

Error kinds: rate_limited, auth_required, blocked, network, content_empty, browser_unavailable, config, upstream_unavailable. Each maps to a suggestedAction the LLM can branch on programmatically.

Full details: see docs/ERROR_HANDLING.md — covers the three-layer architecture, all error kinds and actions, and contributor patterns.

Tool Annotations (MCP Protocol)¶

Every tool declares annotations for client consumption (readOnlyAnnotations(idempotent, openWorld) for read tools, writeAnnotations(idempotent) for the three write tools (memory_save, workspace_contribute, archive_source) — all in internal/tools/registry.go). CI enforces tool↔doc consistency via TestAllToolsHaveAnnotations, TestToolsDocMatchesRegistry, TestOutputSchemaMatchesResponse, and TestToolDescriptionQuality (internal/tools/metadata_test.go) — including on docs-only PRs via the standalone docs-drift CI job. No tool is Destructive — deletion is the /admin/data erasure endpoint, never a tool flag (see docs/DEPLOYMENT.md).

Tool	ReadOnly	Idempotent	OpenWorld
web_search	true	true	true
scrape_page	true	true	true
search_and_scrape	true	true	true
image_search	true	true	true
news_search	true	true	true
academic_search	true	true	true
patent_search	true	true	true
sequential_search	true	false	false
get_research_session	true	true	false
answer	true	true	true
structured_search	true	true	true
citation_graph	true	true	true
research_export	true	true	false
format_bibliography	true	true	false
audit_bibliography	true	true	true
verify_citation	true	true	true
verify_recommendation	true	true	true
archive_source	false (write)	true	false
filing_search	true	true	true
legal_search	true	true	true
econ_search	true	true	true
clinical_search	true	true	true
local_search	true	true	true
get_my_analytics	true	true	false
memory_save	false (write)	false	false
memory_recall	true	true	false
workspace_contribute	false (write)	false	false
workspace_read	true	true	false
brand_research	true	true	true

Notes: sequential_search is non-idempotent because it writes session state to disk on every call. memory_save, workspace_contribute, and archive_source are the three write tools (ReadOnly:false). memory_save and workspace_contribute are non-idempotent (each call appends a new record); archive_source is idempotent (archiving the same URL twice is safe). OpenWorld:false marks tools that touch only local/server state (sessions, memory, analytics, workspaces, exports) rather than the open web. Destructive is uniformly false — no tool is annotated destructive.

Provider Resolution¶

When a provider field is set on any search tool: 1. If provider is in the SearchProviders/PatentProviders/AcademicProviders map → use it 2. If provider is known but not configured → error with env var hint 3. If provider is completely unknown → error listing all supported providers (via allSupportedProviders())

Source of truth for supported providers: search.SupportedProviders, search.SupportedPatentProviders, search.SupportedAcademicProviders in internal/search/provider.go.

Known Provider Behaviors (not bugs)¶

These are upstream behaviors we cannot control — they reflect how the underlying APIs work:

Provider	Behavior	Impact
SearchAPI	May return fewer results than `num_results` requested	Query has limited coverage in their index; not an error
Google (news)	`freshness=hour` may return articles 5-10 hours old	Google's "last hour" filter is approximate, not strict
Google (images)	`size=large` may return images as small as 600x600	Google's size thresholds differ from typical expectations
USPTO	Full-text search only (no field-qualified queries)	API rejects field syntax; results rely on relevance ranking
OpenAlex	`pdf_only` may return 0 results for common topics	Not all papers have PDF URLs indexed in their metadata
DuckDuckGo	Rate-limited aggressively from cloud/datacenter IPs	Works well from local/STDIO; may return 0 results from servers
DuckDuckGo	Images and News return empty results	HTML endpoint doesn't support these categories; Router falls through
HackerNews	`web_search` / `news_search` only (no Images); `dateRange` filter via Algolia `numericFilters`; `num_results` 1–100 (values outside that range reset to 10); no API key required (`SEARCH_PROVIDER=hackernews` or `provider: hackernews` per-call)	Algolia HN search index only; not a general-web index

These are not errors in web-researcher-mcp. The tool faithfully passes parameters to the upstream API and returns whatever the API provides.

MCP Prompts¶

MCP Prompts are LLM-facing instruction handlers — they orchestrate tool calls and synthesis for a specific use case. Unlike tools (which return structured JSON), prompts return a natural-language instruction that tells the LLM how to use tool outputs.

`brand-guidelines`¶

Research a company's brand identity and produce use-case-specific brand-compliant guidance. Calls brand_research, interprets the structured JSON, and produces actionable creative direction for the requested use case.

When to use vs. brand_research directly: Call brand_research when you want the raw structured JSON (colors as hex, logos as URLs, fonts as names). Invoke the brand-guidelines prompt when you want an LLM to interpret that JSON and produce formatted creative guidance for a specific task.

Arguments¶

Argument	Required	Default	Description
`company`	yes	—	Company name or domain (e.g. `kaltura.com` or `Kaltura`)
`use_case`	no	`full_guidelines`	`landing_page`, `email`, `social_post`, `video_brief`, `design_tokens`, `full_guidelines`
`depth`	no	`standard`	Passed to `brand_research`: `quick`, `standard`, `full`

Use cases¶

`use_case`	Produces
`landing_page`	Color palette table, typography spec, sample headline + subhead, component guidance
`email`	Email color spec, font stack, sample subject line + preheader, HTML inline-style snippet
`social_post`	Visual identity notes, three sample captions (Twitter/X, LinkedIn, Instagram), hashtag suggestions
`video_brief`	Motion graphics color spec, logo bug placement, voiceover tone direction, music mood descriptor
`design_tokens`	W3C DTCG JSON code block, token mapping table, gap analysis
`full_guidelines`	Comprehensive brand doc: identity, color system, logo usage, typography, tone of voice, coverage summary

`company-recon`¶

Multi-phase OSINT recon over a target company or domain. Orchestrates existing tools (web_search, scrape_page, search_and_scrape, news_search, filing_search, research_export) across up to 9 phases to produce a cited, confidence-tiered intelligence report. Uses the osint lens for web_search calls targeting OSINT data sources.

No new Go dependencies — all data comes from free, publicly accessible endpoints (crt.sh JSON API, Wayback CDX API, HackerTarget, PublicWWW, etc.).

Arguments¶

Argument	Required	Default	Description
`target`	yes	—	Company name, domain, or both — e.g. `"Acme Corp acme.com"`
`depth`	no	`standard`	`quick` (phases 1+6+8+9), `standard` (phases 1–4+6–9), `deep` (all 9 phases)
`focus`	no	(omit for balanced coverage)	`sales_intel`, `security`, `due_diligence`, `brand_protection` — adjusts emphasis in phase instructions

Phase map¶

Phase	Depth	Tools	Goal
1 — Company Profiling	all	`web_search`, `search_and_scrape`, `news_search`	Identity, leadership, recent news
2 — Certificate Transparency	standard+deep	`scrape_page` (crt.sh JSON API)	Subdomain discovery via CT logs
3 — DNS / Infrastructure	standard+deep	`web_search` (osint lens), `scrape_page` (HackerTarget)	IP blocks, ASN, DNS history
4 — Archive Mining	standard+deep	`scrape_page` (Wayback CDX API)	URL patterns: login, API, admin, JS bundles
5 — Code / Config Search	deep only	`web_search` (osint lens, github.com)	SDK usage, config leaks in public repos
6 — Web / Content Discovery	all	`search_and_scrape`, `web_search` (osint lens)	Exposed login surfaces, forgotten subdomains
7 — Analytics Correlation	standard+deep	`scrape_page` (HackerTarget), `web_search` (PublicWWW)	Co-deployed sites via UA/GTM tags
8 — Business Intelligence	all	`web_search`, `news_search`, `filing_search`	Customers, filings, partnerships
9 — Confidence Scoring + Report	all	`research_export`	Consolidated report with CONFIRMED/STRONG/MODERATE/WEAK tiers

Known limitations¶

GA4 analytics IDs (G-XXXXXX) cannot be reverse-correlated — only Universal Analytics (UA-XXXXXX) and Google Tag Manager (GTM-XXXXXX) IDs work via HackerTarget/PublicWWW reverse-analytics lookup.
Live JavaScript inspection requires a Playwright MCP (if available); the prompt falls back to static source-code search.
Censys and BuiltWith depth is limited without API keys — infrastructure data comes from web-searchable pages only.
GitHub Code Search gives higher recall than web_search on github.com; use it if separately available.
filing_search is only available when EDGAR_CONTACT_EMAIL is set — without it, the tool is not registered and Phase 8 SEC filing lookup should be skipped.

Tool Specifications¶

Tool Registration Pattern¶

Cache-Key Contract¶

Large-Payload Linking (resource_link)¶

Tool 1: web_search¶

Purpose¶

Input Schema¶

Output Schema¶

Behavior¶

Cache¶

Error Conditions¶

Tool 2: scrape_page¶

Purpose¶

Input Schema¶

Output Schema¶

Raw Mode¶

Scraping Strategy (Tiered Fallback)¶

Known SPA Domains (require headless browser)¶

Cache¶

Error Taxonomy (internal/scraper/errors.go)¶

Tool 3: search_and_scrape¶

Purpose¶

Input Schema¶

Output Schema¶

Behavior¶

Recommendations & components (additive)¶

Cache¶

Tool 4: image_search¶

Input Schema¶

Output Schema¶

Provider notes¶

Cache¶

Tool 5: news_search¶

Input Schema¶

Output Schema¶

Behavior¶

Provider notes¶

Cache¶

Tool 6: academic_search¶

Input Schema¶

Output Fields¶

Behavior¶

Academic Site Pool (web search fallback)¶

Cache¶

Tool 7: patent_search¶

Input Schema¶

Output Fields¶

Behavior¶

Cache¶

Tool 8: sequential_search¶

Purpose¶

Input Schema¶

Session Management¶

Output Schema¶

Iterative Depth (depth)¶

State Management¶

Tool 9: get_research_session¶

Input Schema¶

Behavior¶

Cross-call error patterns (#99)¶

Annotations¶

Error Conditions¶

Cache¶

Tool 10: get_my_analytics¶

Purpose¶

Input Schema¶

Output Schema¶

Behavior¶

Cache¶

Tool 11: memory_save¶

Purpose¶

Input Schema¶

Output Schema¶

Behavior¶

Cache¶

Tool 12: memory_recall¶

Purpose¶

Input Schema¶

Output Schema¶

Cache¶

Tool 1: `web_search`¶

Tool 2: `scrape_page`¶

Error Taxonomy (`internal/scraper/errors.go`)¶

Tool 3: `search_and_scrape`¶

Tool 4: `image_search`¶

Tool 5: `news_search`¶

Tool 6: `academic_search`¶

Tool 7: `patent_search`¶

Tool 8: `sequential_search`¶

Iterative Depth (`depth`)¶

Tool 9: `get_research_session`¶

Tool 10: `get_my_analytics`¶

Tool 11: `memory_save`¶

Tool 12: `memory_recall`¶

Tool 13: `workspace_contribute`¶

Tool 14: `workspace_read`¶

Tool 15: `answer`¶

Tool 16: `structured_search`¶

Tool 17: `citation_graph`¶

Tool 18: `research_export`¶

Tool 19: `format_bibliography`¶

Tool 20: `filing_search`¶

Tool 21: `legal_search`¶

Tool 22: `econ_search`¶

Tool 23: `verify_citation`¶

Tool 24: `verify_recommendation`¶