Tool Specifications¶
These tools let your AI assistant search the web, read pages, find academic papers, track multi-step research, and more — always returning real, verifiable sources. Below are the detailed schemas and behavioral contracts for each tool.
Note: Output schemas describe the JSON shape returned by each tool. See the corresponding
internal/tools/*.gofile for the implementation. Input schemas are auto-generated from structjsonschematags.
Tool Registration Pattern¶
Each tool follows the pattern in internal/tools/registry.go: a typed input struct with jsonschema tags (the SDK auto-generates JSON Schema from these) and a register* function that calls mcp.AddTool. See internal/tools/search.go for a representative example.
Cache-Key Contract¶
Every result-affecting parameter MUST be included in a tool's cache key. This single rule prevents an entire class of cache-collision bugs — two requests that would produce different results must never share a key (e.g. different providers, or a smaller max_length that would serve a later larger request a truncated body).
- Canonical implementations:
searchCacheKey(...)(internal/tools/search.go) for the search family — keys on the tool name plus every query parameter includingprovider— andscrapeCacheKey(url, mode, maxLength)(internal/tools/scrape.go) for scrapes. - Each key carries a version segment (e.g.
v2); bump it whenever the cached response shape changes so a post-upgrade cache hit can never serve a blob missing a newly-added field. - Enforcement:
internal/tools/cachekey_test.goguards today's parameters. When you add a tool or a result-affecting parameter, extend both the key and that test — the test only covers the params it knows about, so a new param can reintroduce the bug without failing any existing assertion.
Large-Payload Linking (resource_link)¶
The heaviest tools — scrape_page (mode: raw), search_and_scrape, and research_export — can return tens to hundreds of KB. When a result is at or above the link threshold, the tool returns an MCP resource_link (2025-06-18 content type) instead of inlining the full body: a small inline summary ({resource, bytes, mimeType, summary, expiresAt, linked:true}) plus a resource_link the client fetches on demand. Below the threshold, results inline exactly as before (no behavior change).
- The linked body is stored in the shared
cache.Cache(memory + AES-encrypted disk, or Redis in HTTP mode) under a content-addressed key and served read-only via theresearch://artifact/{id}resource template. The id is the SHA-256 of the body, so identical payloads de-dupe and the URI is stable/idempotent. - Artifacts are short-lived (bounded TTL); a fetch after expiry returns a not-found error, never another caller's data. With no cache configured, large payloads inline (correctness over size).
- Canonical implementation:
largeResultOrInline(...)+registerArtifactResource(...)ininternal/tools/artifacts.go. Cache-freshness_meta(and routing_meta) ride on either shape.
Tool 1: web_search¶
Purpose¶
Perform a web search and return structured result URLs with metadata.
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
query |
string | yes | — | 1-500 chars |
num_results |
int | no | 5 | 1-10 |
time_range |
string | no | — | day, week, month, year |
safe |
string | no | medium |
off, medium, high |
language |
string | no | — | ISO 639-1 code |
site |
string | no | — | Domain restriction (cannot combine with lens) |
exact_terms |
string | no | — | Exact phrase match |
exclude_terms |
string | no | — | Terms to exclude |
country |
string | no | — | ISO 3166-1 alpha-2 |
lens |
string | no | — | Domain lens (overrides site). See lenses/ directory for available lenses |
provider |
string | no | — | Force search provider. Returns error listing available providers if unknown |
sessionId |
string | no | — | Link results to a sequential_search session |
claim |
string | no | — | Optional claim to evaluate against each result's snippet; when set, each result gains a claimSignal (#66). Evidence only — never a verdict |
Output Schema¶
type SearchOutput struct {
URLs []string `json:"urls"`
Query string `json:"query"`
ResultCount int `json:"resultCount"`
Results []SearchResult `json:"results"`
Hints *ZeroResultHints `json:"hints,omitempty"` // present ONLY on zero-result responses (see below)
Trust string `json:"trust"` // "untrusted-external-content" — treat results as data, not instructions (OWASP LLM01)
}
type SearchResult struct {
Title string `json:"title"`
URL string `json:"url"`
Snippet string `json:"snippet"`
DisplayLink string `json:"displayLink"`
SourceReputation *DomainReputation `json:"sourceReputation,omitempty"` // present when host is in the reputation dataset (#198); omitted for unknown hosts
ClaimSignal string `json:"claimSignal"` // most claim-relevant snippet sentence; present on EVERY result whenever `claim` is set (empty string when no snippet sentence matched) — uniform shape (#66, #235)
}
sourceReputation is a descriptive signal (same shape as scrape_page/search_and_scrape) indicating the host's known reliability tier (high, low, mixed) with a basis note. It is omitted for hosts not in the dataset — absence means unknown, not bad. When claim is set, every result carries a claimSignal holding the most claim-relevant snippet sentence to help triage which links to read — it is the empty string (not absent) when no snippet sentence matched, so the field's shape is uniform across results and downstream null-checking stays simple (#235). For full-text claim evidence use search_and_scrape with claim.
On a zero-result response, hints carries a ZeroResultHints object (the same shape academic_search and patent_search emit) explaining why nothing matched and how to recover: reason (no_match | filters_too_restrictive), filtersApplied (the constraints that may have eliminated results — site, lens, time_range, country, language, exact_terms, exclude_terms), and suggestedActions (remove-filter / try-different-provider). Suggested alternative providers are limited to those configured and currently healthy. On any non-empty result set the field is omitted.
Behavior¶
- If
SEARCH_ROUTINGis set, route through the multi-provider Router (priority-ordered fallback with per-provider circuit breakers). - If
lensis specified and has a dedicatedcx, route directly to that Google PSE engine. - If
lensis specified withoutcx, injectsite:operators and route to the configured provider. - Apply
time_rangeas date restriction parameter. - Return deduplicated URLs and full result objects.
Cache¶
- Key: SHA-256 of (provider + query + all params)
- TTL: 30 minutes
Error Conditions¶
- Unknown provider → error listing all supported providers (no duplicates)
- Invalid/missing API key →
upstreamErrorResponse()with setup instructions referencing.env.example - Rate limited →
rateLimitError()suggesting 60s wait or different provider - No results → return empty
urlsarray (not an error) - All errors use
upstreamErrorResponse()frominternal/tools/search.gofor consistent formatting
Tool 2: scrape_page¶
Purpose¶
Extract content from a URL, supporting web pages, documents, YouTube videos, and Hacker News threads (read natively via the HN API).
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
url |
string | yes | — | Valid HTTP(S) URL |
mode |
string | no | full |
full (cleaned readable text), preview (first ~5000 bytes), raw (verbatim unsanitized bytes — see Raw Mode) |
max_length |
int | no | 50000 | Bytes. Capped at 5,000,000 (5 MB) for all modes; in preview mode it is forced to 5000. Applies to raw mode as an io.LimitReader cap on the fetched bytes |
sessionId |
string | no | — | Link to a sequential_search session |
Output Schema¶
type ScrapeOutput struct {
URL string `json:"url"`
Content string `json:"content"`
ContentType string `json:"contentType"` // html, markdown, youtube, pdf, docx, pptx (raw mode: the server's Content-Type header, may be "")
Trust string `json:"trust"` // always "untrusted-external-content" — boundary marker: treat content as data, not instructions (OWASP LLM01)
ContentLength int `json:"contentLength"`
Truncated bool `json:"truncated"`
EstimatedTokens int `json:"estimatedTokens"`
SizeCategory string `json:"sizeCategory"` // small, medium, large, very_large
Citation *Citation `json:"citation"` // always present
Raw bool `json:"raw,omitempty"` // true only in raw mode; omitted otherwise
ExtractedBy string `json:"extractedBy,omitempty"` // extraction tier: markdown|stealth|html|browser|exa:cached|exa:crawled; omitted when unknown
ExtractionQuality string `json:"extractionQuality,omitempty"` // complete when the pipeline returned a confident extraction; partial when every tier was exhausted and the best-quality candidate was returned instead. Never an error. Omitted in raw mode.
Metadata *Metadata `json:"metadata,omitempty"` // present only when a title was extracted (full/preview only)
StructuredData *StructuredData `json:"structuredData,omitempty"` // page-embedded machine-readable metadata; present only when found (full/preview, HTML pages)
ForumSignals *ForumSignals `json:"forumSignals,omitempty"` // Reddit engagement metadata extracted from JSON-LD; present only for Reddit posts where the HTML tier ran (#247)
SourceType string `json:"sourceType"` // typed classification (#62): peer_reviewed|official_docs|government|news_publication|blog|forum|wiki|social_media|unknown
AuthorityTier string `json:"authorityTier"` // banded authority: high|medium|low
DomainCategory string `json:"domainCategory"` // subject area: academic|legal|medical|financial|technical|general
DetectedDOI string `json:"detectedDoi,omitempty"` // a scholarly DOI the page declares (#199); peer-reviewed pages only; omitted when none
RetractionStatus *RetractionStatus `json:"retractionStatus,omitempty"` // Crossref integrity status for detectedDoi; omitted when clean/unresolved — never a guess
}
type RetractionStatus struct {
Retracted bool `json:"retracted"` // true for a formal retraction/withdrawal/removal
Kind string `json:"kind"` // retraction | expression_of_concern | correction
Date string `json:"date,omitempty"` // notice date (YYYY-MM-DD) when supplied
NoticeDOI string `json:"noticeDoi,omitempty"` // DOI of the retraction/correction notice
Source string `json:"source,omitempty"` // provenance: retraction-watch | publisher
}
type Metadata struct {
Title string `json:"title"`
Author string `json:"author"`
}
type ForumSignals struct {
Platform string `json:"platform"` // Forum platform, e.g. "reddit"
Upvotes int `json:"upvotes"` // Vote count from JSON-LD interactionStatistic
Comments int `json:"comments"` // Comment count
DatePublished string `json:"datePublished,omitempty"` // ISO 8601 publish date when available
AuthorName string `json:"authorName,omitempty"` // Original poster name when available
CredibilityNote string `json:"credibilityNote,omitempty"` // Contextual note, e.g. vote-manipulation risk when Upvotes < 20
}
type StructuredData struct {
JSONLD []json.RawMessage `json:"jsonLd,omitempty"` // each <script type="application/ld+json"> block, verbatim
OpenGraph map[string]string `json:"openGraph,omitempty"` // og:* and article:* meta, keys keep their prefix
Citation map[string]string `json:"citation,omitempty"` // Highwire <meta name="citation_*"> tags
}
type Citation struct {
URL string `json:"url"`
AccessedDate string `json:"accessedDate"`
Metadata CitationMetadata `json:"metadata"`
Formatted CitationFormats `json:"formatted"`
}
type CitationMetadata struct {
Title string `json:"title"`
Author string `json:"author"`
Site string `json:"site"`
Date string `json:"date"`
}
type CitationFormats struct {
APA string `json:"apa"`
MLA string `json:"mla"`
}
On a cache hit, the result also carries a top-level _meta block with cache-freshness provenance (cached: true, ageSeconds, maxAgeSeconds, freshness) — see Cache Freshness Provenance. Freshly fetched scrapes have no _meta.
In raw mode the output additionally carries "raw": true, and contentType is the server's real Content-Type header (it may be empty). No metadata block is emitted.
Tables in content (#48). HTML <table> elements are rendered as GitHub-flavored markdown pipe tables inside content (header row + --- separator + data rows), preserving row/column structure instead of flattening cells into disconnected fragments. Pipe characters in cells are escaped and multi-line cells are collapsed to a single row. Layout, malformed, single-column, and nested tables degrade gracefully to plain text — never an error, never a panic.
Forum engagement signals (#247). For Reddit posts where the HTML extraction tier ran, the response carries a forumSignals object: platform ("reddit"), upvotes (vote count from JSON-LD interactionStatistic), comments (comment count), datePublished (ISO 8601), authorName (original poster), and credibilityNote (set when upvotes < 20, noting vote-manipulation risk). The field is omitted entirely for non-Reddit URLs, raw mode, and any non-HTML extraction tier (markdown, browser, document, YouTube, Twitter) — never null, only present-or-absent.
Structured data (#46). When the page embeds machine-readable metadata, the response carries a structuredData object alongside content: jsonLd (each <script type="application/ld+json"> block, kept verbatim — invalid JSON is skipped, never failing the scrape), openGraph (og:*/article:* meta, keys keep their prefix), and citation (Highwire citation_* meta — DOI, authors, journal). The whole object is omitted when no such markup is present, and each sub-field is omitted when empty. It is produced by the HTML-extraction tiers only (absent for raw mode, PDFs, YouTube, and markdown-tier results), is independently size-bounded so a pathological page cannot blow the response budget, and is untrusted external data under the same trust boundary as content.
Extraction provenance (extractedBy). When known, the response names the tier that produced the content: markdown, stealth, html, browser, or — for the paid Exa fallback — exa:cached / exa:crawled. It lets a caller see whether content came from a free local tier or the metered Exa /contents API (Tier 5, present only when EXA_API_KEY is set). Omitted when unknown (e.g. document/YouTube routes).
Typed source classification (#62). Every scrape response (full and raw) carries three categorical fields alongside the numeric content: sourceType (the kind of source — derived from Schema.org @type / Highwire citation_* meta when present, else a domain heuristic, else unknown), authorityTier (high/medium/low, a banding of the internal authority score), and domainCategory (academic/legal/medical/financial/technical/general, from a domain heuristic). They let the model hedge in natural language by source type. They are best-effort hints derived from untrusted page data — treat them as signals, not guarantees. (In raw mode, with no structured-data extraction, sourceType falls back to the host heuristic.)
Scholarly DOI + integrity status (#199). When a page classifies as peer_reviewed or sits on a known academic-journal host (the latter so detection still engages when an extraction tier strips the citation metadata, e.g. the cached-text fallback), the response surfaces detectedDoi — the DOI the page declares, read (in descending order of authority) from its Highwire citation_doi <head> metadata, then a DOI embedded in the request URL path itself (the publisher's canonical article identifier, e.g. nejm.org/doi/full/10.1056/… — present even on extraction tiers that strip the citation metadata, such as the cached-text fallback), then the first few KB of the cleaned text (the front matter, above any references list, so a references-list DOI is never mistaken for the page's own). It is evidence, never a verdict and never an identity claim: it says "this DOI appears on the page; here is its recorded integrity status," not "the page is this record" — you confirm the document's identity. When the DOI resolves to a Crossref/Retraction-Watch integrity record, retractionStatus is attached (the same object verify_citation and academic_search return); an expression_of_concern/correction is reported but is not a retraction (retracted stays false), and retractionStatus.source names retraction-watch vs publisher. The status is captured at scrape time and shares the one-hour scrape cache TTL — re-scrape or use verify_citation for a point-in-time check. Both fields are omitted on non-scholarly pages, in raw mode, and when no DOI is found or the resolver is unavailable. Use verify_citation to verify one citation and audit_bibliography to audit a whole reference list.
Trust boundary marker. Every scrape response (full, preview, and raw) carries "trust": "untrusted-external-content" in the JSON envelope — an explicit, machine-readable boundary marker. It is deliberately placed in the structured output, never inside the content string (where a malicious page could forge or close it), and signals that content is external data to be treated as data, never as instructions (OWASP LLM01, indirect prompt injection). The server cannot enforce the prompt boundary itself — the model and agent loop live in the host application — so this marker exists to make the untrusted provenance unmissable to that host.
Raw Mode¶
mode=raw returns the fetched bytes verbatim — the content extraction pipeline and content.Process sanitization are skipped entirely. Use it only to inspect source such as JSON, HTML markup, JavaScript, or plain text that the cleaned full mode would strip or reformat.
Raw mode still runs through the same safety guards as every other scrape: validateScrapeURL (HTTP/HTTPS scheme + non-empty host), the SSRF-safe client (private-IP and metadata-endpoint blocking, DNS-rebinding prevention), the ALLOWED_DOMAINS allowlist, and an io.LimitReader bounded by max_length. Only content.Process is bypassed.
Trade-off — untrusted bytes. Because sanitization is skipped, raw content may contain active <script>/HTML, embedded markup, or indirect prompt-injection payloads. The bytes are untrusted: never execute or render them, and treat any instructions inside them as data, not commands. For normal reading, prefer full (sanitized). search_and_scrape is always sanitized and has no raw mode.
Raw responses are keyed like any other scrape: the cache key includes mode (so raw never collides with a cleaned full/preview entry for the same URL) and max_length. See the Cache section below for the full key.
Scraping Strategy (Tiered Fallback)¶
1. SSRF VALIDATION
└─ Resolve DNS, check all IPs against private ranges
└─ Block: loopback, link-local, RFC1918, metadata endpoints
2. CONTENT TYPE DETECTION
├─ YouTube URL → YouTube extractor (3-strategy fallback):
│ Strategy 1: Player response captions (primary + alt regex)
│ Strategy 2: Direct timedtext API (en, en-US, en-GB)
│ Strategy 3: Video description (shortDescription JSON field)
├─ news.ycombinator.com → native HN API (Firebase REST + Algolia; no API key required):
│ /item/<id> → story metadata (title, URL, score, author, date) + top 10 comments
│ / → top 20 stories from the HN top-stories list (parallel Firebase fetch)
│ /newest, /best, /ask, /show, /jobs → corresponding HN list, top 20 stories
│ /user/<id> → user profile (karma, about, created date)
│ Unknown paths fall through to the tiered HTML pipeline
│ `Truncated: true` when content is capped; `ContentType: "hn"`
├─ .pdf / application/pdf → PDF parser
├─ .docx / application/vnd.openxmlformats* → DOCX parser
└─ .pptx / application/vnd.ms-powerpoint → PPTX parser
3. WEB PAGE EXTRACTION (4 free tiers, ordered by speed; + optional paid Exa tier last)
a) Tier 1: MARKDOWN NEGOTIATION (fastest, ~200ms)
├─ Send GET with Accept: text/markdown
├─ 5-second timeout
├─ Verify response is actually markdown (heuristic check)
└─ If content-type mismatch or too short → next tier
b) Tier 2: STEALTH HTTP CLIENT (fast, ~300ms)
├─ Browser-like TLS fingerprint (TLS 1.2+, HTTP/2)
├─ Full Chrome 131 headers (User-Agent, Sec-Ch-Ua, Sec-Fetch-*)
├─ Parse with goquery (article > [role=main] > main > body)
├─ Remove: script, style, nav, footer, aside, ads, popups
├─ SSRF protection via safe dialer when AllowPrivateIPs=false
└─ If below 100-char threshold → next tier
c) Tier 3: HTML EXTRACTION via goquery (standard, ~500ms)
├─ Fetch page with standard Accept header
├─ Parse with goquery
├─ Extract: article > main > body (priority order)
├─ Remove: script, style, nav, footer, aside, ads
├─ Minimum content: 100 bytes, 10% meaningful text ratio
└─ If below threshold → next tier
d) Tier 4: HEADLESS BROWSER via go-rod + stealth (slow, ~5s)
├─ Browser pool with lazy init + singleton pattern
├─ go-rod/stealth plugin (navigator spoofing, WebGL masking)
├─ Used for: Known SPA domains, JS-rendered content, bot challenges
├─ Wait for: page stability (2s for SPA domains, 500ms otherwise) OR 30s timeout
├─ Extract: rendered DOM via JavaScript evaluation
└─ Graceful cleanup via Pipeline.Close()
e) Tier 5: EXA /contents (PAID, opt-in, last resort) — only when EXA_API_KEY is set
├─ Neural extractor: POST https://api.exa.ai/contents (x-api-key auth)
├─ Runs ONLY after every free tier above failed to extract >100 bytes,
│ so the common path never incurs Exa cost
├─ Recovers bot-blocked / JS-heavy pages the local tiers cannot
└─ Records provenance into extractedBy: "exa:cached" (served from Exa's
cache) or "exa:crawled" (freshly fetched by Exa)
4. CONTENT PROCESSING
├─ Sanitize: strip hidden text, zero-width chars, dangerous patterns
├─ Truncate: at paragraph/sentence boundary if > max_length
├─ Estimate tokens: length / 4
└─ Extract citation: from <meta> tags, URL, response headers
Known SPA Domains (require headless browser)¶
- go.dev, pkg.go.dev
- patents.google.com, scholar.google.com, news.google.com
- trends.google.com, youtube.com
- linkedin.com, facebook.com, instagram.com
- medium.com, dev.to
Note: twitter.com and x.com are not in the SPA list — they use a dedicated FxTwitter API path, not the browser tier.
Cache¶
- Key: SHA-256 of (
url+mode+max_length) —max_lengthis part of the key so a larger request never serves a shorter cached body - TTL: 1 hour
Error Taxonomy (internal/scraper/errors.go)¶
All scrape errors are typed as ScrapeError{Kind, Message, Cause, URL, Tier}. The scrapeErrorResponse() function in internal/tools/scrape.go maps each kind to an actionable LLM-facing message:
| ErrorKind | Retryable | Trigger | LLM Message (verbatim shape from scrape.go) |
|---|---|---|---|
ErrValidation |
no (permanent) | Unsupported scheme, empty host, SSRF / private-IP / blocked-hostname denial, domain allowlist denial | "URL rejected for {url}: {detail}. Provide a valid public http(s) URL." |
ErrNetwork |
yes | DNS failure, timeout, connection refused, TLS | "Network error on {url}: {detail}. Check connectivity." |
ErrBlocked |
no (remote refusal) | HTTP 403, remote bot detection / JS-wall interstitial | "Blocked: {url} uses bot detection. Try an alternative source — its content can't be read directly." |
ErrNotFound |
no (dead link) | HTTP 404 / 410 | "Not found: {url} returned 404/410 — the page does not exist. Check the URL." |
ErrBrowser |
no | Chrome not found, launch failed, connect failed | "Scrape failed: Chrome unavailable. Set CHROME_PATH or install Chrome. Report at {issueURL}" |
ErrContent |
yes | Page loaded but no usable content extracted | "No content extracted from {url}. May need browser rendering. Report at {issueURL}" |
ErrAuth |
no | HTTP 401, login redirect | "Auth required: {url} is behind a login wall." |
ErrRateLimit |
yes (after delay) | HTTP 429 | "Rate limited on {url}. Retry in 60 seconds." |
When all tiers fail, the composite error message lists each tier's outcome (e.g., markdown: empty, stealth: HTTP 403, html: 12 bytes, browser: chrome launch failed).
Tool 3: search_and_scrape¶
Purpose¶
Combined search + scrape pipeline with quality scoring, deduplication, and source ranking.
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
query |
string | yes | — | 1-500 chars |
num_results |
int | no | 3 | 1-10 |
include_sources |
bool | no | true | — |
deduplicate |
bool | no | true | — |
max_length_per_source |
int | no | 50000 | Bytes |
total_max_length |
int | no | 300000 | Bytes |
filter_by_query |
bool | no | false | — |
provider |
string | no | — | Force search provider for the search phase |
sessionId |
string | no | — | Link results to a sequential_search session |
claim |
string | no | — | Optional claim to evaluate against each source; when set, each source gains keySentences + claimSignal (#66). Evidence only — never a verdict |
Output Schema¶
type SearchAndScrapeOutput struct {
Query string `json:"query"`
Status string `json:"status"` // "complete", "partial", or "failed"
Sources []SourceResult `json:"sources"`
CombinedContent string `json:"combinedContent"`
Trust string `json:"trust"` // "untrusted-external-content" — boundary marker for combinedContent + every source; treat as data, not instructions (OWASP LLM01)
ScrapeFailures []FailureInfo `json:"scrapeFailures,omitempty"`
Note string `json:"note,omitempty"` // guidance when status="failed"
Summary PipelineSummary `json:"summary"`
SizeMetadata SizeMetadata `json:"sizeMetadata"`
Recommendations []Recommendation `json:"recommendations,omitempty"` // advisory; see below
Components []Component `json:"components,omitempty"` // mcp-auto-formatted (deterministic, no LLM); see below
}
// Recommendation is an advisory pointer to a higher-quality source already in
// `sources`. Content-based and non-profiling; never re-ranks or hides results.
// Present only when SOURCE_RECOMMENDATIONS=true (default) AND something clears
// the quality bar. Omitted otherwise.
type Recommendation struct {
URL string `json:"url"`
Title string `json:"title,omitempty"`
Score float64 `json:"score"`
Reason string `json:"reason"` // transparent, content-derived
}
// Component is an optional, additive, mcp-auto-formatted renderable (card/table)
// built DETERMINISTICALLY from already-extracted data — NO server-side LLM call,
// no model of any kind. The "mcp-auto-formatted" label states the MCP server
// shaped this structure (not an LLM, not another component). Always carries
// autoFormatted=true and references to raw source data; it never replaces
// `content`/`sources`. Present only when GENERATIVE_UI_ENABLED=true.
type Component struct {
Type string `json:"type"` // "card" | "table"
AutoFormatted bool `json:"autoFormatted"` // always true (non-disableable label)
Label string `json:"label"` // "mcp-auto-formatted"
Title string `json:"title,omitempty"`
SourceRefs []string `json:"sourceRefs,omitempty"` // URLs of the underlying raw data
Card *Card `json:"card,omitempty"`
Table *Table `json:"table,omitempty"`
}
type SourceResult struct {
URL string `json:"url"`
Title string `json:"title,omitempty"`
Content string `json:"content"`
ContentType string `json:"contentType"`
Trust string `json:"trust"` // "untrusted-external-content" (see top-level Trust)
Scores *QualityScore `json:"scores,omitempty"`
SourceType string `json:"sourceType,omitempty"` // typed classification (#62): peer_reviewed|official_docs|government|news_publication|blog|forum|wiki|social_media|unknown
AuthorityTier string `json:"authorityTier,omitempty"` // high|medium|low
DomainCategory string `json:"domainCategory,omitempty"` // academic|legal|medical|financial|technical|general
ClaimSignal string `json:"claimSignal,omitempty"` // strongest claim-relevant sentence; present only when `claim` is set and matched (#66)
KeySentences []string `json:"keySentences,omitempty"` // top claim-relevant sentences in document order; present only with `claim`
}
type FailureInfo struct {
URL string `json:"url"`
Kind string `json:"kind,omitempty"` // error category (blocked, auth_required, etc.)
Reason string `json:"reason"`
Retryable bool `json:"retryable"`
SuggestedAction string `json:"suggestedAction,omitempty"` // recovery hint
}
type QualityScore struct {
Overall float64 `json:"overall"`
Relevance float64 `json:"relevance"`
Freshness float64 `json:"freshness"`
Authority float64 `json:"authority"`
ContentQuality float64 `json:"contentQuality"`
}
type PipelineSummary struct {
URLsSearched int `json:"urlsSearched"`
URLsScraped int `json:"urlsScraped"`
URLsFailed int `json:"urlsFailed"`
ProcessingTimeMs int `json:"processingTimeMs"`
}
Behavior¶
- Execute search (via configured provider)
- Scrape all result URLs in parallel (bounded concurrency: 5)
- If
deduplicate: paragraph-level djb2 hashing, drop blocks whose hash exactly matches one already seen (exact-match dedup, not fuzzy similarity) - Score and rank sources by quality (weighted: relevance 35%, freshness 20%, authority 25%, content 20%)
- If
filter_by_query: extract keywords, remove sources below relevance threshold - Combine content, truncate to
total_max_length - Return structured result with scores and metadata
- Optionally append
recommendations(advisory, content-based;SOURCE_RECOMMENDATIONS, default on) andcomponents(mcp-auto-formattedrenderables, deterministic — no LLM;GENERATIVE_UI_ENABLED, default off) — both derived purely from the quality scores already computed, with no extra scoring pass and no model call
Recommendations & components (additive)¶
recommendationssurface the highest-quality related sources from the current result set using the transparent quality signals (authority, relevance, freshness, content). They are advisory only —sourcesordering is never changed and the caller can ignore them. Strictly content-based: no user-behavior inputs, no profiling. Toggle withSOURCE_RECOMMENDATIONS(defaulttrue). Behavior-based/personalized ranking is explicitly out of scope.componentsare optional renderable structures (source cards, a quality-comparison table) assembled deterministically from data already extracted — there is no server-side LLM call and no model of any kind. Every component is labelledautoFormatted: true/"mcp-auto-formatted"(stating the MCP server shaped it, not an LLM) and references the raw source URLs, so nothing is hidden or unverifiable. Off by default (GENERATIVE_UI_ENABLED=false); when off, output is byte-for-byte unchanged. The rawcontent/sourcesare always present regardless.
Cache¶
- NOT cached as a whole (composed of cached sub-operations)
- Individual search and scrape results are cached per their own TTLs
Tool 4: image_search¶
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
query |
string | yes | — | 1-500 chars |
num_results |
int | no | 5 | 1-200 (Brave up to 200; Google up to 10) |
size |
string | no | — | huge, icon, large, medium, small, xlarge, xxlarge. Google/SearchAPI only — Brave ignores it. |
type |
string | no | — | clipart, face, lineart, stock, photo, animated. Google/SearchAPI only — Brave ignores it. |
color_type |
string | no | — | color, gray, mono, trans. Google/SearchAPI only — Brave ignores it. |
dominant_color |
string | no | — | black, blue, brown, gray, green, orange, pink, purple, red, teal, white, yellow. Google/SearchAPI only — Brave ignores it. |
file_type |
string | no | — | jpg, gif, png, bmp, svg, webp. Google/SearchAPI only — Brave ignores it. |
safe |
string | no | medium |
off, medium, high. On Brave images only off and strict apply (any non-off maps to strict). |
country |
string | no | — | ISO 3166-1 alpha-2 (e.g. us, gb). Honored by Brave and Google. |
language |
string | no | — | BCP 47 / 2-letter code (e.g. en, de). Honored by Brave (search_lang) and Google (lr). |
provider |
string | no | — | Force search provider |
Output Schema¶
type ImageSearchOutput struct {
Images []ImageResult `json:"images"`
Query string `json:"query"`
ResultCount int `json:"resultCount"`
Trust string `json:"trust"` // "untrusted-external-content"
}
type ImageResult struct {
Title string `json:"title"`
Link string `json:"link"`
ThumbnailLink string `json:"thumbnailLink,omitempty"`
DisplayLink string `json:"displayLink"`
ContextLink string `json:"contextLink,omitempty"`
Width int `json:"width,omitempty"`
Height int `json:"height,omitempty"`
FileSize string `json:"fileSize,omitempty"` // optional, provider-dependent; omitted when the provider does not report it
}
Provider notes¶
size,type,color_type,dominant_color, andfile_typeare Google/SearchAPI-only filters — they are not documented Brave image params, so the Brave adapter never sends them (Brave would silently drop them).country,language, andsafeare honored across providers. Thesizebucket is a hint the provider applies loosely — returned dimensions may not strictly match the requested bucket; usewidth/heightto filter precisely when exact sizing matters.fileSize,contextLink,width, andheightare optional and provider-dependent — each is emitted only when the configured provider reports it and is omitted (never fabricated) otherwise. No currently-configured provider populatesfileSize, so treat it as reserved/best-effort.
Cache¶
- Key: SHA-256 of (query + all filter params)
- TTL: 30 minutes
Tool 5: news_search¶
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
query |
string | yes | — | 1-500 chars |
num_results |
int | no | 5 | 1-50 (Brave up to 50; Google up to 10) |
time_range |
string | no | week |
hour, day, week, month, year |
sort_by |
string | no | relevance |
relevance, date. Google only — Brave news has no sort param and ignores it. |
news_source |
string | no | — | Domain filter (e.g. reuters.com). Google only — Brave news has no source filter and ignores it. |
country |
string | no | — | ISO 3166-1 alpha-2 (e.g. us, gb). Honored by Brave news. |
language |
string | no | — | BCP 47 / 2-letter code (e.g. en, de). Honored by Brave news (search_lang). |
safe |
string | no | — | SafeSearch level: off, moderate, strict. Honored by Brave news. |
provider |
string | no | — | Force search provider |
sessionId |
string | no | — | Link results to a sequential_search session |
Output Schema¶
type NewsSearchOutput struct {
Articles []NewsArticle `json:"articles"`
Query string `json:"query"`
ResultCount int `json:"resultCount"`
Hints *ZeroResultHints `json:"hints,omitempty"` // present ONLY on zero-result responses (see below)
Trust string `json:"trust"` // "untrusted-external-content"
}
type NewsArticle struct {
Title string `json:"title"`
URL string `json:"url"`
Source string `json:"source"`
PublishedAt string `json:"publishedAt,omitempty"` // optional, provider-dependent; ISO-8601 (RFC3339 UTC) when present
Snippet string `json:"snippet"`
}
On a zero-result response, hints carries the same ZeroResultHints object as web_search/academic_search/patent_search. The active freshness window (default week) and any news_source are surfaced in filtersApplied, since an over-narrow recency window is the most common reason news returns nothing; suggested alternative providers are limited to those configured and healthy. Omitted on any non-empty result set.
Behavior¶
- Route to configured search provider's news endpoint.
- Apply
time_rangeas date restriction. - If
news_sourcespecified, add as domain filter (Google only). - Sort by
sort_by:relevance(default) uses the provider's native ranking;daterequests newest-first ordering (Google only). - Return deduplicated articles.
Provider notes¶
sort_byandnews_sourceare Google-only controls — Brave's news API has no sort or single-source parameter, so the Brave adapter never sends them (the schema descriptions mark them provider-conditional rather than dropping the fields, since Google genuinely honors them).country,language, andsafeare honored by Brave news.publishedAtis optional and provider-dependent: populated when the provider exposes a publish timestamp (Google CSE via page metadata; Brave/Exa/Serper/SearchAPI/SearXNG/Tavily natively), omitted (not fabricated) when the provider supplies none — so treat it as best-effort. When present it is always normalized to ISO-8601 (RFC3339 UTC) regardless of the provider's raw format (RFC1123, relative ages like "3 days ago"/"2h", or bare dates), so values sort and compare consistently across providers; an unparseable timestamp is dropped rather than passed through.sort_by=datemaps to Google's date-sort control; exact ordering andtime_range=hourgranularity depend on the provider's index and may be approximate. News providers may also surface high-ranking forum/aggregator pages —news_sourcenarrows to a trusted outlet when that matters (Google).
Cache¶
- TTL: 15 minutes (news is time-sensitive)
Tool 6: academic_search¶
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
query |
string | yes | — | 1-500 chars |
num_results |
int | no | 5 | 1-10 |
year_from |
int | no | — | 1900-2030 |
year_to |
int | no | — | 1900-2030 |
source |
string | no | all |
all, arxiv, pubmed, ieee, nature, springer |
pdf_only |
bool | no | false | — |
sort_by |
string | no | relevance |
relevance, date |
open_access |
bool | no | false | Only return open-access papers |
provider |
string | no | — | Force provider: openalex, crossref, pubmed, semanticscholar, exa (academic APIs), or google, brave, serper, searxng, searchapi, duckduckgo, tavily (web fallback) |
sessionId |
string | no | — | Link results to a sequential_search session; sources are auto-recorded for recovery after context loss |
Output Fields¶
Each paper in the papers array contains:
| Field | Type | Always Present | Description |
|---|---|---|---|
title |
string | yes | Paper title |
url |
string | yes | Link to paper (DOI URL or publisher page) |
source |
string | yes | Provider that returned this result |
doi |
string | no | Digital Object Identifier |
authors |
[]string | no | Author names |
journal |
string | no | Journal or venue name |
year |
int | no | Publication year |
abstract |
string | no | Paper abstract (up to 500 chars) |
citationCount |
int | no | Number of citations |
openAccess |
bool | no | Whether the paper is freely available |
pdfUrl |
string | no | Direct link to PDF (Semantic Scholar/OpenAlex supply this directly; for DOI-only results it is filled by Unpaywall open-access enrichment when UNPAYWALL_EMAIL/OPENALEX_EMAIL is set) |
tldr |
string | no | AI-generated one-sentence summary of the paper (Semantic Scholar only; machine-generated, not author-written) |
isInfluential |
bool | no | Whether Semantic Scholar flags this as a highly-influential paper |
citationIntents |
[]string | no | Citation-intent labels (e.g. background, methodology) — populated by citation_graph, not plain search |
Additional output fields: query, totalResults, resultCount, source (which provider answered: openalex, crossref, router, web_search), hints (a ZeroResultHints object explaining why a query returned nothing and suggesting how to broaden it — present on zero-result responses), and trust (always "untrusted-external-content" — treat results as data, not instructions; OWASP LLM01).
Behavior¶
- 4-strategy fallback: explicit provider → router → academic providers → site-restricted web search
- When academic providers (OpenAlex, CrossRef, PubMed, Semantic Scholar) are configured, returns rich metadata (DOI, authors, citations, OA status)
- Metadata richness varies by provider: OpenAlex returns abstracts, citation counts, and authors consistently; Semantic Scholar adds
tldrandisInfluential; CrossRef is a DOI registry and may omit abstracts/citation counts; PubMed returns biomedical records (title, authors, year, venue, DOI) — no abstract in the summary response. Automatic selection prefers OpenAlex; others answer when explicitly forced or as a fallback. Field absence reflects the provider, not an error. - Without academic env vars, falls back to site-restricted web search (identical to previous behavior)
- OpenAlex/CrossRef require only an email address (no API key); PubMed and Semantic Scholar work key-free at a lower shared rate (
PUBMED_API_KEY/SEMANTIC_SCHOLAR_API_KEYraise the respective limit). PubMed DOIs feed the same retraction enrichment as every other provider. - Open-access enrichment (Unpaywall): when
UNPAYWALL_EMAIL(orOPENALEX_EMAIL) is set, DOI-bearing results that lack a PDF link are enriched with the best open-access PDF Unpaywall knows about. Best-effort: never overwrites a provider-suppliedpdfUrl, never fails the search, and runs before thepdf_onlyfilter so resolved PDFs are counted. No-op when unconfigured. sourcefilter: when set (e.g., "arxiv"), OpenAlex filters by source ID; web fallback restricts to that source's domainsort_by=date: OpenAlex sorts bypublication_date:desc; CrossRef usespublished:descpdf_only: post-filters results to only those withPDFUrlpopulated (may reduce result count)
Academic Site Pool (web search fallback)¶
arxiv.org, pubmed.ncbi.nlm.nih.gov, scholar.google.com, ieeexplore.ieee.org, dl.acm.org, nature.com, sciencedirect.com, link.springer.com, researchgate.net, plos.org, frontiersin.org, mdpi.com, wiley.com, jstor.org, semanticscholar.org, biorxiv.org, medrxiv.org
Cache¶
- TTL: 1 hour (academic providers use semantic ranking that can shift)
Tool 7: patent_search¶
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
query |
string | no | — | Patent terms or number. Not required when assignee or inventor provided |
num_results |
int | no | 5 | 1-10 |
search_type |
string | no | prior_art |
prior_art, specific, landscape |
patent_office |
string | no | all |
all, US, EP, WO, JP, CN, KR |
assignee |
string | no | — | Company name (auto-strips Inc/LLC/Ltd suffixes) |
inventor |
string | no | — | Inventor name |
cpc_code |
string | no | — | CPC classification (e.g., G06F) |
year_from |
int | no | — | Only patents filed in or after this year |
year_to |
int | no | — | Only patents filed in or before this year |
provider |
string | no | — | Force provider: searchapi, epo, lens, uspto, or web search providers |
sessionId |
string | no | — | Link results to a sequential_search session; sources are auto-recorded for recovery after context loss |
Output Fields¶
Each patent in the patents array contains:
| Field | Type | Always Present | Description |
|---|---|---|---|
title |
string | yes | Patent title |
number |
string | yes | Patent number (e.g., US10165245B2) |
url |
string | yes | Link to patent detail page |
abstract |
string | no | Patent abstract or snippet |
assignee |
string | no | Patent owner/assignee |
inventor |
string | no | Primary inventor name |
filed |
string | no | Filing date (YYYY-MM-DD) |
granted |
string | no | Grant date (YYYY-MM-DD) |
pdf |
string | no | Direct link to patent PDF |
status |
string | no | Application status (e.g., "Patented Case") — provider-dependent: USPTO reports it; EPO/Lens/SearchAPI/web-discovery typically omit it |
Additional output fields: query, searchType, resultCount, source (which provider answered), searchUrl, hints (a ZeroResultHints object explaining why a query returned nothing and suggesting how to broaden it — present on zero-result responses), and trust (always "untrusted-external-content" — treat results as data, not instructions; OWASP LLM01).
Behavior¶
- 4-strategy fallback: explicit provider → router → patent-only providers → web search discovery
- When an explicit provider is set: that provider is used exclusively. If it returns empty results (e.g., USPTO for non-US patents), empty results are returned — no silent fallback to web_discovery
- Unknown provider: returns error listing all supported providers (no duplicates)
- Strips HTML from API responses; extracts clean patent numbers from paths
- Normalizes assignee names (removes Inc/LLC/Corp/Ltd suffixes for matching)
- Region-aware routing:
patent_officefilters which providers are tried - Post-filter results by patent number prefix when
patent_officeis specified - Does not cache empty results (only caches when patents are found)
- USPTO uses simple full-text search (quoted phrases); Lens uses Elasticsearch bool queries with match_phrase
num_resultsis enforced for every provider, including a defensive cap on the USPTO path (its API may return more rows than requested)- Provider matching is token/substring-based:
inventor/assigneematches share a surname or company token rather than disambiguating entities, and a nonsense query may still fuzzy-match loosely-related patents instead of returning zero. Verify results against the returned bibliographic fields rather than assuming exact-entity matching.
Cache¶
- TTL: 24 hours (only for non-empty results)
Tool 8: sequential_search¶
Purpose¶
Multi-step research tracking with session persistence, branching, and knowledge gap identification.
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
searchStep |
string | yes | — | Description of this step |
stepNumber |
int | yes | — | Starts at 1 |
nextStepNeeded |
bool | yes | — | Whether more steps follow |
sessionId |
string | no | — | Session ID (required for steps 2+) |
researchGoal |
string | no | — | Set on step 1; defines the research objective |
reasoning |
string | no | — | Why this search direction was chosen |
confidence |
string | no | — | Confidence in this step: high, medium, or low |
rejectedApproaches |
[]string | no | — | Approaches considered but rejected |
sessionSummary |
string | no | — | Running summary (used for recovery) |
responseMode |
string | no | auto | Force full or summary output |
totalStepsEstimate |
int | no | — | Estimated total steps |
isRevision |
bool | no | false | Revising a previous step |
revisesStep |
int | no | — | Step being revised |
branchFromStep |
int | no | — | Branching point |
branchId |
string | no | — | Branch identifier |
knowledgeGap |
string | no | — | Gap identified |
depth |
string | no | quick |
Iteration-assist level: quick, standard, or thorough (see Iterative Depth below) |
Session Management¶
- Sessions created on first call (stepNumber=1)
- A
stepNumber > 1call with nosessionIdis rejected with guidance (pass the sessionId, recover withget_research_session, or restart at step 1) — it does not silently start a new session, so a lost sessionId never orphans the in-flight research trail - Session ID: UUID v4, returned in output
- TTL: 4 hours of inactivity (configurable via
SESSION_TTL), resets on every access - Max concurrent sessions: 50 per tenant (oldest evicted when exceeded)
- Max steps per session: 200 (configurable via
SESSION_MAX_STEPS) - Persistence: encrypted disk (AES-256-GCM), survives server restarts
- Cleanup: goroutine every 15 minutes removes expired sessions from memory + disk
- Per-tenant isolation: sessions keyed by
{tenantID}:{sessionID} - Recovery: use
get_research_sessiontool after context loss - Response modes: "full" for ≤8 steps, "summary" for >8 (override with
responseModeinput)
Output Schema¶
type SequentialSearchOutput struct {
SessionID string `json:"sessionId"`
ResponseMode string `json:"responseMode"` // "full" or "summary"
ResearchGoal string `json:"researchGoal"`
CurrentStep int `json:"currentStep"` // echoes the input stepNumber
TotalStepsEstimate int `json:"totalStepsEstimate"`
IsComplete bool `json:"isComplete"` // !nextStepNeeded
StartedAt string `json:"startedAt"`
CompletedAt string `json:"completedAt,omitempty"` // set only when complete
Warning string `json:"warning,omitempty"` // e.g. max-steps reached
Trust string `json:"trust"` // "untrusted-external-content" — echoed source metadata is external data
// "full" mode (default for <=8 steps):
Steps []StepIndexEntry `json:"steps,omitempty"` // one-liner index, full mode only
// "summary" mode (default for >8 steps):
Summary string `json:"summary,omitempty"` // summary mode only
StepIndex []StepIndexEntry `json:"stepIndex,omitempty"` // summary mode only
// Both modes:
LastSteps []ResearchStep `json:"lastSteps,omitempty"` // most recent full steps
Gaps []KnowledgeGap `json:"gaps,omitempty"`
Sources []ResearchSource `json:"sources,omitempty"`
}
type StepIndexEntry struct {
StepNumber int `json:"stepNumber"`
OneLiner string `json:"oneLiner"`
BranchID string `json:"branchId"`
Confidence string `json:"confidence"`
}
The key set depends on
responseMode: full mode emitssteps; summary mode emitssummary+stepIndexinstead. Both emitlastSteps,gaps, andsources. This tool does not emit a_metablock (no caching).
Iterative Depth (depth)¶
An optional iteration-assist level (#67). The server stays infrastructure, not synthesis — it never writes an answer, only richer metadata and (for thorough) raw results.
| Level | Behavior |
|---|---|
quick (default) |
Record the step and return. Byte-for-byte the prior behavior — no extra fields. |
standard |
Also analyze coverage of the sources gathered so far and suggest refinement queries. No auto-execution. Adds depth, coverage, and refinementQueries. |
thorough |
Same as standard, plus auto-runs up to 3 of the suggested refinement queries as web searches and attaches their merged, provenance-tagged results. Adds refinementResults (and refinementNote when the suggestion list was truncated). |
Extra output fields (present only for standard/thorough):
coverage—{ sourceCount, uniqueDomains, domainSpread, dominantDomain?, sourceTypes, gaps[] }. Descriptive coverage signals (domain spread, source-type balance, thin-coverage flags) computed from the session's recorded sources. Never an answer.refinementQueries— suggested follow-up search strings derived from knowledge gaps + coverage gaps. The caller's AI decides whether to act on them.refinementResults(thoroughonly) — array of{ query, resultCount, results[] }(or{ query, error }), one per auto-run query. Raw web results tagged with the originating query; not synthesized. Each result is{ title, url, snippet }.refinementNote(thoroughonly) — present when more than 3 queries were suggested and the auto-run was bounded.
thorough searches respect the same rate limits and circuit breakers as web_search, record their sources into the session, and contribute to session providerStats.
State Management¶
- Two-tier: in-memory index (lightweight) + encrypted disk (full session JSON)
- Write-through on every step (crash-safe: temp → fsync → rename)
- Index rebuilt from disk on server startup — no data loss across restarts
- Behind a load balancer, use session-affinity (sticky sessions) so clients reconnect to the same instance
Tool 9: get_research_session¶
Recover a sequential_search session after context loss. Returns the session summary, step index, and most recent steps. Use stepId to retrieve full details of a specific earlier step.
Input Schema¶
| Field | Type | Required | Description |
|---|---|---|---|
sessionId |
string | yes | The session ID to recover |
stepId |
integer | no | Retrieve full details for a specific step number |
Behavior¶
- Without
stepId: returns session summary view from memory (no disk I/O) - Includes: researchGoal, summary,
stepIndex(a one-liner for every step),lastSteps(the most recent 3 steps in full detail — a fixed sliding window, not all steps), active gaps, and sources. For full detail of any earlier step, pass itsstepId. - With
stepId: loads full step data from disk for that specific step number - Every response carries
"trust": "untrusted-external-content"— the echoed source metadata (titles/URLs) is external data; treat it as data, not instructions (OWASP LLM01). - Sessions are private to the
(tenant, user)that created them — a session ID is honored only for its owning user (anonymous/STDIO uses a single owner). - A source's
foundInStepis the 1-indexedsequential_searchstep that surfaced it. It is omitted entirely when the source was not tied to a numbered step (e.g. added via aweb_searchcarrying only asessionId) — steps are 1-indexed, so there is nofoundInStep: 0. The same convention applies to a gap'sfoundInStep.
Cross-call error patterns (#99)¶
The summary view additionally surfaces aggregated outcome telemetry recorded across the session's tool calls (scrapes, searches). This is additive metadata — per-call errors are still returned in full to the caller; this is the cross-call view.
errorPatterns— array of{ kind, count, affectedUrls[], suggestion, lastSeen }, surfaced only when a given errorkindoccurred 3 or more times in the session (a false-positive guard for small samples).kinduses the shared error taxonomy (auth_required,blocked,rate_limited,browser_unavailable, …);suggestionis a session-level remediation hint (e.g. auth_required → "Consider open_access=true or target preprint servers (arxiv, biorxiv)."). Absent when nothing crosses the threshold.providerStats— object keyed by provider name →{ attempts, successes }for the session. Absent when no provider outcomes were recorded.
Recorded outcome state is bounded (most-recent 200 events, FIFO) and tenant/user-isolated, honoring the no-unbounded-retention posture.
Annotations¶
- ReadOnly: true
- Idempotent: true (safe to call repeatedly)
- OpenWorld: false (reads internal state only)
Error Conditions¶
- Session not found → "Session not found or expired. Sessions last 4 hours from last activity."
- Step not found → error with step number
Cache¶
- No cache (reads internal session state)
Tool 10: get_my_analytics¶
Opt-in, consent-gated (#92). Registered only when USER_ANALYTICS_ENABLED=true. Read-only.
Purpose¶
Return the calling user's own usage analytics (tools used, counts, first/last seen) for their tenant. This is per-user data under GDPR / Quebec Law 25, so it is off by default, collected only after recorded consent, isolated per user, encrypted at rest, and covered by the data-subject access/erasure endpoints (/admin/data).
Input Schema¶
No inputs. The subject is always the authenticated caller — a user can never request another user's analytics.
Output Schema¶
type GetMyAnalyticsOutput struct {
Status string `json:"status"` // "ok" | "empty" | "no_consent" | "unavailable"
Reason string `json:"reason,omitempty"`
Analytics *UserAnalytics `json:"analytics,omitempty"`
}
type UserAnalytics struct {
TenantID string `json:"tenantId"`
UserID string `json:"userId"`
TotalCalls int64 `json:"totalCalls"`
ToolCounts map[string]int64 `json:"toolCounts"`
FirstSeen string `json:"firstSeen,omitempty"`
LastSeen string `json:"lastSeen,omitempty"`
RecentTools []string `json:"recentTools,omitempty"`
}
Behavior¶
- Requires an authenticated user (
status: "unavailable"for anonymous). - Requires recorded consent for the
analyticspurpose (status: "no_consent"otherwise — nothing is collected without it). - Returns the caller's own summary, or
status: "empty"if none recorded yet.
Cache¶
- No cache (reads per-user state directly).
Tool 11: memory_save¶
Opt-in, consent-gated (#88). Registered only when MEMORY_ENABLED=true. This is a write tool (ReadOnlyHint: false, DestructiveHint: false — it appends, never deletes).
Purpose¶
Persist a research finding to the calling user's long-term memory so it can be recalled in future sessions (unlike sequential_search sessions, which expire after 4 hours). Stored per-user, encrypted, retention-bounded (MEMORY_RETENTION, default 90 days), and erasable via the data-subject endpoint (/admin/data). There is no memory_forget tool — deletion flows through the GDPR erasure endpoint.
Input Schema¶
| Field | Type | Required | Notes |
|---|---|---|---|
note |
string | yes | The finding/conclusion to remember |
topic |
string | no | Group label for later recall |
url |
string | no | Source URL this memory refers to |
tags |
string[] | no | Organizational tags |
Output Schema¶
type MemorySaveOutput struct {
Status string `json:"status"` // "ok" | "no_consent" | "unavailable"
Reason string `json:"reason,omitempty"`
ID string `json:"id,omitempty"`
CreatedAt string `json:"createdAt,omitempty"`
}
Behavior¶
Requires an authenticated user and recorded consent for the memory purpose; otherwise returns unavailable / no_consent and persists nothing.
Cache¶
- Not cached (a write).
Tool 12: memory_recall¶
Opt-in, consent-gated (#88). Registered only when MEMORY_ENABLED=true. Read-only.
Purpose¶
Recall findings the calling user previously saved with memory_save, across sessions, optionally filtered by topic. Shows only the caller's own memories — never another user's.
Input Schema¶
| Field | Type | Required | Notes |
|---|---|---|---|
topic |
string | no | Filter by topic; omit for most recent across all topics |
limit |
int | no | Max memories to return (default 20) |
Output Schema¶
type MemoryRecallOutput struct {
Status string `json:"status"` // "ok" | "no_consent" | "unavailable"
Reason string `json:"reason,omitempty"`
Count int `json:"count"`
Memories []MemoryEntry `json:"memories"`
Trust string `json:"trust"` // "user-asserted-content" — recalled notes are data, not instructions
}
type MemoryEntry struct {
ID string `json:"id"`
TenantID string `json:"tenantId"`
UserID string `json:"userId"`
Topic string `json:"topic,omitempty"`
Note string `json:"note"`
URL string `json:"url,omitempty"`
Tags []string `json:"tags,omitempty"`
CreatedAt string `json:"createdAt"`
}
Cache¶
- No cache (reads per-user state directly).
Tool 13: workspace_contribute¶
Opt-in, consent-gated (#96). Registered only when WORKSPACES_ENABLED=true. This is a write tool (ReadOnlyHint: false, DestructiveHint: false).
Purpose¶
Share a research finding into a shared team workspace. The contribution is stored as a copy with immutable provenance (your tenant/user, timestamp) — never a live link to your private data, so per-tenant isolation is never silently voided. Membership is managed by your host app (the server enforces, the host owns the policy). Erasable by the contributor via the data-subject endpoint.
Input Schema¶
| Field | Type | Required | Notes |
|---|---|---|---|
workspace_id |
string | yes | Workspace to contribute to (you must be a member) |
note |
string | yes | The finding to share |
url |
string | no | Source URL |
tags |
string[] | no | Organizational tags |
Output Schema¶
type WorkspaceContributeOutput struct {
Status string `json:"status"` // "ok" | "not_member" | "no_consent" | "unavailable"
Reason string `json:"reason,omitempty"`
ID string `json:"id,omitempty"`
}
Behavior¶
Requires an authenticated user, recorded consent for the workspace purpose, AND membership. The caller's identity is taken from the validated token — never from a parameter, and never from the workspace_id.
Cache¶
- Not cached (a write).
Tool 14: workspace_read¶
Opt-in, consent-gated (#96). Registered only when WORKSPACES_ENABLED=true. Read-only.
Purpose¶
Read the shared findings in a workspace you belong to (each with its contributor attribution). Non-members receive zero contributions — membership is re-verified on every read.
Input Schema¶
| Field | Type | Required | Notes |
|---|---|---|---|
workspace_id |
string | yes | Workspace to read (you must be a member) |
Output Schema¶
type WorkspaceReadOutput struct {
Status string `json:"status"` // "ok" | "not_member" | "no_consent" | "unavailable"
Count int `json:"count"`
Contributions []Contribution `json:"contributions"`
Trust string `json:"trust"` // "untrusted-external-content" — cross-member notes are untrusted data (does not restrict who may read)
}
type Contribution struct {
ID string `json:"id"`
WorkspaceID string `json:"workspaceId"`
ContributorTenant string `json:"contributorTenant"`
ContributorUser string `json:"contributorUser"`
Note string `json:"note"`
URL string `json:"url,omitempty"`
Tags []string `json:"tags,omitempty"`
CreatedAt string `json:"createdAt"`
}
Membership management¶
Membership is host-owned via admin endpoints (not MCP tools): POST /admin/workspace/members and DELETE /admin/workspace/members with {workspace_id, tenant_id, user_id}. The server enforces the resulting membership checks; it does not own the membership policy.
Cache¶
- No cache (reads workspace state directly).
Tool 15: answer¶
Provider-independent. Registered only when at least one answer provider is configured (currently Exa via EXA_API_KEY; future providers register the same way). Read-only, open-world, idempotent.
Purpose¶
Ask a factual question and get one grounded, synthesized answer with source citations. Unlike web_search (a list of links) or search_and_scrape (raw page text), this returns a direct written answer plus the URLs it relied on. The backing provider is pluggable — set provider to choose one when several are configured. The result names the answering provider, and costUsd reports the per-call estimate for metered providers (0 for free ones).
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
query |
string | yes | — | The question to answer |
provider |
string | no | — | Force a provider (e.g. exa); required only when more than one is configured |
Output Schema¶
type AnswerOutput struct {
Answer string `json:"answer"`
Citations []Citation `json:"citations"`
Provider string `json:"provider"` // which provider answered
CostUsd float64 `json:"costUsd,omitempty"` // per-call estimate for metered providers (not an invoice)
Trust string `json:"trust"` // "untrusted-external-content"
}
type Citation struct {
Title string `json:"title,omitempty"`
URL string `json:"url"`
PublishedDate string `json:"publishedDate,omitempty"`
}
Behavior¶
- Resolve the
search.AnswerSearcherfor the requestedprovider(or the sole configured one). - Call the provider; map its grounded answer + citations + cost into the output.
costUsdand the resolved provider are surfaced into audit metadata (cost_usd,provider).- The answer is external content —
trustis always"untrusted-external-content".
Cache¶
- TTL: 1 hour (keyed by query + provider)
Tool 16: structured_search¶
Provider-independent. Registered only when at least one structured-search provider is configured (currently Exa via EXA_API_KEY). Read-only, open-world, idempotent.
Purpose¶
Search the web and extract structured data from each result. Supply a JSON schema to pull specific fields back as JSON per result, and/or a category to focus the search. The backing provider is pluggable (provider field). Valid category values and any schema limits are provider-specific and validated by the chosen provider — an unsupported value returns an error listing the valid options. The result names the provider, and costUsd reports the per-call estimate for metered providers.
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
query |
string | yes | — | What to search for (entity name for entity lookups) |
category |
string | no | — | Provider-specific result category (validated by the provider) |
num_results |
int | no | 5 | 1-10 |
schema |
object | no | — | JSON Schema for per-result field extraction; provider-specific limits apply |
provider |
string | no | — | Force a provider (e.g. exa); required only when more than one is configured |
Output Schema¶
type StructuredOutput struct {
Query string `json:"query"`
Category string `json:"category"`
ResultCount int `json:"resultCount"`
Results []StructuredItem `json:"results"`
Provider string `json:"provider"` // which provider answered
CostUsd float64 `json:"costUsd,omitempty"` // per-call estimate for metered providers
Trust string `json:"trust"` // "untrusted-external-content"
}
type StructuredItem struct {
Title string `json:"title,omitempty"`
URL string `json:"url"`
PublishedDate string `json:"publishedDate,omitempty"`
Author string `json:"author,omitempty"`
Summary json.RawMessage `json:"summary,omitempty"` // schema-conforming JSON (best-effort), or plain text summary
Highlights []string `json:"highlights,omitempty"` // verbatim source snippets — the authoritative payload
Entities json.RawMessage `json:"entities,omitempty"` // provider-specific structured entities, when available
}
Behavior¶
- Resolve the
search.StructuredSearcherfor the requestedprovider(or the sole configured one). - The provider validates its own constraints (category vocabulary, schema limits) before any paid call; a violation returns a validation tool-error, never a wasted upstream request.
- When
schemais set, each result'ssummaryis JSON conforming to it; otherwise it is a plain text summary. Schema extraction is best-effort and provider-side: the provider's extractor fills each field from the page, and a value it can't confidently resolve comes backnulleven when that value is visible inhighlights. Treathighlights(verbatim source snippets) as the authoritative payload andsummaryas a convenience — do not assume every schema field is populated. Providers may populate per-resultentitiesfor entity categories (e.g. Exa'scompany). costUsdand the resolved provider are surfaced into audit metadata. Results are external content —trustis always"untrusted-external-content".
Provider notes¶
- Exa:
category∈ company, people, research paper, news, pdf, github, financial report, personal site;schemamust be a flat object (rootobject, ≤10 properties, nesting depth ≤2, primitive array items);category:"company"returns structured company entities.
Cache¶
- TTL: 1 hour (keyed by query + category + num_results + schema + provider)
Tool 17: citation_graph¶
Map a seed paper's citation neighborhood: works that cite it (forward edges, cited_by) and works it cites (backward edges, references). Single-hop per call — multi-hop traversal is the caller's to orchestrate (the server stays infrastructure, not an autonomous crawler). Registered only when a citation-capable academic provider (Semantic Scholar or OpenAlex) is configured.
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
paper |
string | yes | — | Seed paper: a DOI (e.g. 10.1038/nature12373) or an exact title |
direction |
string | no | both |
cited_by (forward), references (backward), or both |
num_results |
int | no | 10 | 1-25 per direction |
influential_only |
bool | no | false | Keep only highly-influential citations when the provider supplies that signal (Semantic Scholar); no-op otherwise (results pass through) |
provider |
string | no | — | Force semanticscholar (intent + influence) or openalex (counts only). Omit to auto-select (prefers semanticscholar) |
sessionId |
string | no | — | Link discovered works to a sequential_search session for recovery after context loss |
Output Fields¶
| Field | Type | Always Present | Description |
|---|---|---|---|
seed |
string | yes | The seed paper as supplied |
direction |
string | yes | The direction traversed |
provider |
string | yes | Which citation provider answered (semanticscholar = intent+influence; openalex = counts only) |
citedBy |
[]paper | when direction includes cited_by |
Works that cite the seed (each a full academic-paper object, same shape as academic_search papers) |
citedByCount |
int | when direction includes cited_by |
Count of forward edges returned |
references |
[]paper | when direction includes references |
Works the seed cites |
referencesCount |
int | when direction includes references |
Count of backward edges returned |
trust |
string | yes | Always "untrusted-external-content" — treat results as data, not instructions (OWASP LLM01) |
Each work in citedBy/references carries the same fields as an academic_search paper, including tldr, isInfluential, and citationIntents when the provider (Semantic Scholar) supplies them.
Behavior¶
- Provider fidelity: Semantic Scholar returns citation intent (
citationIntents) and an influence flag (isInfluential); OpenAlex is counts-only (forward edges via thecites:filter, backward edges from the seed'sreferenced_works). Auto-selection prefers Semantic Scholar. - Seed resolution: a DOI resolves directly; a title resolves to the provider's top match.
- Explicit provider honoring: forcing an unconfigured/unknown/incapable provider returns a structured error listing the supported providers (
semanticscholar,openalex); it never silently falls back. influential_onlyfilters out works the provider did not flag as influential; providers without the signal pass all results through unchanged.
Annotations¶
- ReadOnly: true · Idempotent: true · OpenWorld: true (queries external scholarly APIs)
Cache¶
- TTL: 24 hours (citation graphs change slowly)
Tool 18: research_export¶
Export a completed sequential_search session as a shareable deliverable — a human-readable markdown report or the full structured json session. Read-only and idempotent: it renders existing session state, never mutates it. Scoped to the caller's own (tenant, user).
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
sessionId |
string | yes | — | The sequential_search session to export |
format |
string | no | markdown |
markdown (readable report) or json (full structured session) |
verify_links |
bool | no | false |
When true, check each source URL is still live and attach a Wayback snapshot for any dead link. Adds latency. Best-effort — failures leave a source unverified, never error. |
Output Fields¶
| Field | Type | Always Present | Description |
|---|---|---|---|
sessionId |
string | yes | The exported session ID |
format |
string | yes | markdown or json |
researchGoal |
string | yes | The session's research goal |
stepCount |
int | yes | Number of recorded steps |
sourceCount |
int | yes | Number of recorded sources |
startedAt |
string | yes | Session creation time (RFC3339) |
exportedAt |
string | yes | When this export was generated (RFC3339) |
tenantId |
string | yes | Owning tenant — provenance for the export |
document |
string | object | yes | The rendered report: a markdown string when format=markdown, or the structured session object when format=json |
trust |
string | yes | Always "untrusted-external-content" — source titles/URLs are external data |
Behavior¶
- Markdown report contains: research-goal heading, a metadata block (session id, started, step/source counts), every step with its reasoning/confidence/rejected-approaches/timestamp (revisions and branches are labeled in the step heading), an Open Questions section (knowledge gaps), a numbered Sources list, and a provenance footer (export time, tenant).
- Deterministic: same session → byte-identical output aside from the
exportedAtstamp (idempotency). - Sessions are private to their owning
(tenant, user); a leaked sessionId is honored only for its owner. verify_links=true: runs a batched SSRF-safe liveness check on every recorded source URL and attaches a Wayback Machine snapshot URL for any dead link. Best-effort — a URL that can't be checked is left unverified, never an error. Off by default (adds latency proportional to source count).
Annotations¶
- ReadOnly: true · Idempotent: true · OpenWorld: false (reads internal session state)
Error Conditions¶
- Missing
sessionId→ validation error - Unknown/expired session → "Session not found or expired. Sessions last 4 hours from last activity."
- Invalid
format→ validation error (usemarkdownorjson)
Cache¶
- No cache (renders live session state)
Tool 19: format_bibliography¶
Turn a set of sources into a formatted bibliography. Pick a human-readable style (APA, MLA) or a reference-manager interchange format (BibTeX, RIS, CSL-JSON) that imports straight into Zotero / EndNote / Mendeley. Sources come from either a sequential_search session (its recorded sources) or an explicit list the caller supplies (e.g. academic_search / citation_graph results — pass their doi so the persistent id survives). Read-only and idempotent.
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
style |
string | no | apa |
apa, mla, bibtex, ris, or csl-json |
sessionId |
string | no | — | Build from this session's recorded sources. Provide this or sources |
sources |
[]object | no | — | Explicit sources. Provide this or sessionId. Each: url (required), title, author, site, date, doi |
Output Fields¶
| Field | Type | Always Present | Description |
|---|---|---|---|
style |
string | yes | The citation style used (apa/mla/bibtex/ris/csl-json) |
entryCount |
int | yes | Number of unique entries rendered (after de-duplication by URL) |
bibliography |
string | yes | The formatted bibliography. For apa/mla/bibtex/ris, records separated by blank lines; for csl-json, a JSON array string |
sessionId |
string | no | Echoed when sources were drawn from a session |
trust |
string | yes | Always "untrusted-external-content" — source metadata is external data |
Behavior¶
- Sources are de-duplicated by URL (first occurrence wins) and ordered deterministically: APA/MLA alphabetically by the rendered line; BibTeX/RIS/CSL-JSON by (collision-free) cite key. Same inputs → byte-identical output (interchange formats omit the accessed-date stamp so they stay reproducible).
- BibTeX cite keys are
surname + year + first-title-word(e.g.vaswani2017attention), made collision-free within the list by appendinga/b/c…; BibTeX-significant characters in values are escaped. - RIS records use
TY - JOURwhen a DOI is present (the entry is almost certainly a journal article); others useTY - ELEC. OneAUline per author (split on;/and), withDOcarrying the bare DOI andURthe URL; values are stripped of line breaks so a title can't inject extra RIS tags. - CSL-JSON is a JSON array: entries with a DOI use
"type": "article-journal"; others use"type": "webpage". (id= cite key, authors as{"literal": …},issueddate-parts,container-title,DOI,URL); all values are JSON-escaped. - DOI (when supplied) is normalized to the bare
10.x/yform and emitted into bibtex (doi), ris (DO), and csl-json (DOI). It is not network-verified here — useverify_citationfor that. - An unrecognized style is rejected at the tool boundary (the lower-level formatter falls back to APA, but the tool validates first).
- Entries with no
urlare skipped. EithersessionIdor a non-emptysourceslist is required. - Session-sourced bibliographies are scoped to the caller's own
(tenant, user).
Annotations¶
- ReadOnly: true · Idempotent: true · OpenWorld: false (formats supplied/stored data)
Cache¶
- No cache (pure formatting of supplied/stored data)
Tool 20: filing_search¶
Search SEC EDGAR — the authoritative primary source for US public-company disclosures (10-K/10-Q/8-K/S-1/DEF 14A/…). Registered only when a filing provider is configured (edgar, which needs a contact email for SEC's required User-Agent).
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
query |
string | yes* | — | Company name, ticker, or CIK — or free text to full-text search all filings. *Required unless ticker is set |
form_type |
string | no | — | Restrict to a filing type (10-K, 10-Q, 8-K, S-1, DEF 14A, …) |
ticker |
string | no | — | Direct ticker lookup; takes precedence over query for entity resolution |
date_from |
string | no | — | Only filings on/after this date (YYYY-MM-DD) |
date_to |
string | no | — | Only filings on/before this date (YYYY-MM-DD) |
facts |
bool | no | false | Return structured XBRL company facts (revenue, net income, EPS, assets) instead of a filing list |
num_results |
int | no | 5 | 1-10 |
provider |
string | no | — | Force a filing provider: edgar |
sessionId |
string | no | — | Link results to a sequential_search session |
Output Fields¶
Each item in filings[]: company, cik, formType, filingDate, periodOfReport, accession, url (document link; pair with scrape_page), description, source. In facts=true mode each item is one XBRL fact: concept, unit, value (exactly as filed — no rounding). Plus query, resultCount, provider, hints (zero-result), and trust (untrusted-external-content).
Behavior¶
- Entity resolution: a ticker/CIK/known-company
queryresolves to a CIK and lists its recent filings from the submissions API; otherwise a full-text search runs across all filers (EFTS). facts=truereturns a curated set of headline XBRL concepts (revenue, net income, assets, EPS, …), most-recent value each, passed through verbatim.- Required
User-Agent: SEC blocks requests without a descriptive UA + contact email; the provider only registers whenEDGAR_CONTACT_EMAIL(orOPENALEX_EMAIL) is set. No request is ever made without it. - Ticker→CIK map is fetched once and cached for the process lifetime.
Annotations¶
- ReadOnly: true · Idempotent: true · OpenWorld: true (queries the live SEC API)
Cache¶
- TTL: 24 hours (only for non-empty results)
Tool 21: legal_search¶
Search US court opinions (federal + state) via CourtListener for case-law research and precedent tracing. Registered only when a case provider is configured (courtlistener, which works keyless at a lower rate).
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
query |
string | yes | — | Legal topic, case name (e.g. Miranda v. Arizona), or statutory reference |
jurisdiction |
string | no | — | Court id: scotus, ca9, ny, … |
date_from |
string | no | — | Only opinions decided on/after this date (YYYY-MM-DD) |
date_to |
string | no | — | Only opinions decided on/before this date (YYYY-MM-DD) |
num_results |
int | no | 10 | 1-20 |
provider |
string | no | — | Force a case-law provider: courtlistener |
sessionId |
string | no | — | Link results to a sequential_search session |
Output Fields¶
Each item in cases[]: caseName, citation (Bluebook), court, courtId, dateFiled, docketNumber, citationCount, url (opinion page; scrape_page for full text), source. Plus query, resultCount, provider, hints, and trust (untrusted-external-content).
Behavior¶
- Searches the CourtListener v4 opinions index;
jurisdictionmaps to thecourtfilter, dates tofiled_after/filed_before. - Auth: works keyless at ~100 req/day;
COURTLISTENER_API_TOKENraises the limit (~5000/day). The token is sent as anAuthorizationheader and never logged. - Anti-hallucination workflow: pair with the
legallens (web_searchwithlens: legal, an authority-weighted primary-source pack — seelenses/README.md) for context, and withverify_citationto confirm a cited case actually exists before relying on it.
Annotations¶
- ReadOnly: true · Idempotent: true · OpenWorld: true (queries the live CourtListener API)
Cache¶
- TTL: 24 hours (only for non-empty results)
Tool 22: econ_search¶
Look up macroeconomic and development data. FRED (Federal Reserve Economic Data) — 800K+ US time series (GDP, CPI, unemployment, rates); World Bank Open Data — global development indicators for 200+ economies; OECD (SDMX) — economic indicators for OECD economies (national accounts, prices, labour, trade); Eurostat — official European statistics. World Bank, OECD, and Eurostat are keyless, so econ_search is always registered; FRED adds the US macro series when FRED_API_KEY is set.
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
query |
string | yes* | — | Keyword to search series by (matches indicator name for World Bank). *Provide this OR series_id |
series_id |
string | yes* | — | A series ID to fetch observations: FRED (GDP, CPIAUCSL, UNRATE), a World Bank indicator (NY.GDP.MKTP.CD), an OECD dataflow ref (agency,dataflow,version — returned by a keyword search), or a Eurostat dataset code (une_rt_m). *Provide this OR query |
country |
string | no | WLD |
ISO code for multi-country providers (worldbank default WLD; oecd → REF_AREA, e.g. USA; eurostat → geo, e.g. DE). Ignored by fred |
date_from |
string | no | — | Only observations on/after this date (YYYY-MM-DD or YYYY) |
date_to |
string | no | — | Only observations on/before this date (YYYY-MM-DD or YYYY) |
frequency |
string | no | — | FRED only: resample d, w, m, q, a |
units |
string | no | — | FRED only: units transform (e.g. pch, pc1); omit for raw levels |
num_results |
int | no | 5 (search) / 10 (observations) | — |
provider |
string | no | — | Force an economic-data provider: fred, worldbank, oecd, or eurostat |
Output Fields¶
mode is series (keyword search) or observations (series_id lookup). In series mode each results[] item: seriesId, title, units, frequency, lastUpdated, notes. In observations mode: seriesId, date, value (exactly as returned — no rounding; missing observations carry no value), plus title and units for multi-dimensional providers (OECD/Eurostat) so interleaved subgroup series — youth vs total, male vs female — are tellable apart (a single FRED/World Bank series carries neither). Plus query, seriesId (echoed in observations mode), country (echoed for a World Bank lookup), resultCount, provider, hints, and trust (untrusted-external-content).
Behavior¶
series_idset → returns that series' observations; otherwise keyword-searches series. With nodate_fromthe window is the most-recentnum_results(latest first); with adate_fromit is the firstnum_resultson/after that date (anchored at the requested start, oldest first) so the filter is never silently dropped. FRED honorsfrequency/units; World Bank scopes bycountry(defaultWLD) and filters by year.- World Bank / OECD / Eurostat have no server-side keyword search, so keyword mode lists the provider's catalogue (WDI indicators / OECD dataflows / Eurostat datasets) once and filters by name client-side; for OECD/Eurostat the matched id is the
series_idto fetch observations with. Multi-word queries use AND-matching (all words must appear in the name — "quarterly GDP growth" matches titles containing each word even when not adjacent); single-word queries require a contiguous substring. - OECD addresses a series by a dataflow ref (
agency,dataflow,version) plus aREF_AREAcountry filter; observations are decoded from SDMX-JSON at the requested time granularity (monthly/quarterly/annual — the period is not truncated to the year). Eurostat addresses a dataset by code plus ageofilter; observations are decoded from the JSON-stat cube by recovering every dimension's coordinate, surfacing its status flag (provisional/estimated) asnotes. A dataset is multi-dimensional (sex/age/unit/adjustment/…), so eachtitlecarries the dimension labels that distinguish one series from another (e.g. "…— Females, Percentage of population in the labour force") andunitscarries the unit dimension; values pass through verbatim (no rounding). - Provider honoring: an explicit
provideris used exclusively; otherwise the first configured provider answers (FRED if keyed, else World Bank/OECD/Eurostat in order). An error/empty returns a structured zero-result with hints (no silent cross-provider fallback). - Auth: World Bank, OECD, and Eurostat are keyless (always available).
FRED_API_KEY(free at fred.stlouisfed.org) enables FRED; it is sent as a query param and never logged.
Annotations¶
- ReadOnly: true · Idempotent: true · OpenWorld: true (queries the live FRED / World Bank / OECD / Eurostat APIs)
Cache¶
- TTL: 6 hours (only for non-empty results)
Tool 23: verify_citation¶
Purpose¶
Verify a single citation before relying on it — confirm it exists, matches a real record, hasn't been retracted, and still resolves. Built to catch AI-fabricated or retracted citations before they ship (legal filings, papers, articles). Composes the retraction enrichment, the link verifier, and the academic searchers; adds no new provider.
Input Schema¶
| Field | Type | Required | Description |
|---|---|---|---|
citation |
string | yes | A DOI (10.1038/nature12373), a URL, or a free-text reference string. The tool auto-detects which. |
claim |
string | no | The assertion this citation is cited for. When set, the source (the live URL, or its Wayback snapshot when dead) is fetched and checked for whether it actually addresses the claim — surfacing evidence sentences and flagging mischaracterization. Coverage + evidence, never a support/refute verdict. Off unless provided; adds one fetch. |
Output Schema¶
input, inputType (doi|url|reference), exists (bool), matchedRecord (the academic record — for a DOI input it is only the work whose DOI exactly equals the input, never a near-neighbor, and is omitted when no exact record exists or exists:false), matchConfidence (high|medium|low|none), detectedDoi (for a URL input that resolves to a scholarly article: the DOI extracted from the page — citation_doi meta, the URL path, or references-safe front matter — which then drives the retraction + title-match checks just like a DOI input; omitted when no scholarly DOI was found), titleMatch (match|mismatch|not_checked: whether a title — text supplied alongside a DOI, or a scholarly page's own title for a URL input — matches the matched record's actual title; mismatch means ≥2 substantive title tokens are absent from the record — the caller may have cited the wrong paper; not_checked when there is no title text or only a bare DOI), retractionStatus (Crossref integrity status when retracted/corrected; omitted when clean), httpStatus + archivedUrl (for URL inputs — live status and a Wayback snapshot for dead links; exists:true with a 403/429/503 httpStatus means the server is reachable but refused the verifier — the resource exists, it is not a dead link), provenance (how each piece of evidence was obtained), and the trust marker. When a claim was supplied: claim (echo), claimSupport (addressed|partially_addressed|not_addressed|source_unavailable), claimEvidence (claim-relevant source sentences), claimSourceUrl (the URL actually fetched), and contrastSignal (true only — a negation/contrast cue, read the evidence yourself).
Behavior¶
- Evidence, never a verdict. The tool reports what it found (exists/matches/retracted/resolves); the caller decides whether to cite. It never synthesizes a true/false judgment.
- DOI input → existence + retraction via Crossref
works/{doi}; the matched record is fetched by exact-DOI entity lookup (theDOIResolvercapability, e.g. OpenAlex/works/doi:{doi}) somatchedRecordis always the cited work or nothing — a relevance DOI search returns near-neighbors, which are never shown as this DOI's record.matchedRecord/matchConfidenceare omitted when no exact record is found orexists:false. When the citation string also carries a title alongside the DOI,titleMatchcompares those tokens against the matched record's title:mismatchfires only when ≥2 substantive tokens supplied are absent from the record (the caller may have paired the wrong title with this DOI), whilenot_checkedmeans only a bare DOI was given. Zero false positives: a single coincidental token is never flagged mismatch. - URL input → liveness via the SSRF-safe link verifier; a Wayback
archivedUrlwhen the live link is dead. When the URL resolves to a scholarly article (classified peer-reviewed/academic), the page is fetched once and its DOI is extracted (citation_doimeta → URL path → references-safe front matter, the same authority order asscrape_page); a found DOI then drives the full DOI enrichment —detectedDoi,retractionStatus,matchedRecord/matchConfidence(exact-DOI lookup), andtitleMatchcomparing the page's own title against the matched record. A non-scholarly page stays liveness-only — a DOI-shaped string in prose is never surfaced. - Free-text input → best-match academic lookup with a transparent token-overlap
matchConfidence; retraction checked when the match carries a DOI. - Claim check (optional). When a
claimis given, the source is fetched (the live URL, a matched record's URL, or a Wayback snapshot) and measured for topical overlap with the same lexical, model-free coverage asaudit_bibliography(no model/embedding):claimSupportreports COVERAGE not stance,not_addressedis the mischaracterization signal (only when a source was actually read), andcontrastSignalflags a negation cue.source_unavailablewhen no fetchable source (e.g. a DOI/reference whose matched record carries no URL). Off — and zero added latency — unless a claim is supplied. - Degrades gracefully when a resolver is unconfigured (reports the gap in
provenance); never panics. - To check a whole reference list at once (a document, an explicit list, or a session), use
audit_bibliography— the corpus-level companion that runs these same checks over every entry.
Annotations¶
- ReadOnly: true · Idempotent: true · OpenWorld: true (queries live external sources)
Cache¶
- Not cached (a verification is a point-in-time liveness/integrity check).
Tool 24: verify_recommendation¶
Purpose¶
Audit an AI-generated recommendation list (a listicle, product ranking, or comparison) for anti-sloptimization signals. Given a list of recommendations with optional URLs and author bios, returns per-item evidence: self-promotion patterns (a brand ranking itself first), conflicts of interest (the author is employed by the recommended company), domain reputation, and link liveness. Built to catch GEO (Generative Engine Optimization) and brand-favoring recommendations so you can decide whether the list is trustworthy or gaming you. Compose with web_search + verify_citation to audit sources and claims.
Input Schema¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
recommendations |
[]object | yes | — | Array of recommendations (max 100). Each has: title (the recommendation), url (optional), author (optional), authorBio (optional author affiliation/bio). |
claim |
string | no | — | Optional context describing what the list claims (e.g. "best e-commerce platforms for small businesses"). When set, triggers corroboration searches across independent journalism and tech sources per recommendation. |
numCorroborationResults |
int | no | 5 | Results to fetch per lens per recommendation when claim is set. Max 10. |
Output Schema¶
itemCount (recommendations audited), recommendations[] — per item: title (echo), url (echo when provided), author (echo when provided), selfPromotionSignal (present when the recommendation text contains ranking patterns favoring the brand; rare for structured lists, more common on full-page audits), conflictOfInterest (present when the author has a detected financial stake — employment / funding / equity — in the recommended entity), domainReputation (domain reputation when the URL host is in the known sources dataset; omitted for unlisted hosts), linkLive (true when the URL resolves 2xx/3xx; false when dead), httpStatus (live HTTP status, 0 = unreachable/timeout), corroborationSearches[] (present when claim was given — one entry per corroboration lens with query, lens, resultCount, agreeCount, disagreeCount, silentCount, and topResults[]), flags (conflict_of_interest / dead_link / unknown_reputation / low_reputation; empty = no issues), reasons[] (human-readable explanations for any flags), and the trust marker. Top-level aggregateFlags[] (present only when claim is given): no_independent_corroboration fires when zero results across all lenses agreed with any recommendation.
Behavior¶
- Evidence, never a verdict. The tool reports what it found (conflicts/reputation/liveness/corroboration); the caller decides whether to trust the recommendation.
- Conflict of interest detection: scans author bio for employment, funding, or equity indicators (e.g. "at Shopify", "advisor to") that match entities mentioned in the recommendation text. Conservative — false negatives preferred to false positives.
- Self-promotion signal (when present): indicates a ranking list putting its own brand first (a brand blog recommending itself as #1, etc.). Uses lexical matching: the URL's domain token (e.g.
shopify.com→"shopify") must appear as the rank-1 item name. Does not detect corporate ownership (e.g.adobe.comranking"Marketo"#1, since Marketo is an Adobe acquisition but the names don't match). See the open enhancement issue for scope-expansion plans. - Domain reputation: when known, surfaces the source's reputation tier from the embedded allowlist (academic, news, official docs, etc.).
- Link liveness: batched SSRF-safe check of all provided URLs; present only when URLs are given.
- Corroboration search (#246): when
claimis provided and a recommendation has atitle, the tool issues one site-scoped search per lens (journalism— independent press, SEC/court/data.gov;tech— Ars Technica, TechCrunch, The Verge, Wired) against the default provider. Each result is scored byclaimSignal:"addressed"→agreeCount,"not_addressed"→disagreeCount, all others →silentCount. The aggregate flagno_independent_corroborationfires when no result across all lenses agreed with any recommendation. Fail-open: a missing lens or provider error leavescorroborationSearchesnil without failing the audit. - Scope: up to 100 recommendations per call; any excess returns an error.
Annotations¶
- ReadOnly: true · Idempotent: true · OpenWorld: true (queries live external sources for link liveness)
Cache¶
- Link liveness checks are cached for 1 hour; domain reputation lookups use the embedded dataset (no network).
Tool 25: clinical_search¶
Search ClinicalTrials.gov — the NIH registry of 400K+ clinical studies — for evidence-based-medicine and systematic-review research. ClinicalTrials.gov is keyless, so this tool is always registered. Discovery + primary-source retrieval only — not medical advice.
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
query |
string | yes* | — | Free-text across trial fields. *Provide at least one of query/condition/intervention/sponsor |
condition |
string | yes* | — | Disease/condition (e.g. covid-19) |
intervention |
string | yes* | — | Drug/device/treatment (e.g. remdesivir) |
sponsor |
string | yes* | — | Lead sponsor / funder |
status |
string | no | — | Recruitment status filter: RECRUITING, COMPLETED, TERMINATED, … |
num_results |
int | no | 10 | 1–100 |
provider |
string | no | — | Force a clinical-trials provider: clinicaltrials |
sessionId |
string | no | — | Record results as sources on a sequential_search session |
Output Fields¶
Each trials[] item: nctId, title, status, phases (array), conditions (array), interventions (array), sponsor, startDate, hasResults (bool — whether results are posted), url (study page; scrape_page for the full registration), source. Plus query, resultCount, provider, hints (when empty), and trust (untrusted-external-content).
Behavior¶
- Combine
query/condition/intervention/sponsor/statusto narrow the registry's structured facets; at least one is required. - Provider honoring: an explicit
provideris used exclusively; otherwise the first configured provider answers. An error/empty returns a structured zero-result with hints (no silent fallback). - A bad request surfaces as a structured upstream error (the API returns
text/plainerrors, decoded as a message snippet); a404/no-match is an empty result, never a panic. - Auth: keyless — ClinicalTrials.gov v2 needs no API key.
Annotations¶
- ReadOnly: true · Idempotent: true · OpenWorld: true (queries the live ClinicalTrials.gov API)
Cache¶
- TTL: 6 hours (only for non-empty results)
Tool 26: local_search¶
Search for physical places (restaurants, cafes, shops, services, points of interest) by local intent query. Backed by Brave's three-call local pipeline: web search with result_filter=locations to collect ephemeral location IDs, then local/pois for structured POI details, then local/descriptions for AI-generated descriptions (best-effort). Requires BRAVE_API_KEY; the tool is not registered when the key is absent. Location IDs are ephemeral — never persisted beyond the request lifecycle.
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
query |
string | yes | — | Local intent query (e.g. 'best coffee shops near downtown Seattle') |
near |
string | no | — | Free-text location bias (city, neighborhood, region). Used as the location anchor when no coordinates are given (Brave: sent as a location header, not appended to the query). Coordinates take precedence. |
latitude |
number | no | — | WGS-84 latitude (−90…90) of the search anchor. With longitude, takes precedence over near, anchors the place index, and distance-ranks results nearest-first. |
longitude |
number | no | — | WGS-84 longitude (−180…180). Must be paired with latitude to take effect. |
radius |
number | no | 0 | Distance filter in meters, applied only when latitude/longitude are set. Drops places farther than this from the anchor. 0 = no filter. Independent of units. |
country |
string | no | — | ISO 3166-1 alpha-2 country restriction (e.g. US, GB) |
units |
string | no | — | metric or imperial (display only) |
num_results |
int | no | 5 | 1–20 |
provider |
string | no | — | Force a local-search provider: brave |
sessionId |
string | no | — | Record results as sources on a sequential_search session |
Output Fields¶
Each places[] item: id (ephemeral), name, address, lat, lon, phone, website (use scrape_page for the full site), categories (array), rating, ratingCount, hours (array, e.g. 'Thursday: 06:59-17:00'), description (absent when unavailable), source. Plus query, resultCount, provider, hints (when empty), and trust (untrusted-external-content).
Behavior¶
- Provider honoring: an explicit
provideris used exclusively; an unknown or unconfigured provider returns a structured error (no silent fallback). - Location anchoring (Brave): when
latitude/longitudeare supplied they are sent asX-Loc-Lat/X-Loc-Longheaders on the step-1 locations call (header values are never logged) and take precedence overnear; otherwisenear/countrypopulate theX-Loc-City/X-Loc-Countrytext-fallback headers. The query is not suffixed with the location. Steps 2/3 (local/pois,local/descriptions) send no location headers, matching Brave's reference client. - Distance ranking: with an anchor coordinate present, returned POIs are sorted nearest-first by haversine distance;
radius(meters) additionally drops places beyond that distance. Coordinates are optional and validated at the boundary —latitudeandlongitudemust be given together and within range, andradiusmust be non-negative, else the call returns a structured validation error. - The descriptions step is best-effort: a failure there does not fail the whole call — results are returned without descriptions.
- Returns an empty
placesarray (withhints) when no location results match; never panics.
Annotations¶
- ReadOnly: true · Idempotent: true · OpenWorld: true (queries the live Brave Search API)
Cache¶
- TTL: 6 hours (only for non-empty results)
Tool 27: audit_bibliography¶
Purpose¶
The corpus-level companion to verify_citation: audit a whole bibliography at once. Read a CSL-JSON / RIS / BibTeX document (what format_bibliography exports), an explicit list of references, or a sequential_search session's sources, and run the same trust checks over every entry — does it exist, is it retracted, does its link still resolve, and (optionally, per entry) does the source actually address the claim it's cited for. Built to catch fabricated, retracted, or mischaracterized citations across a full reference list (legal filings, papers, systematic reviews) before they ship. Composes the retraction enrichment, the link verifier, the academic searchers, and claim-evidence extraction; adds no new provider.
Input Schema¶
| Field | Type | Required | Default | Constraints |
|---|---|---|---|---|
bibliography |
string | yes* | — | A bibliography document. *Provide one of bibliography/entries/sessionId |
format |
string | no | auto |
auto (detect), csl-json, ris, or bibtex |
entries |
[]object | yes* | — | Explicit references (url, title, author, site, date, doi, claim). *One of bibliography/entries/sessionId |
sessionId |
string | yes* | — | Audit a sequential_search session's recorded sources. *One of bibliography/entries/sessionId |
Precedence when more than one is supplied: entries → bibliography → sessionId. A per-entry claim is honored only in the explicit entries mode (a document/session carries no per-entry claims).
Output Schema¶
source (where entries came from: entries / bibliography:<format> / session), entryCount, summary ({total, retracted, deadLink, notFound, unchecked, mischaracterized, ok}), and entries[] — per entry: index, title, doi, url, exists (bool), retractionStatus (when retracted/corrected), linkLive + httpStatus, archivedUrl (Wayback snapshot for a dead link), flags (retracted / dead_link / not_found / unchecked / mischaracterized; empty = clean), reason (a human-readable explanation for a flagged entry), and — when a claim was given — claim, claimSupport (addressed / partially_addressed / not_addressed / source_unavailable), claimEvidence (relevant source sentences), and claimSourceUrl (the URL actually fetched). Plus checkedAt (RFC 3339 point-in-time stamp), the trust marker, and — only when the per-call cap is exceeded — skipped + skippedNote.
Behavior¶
- Evidence, never a verdict (same contract as
verify_citation). It reports what it found per entry and a corpus summary; the caller decides what to fix. - One pass, bounded. All entry URLs are checked in a single batched, concurrency-bounded link pass; DOI existence+retraction (one Crossref call each), academic existence lookups, and the optional per-entry claim fetch all run concurrently (bounded). A DOI is authoritative for existence+retraction; without one, existence is confirmed by a best-match academic title lookup.
- Claim check (optional, #174). When an entry carries a
claim, the source page is fetched (the live URL, or its Wayback snapshot when the live link is dead) and measured for topical overlap via transparent term coverage.claimSupportreports coverage, not a stance:addressed(strong overlap — claim-relevant sentences returned inclaimEvidencefor you to judge direction),partially_addressed(some overlap — evidence shown but not flagged; ambiguous, you judge),not_addressed(the source addresses none of the claim → themischaracterizedflag), orsource_unavailable(no fetchable source). It never asserts "supports"/"refutes" — the extractor surfaces sentences, not direction — and the flag fires only on zero overlap of a fetched source, so a real-but-tangential source is never falsely accused (under-flagging is the safe direction). The check is lexical, not semantic (no model/embedding dependency): coverage is measured as the peak overlap within a sentence window (not across the whole page), so a narrow claim whose terms are merely scattered across a long, broad article scores low local coverage rather than a misleadingpartially_addressed. Borderline partial overlap is still shown as evidence for you to judge, not flagged. Because a source can share a claim's terms while contradicting it, acontrastSignal: trueis added when a claim-relevant sentence carries a negation/refutation cue (e.g. "no significant", "did not", "no evidence", "contradicts") — a neutral "read this sentence yourself" heads-up so anaddressedresult is never mistaken for confirmation. The cue list holds only explicit negation/refutation terms, never bare discourse connectives like "however"/"although"/"whereas" (which contrast two things without opposing the claim and would false-positive on supporting sources). Off unless a claim is given (no added latency otherwise). - Flagging (deliberately distinguishes evidence of a problem from absence of evidence):
retracted= the DOI/record is retracted (an expression-of-concern/correction is surfaced inretractionStatusbut not flagged retracted);dead_link= a URL was checked and the server did not respond — 4xx-gone or network failure (a WaybackarchivedUrlis attached when one exists); alinkLive:trueresult with a 403/429/503httpStatusmeans the server is reachable but blocked the verifier — the resource exists anddead_linkis not set;not_found= a DOI was looked up against Crossref and had no match — a possible fabrication;unchecked= the entry could not be corroborated by any check (no identifier, no live link) — absence of evidence, not evidence of absence (e.g. a book, a paywalled or offline source);mischaracterized= a claim was given and the fetched source does not address it. These are never conflated, and each carries areasonso a legitimate uncheckable source is never read as fake. - Capped at the first 200 entries per call; any overflow is reported in
skipped/skippedNote(never silently dropped). - Session audits are scoped to the caller's own
(tenant, user). Degrades gracefully when a resolver/scraper is unconfigured; never panics.
Annotations¶
- ReadOnly: true · Idempotent: true · OpenWorld: true (queries live Crossref, the open web, and the Internet Archive)
Cache¶
- Not cached (a point-in-time liveness/integrity audit).
Tool 28: archive_source¶
Purpose¶
Capture a fresh Internet Archive (Wayback Machine) snapshot of a URL via Save Page Now, so a source you intend to cite stays verifiable even if the page later changes or disappears. This is the trust suite's one write tool: the rest of the suite can tell you a link is dead and surface an existing snapshot (read-only); this one creates a new snapshot. It makes "stays honest" durable rather than point-in-time.
Input Schema¶
| Field | Type | Required | Description |
|---|---|---|---|
url |
string | yes | The URL to capture a fresh snapshot of. Must be a public http/https URL (private/loopback hosts are refused). |
Output Schema¶
requestedUrl (echo), snapshotUrl (the https://web.archive.org/web/<timestamp>/<url> snapshot; omitted when pending/unavailable), archivedAt (RFC 3339 — when a fresh capture was confirmed; present only on a fresh capture), captured (bool — true only for a fresh capture by this call), status (archived|existing|pending|unavailable), httpStatus (Save Page Now endpoint status), reason (why no fresh capture, for existing/pending/unavailable), pollUrl (Wayback wildcard URL to check manually once SPN's in-flight ingestion completes; present only when status:"pending" and no existing snapshot was found), source (web.archive.org Save Page Now), provenance, and the trust marker.
Behavior¶
- Write tool, non-destructive. It creates a public archive entry; it never deletes or mutates existing data. Annotated
ReadOnly:false, Destructive:false, Idempotent:true(Save Page Now dedups within its rate window). - Best-effort and honest. Save Page Now is rate-limited and slow — a fresh capture is not guaranteed. The tool retries with backoff within its ~25 s budget so a slow-but-successful first-time capture is confirmed in-call. When one can't be made, the tool falls back to the most recent existing snapshot and reports
captured:false/status:"existing"; when nothing is confirmed it reportsstatus:"pending"with apollUrlto check once in-flight ingestion completes. It never errors on a slow/throttled capture. - Evidence, never a verdict. It returns the snapshot artifact + provenance; it does not judge the source.
- SSRF-safe. The outbound request goes to the fixed
web.archive.orghost through the SSRF-safe client (every redirect hop IP-revalidated); the submitted URL is additionally validated and private/loopback hosts are refused before the request. - Optional credentials. Keyless Save Page Now works by default; setting
IA_ACCESS_KEY+IA_SECRET_KEYauthenticates the request for higher reliability (keys are never logged or echoed). status:"unavailable"when no link verifier is configured (graceful, not an error).- Use
verify_citationfirst to see whether a link is already dead or already archived.
Annotations¶
- ReadOnly: false (write) · Destructive: false · Idempotent: true · OpenWorld: false (advisory; the call does reach the Internet Archive)
Cache¶
- Not cached (each call is an explicit archive request).
Tool 29: brand_research¶
Purpose¶
Research a company's complete brand identity — colors, logos, typography, tone of voice, social handles, and W3C design tokens — from any domain or company name. Probes official brand portals and brand guideline pages; only returns high-confidence structured data found directly on those pages (empty fields = genuinely not found). When a brand portal is found, the fully rendered page text is stored as a resource (brand_portal_resource) that an AI agent can pass to read_resource to analyze colors, typography, and other details directly. When no brand portal is found, the suggestion field guides the AI agent to use scrape_page on the homepage instead.
Use brand_research when you need structured brand JSON. Use brand-guidelines (MCP Prompt) when you want LLM interpretation for a specific use case (landing page, email, video brief).
When to use vs. other tools¶
| Use case | Tool |
|---|---|
| Get raw brand data (colors, logos, fonts as JSON) | brand_research |
| Generate brand-compliant content for a specific use case | brand-guidelines prompt |
| Scrape the brand homepage | scrape_page |
| Search for brand guidelines PDF | search_and_scrape |
Input Schema¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string | no* | — | Domain or URL (e.g. kaltura.com, https://kaltura.com). Preferred when both are supplied. |
company_name |
string | no* | — | Company name used to resolve domain when url is omitted. |
depth |
string | no | standard |
quick (meta only, ~1–2s), standard (adds brand-page probe, ~3–6s), full (adds web search, ~8–15s). |
include_design_tokens |
bool | no | false | When true, include W3C DTCG design_tokens object for Style Dictionary / Tokens Studio / Figma Variables. |
sessionId |
string | no | — | Link to a sequential_search session. |
*At least one of url or company_name is required.
Output Schema¶
| Field | Type | Description |
|---|---|---|
identity |
object | name, domain, tagline, description, industry, founded, location |
colors |
object | primary, secondary, accent, background, text, surface, text_secondary (hex strings) + palette array with hex, name, role, brightness |
logos |
object | primary, dark, icon (each: url, format, width, height) + favicon, og_image |
typography |
object | heading, body, mono (each: family, weights, origin, origin_id) + google_fonts_url, scale |
tone_of_voice |
object | summary, attributes array, dos_and_donts object |
social |
object | twitter, linkedin, github, youtube, facebook, instagram (URLs) |
sources |
array | Which tiers contributed: name (homepage_meta/brand_page/web_search), url, fields |
guidelines_url |
string | First discovered brand guidelines page URL |
brand_portal_resource |
string | research://artifact/{id} URI — pass to read_resource to retrieve the full rendered brand portal text for AI analysis |
suggestion |
string | Guidance for the AI agent when no brand portal was found (recommends scrape_page on the homepage), OR when a portal URL was found but its content could not be extracted (recommends scrape_page on the portal URL) |
design_tokens |
object | W3C DTCG format ($value/$type per token) — only when include_design_tokens: true |
coverage |
object | colors, logos, typography — each full/partial/none; tone_of_voice — found/none |
cache_age |
integer | Seconds since cache was written. 0 = live fetch. Cache TTL: 24 hours. |
trust |
string | Always untrusted-external-content |
Behavior¶
- Three-tier extraction pipeline. Tiers (2) homepage structured-data + meta and (4) brand-page probe (concurrent HEAD requests, 26 patterns: 5 subdomains + 21 paths) run concurrently via goroutines. Tier (5) web search (depth=full only) runs sequentially after they complete. Higher tiers never overwrite lower-tier values.
- Authoritative-data-first. Only high-confidence data found explicitly on brand portal pages is returned. Empty fields mean genuinely not found — never inferred or guessed. No CSS heuristics.
- Brand-page probe runs all 26 candidates concurrently, picks the highest-priority match (dedicated brand subdomains beat path matches), rejects redirects to the homepage or a different host. For dynamic SPA brand portals (e.g. Corebook.io), the page is rendered via browser tier and navigation links are extracted from the rendered DOM to find color/typography sub-pages.
- Brand portal resource. When a brand portal is found, the full rendered page text (including any color sub-pages) is stored as an encrypted artifact (
research://artifact/{id}, 30-min TTL) and returned inbrand_portal_resource. Pass this URI toread_resourceso an AI agent can analyze the raw rendered content for colors, typography, and other details. - Fallback suggestion. When no brand portal is found,
suggestionis populated with guidance to usescrape_pageon the homepage for fully rendered content analysis. When a portal URL was found but its content could not be extracted (e.g. a gated SPA),suggestioninstead recommendsscrape_pageon the portal URL. - Tone of voice is only extracted from explicit brand guidelines pages — not inferrable from meta or CSS.
Annotations¶
- ReadOnly: true · Destructive: false · Idempotent: true · OpenWorld: true
Cache¶
- TTL: 24 hours. Key: SHA-256 of domain + depth.
cache_agefield shows seconds since last fetch.
Cross-Cutting Concerns¶
Timeouts¶
Scrape-tier timeouts are hardcoded in source; only HTTP server timeouts are env-configurable (see docs/DEPLOYMENT.md).
| Operation | Default | Max |
|---|---|---|
| Search API call | 10s | 30s |
| Markdown negotiation | 5s | 10s |
| Stealth scrape (http client) | 20s | — |
| HTML scrape (goquery) | 15s | 30s |
| Browser scrape (go-rod) | 30s | 60s |
| YouTube transcript | 30s | 60s |
| Document download | 30s | 60s |
| Total tool execution | 60s | 120s |
Content Size Limits¶
| Content | Limit |
|---|---|
scrape_page content — default max_length |
50 KB |
scrape_page content — hard cap (maxScrapeLength, all modes incl. raw) |
5 MB |
search_and_scrape per-source content — default |
50 KB |
search_and_scrape combined content — default total_max_length |
300 KB |
| Document download | 50 MB default |
| YouTube transcript | 100 KB |
Token Estimation¶
- Formula:
len(content) / 4(conservative, ~4 chars per token) - Size categories: small (<5K chars), medium (<20K), large (<50K), very_large (>=50K)
Cache Freshness Provenance¶
Cacheable tools attach a top-level MCP _meta block so a client can tell whether a result was served from cache and how stale it is. The fields (set in cachedResultWithMeta / freshResult, internal/tools/errors.go) are:
_metarides on the MCPCallToolResultenvelope — a sibling ofcontent, not a key inside the content JSON body. A client reads it from the result'smeta/_metafield, never by parsing the body string. The end-to-end roundtrip guard isTestWebSearchCacheMeta_PresentOnFreshAndCacheHit.
| Field | Type | Meaning |
|---|---|---|
cached |
bool | true if served from cache, false if freshly fetched |
ageSeconds |
int | Age of the cached entry in seconds (0 for fresh) |
maxAgeSeconds |
int | The entry's TTL in seconds |
freshness |
string | Human-readable freshness label (e.g. fresh) |
Which tools emit _meta:
| Tool | Fresh result | Cache hit |
|---|---|---|
web_search |
yes (cached: false) |
yes (cached: true) |
image_search, news_search, academic_search, patent_search, scrape_page |
no | yes (cached: true) |
search_and_scrape, sequential_search, get_research_session |
no (not cached as a unit) | n/a |
Routing Provenance (_meta.routing)¶
When multi-provider routing (SEARCH_ROUTING) is active, the search-family tools attach an operator-facing routing block to the same _meta channel (set in routingMeta / withRoutingMeta, internal/tools/errors.go; captured by the Router via search.RoutingTrace). It coexists with the cache block — withRoutingMeta merges into _meta without clobbering the freshness keys.
This is operator/debug data, not content. It is LLM-invisible (a sibling of content, never fed to the model) and never appears in the result body — the Router's job is to make providers interchangeable to the model. The drift guard TestRoutingMeta_PresentOnResultAndAbsentFromContent fails CI if any routing field leaks into LLM-facing content.
| Field | Type | Meaning |
|---|---|---|
provider_used |
string | The provider that served the result (omitted on a cache hit) |
providers_attempted |
[]string | Providers tried in priority order, up to the one that served |
fallback |
bool | true when the served provider was not the first attempted |
fallback_reason |
string | Coarse enum: circuit_open or primary_unavailable (omitted unless a fallback occurred). No raw breaker counts or upstream error text. |
cache_hit |
bool | true when served from cache (provider attribution is then omitted — the cached blob's provenance is not this call's routing) |
latency_ms |
int | Server-side end-to-end latency for the call |
The provider name is the disclosure boundary: no upstream URLs, credentials, or breaker internals appear. The block is omitted entirely when there is nothing to observe — a single-provider / no-routing deployment, or a non-routed capability. Routing applies to the Router-routed capabilities only (web / images / news / patents / academic); the synthesis (answer, structured_search), citation_graph, and structured-domain (filing_search, legal_search, econ_search, clinical_search) tools resolve a single provider directly and already name it in the result body's source/provider field — they have no fallback ladder to observe. The same routing summary is also recorded under audit.AuditEvent.Metadata["routing"].
For the aggregate, on-demand operator views (recent errors, live provider/breaker health) see the diagnostics:// MCP Resources and the HTTP-mode dashboard in docs/DEPLOYMENT.md.
MCP Resources¶
Read-only, on-demand views exposed as MCP Resources (not tools). Read with ReadResource(uri).
| URI | Name | Description |
|---|---|---|
stats://tools |
Tool Statistics | Per-tool call count, latency, and error rate |
stats://sessions |
Active Sessions | Count of live research sessions |
stats://rate-limits |
Rate Limit Status | Per-tenant and global quota config + remaining |
stats://providers |
Configured Providers | Every configured provider by name and capability type |
lenses://catalog |
Search Lens Catalog | All available search lenses — name, description, domain count, and whether a dedicated Custom Search Engine is configured. Pass a name to web_search, academic_search, news_search, or image_search as the lens parameter to restrict results to authoritative sources for that domain. |
diagnostics://errors/recent |
Recent Errors | Bounded, newest-first ring of recent tool errors (redacted, tenant-scoped) |
diagnostics://health |
Provider Health | Live circuit-breaker state per provider; empty when multi-provider routing is not enabled |
research://artifact/{id} |
Research Artifact | Large-payload store for scrape_page (raw mode), search_and_scrape, and research_export results served via resource_link |
Audit & Tenant Scope¶
Every tool call is logged through deps.Auditor.Log() as an audit.AuditEvent (internal/audit/logger.go) carrying tenant_id, user_id, request_id, tool_name, duration_ms, success, and an optional error_code (field names are the JSON tags on AuditEvent). Tenant and user identity are read from the request context (auth.TenantIDFromContext / auth.UserIDFromContext).
Privacy: the raw query text is attached to metadata.query only when AUDIT_INCLUDE_REQUEST_BODY is enabled (Auditor.IncludeRequestBody()); otherwise just metadata.query_length is recorded. All metadata string values and error strings pass through audit.MaskSecrets so credentials never persist. Cache keys and session keys are tenant-scoped, so one tenant cannot read another's cached or session data.
Unified Error Handling¶
All tools use a dual-format error response: a natural-language first line + a JSON block with machine-readable metadata:
Rate limited (google). Wait 60 seconds and retry, or try a different provider.
{"error":{"kind":"rate_limited","retryable":true,"retryAfterSeconds":60,"suggestedAction":"retry_after_delay","provider":"google"}}
Error kinds: rate_limited, auth_required, blocked, network, content_empty, browser_unavailable, config, upstream_unavailable. Each maps to a suggestedAction the LLM can branch on programmatically.
Full details: see docs/ERROR_HANDLING.md — covers the three-layer architecture, all error kinds and actions, and contributor patterns.
Tool Annotations (MCP Protocol)¶
Every tool declares annotations for client consumption (readOnlyAnnotations(idempotent, openWorld) for read tools, writeAnnotations(idempotent) for the three write tools (memory_save, workspace_contribute, archive_source) — all in internal/tools/registry.go). CI enforces tool↔doc consistency via TestAllToolsHaveAnnotations, TestToolsDocMatchesRegistry, TestOutputSchemaMatchesResponse, and TestToolDescriptionQuality (internal/tools/metadata_test.go) — including on docs-only PRs via the standalone docs-drift CI job. No tool is Destructive — deletion is the /admin/data erasure endpoint, never a tool flag (see docs/DEPLOYMENT.md).
| Tool | ReadOnly | Idempotent | OpenWorld |
|---|---|---|---|
| web_search | true | true | true |
| scrape_page | true | true | true |
| search_and_scrape | true | true | true |
| image_search | true | true | true |
| news_search | true | true | true |
| academic_search | true | true | true |
| patent_search | true | true | true |
| sequential_search | true | false | false |
| get_research_session | true | true | false |
| answer | true | true | true |
| structured_search | true | true | true |
| citation_graph | true | true | true |
| research_export | true | true | false |
| format_bibliography | true | true | false |
| audit_bibliography | true | true | true |
| verify_citation | true | true | true |
| verify_recommendation | true | true | true |
| archive_source | false (write) | true | false |
| filing_search | true | true | true |
| legal_search | true | true | true |
| econ_search | true | true | true |
| clinical_search | true | true | true |
| local_search | true | true | true |
| get_my_analytics | true | true | false |
| memory_save | false (write) | false | false |
| memory_recall | true | true | false |
| workspace_contribute | false (write) | false | false |
| workspace_read | true | true | false |
| brand_research | true | true | true |
Notes: sequential_search is non-idempotent because it writes session state to disk on every call. memory_save, workspace_contribute, and archive_source are the three write tools (ReadOnly:false). memory_save and workspace_contribute are non-idempotent (each call appends a new record); archive_source is idempotent (archiving the same URL twice is safe). OpenWorld:false marks tools that touch only local/server state (sessions, memory, analytics, workspaces, exports) rather than the open web. Destructive is uniformly false — no tool is annotated destructive.
Provider Resolution¶
When a provider field is set on any search tool:
1. If provider is in the SearchProviders/PatentProviders/AcademicProviders map → use it
2. If provider is known but not configured → error with env var hint
3. If provider is completely unknown → error listing all supported providers (via allSupportedProviders())
Source of truth for supported providers: search.SupportedProviders, search.SupportedPatentProviders, search.SupportedAcademicProviders in internal/search/provider.go.
Known Provider Behaviors (not bugs)¶
These are upstream behaviors we cannot control — they reflect how the underlying APIs work:
| Provider | Behavior | Impact |
|---|---|---|
| SearchAPI | May return fewer results than num_results requested |
Query has limited coverage in their index; not an error |
| Google (news) | freshness=hour may return articles 5-10 hours old |
Google's "last hour" filter is approximate, not strict |
| Google (images) | size=large may return images as small as 600x600 |
Google's size thresholds differ from typical expectations |
| USPTO | Full-text search only (no field-qualified queries) | API rejects field syntax; results rely on relevance ranking |
| OpenAlex | pdf_only may return 0 results for common topics |
Not all papers have PDF URLs indexed in their metadata |
| DuckDuckGo | Rate-limited aggressively from cloud/datacenter IPs | Works well from local/STDIO; may return 0 results from servers |
| DuckDuckGo | Images and News return empty results | HTML endpoint doesn't support these categories; Router falls through |
| HackerNews | web_search / news_search only (no Images); dateRange filter via Algolia numericFilters; num_results 1–100 (values outside that range reset to 10); no API key required (SEARCH_PROVIDER=hackernews or provider: hackernews per-call) |
Algolia HN search index only; not a general-web index |
These are not errors in web-researcher-mcp. The tool faithfully passes parameters to the upstream API and returns whatever the API provides.
MCP Prompts¶
MCP Prompts are LLM-facing instruction handlers — they orchestrate tool calls and synthesis for a specific use case. Unlike tools (which return structured JSON), prompts return a natural-language instruction that tells the LLM how to use tool outputs.
brand-guidelines¶
Research a company's brand identity and produce use-case-specific brand-compliant guidance. Calls brand_research, interprets the structured JSON, and produces actionable creative direction for the requested use case.
When to use vs. brand_research directly: Call brand_research when you want the raw structured JSON (colors as hex, logos as URLs, fonts as names). Invoke the brand-guidelines prompt when you want an LLM to interpret that JSON and produce formatted creative guidance for a specific task.
Arguments¶
| Argument | Required | Default | Description |
|---|---|---|---|
company |
yes | — | Company name or domain (e.g. kaltura.com or Kaltura) |
use_case |
no | full_guidelines |
landing_page, email, social_post, video_brief, design_tokens, full_guidelines |
depth |
no | standard |
Passed to brand_research: quick, standard, full |
Use cases¶
use_case |
Produces |
|---|---|
landing_page |
Color palette table, typography spec, sample headline + subhead, component guidance |
email |
Email color spec, font stack, sample subject line + preheader, HTML inline-style snippet |
social_post |
Visual identity notes, three sample captions (Twitter/X, LinkedIn, Instagram), hashtag suggestions |
video_brief |
Motion graphics color spec, logo bug placement, voiceover tone direction, music mood descriptor |
design_tokens |
W3C DTCG JSON code block, token mapping table, gap analysis |
full_guidelines |
Comprehensive brand doc: identity, color system, logo usage, typography, tone of voice, coverage summary |