Skip to content

Provider Reference

This document covers every search and data provider supported by web-researcher-mcp: what backs each one, which tools it enables, free-tier limits, and when to choose it. For setup instructions (API keys, env vars, routing config) see API_SETUP.md.


Contents


Web Search Providers

Index Classification

Understanding what backs each provider helps you reason about result overlap and independence.

Provider Index Type What backs the results
DuckDuckGo Bing-sourced Microsoft Bing (+ Bing-syndicated sources)
Google PSE Own index Google's web index
Serper Google-backed Google's web index via API
SearchAPI.io Google-backed Google's web index via API
Brave Own independent Brave's own web crawler and index
Exa Own independent Neural/embedding-based web index
Tavily Aggregator Queries multiple existing engines at runtime; scrapes top results; applies AI re-ranking — no proprietary crawled index
SearXNG Meta-search Configurable — routes to whatever backends you point it at (Bing, Google, DuckDuckGo, and others)
HackerNews Niche HN Algolia index — Hacker News stories and submissions only

Practical implication: Google PSE, Serper, and SearchAPI.io draw from the same index — using more than one adds no coverage, only redundancy. Brave and Exa bring genuinely independent results. Tavily and SearXNG aggregate results from others rather than crawling themselves.


Capability Matrix

Which tools each web search provider enables. means the provider returns empty (no error) for that capability — image-capable providers in SEARCH_ROUTING will handle the fallback automatically.

Provider web_search image_search news_search answer structured_search local_search Scrape fallback tier
DuckDuckGo
Google PSE
Serper
SearchAPI.io
Brave
Exa ✓ (paid, last-resort)
Tavily
SearXNG
HackerNews

Notes: - answer and structured_search are provider-independent tools, but Exa is the only web provider that backs them with its native API. They remain unavailable if no Exa key is set. - local_search is Brave-only — it requires BRAVE_API_KEY. No other web provider supports the three-call local pipeline (locations → POIs → descriptions). - Brave also exposes a LLM context endpoint (/res/v1/llm/context) consumed by search_and_scrape as a fast-path for RAG/grounding workflows. When Brave is the active provider, search_and_scrape tries the server-assembled context first; if that fails, it falls back to the standard search-then-scrape pipeline. Requires BRAVE_DATA_FOR_AI plan access. - Exa's scrape fallback tier (/contents) fires only when all four free tiers (markdown → stealth → HTML → browser) have failed. It charges an Exa credit per call. - Tavily's time-range filter is aggressive on web search — for recent content, news_search works better; web_search may return nothing for narrow windows.


Free Tier and Pricing

Provider Free Tier Paid
DuckDuckGo Unlimited Free
HackerNews Unlimited Free
SearXNG Unlimited (self-hosted) Free (self-hosted)
Google PSE 100 queries/day $5 / 1,000 queries
Brave 2,000 queries/month Paid plans
Serper 2,500 queries (one-time) Paid plans
SearchAPI.io 100 searches/month Paid plans
Exa 1,000 requests/month Per call beyond free tier
Tavily Monthly dev credits Paid plans

Quick-Pick Guide

If you need… Use
Zero-config, no signup DuckDuckGo (built-in fallback) or HackerNews (HN-only)
Broadest index coverage Google PSE
High-volume + own index Brave (2,000/month free, privacy-first)
Independent results alongside Google Brave or Exa (different indices, no overlap)
Semantic / conceptual search Exa
LLM-ready extracted content Tavily
answer or structured_search tools Exa (required)
Air-gapped or no vendor lock-in SearXNG (self-hosted)
Tech/developer community signal HackerNews
Maximum reliability SEARCH_ROUTING=brave,google,serper (three independent providers)

Provider Notes

DuckDuckGo — The zero-config default. No API key, no registration, no rate limit to configure. Result depth is lower than keyed providers; image and news results are present but less comprehensive. Use as a fallback, not a primary.

Google PSE — The largest index. Best for broadest coverage, image search, and exact-phrase queries. Requires both an API key (via Google Cloud Console) and a Programmable Search Engine ID. Free tier of 100/day is low for sustained use.

Serper and SearchAPI.io — Google results without the PSE setup overhead. Serper is the simpler option; SearchAPI.io supports multiple engine backends beyond Google. Both draw from Google — no coverage difference between them or vs. Google PSE.

Brave — Own crawler, own index, privacy-first. Best all-purpose choice when you want index independence from Google/Bing and a generous free tier. Supports web, image, news, and Goggles-based custom result weighting. Also exposes local/map results via local_search (the only provider that does) and a LLM context endpoint used by search_and_scrape for faster grounding when you're on Brave's Data for AI plan.

Exa — Neural/semantic index. Results are ranked by embedding similarity, not just keyword match — better for conceptual or research queries. The only provider that backs answer (grounded synthesis with citations) and structured_search (schema-defined entity extraction). Also provides a paid /contents scrape tier as a last-resort fallback for scrape_page. Most expensive per-call but uniquely capable.

Tavily — Aggregates from multiple existing search engines at query time, then scrapes the top results and applies AI re-ranking. No proprietary index — similar in architecture to SearXNG, but hosted/commercial with an AI synthesis layer. Returns pre-extracted LLM-ready content. Closest comparison: SearXNG (open-source, self-hosted, no synthesis layer) or Exa (own index, deeper semantic capabilities). Best used as a routing member rather than the sole provider since it lacks image search.

SearXNG — Open-source, self-hosted, routes to configurable backends. Best for air-gapped environments, organizations requiring no external vendor dependency, or privacy-first deployments. Requires hosting and setup but carries no query limits or API costs.

HackerNews — Searches HN stories and submissions via the public HN Algolia API. No key or registration. Not general web — use only when you specifically want HN community signal, tech discussions, or submission history. scrape_page on any HN URL (item, user, list) reads natively through the HN Firebase API regardless of which SEARCH_PROVIDER is set.


Academic Search Providers

Capability Matrix

Provider Search DOI Resolution Citation Graph OA PDF enrichment AI summaries Key Required
OpenAlex via Unpaywall No (email for polite pool)
CrossRef ✓ (authoritative) No (email for polite pool)
Semantic Scholar ✓ (rich edges) ✓ (tldr) No (key raises limits)
PubMed No (key raises limits)
Exa Yes (EXA_API_KEY)

Notes: - CrossRef is the official DOI registration agency — the authoritative source for DOI metadata. Every DOI-registered work appears here. - Semantic Scholar enriches results with AI-generated tldr summaries and citation intent/influence edges, which power citation_graph. OpenAlex also implements citation_graph support with citation-count edges as a fallback. - Only OpenAlex implements the DOIResolver interface (exact-entity lookup via /works/doi:{doi}). CrossRef, Semantic Scholar, and PubMed do not. - Exa routes academic queries using its research-paper category — useful when its neural index surfaces papers the bibliographic databases miss. - Unpaywall OA enrichment runs as a post-processing step on any DOI-bearing result — not a separate provider to select.

Coverage

Provider Corpus Focus
OpenAlex 287M+ works All academic disciplines; CC0 data
CrossRef 140M+ DOI-registered works Peer-reviewed literature; authoritative DOI metadata
Semantic Scholar 200M+ papers Broad; strong on CS, medicine, biology
PubMed 35M+ citations Biomedical and life science only
Exa Neural web index Research-paper category; surfaces papers outside bibliographic DBs

Academic Routing

Without explicit routing, all configured academic providers are tried in order. The recommended starting config:

export SEARCH_ROUTING='{"academic":"openalex,crossref,semanticscholar","default":"brave,google"}'

If no academic providers are configured, academic_search automatically falls back to site-restricted web search.


Patent Search Providers

Jurisdiction Matrix

Provider US EP WO (PCT) Other Offices Key Required
EPO OPS ✓ (100M+ docs, all major offices) Yes (free registration)
The Lens ✓ (100+ jurisdictions) Yes (free, request access)
USPTO Yes (free)
SearchAPI.io ✓ (Google Patents via SerpAPI) Yes (SEARCHAPI_API_KEY)

Notes: - EPO OPS and The Lens cover worldwide jurisdictions; USPTO covers US patents only. - SearchAPI.io wraps Google Patents via SerpAPI — good for quick coverage when you already have a SearchAPI key. - The Lens uniquely links patents to citing academic papers. - Without any patent provider configured, patent_search falls back to site-restricted web discovery. - The patent_office parameter enables intelligent routing — a search restricted to EP automatically skips USPTO.

Patent Routing

export SEARCH_ROUTING='{"patents":"epo,lens,searchapi,uspto","default":"brave,google"}'

Structured-Domain Providers

These providers back dedicated tools and are independent of the web search providers above.

Tool Provider Coverage Key Required
filing_search SEC EDGAR US public-company filings (10-K, 10-Q, 8-K, XBRL company facts) No (contact email required)
legal_search CourtListener US federal and state court opinions No (token raises limit to ~5,000/day)
econ_search World Bank Global development indicators, 200+ economies No
econ_search OECD OECD economy indicators via SDMX No
econ_search Eurostat European official statistics No
econ_search FRED 800K+ US macro series (GDP, CPI, unemployment, rates) Yes (free)
clinical_search ClinicalTrials.gov 400K+ NIH-registered clinical trials No
archive_source Internet Archive SPN Save Page Now capture No (keys raise reliability/limits)

Notes: - World Bank, OECD, Eurostat, ClinicalTrials.gov, and CourtListener are always available — no configuration required. - SEC EDGAR and FRED activate on their respective env vars (EDGAR_CONTACT_EMAIL / FRED_API_KEY). EDGAR_CONTACT_EMAIL falls back to OPENALEX_EMAIL. - archive_source is the only write tool in the suite — it triggers a live internet capture, not a cache lookup.


Multi-Provider Routing

See docs/DEPLOYMENT.md for full routing configuration. The short version:

# Priority-ordered fallback — if Brave is down, routes to Google, then Serper
export SEARCH_ROUTING=brave,google,serper

# Per-operation routing
export SEARCH_ROUTING='{"web":"brave,google","news":"brave,serper","images":"google,brave","academic":"openalex,crossref","patents":"epo,lens,searchapi,uspto","default":"brave,google,searchapi"}'

Providers with repeated failures are automatically circuit-broken and skipped until they recover.