Provider Reference¶
This document covers every search and data provider supported by web-researcher-mcp: what backs each one, which tools it enables, free-tier limits, and when to choose it. For setup instructions (API keys, env vars, routing config) see API_SETUP.md.
Contents¶
- Web Search Providers
- Index Classification
- Capability Matrix
- Free Tier and Pricing
- Quick-Pick Guide
- Provider Notes
- Academic Search Providers
- Capability Matrix
- Coverage
- Academic Routing
- Patent Search Providers
- Jurisdiction Matrix
- Patent Routing
- Structured-Domain Providers
- Multi-Provider Routing
Web Search Providers¶
Index Classification¶
Understanding what backs each provider helps you reason about result overlap and independence.
| Provider | Index Type | What backs the results |
|---|---|---|
| DuckDuckGo | Bing-sourced | Microsoft Bing (+ Bing-syndicated sources) |
| Google PSE | Own index | Google's web index |
| Serper | Google-backed | Google's web index via API |
| SearchAPI.io | Google-backed | Google's web index via API |
| Brave | Own independent | Brave's own web crawler and index |
| Exa | Own independent | Neural/embedding-based web index |
| Tavily | Aggregator | Queries multiple existing engines at runtime; scrapes top results; applies AI re-ranking — no proprietary crawled index |
| SearXNG | Meta-search | Configurable — routes to whatever backends you point it at (Bing, Google, DuckDuckGo, and others) |
| HackerNews | Niche | HN Algolia index — Hacker News stories and submissions only |
Practical implication: Google PSE, Serper, and SearchAPI.io draw from the same index — using more than one adds no coverage, only redundancy. Brave and Exa bring genuinely independent results. Tavily and SearXNG aggregate results from others rather than crawling themselves.
Capability Matrix¶
Which tools each web search provider enables. — means the provider returns empty (no error) for that capability — image-capable providers in SEARCH_ROUTING will handle the fallback automatically.
| Provider | web_search |
image_search |
news_search |
answer |
structured_search |
local_search |
Scrape fallback tier |
|---|---|---|---|---|---|---|---|
| DuckDuckGo | ✓ | ✓ | ✓ | — | — | — | — |
| Google PSE | ✓ | ✓ | ✓ | — | — | — | — |
| Serper | ✓ | ✓ | ✓ | — | — | — | — |
| SearchAPI.io | ✓ | ✓ | ✓ | — | — | — | — |
| Brave | ✓ | ✓ | ✓ | — | — | ✓ | — |
| Exa | ✓ | — | ✓ | ✓ | ✓ | — | ✓ (paid, last-resort) |
| Tavily | ✓ | — | ✓ | — | — | — | — |
| SearXNG | ✓ | ✓ | ✓ | — | — | — | — |
| HackerNews | ✓ | — | ✓ | — | — | — | — |
Notes:
- answer and structured_search are provider-independent tools, but Exa is the only web provider that backs them with its native API. They remain unavailable if no Exa key is set.
- local_search is Brave-only — it requires BRAVE_API_KEY. No other web provider supports the three-call local pipeline (locations → POIs → descriptions).
- Brave also exposes a LLM context endpoint (/res/v1/llm/context) consumed by search_and_scrape as a fast-path for RAG/grounding workflows. When Brave is the active provider, search_and_scrape tries the server-assembled context first; if that fails, it falls back to the standard search-then-scrape pipeline. Requires BRAVE_DATA_FOR_AI plan access.
- Exa's scrape fallback tier (/contents) fires only when all four free tiers (markdown → stealth → HTML → browser) have failed. It charges an Exa credit per call.
- Tavily's time-range filter is aggressive on web search — for recent content, news_search works better; web_search may return nothing for narrow windows.
Free Tier and Pricing¶
| Provider | Free Tier | Paid |
|---|---|---|
| DuckDuckGo | Unlimited | Free |
| HackerNews | Unlimited | Free |
| SearXNG | Unlimited (self-hosted) | Free (self-hosted) |
| Google PSE | 100 queries/day | $5 / 1,000 queries |
| Brave | 2,000 queries/month | Paid plans |
| Serper | 2,500 queries (one-time) | Paid plans |
| SearchAPI.io | 100 searches/month | Paid plans |
| Exa | 1,000 requests/month | Per call beyond free tier |
| Tavily | Monthly dev credits | Paid plans |
Quick-Pick Guide¶
| If you need… | Use |
|---|---|
| Zero-config, no signup | DuckDuckGo (built-in fallback) or HackerNews (HN-only) |
| Broadest index coverage | Google PSE |
| High-volume + own index | Brave (2,000/month free, privacy-first) |
| Independent results alongside Google | Brave or Exa (different indices, no overlap) |
| Semantic / conceptual search | Exa |
| LLM-ready extracted content | Tavily |
answer or structured_search tools |
Exa (required) |
| Air-gapped or no vendor lock-in | SearXNG (self-hosted) |
| Tech/developer community signal | HackerNews |
| Maximum reliability | SEARCH_ROUTING=brave,google,serper (three independent providers) |
Provider Notes¶
DuckDuckGo — The zero-config default. No API key, no registration, no rate limit to configure. Result depth is lower than keyed providers; image and news results are present but less comprehensive. Use as a fallback, not a primary.
Google PSE — The largest index. Best for broadest coverage, image search, and exact-phrase queries. Requires both an API key (via Google Cloud Console) and a Programmable Search Engine ID. Free tier of 100/day is low for sustained use.
Serper and SearchAPI.io — Google results without the PSE setup overhead. Serper is the simpler option; SearchAPI.io supports multiple engine backends beyond Google. Both draw from Google — no coverage difference between them or vs. Google PSE.
Brave — Own crawler, own index, privacy-first. Best all-purpose choice when you want index independence from Google/Bing and a generous free tier. Supports web, image, news, and Goggles-based custom result weighting. Also exposes local/map results via local_search (the only provider that does) and a LLM context endpoint used by search_and_scrape for faster grounding when you're on Brave's Data for AI plan.
Exa — Neural/semantic index. Results are ranked by embedding similarity, not just keyword match — better for conceptual or research queries. The only provider that backs answer (grounded synthesis with citations) and structured_search (schema-defined entity extraction). Also provides a paid /contents scrape tier as a last-resort fallback for scrape_page. Most expensive per-call but uniquely capable.
Tavily — Aggregates from multiple existing search engines at query time, then scrapes the top results and applies AI re-ranking. No proprietary index — similar in architecture to SearXNG, but hosted/commercial with an AI synthesis layer. Returns pre-extracted LLM-ready content. Closest comparison: SearXNG (open-source, self-hosted, no synthesis layer) or Exa (own index, deeper semantic capabilities). Best used as a routing member rather than the sole provider since it lacks image search.
SearXNG — Open-source, self-hosted, routes to configurable backends. Best for air-gapped environments, organizations requiring no external vendor dependency, or privacy-first deployments. Requires hosting and setup but carries no query limits or API costs.
HackerNews — Searches HN stories and submissions via the public HN Algolia API. No key or registration. Not general web — use only when you specifically want HN community signal, tech discussions, or submission history. scrape_page on any HN URL (item, user, list) reads natively through the HN Firebase API regardless of which SEARCH_PROVIDER is set.
Academic Search Providers¶
Capability Matrix¶
| Provider | Search | DOI Resolution | Citation Graph | OA PDF enrichment | AI summaries | Key Required |
|---|---|---|---|---|---|---|
| OpenAlex | ✓ | ✓ | ✓ | via Unpaywall | — | No (email for polite pool) |
| CrossRef | ✓ | ✓ (authoritative) | — | — | — | No (email for polite pool) |
| Semantic Scholar | ✓ | — | ✓ (rich edges) | — | ✓ (tldr) | No (key raises limits) |
| PubMed | ✓ | — | — | — | — | No (key raises limits) |
| Exa | ✓ | — | — | — | — | Yes (EXA_API_KEY) |
Notes:
- CrossRef is the official DOI registration agency — the authoritative source for DOI metadata. Every DOI-registered work appears here.
- Semantic Scholar enriches results with AI-generated tldr summaries and citation intent/influence edges, which power citation_graph. OpenAlex also implements citation_graph support with citation-count edges as a fallback.
- Only OpenAlex implements the DOIResolver interface (exact-entity lookup via /works/doi:{doi}). CrossRef, Semantic Scholar, and PubMed do not.
- Exa routes academic queries using its research-paper category — useful when its neural index surfaces papers the bibliographic databases miss.
- Unpaywall OA enrichment runs as a post-processing step on any DOI-bearing result — not a separate provider to select.
Coverage¶
| Provider | Corpus | Focus |
|---|---|---|
| OpenAlex | 287M+ works | All academic disciplines; CC0 data |
| CrossRef | 140M+ DOI-registered works | Peer-reviewed literature; authoritative DOI metadata |
| Semantic Scholar | 200M+ papers | Broad; strong on CS, medicine, biology |
| PubMed | 35M+ citations | Biomedical and life science only |
| Exa | Neural web index | Research-paper category; surfaces papers outside bibliographic DBs |
Academic Routing¶
Without explicit routing, all configured academic providers are tried in order. The recommended starting config:
export SEARCH_ROUTING='{"academic":"openalex,crossref,semanticscholar","default":"brave,google"}'
If no academic providers are configured, academic_search automatically falls back to site-restricted web search.
Patent Search Providers¶
Jurisdiction Matrix¶
| Provider | US | EP | WO (PCT) | Other Offices | Key Required |
|---|---|---|---|---|---|
| EPO OPS | ✓ | ✓ | ✓ | ✓ (100M+ docs, all major offices) | Yes (free registration) |
| The Lens | ✓ | ✓ | ✓ | ✓ (100+ jurisdictions) | Yes (free, request access) |
| USPTO | ✓ | — | — | — | Yes (free) |
| SearchAPI.io | ✓ | ✓ | ✓ | ✓ (Google Patents via SerpAPI) | Yes (SEARCHAPI_API_KEY) |
Notes:
- EPO OPS and The Lens cover worldwide jurisdictions; USPTO covers US patents only.
- SearchAPI.io wraps Google Patents via SerpAPI — good for quick coverage when you already have a SearchAPI key.
- The Lens uniquely links patents to citing academic papers.
- Without any patent provider configured, patent_search falls back to site-restricted web discovery.
- The patent_office parameter enables intelligent routing — a search restricted to EP automatically skips USPTO.
Patent Routing¶
export SEARCH_ROUTING='{"patents":"epo,lens,searchapi,uspto","default":"brave,google"}'
Structured-Domain Providers¶
These providers back dedicated tools and are independent of the web search providers above.
| Tool | Provider | Coverage | Key Required |
|---|---|---|---|
filing_search |
SEC EDGAR | US public-company filings (10-K, 10-Q, 8-K, XBRL company facts) | No (contact email required) |
legal_search |
CourtListener | US federal and state court opinions | No (token raises limit to ~5,000/day) |
econ_search |
World Bank | Global development indicators, 200+ economies | No |
econ_search |
OECD | OECD economy indicators via SDMX | No |
econ_search |
Eurostat | European official statistics | No |
econ_search |
FRED | 800K+ US macro series (GDP, CPI, unemployment, rates) | Yes (free) |
clinical_search |
ClinicalTrials.gov | 400K+ NIH-registered clinical trials | No |
archive_source |
Internet Archive SPN | Save Page Now capture | No (keys raise reliability/limits) |
Notes:
- World Bank, OECD, Eurostat, ClinicalTrials.gov, and CourtListener are always available — no configuration required.
- SEC EDGAR and FRED activate on their respective env vars (EDGAR_CONTACT_EMAIL / FRED_API_KEY). EDGAR_CONTACT_EMAIL falls back to OPENALEX_EMAIL.
- archive_source is the only write tool in the suite — it triggers a live internet capture, not a cache lookup.
Multi-Provider Routing¶
See docs/DEPLOYMENT.md for full routing configuration. The short version:
# Priority-ordered fallback — if Brave is down, routes to Google, then Serper
export SEARCH_ROUTING=brave,google,serper
# Per-operation routing
export SEARCH_ROUTING='{"web":"brave,google","news":"brave,serper","images":"google,brave","academic":"openalex,crossref","patents":"epo,lens,searchapi,uspto","default":"brave,google,searchapi"}'
Providers with repeated failures are automatically circuit-broken and skipped until they recover.