GEO
January 10, 2025

How Claude/Anthropic Crawls Your website?

Learn how Claude crawls, indexes, and cites your site—and configure robots.txt, sitemaps, and structure to control access and visibility.

Learn exactly how Claude crawls, indexes, and cites your site so you can earn attribution without losing control. The short version: Claude leans on the Brave Search index for discovery. It uses distinct Anthropic user agents for different purposes and honors robots.txt—so your configuration drives visibility, load, and training access.

Quick answer: How Claude discovers, crawls, and cites your content

Use this section to get discoverable in Claude quickly while staying compliant and in control. What matters most is allowing Claude-SearchBot for search visibility, deciding your training policy for ClaudeBot via robots.txt, and ensuring Brave can index your pages.

  • Discovery: Claude relies heavily on the Brave Search index, which finds your pages via standard crawling (e.g., BraveBot), links, and sitemaps.
  • Bots: Claude-SearchBot supports search visibility; ClaudeBot is for model training; Claude-User is on-demand fetching for user queries.
  • Control: Use robots.txt to allow Claude-SearchBot, optionally disallow ClaudeBot, and set crawl-delay if needed; keep JS/CSS unblocked.
  • Citations: Claude prefers fresh, answerable, well-structured pages and will cite sources when web results are used.

Quick checklist

  • Allow Claude-SearchBot; decide whether to disallow ClaudeBot (training).
  • Confirm Brave can index: public access, 200 status, sitemaps, internal links.
  • Add concise answers, FAQs, and relevant schema (FAQ, HowTo, Product).
  • Monitor logs for Anthropic and Brave user agents; validate directives on each subdomain.

How Claude’s discovery pipeline works (bots, Brave index, and citations)

Map your controls to Claude’s discovery and citation steps so you can optimize the right levers. The headline: Brave inclusion plus correct robots.txt for Anthropic bots determines what gets found and cited.

Claude typically discovers content through Brave Search’s index. Brave builds this from its own crawling and external signals.

When a user asks a question, Claude may search and retrieve candidates. It then synthesizes an answer with citations to high-quality, fresh sources.

Your job is to be included in Brave and to format content so Claude can confidently select and cite it.

  • Brave inclusion depends on: crawlability, links, and sitemaps.
  • Claude selection depends on: answerability, clarity, and recency.

Validate by checking Brave’s results for key queries (including site: searches). Also watch server logs for bot activity.

If inclusion exists but citations lag, shift focus to answer-first formatting and freshness.

Takeaway: discovery flows through Brave; selection flows through clarity and freshness—optimize both.

Claude’s user agents and purposes: ClaudeBot vs Claude-SearchBot vs Claude-User

Configure the right access without breaking visibility or privacy. The key distinction: search visibility is handled by Claude-SearchBot, while ClaudeBot relates to training.

  • Claude-SearchBot: Supports web search visibility and result curation; respects robots.txt.
  • ClaudeBot: Used for model training/data collection; respects robots.txt; many sites opt out here.
  • Claude-User: Fetches pages at query time to assist a user session; respects robots.txt and site restrictions.

Action steps: explicitly allow Claude-SearchBot and decide your policy for ClaudeBot. Verify behavior with curl and logs.

Result: you enable AI search SEO while honoring privacy and training preferences.

When Claude searches the web vs uses internal knowledge

Set expectations for when citations appear so you can troubleshoot gaps effectively. The headline: Claude cites when it actively uses web results that add value, especially for timely or specific facts.

Claude may answer from internal knowledge for generic or well-established topics, often without citations. It triggers web search and retrieval when the query expects up-to-date info, specific figures, or niche sources.

Encourage citations by keeping pages current, clearly scoped, and rich with definitive answers or data.

Practical tip: add answer-first summaries and update timestamps to nudge retrieval. The takeaway: the more your page provides unique, fresh, and specific answers, the more likely Claude will search and cite it.

Robots.txt controls: How to allow search visibility and manage training access

Implement selective allow/deny so you get indexed for AI search without enabling model training. The critical idea: treat each Anthropic user agent explicitly and manage policies per subdomain.

At minimum, publish a clear robots.txt that distinguishes Claude-SearchBot and ClaudeBot. For multi-subdomain sites, place a robots.txt at each host (e.g., www, blog, docs) because robots directives are host-scoped.

Example (site-wide allow for search visibility, opt out of training):

User-agent: Claude-SearchBot
Allow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-User
Allow: /

# Optional throttle where supported
User-agent: Claude-SearchBot
Crawl-delay: 10

Verification: fetch robots with each UA and inspect responses in logs. Outcome: search visibility preserved while training is controlled.

Checklist

  • Separate policies for Claude-SearchBot and ClaudeBot.
  • Publish per-host robots.txt (www, app, docs, etc.).
  • Keep JS/CSS/image assets crawlable if needed for rendering.
  • Test with curl and confirm in logs.

Selective allow/deny patterns (site-wide and per subdomain

Choose policies that reflect business and compliance needs while preserving public discovery. The priority is to protect staging or private hosts and keep public content accessible.

Common patterns:

  • Public marketing site: allow Claude-SearchBot; consider disallowing ClaudeBot; allow Claude-User.
  • Staging or internal: require auth; add Disallow: / for all bots; avoid relying on robots alone for sensitive content.
  • Docs subdomain: allow search visibility; disallow training if needed; ensure sitemaps are discoverable.

Per-subdomain example:

# https://www.example.com/robots.txt
User-agent: Claude-SearchBot
Allow: /
User-agent: ClaudeBot
Disallow: /

# https://docs.example.com/robots.txt
User-agent: Claude-SearchBot
Allow: /
Sitemap: https://docs.example.com/sitemap.xml

# https://staging.example.com/robots.txt
User-agent: *
Disallow: /

Takeaway: be explicit per host, and prefer authentication over robots for any private environment.

Crawl-delay and server-load considerations

Protect site performance during crawl spikes without sacrificing discoverability. The key: crawl-delay isn’t universal, so back it up with server/CDN controls.

Add crawl-delay directives for Claude-SearchBot if you observe load issues. Also lean on standard safeguards such as HTTP caching, CDN, and compression.

Balance robots with sitemaps so crawlers focus on fresh URLs instead of recrawling everything.

Operational tip: monitor 5xx rates and edge latency during crawl windows. Adjust delay or WAF rate limits if needed.

Bottom line: keep your site fast for users and predictable for bots.

Brave Search indexing: prerequisites, checks, and practical signals

Ensure Brave can find and index your pages so Claude can surface them. The bottom line: if Brave can’t index you, Claude is unlikely to show or cite you.

Success in Brave starts with crawlable, linked, and consistently available pages. Include sitemaps, clean internal linking, and avoid soft 404s or blocked assets that hinder rendering.

Validate discovery by looking for BraveBot in logs and by running site: queries on Brave Search.

If you’re absent, prioritize crawlability fixes, sitemap hygiene, and link discovery from reputable sources. The outcome: increased Brave coverage and improved Claude visibility.

Checklist

  • Verify 200 status and cache headers on key URLs.
  • Submit sitemaps via robots.txt and ensure freshness.
  • Check site:example.com on search.brave.com.
  • Look for BraveBot and Anthropic UAs in logs.

How to infer Brave inclusion and diagnose discovery

Confirm Brave presence even without a submission console so you can act quickly. The crucial signals are site: queries and crawler footprints in your logs.

  • Run: search.brave.com with queries like site:example.com and targeted keywords.
  • Inspect server logs for BraveBot hits and successful 200 responses on key pages.
  • Ensure robots.txt is accessible and that sitemaps are referenced:

Sitemap: https://www.example.com/sitemap.xml

If visibility is thin, check internal links and follow the next steps to improve discovery. Takeaway: straightforward checks rule out basic blockages fast.

What to do if you’re not appearing in Brave

Resolve crawl and quality blockers systematically to regain visibility. The main lever is making discovery easy and content obviously useful.

  • Fix robots.txt conflicts and unblock essential assets.
  • Strengthen internal links to important pages; add them to sitemaps with accurate lastmod.
  • Acquire reputable links; publish answer-first content that meets searcher intent.
  • Eliminate duplicate/thin pages; add canonical tags.

Result: stronger signals that help Brave index you—and help Claude find and cite you.

Sitemaps, internal linking, and indexability hygiene

Guide crawlers efficiently to your best content to improve freshness and coverage. The headline: accurate lastmod and sensible sitemap structure boost crawl responsiveness.

Publish a primary XML sitemap index that points to segmented sitemaps (e.g., blog, docs, products). Keep each under ~50k URLs and ~50MB uncompressed.

Keep internal linking shallow for priority pages. Ensure every important URL is reachable in three clicks or fewer.

Include media sitemaps for images/videos where relevant. Ensure canonical URLs match sitemap entries.

The takeaway: clean, current sitemaps amplify crawl budget and recrawl cadence.

Checklist

  • Sitemap index + segmented sitemaps with correct lastmod.
  • All priority URLs linked internally and in sitemaps.
  • Canonicals align with sitemap URLs.
  • Robots allows access to sitemaps and assets.

Lastmod discipline and large-site sitemap strategy

Make lastmod meaningful so crawlers know what truly changed. Update lastmod only when content materially changes—not on every deploy.

For large sites, split sitemaps by section or date (e.g., /sitemap-blog-2025-11.xml) to spotlight fresh clusters. Automate lastmod from content modification timestamps and keep stable URLs.

Verification: spot-check sitemap URLs to ensure lastmod recency matches actual changes. Outcome: faster inclusion and recrawl of what matters.

Canonical, noindex, and hreflang considerations

Eliminate duplication and language confusion that suppress citations. The priority is correct canonical consolidation and accurate hreflang across variants.

  • Use rel=canonical to consolidate duplicates and parameterized URLs.
  • Apply meta robots noindex to truly disposable pages, not valuable content.
  • Implement hreflang across language/region variants and ensure self-referencing canonicals on each version.

Result: the right version is indexed and cited, reducing dilution across similar pages.

Rendering and technical health: JS, assets, performance, and status codes

Make sure crawlers can fetch and understand the same content users see. The critical principle: if rendering breaks (blocked JS/CSS or heavy client-side rendering), indexing and citations suffer.

Prefer server-side rendering or hydration for critical content. Alternatively, provide a pre-rendered snapshot.

Keep JS, CSS, image, and font directories crawlable when they’re necessary for rendering and layout.

Monitor performance. Aim for fast TTFB, consistent 200s, and stable URLs.

Takeaway: technical stability is table stakes for Claude indexing and Brave inclusion.

Checklist

  • SSR or pre-render key templates; avoid render-blocking errors.
  • Do not disallow critical /js, /css, /images in robots.txt.
  • Keep 4xx/5xx low; fix redirect chains.
  • Use strong caching (ETag/Last-Modified) for efficient recrawl.

Diagnosing JS-rendered content and blocked assets

Verify that crawlers can access critical content without executing JS. The key step is to fetch HTML as text and check for content presence before rendering.

  • Use curl to fetch HTML and ensure core content is present:

curl -s https://www.example.com/page | head -n 50

  • Ensure robots.txt doesn’t block /static/, /assets/, /js/, /css/ paths needed for rendering.
  • If content is JS-only, enable SSR, pre-render, or dynamic rendering for bots.

Outcome: crawlers see the same meaningful content users do.

Error handling (4xx/5xx), caching, and change detection

Stabilize crawl paths and make changes easy to detect so recrawl stays efficient. The most impactful wins are eliminating soft 404s and using standard validators.

  • Return truthful status codes (200/404/410/301/302).
  • Add ETag or Last-Modified headers to key pages and static assets.
  • Keep CDN/cache rules consistent; avoid frequent cache-busting that changes URLs.

Takeaway: predictable status and caching patterns improve crawl efficiency and recrawl cadence.

Content formatting that earns Claude citations

Structure your pages so Claude can confidently quote and link them. The big idea: answer-first, scannable formats with schema win for both AI and traditional SEO.

Open with a 1–2 sentence summary that directly answers the query. Then support with concise bullets or short paragraphs.

Use data points, version numbers, steps, and definitions that are easy to extract and cite.

Include FAQs, specs, and “how-to” blocks to match intent patterns. Takeaway: make it trivially easy for Claude to lift the right facts with attribution.

Checklist

  • Lead with a crisp answer or definition.
  • Add an FAQ section with 2–3 line answers.
  • Include concrete data points and examples.
  • Use schema for eligible content types.

Answer-first structure, FAQs, and data-rich blocks

Match how users (and AI) scan so your answers are unmissable. The priority is to deliver the answer upfront, then elaborate.

Create “TL;DR” intros, numbered steps, and compact lists for procedures. Embed supporting data (dates, thresholds, code snippets) so Claude can cite specifics.

Verification: ask SMEs if the first 2–3 sentences stand alone as the answer. Result: higher likelihood of selection and citation.

Schema that helps (FAQ, Product, HowTo) and when to use it

Use schema to clarify meaning and structure for machines. The best fits for AI search SEO include FAQPage for Q&A content, HowTo for step-by-steps, and Product for specs and offers.

Add JSON-LD with accurate, non-spammy fields. Keep structured data aligned with visible content.

Revalidate whenever you update pages that drive AI citations. Takeaway: semantic clarity reduces ambiguity and boosts extractability.

Verification workflow: prove crawling, indexing, and citation

Run this end-to-end process to confirm your setup works as intended. The essentials: verify robots behavior, observe bot hits, confirm Brave inclusion, and watch for citations.

1) Robots.txt and access tests

  • Fetch robots per host:

curl -I https://www.example.com/robots.txt
curl -A "Claude-SearchBot" -I https://www.example.com/
curl -A "ClaudeBot" -I https://www.example.com/

  • Confirm directives match your policy and pages return 200 for allowed bots.

2) Server log monitoring for Anthropic visits

  • Grep for Anthropic UAs:

grep -E 'Claude-(SearchBot|Bot|User)' /var/log/nginx/access.log

  • Check response codes, crawl cadence, and robots fetches.

3) Checking Brave presence and spotting citations

  • Run site:example.com on search.brave.com and test topical queries.
  • Ask Claude about your niche and look for your domain in citations on relevant prompts.

Outcome: you’ll have evidence of crawling, indexing, and citation.

Robots.txt and access tests

Prove your allow/deny policies are respected across subdomains. The key is to test each host independently and confirm expected responses.

  • Validate each robots.txt:

curl -I https://docs.example.com/robots.txt

  • Test allowed vs disallowed UAs:

curl -A "Claude-SearchBot" -I https://docs.example.com/
curl -A "ClaudeBot" -I https://docs.example.com/

Ensure allowed pages return 200 and disallowed content isn’t crawled. Takeaway: catch misconfigurations before they impact visibility.

Server log monitoring for Anthropic visits

Observe real bot behavior to confirm compliance and performance. The must-do is to log user-agent and status code, and review after major changes.

  • Sample filter:

grep -E 'Claude-(SearchBot|Bot|User)|BraveBot' /var/log/nginx/access.log

  • Watch for robots.txt fetches, crawl bursts, and any 4xx/5xx patterns.

Note: User-agent strings can be spoofed. Pair UA checks with sensible rate patterns and allowlists where appropriate.

Outcome: confidence your site is being crawled as intended.

Checking Brave presence and spotting citations

Validate index presence and real-world impact to guide next steps. The focus is confirming inclusion in Brave and whether Claude cites you for relevant topics.

  • Use site: searches and targeted queries on Brave to gauge coverage.
  • Track citations by prompting Claude on topics you cover and scanning for your domain.

If gaps persist, revisit sitemap lastmod, internal links, and answer-first formatting. Takeaway: measurement guides your next optimization.

Troubleshooting: common issues and fixes

Unblock visibility quickly by separating accessibility from selection quality. The key is to fix crawl/index issues first, then improve content for citations.

  • Not in Brave: check robots, sitemaps, internal links, canonical conflicts; ensure 200s and unblocked assets.
  • Low crawl: add fresh content; update lastmod; strengthen internal links; reduce 5xx.
  • Indexed but not cited: add answer-first summaries, FAQs, and updated data; improve topical authority; refresh content cadence.

Result: a tighter pipeline from discovery to citation.

Blocked or throttled by robots.txt (or on subdomains)

Resolve directive conflicts that unintentionally block desired bots. The priority is correctness per host and precise UA targeting.

  • Search for conflicting wildcards (User-agent: * Disallow: / overshadowing specific allows).
  • Add explicit allows for Claude-SearchBot where needed.
  • Ensure each subdomain has its own robots.txt and sitemap reference.

Takeaway: precision beats broad rules—especially on multi-host architectures.

Indexed but not cited: strengthening selection factors

Close the gap between inclusion and citation with answer-forward formatting. The quickest win is to surface clear, current, and specific answers.

  • Add a “Quick answer” at the top with the key fact or steps.
  • Update dates, versions, and numbers; cite sources or standards.
  • Add schema (FAQ/HowTo/Product) and internal links to reinforce topical clusters.

Outcome: Claude is more likely to select and attribute your content.

Compliance, privacy, and safety considerations

Balance visibility with legal and privacy requirements across environments. The most important rule: don’t rely on robots.txt to protect sensitive content—use authentication.

  • Use HTTP auth, IP allowlists, or gated access for staging and private areas.
  • Prefer robots over IP blocking for public sites to avoid collateral damage. If you must block IPs, test thoroughly to avoid impacting users and legitimate crawlers.
  • Respect CAPTCHAs and rate limits; monitor for unusual spikes and enforce fair-use policies.

Takeaway: align controls with data sensitivity while keeping public content accessible for legitimate indexing.

Implementation checklist and resources

Lock in the essentials with a focused plan you can verify. The priority is selective access, clean indexability, and measurable monitoring.

  • Robots per subdomain: allow Claude-SearchBot; decide on ClaudeBot; keep assets unblocked.
  • Sitemaps: segmented, current lastmod, referenced in robots.txt.
  • Rendering: SSR/pre-render; verify content without JS; fix blocked assets.
  • Health: stable 200s, ETag/Last-Modified, minimal 4xx/5xx.
  • Brave checks: site: queries and logs for BraveBot.
  • Monitoring: grep Anthropic UAs; review after content or config changes.

Authoritative references

  • Anthropic’s documentation on web crawling, robots.txt support, and model training opt-outs.
  • Brave Search documentation on indexing and crawling behavior.
  • Sitemaps.org and Google Search Central guidelines for sitemaps, canonicals, and hreflang (standards that broadly apply).

FAQs

Use these quick answers to implement changes fast and resolve common doubts. The big picture: allow Claude-SearchBot for visibility, control ClaudeBot for training, and verify everything with logs and site: checks.

How do I block ClaudeBot but allow Claude-SearchBot?

  • In robots.txt, explicitly allow the search bot and disallow the training bot:

User-agent: Claude-SearchBot
Allow: /
User-agent: ClaudeBot
Disallow: /

Does Claude use Brave Search to index websites?

  • Claude relies heavily on the Brave Search index for web discovery and retrieval; ensure Brave can crawl and index your site to appear in results.

What is Claude-SearchBot and how does it visit my site?

  • Claude-SearchBot supports search visibility and result curation; it respects robots.txt and typically fetches robots before crawling allowed pages.

How to check if my site is indexed in Brave?

  • Run site:example.com on search.brave.com and look for coverage; also review logs for BraveBot hits and successful 200 responses.

How often does Claude crawl or refresh citations?

  • Recrawl cadence varies by change signals, lastmod, internal links, and popularity; updating content and lastmod can accelerate refresh.

What robots.txt rules does Anthropic respect?

  • Anthropic bots respect standard robots directives (Allow/Disallow) and commonly honor reasonable crawl-delay; configure per user agent.

Does Claude read JavaScript-rendered content?

  • If critical content requires JS, enable SSR or pre-render and keep JS/CSS unblocked; otherwise crawlers may miss key content.

How to structure content to earn Claude citations?

  • Lead with a concise answer, include FAQs and data points, and use appropriate schema so Claude can extract and attribute accurately.

How to set crawl-delay for Anthropic bots?

  • User-agent: Claude-SearchBot
  • Crawl-delay: 10
  • Use alongside caching and CDN controls; note not all bots honor crawl-delay.

How to verify Anthropic bot activity in server logs?

  • Search logs for Claude-(SearchBot|Bot|User) and confirm 200s on allowed pages and robots fetches; monitor patterns after changes.

How can I allow Claude-SearchBot for visibility while blocking ClaudeBot across multiple subdomains?

  • Publish robots.txt per subdomain with the same allow/deny rules and validate each host with curl and log reviews.

What are the risks of IP blocking vs robots.txt?

  • Robots is reversible and precise; IP blocking can break legitimate access and is harder to maintain—use only with careful testing.

How do canonical and hreflang impact the cited version?

  • Correct canonicals consolidate duplicates; accurate hreflang ensures the right language/region gets indexed and cited.

What’s the recommended sitemap structure and lastmod strategy for large sites?

  • Use a sitemap index pointing to segmented sitemaps (by section/date), keep lastmod accurate to content changes, and avoid inflating timestamps.

What’s the best practice for staging or protected environments?

  • Require authentication and disallow all in robots.txt; never rely solely on robots to protect sensitive content.

How do I detect and fix render-blocking errors that prevent understanding?

  • Fetch pages without JS to ensure core content is present, unblock assets in robots.txt, and enable SSR/pre-render for critical templates.

Your SEO & GEO Agent

© 2025 Searcle. All rights reserved.