How OpenAI Crawls Your website?

If you’re unsure how to show up in ChatGPT search without letting your content be used for model training, this guide is for you.

OpenAI now separates its crawlers by purpose. Small config mistakes can block visibility or permit training you didn’t intend.

You’ll learn which bots do what, the exact controls (robots.txt plus network rules), how to verify access in logs, and how to improve your chances of citations.

OpenAI’s Crawlers at a Glance

You need to know which OpenAI agent does what before you set policy. OpenAI uses distinct user-agents (UAs) and published IP ranges for search discovery, model training, and live user fetches.

Below, you’ll see each bot’s purpose and how to identify it. We include references to OpenAI’s official docs for UA strings and IP lists.

OAI-SearchBot — used for ChatGPT search discovery/surfacing

This crawler discovers and refreshes pages eligible to surface and be cited in ChatGPT search. Its UA string includes “OAI-SearchBot.” Requests originate from OpenAI-owned IPs documented in OpenAI’s crawler documentation.

In practice, allow this bot if you want to appear in ChatGPT search.

Purpose: discovery for ChatGPT search citations and snippets
Identify: User-Agent contains “OAI-SearchBot”; IPs from OpenAI’s published ranges
Docs: See OpenAI’s official crawler docs for the OAI-SearchBot UA and IP ranges

Takeaway: Allow OAI-SearchBot to participate in ChatGPT search indexing and citations.

GPTBot — used for model training (opt-out capable via robots.txt)

This crawler fetches content for potential model training and evaluation. If you don’t want content used for training, block GPTBot via robots.txt; per OpenAI’s docs, this is the formal opt-out.

Blocking GPTBot does not block OAI-SearchBot unless you instruct it.

Purpose: training/evaluation ingestion control
Identify: User-Agent contains “GPTBot”; IPs from OpenAI’s published ranges
Docs: OpenAI’s GPTBot page provides the authoritative UA and JSON IP ranges

Takeaway: Disallow GPTBot in robots.txt if you want to opt out of training while still allowing search.

ChatGPT-User — user-triggered fetches during live interactions

When a user or tool in ChatGPT opens your page (e.g., browse-with-search or a link click), OpenAI may fetch the URL using the “ChatGPT-User” UA. These are on-demand, session-driven fetches, not scheduled crawls.

Expect them to honor your robots policies and any standard authentication you configure.

Purpose: ephemeral retrieval for a live ChatGPT session
Identify: User-Agent contains “ChatGPT-User”; IPs from OpenAI-owned ranges
Control: Honor your robots policies; use standard auth/headers for gated areas

Takeaway: Expect sporadic fetches tied to user actions; ensure your WAF/CDN doesn’t mistake them for abuse.

Bottom line: Set bot-specific controls using OpenAI’s official UA and IP data; don’t treat OAI-SearchBot, GPTBot, and ChatGPT-User as interchangeable.

How Discovery and Indexing Work for OpenAI

Many teams conflate “indexing for ChatGPT search” with “training.” They’re governed separately.

OpenAI uses common web discovery signals and a distinct “search” pipeline for surfacing. In this section, you’ll align your mental model and tune your discovery inputs.

OpenAI follows links (internal/external), sitemaps, and site architecture to discover content. Clear navigation and XML sitemaps increase crawl coverage and improve recrawl cadence.

In practice, keep public sitemaps current and linked in robots.txt to speed up discovery.

“Indexing” here means being eligible to be surfaced and cited in ChatGPT search results, not being stored for model training. Training access is handled by GPTBot and your robots policy.

Keep these workflows mentally separate so you can apply precise, bot-specific controls.

Takeaway: Use links and sitemaps for discovery; treat “search surfacing” and “training” as distinct systems with different bots.

Discovery inputs: links, sitemaps, and internal navigation

Discovery relies on crawlable links, sitemap URLs, and findable content in your IA. XML sitemaps (and a sitemap index if you have many files) help OpenAI crawlers locate new and updated URLs faster.

For content collections (blogs, docs), ensure index pages link directly to detail pages.

Example actions:

Link your primary sitemap at /robots.txt (Sitemap: https://example.com/sitemap.xml).
Keep lastmod accurate for changed pages; remove stale URLs to avoid crawl waste.
Avoid orphan pages by linking from hubs and related content.

Takeaway: Strong internal linking + clean sitemaps = better discovery and refresh.

Indexing vs surfacing vs training: how they differ

Indexing means a page is recorded as eligible for display in ChatGPT search. Surfacing is actually appearing and being cited in an answer, which depends on relevance, quality, and trust.

Training is allowing GPTBot to use your content to improve models and evaluations.

Example: You can allow OAI-SearchBot and block GPTBot to show up in ChatGPT search without contributing to training. Conversely, allowing GPTBot doesn’t guarantee citations in search.

Takeaway: Decide each outcome independently with the right bot-specific controls.

In short, optimize discovery with links and sitemaps, and configure training access separately—eligibility to surface does not require training consent.

Rendering and Accessibility: Why SSR Still Matters

If crawlers can’t see your content in raw HTML, they may miss it entirely. OpenAI has not documented full JavaScript execution for its crawlers.

Assume limited or no JS rendering. Prioritize server-side rendering (SSR) or pre-rendering for public content.

Here’s how to expose critical text reliably.

For JS-heavy frameworks (Next.js, Remix, Nuxt), turn on SSR or static pre-render for public pages. Ensure critical content appears in the initial HTML and isn’t hidden behind client-only hydration.

Treat first response HTML as the source of truth for bots.

Takeaway: Deliver indexable HTML on first response; don’t rely on client-side rendering for essential text.

Common JS pitfalls (hydration, lazy-load, infinite scroll) and fixes

Hydration-only content: If text appears only after hydration, bots may see empty shells. Fix with SSR or pre-rendering.

Lazy-loaded above-the-fold content: Thresholds can hide content from bots. Fix by server-rendering the first viewport and loading non-critical assets later.

Infinite scroll without paginated links: Bots can’t scroll. Fix by adding paginated URLs with rel=next/prev or a “View All” page.

Client-side routers that block anchor URLs: If deep links don’t resolve server-side, crawlers fail. Fix by handling routes at the server and returning canonical HTML.

Takeaway: Audit rendered HTML; if “View Source” lacks your core content, bots likely can’t read it.

Set Your Policy: Robots.txt Templates for OpenAI

This is where you define what OpenAI can do with your site. Because OpenAI now separates discovery/surfacing from training, the safest approach is explicit, bot-specific directives.

Below are copy-ready robots.txt policies for common goals. Per OpenAI’s docs, policy changes may take up to ~24 hours to propagate.

Allow ChatGPT search (OAI-SearchBot) but block training (GPTBot)

Use this if you want to appear and be cited in ChatGPT search while opting out of model training.

# Allow ChatGPT search discovery; block model training
User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

# Optional: allow user-triggered fetches for live sessions
User-agent: ChatGPT-User
Allow: /

# Your sitemaps for discovery
Sitemap: https://example.com/sitemap.xml

Tip: Keep disallows specific to GPTBot; do not use broad “Disallow: /” at the top.

Block all OpenAI crawlers (not recommended unless required)

Use this only if policy or licensing requires a full block.

# Block all OpenAI crawlers
User-agent: OAI-SearchBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

Note: This prevents ChatGPT search citations and live fetches.

Allow all OpenAI crawlers (search + training)

Use this if you support both discovery/surfacing and model training.

# Allow OpenAI crawlers
User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

Sitemap: https://example.com/sitemap.xml

Expect similar treatment to other major crawlers: normal crawl behavior respecting robots directives.

Propagation expectations (~24 hours) after robots.txt changes

OpenAI states robots.txt policy updates can take up to ~24 hours to be fully honored. During that window you may see legacy behavior in logs, so plan changes accordingly and re-check after a day.

Takeaway: Don’t assume instant effect—verify again after 24 hours.

Network-Level Controls: IP Allowlisting and Firewalls

Robots.txt won’t help if your CDN/WAF blocks requests. Combine UA checks with OpenAI’s published IP ranges to reduce spoofing and avoid false positives.

OpenAI publishes machine-readable JSON lists of crawler IP ranges. Build an allowlist that updates automatically.

Validate both the UA string and source IP to mitigate UA spoofing and keep your bot defenses intact.

Takeaway: Pair robots policy with network rules that trust OpenAI’s published IPs.

OpenAI IP JSON endpoints and how to sync allowlists

OpenAI provides machine-readable IP ranges for GPTBot and OAI-SearchBot in its docs. Fetch these JSON endpoints on a schedule (e.g., hourly) and update your CDN/WAF IP lists so changes don’t break access.

Fetch: programmatically GET the official JSON endpoints from OpenAI’s docs for GPTBot and OAI-SearchBot IPs
Parse: extract CIDR blocks; update your platform’s “IP list” object
Validate: combine “IP in list” AND “User-Agent contains expected token”

Caution: Always rely on the official OpenAI documentation URLs for current ranges and endpoints.

Takeaway: Automate IP sync to prevent accidental blocks when OpenAI changes infrastructure.

CDN/WAF recipes (Cloudflare, AWS WAF, NGINX)

Cloudflare (recommended pattern):

Create an IP List containing OpenAI CIDRs (via API or dashboard).
Firewall rule to Allow if:
(ip.src in $OPENAI_IPS and http.user_agent contains "OAI-SearchBot") OR
(ip.src in $OPENAI_IPS and http.user_agent contains "GPTBot") OR
(ip.src in $OPENAI_IPS and http.user_agent contains "ChatGPT-User")
Place this above generic bot or rate-limit rules; add a second rule to Challenge/Silence if UA matches but IP not in list.

AWS WAF:

Create an IPSet with OpenAI CIDRs (Regional where your app runs).
Rule 1 (Allow): IF (Source IP in IPSet) AND (UA matches OAI-SearchBot|GPTBot|ChatGPT-User).
Rule 2 (Block/Count): IF UA matches but IP not in IPSet (spoofing guard).
Attach to the WebACL on your ALB/CloudFront.

NGINX (edge or origin):

Use the geo or map module to flag OpenAI CIDRs (or ipset via firewall).
Combine with UA matching in a server block:

if ($http_user_agent ~* "(OAI-SearchBot|GPTBot|ChatGPT-User)") {
set $is_openai_ua 1;
}
# Example using a CIDR include or a variable set by a real-ip/geo module
if ($is_openai_ua = 1) {
# Optionally rate-limit more gently or skip WAF locations
}

Prefer IP validation upstream (OS firewall/ipset) or via a reverse proxy that supports CIDR lists.

Takeaway: Always pair UA match with IP validation; log rule matches for auditing.

Verify and Monitor: Make Sure OpenAI Can See You

Don’t guess—prove access in logs and via targeted test requests. Verification prevents weeks of confusion after robots or WAF changes and confirms you’re using OpenAI’s official UA/IP signals.

Start with server logs to confirm real crawler hits. Confirm the source IP belongs to OpenAI’s published CIDRs.

Then run curl/wget with the right UAs to test page-level and robots access at both CDN and origin. Re-verify after ~24 hours when changing policy.

Takeaway: Logs + on-demand tests provide fast, reliable validation.

Server log patterns for OAI-SearchBot, GPTBot, and ChatGPT-User

Look for the UA token and confirm the source IP belongs to OpenAI’s ranges. Example combined log lines (Apache/Nginx):

OAI-SearchBot:

203.0.113.45 - - [17/Nov/2024:12:34:56 +0000] "GET /guide/ HTTP/1.1" 200 18473 "-" "Mozilla/5.0 (compatible; OAI-SearchBot/1.0; +https://openai.com/bot)"

GPTBot:

203.0.113.46 - - [17/Nov/2024:12:35:02 +0000] "GET /blog/post/ HTTP/1.1" 200 32761 "-" "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)"

ChatGPT-User:

203.0.113.52 - - [17/Nov/2025:12:36:14 +0000] "GET /pricing/ HTTP/1.1" 200 9210 "https://chat.openai.com/" "Mozilla/5.0 (compatible; ChatGPT-User/1.0; +https://openai.com/bot)"

Checklist:

Status codes: 200/304 preferred; investigate 401/403/429.
robots.txt fetches appear before crawling: “GET /robots.txt”.
IP ownership: confirm against OpenAI’s published CIDRs.

Takeaway: UA token + OpenAI IP + expected crawl behavior = authentic hit.

Use curl/wget with user-agent strings to test access

Quickly simulate bot requests to ensure you’re not blocking at app or CDN layers.

Robots policy checks:

curl -i https://example.com/robots.txt

Page fetch as OAI-SearchBot:

curl -A "Mozilla/5.0 (compatible; OAI-SearchBot/1.0; +https://openai.com/bot)" -I https://example.com/page/
curl -A "Mozilla/5.0 (compatible; OAI-SearchBot/1.0; +https://openai.com/bot)" https://example.com/page/ | head -n 20

Page fetch as GPTBot:

curl -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" -I https://example.com/page/

Expected outcomes:

200 for OAI-SearchBot on allowed pages.
403/404 or crawl avoidance only if you’ve blocked via robots (note: servers shouldn’t 403 based solely on robots rules—ensure WAF behavior matches intent).
robots.txt reflects the allow/disallow lines you configured.

Takeaway: If curl works but you see no bot hits, check WAF IP filtering and wait for the ~24h policy propagation.

Boost Your Chance of Being Cited in ChatGPT

Citations favor clear, trustworthy, well-structured content. While OpenAI hasn’t published a full ranking algorithm, standard E-E-A-T principles and unambiguous structure help ChatGPT select sources.

Use this section to tune structure, schema, and credibility signals.

Focus on scannable sections, precise definitions near the top, and schema that clarifies entities and instructions. Link to your sources and include bylines/bios for credibility.

Keep a single canonical URL per topic to concentrate signals.

Takeaway: The same signals that help search also help LLMs select authoritative snippets.

Structured data that helps (Article, FAQ, HowTo) and author E-E-A-T

Use schema.org to make intent and structure machine-readable:

Article/BlogPosting for editorial content with author, datePublished, headline.
FAQPage for question-answer sections likely to be quoted.
HowTo for procedural content with steps, tools, and outcomes.

Add author profiles with expertise, link external credentials, and reference primary sources. Mark paywalls with appropriate structured signals (isAccessibleForFree) if applicable.

Takeaway: Schema + author transparency improves eligibility and trust.

Content patterns: concise definitions, clear headings, canonical clarity

Lead with a one-sentence definition that maps to the query. Use H2/H3s that echo exact-match intents (e.g., “Do OpenAI crawlers execute JavaScript?”).

Ensure a single canonical URL per topic and coherent internal links.

Actions:

Put the short answer first; elaborate below.
Use consistent, descriptive headings and anchor links.
Fix canonical/duplicate issues to consolidate authority.

Takeaway: Make it easy to extract a clean, quotable answer with a clear source URL.

Sitemaps and Recrawl Hygiene

Sitemaps remain a reliable nudge for discovery and recrawl. Maintain accurate lastmod fields and avoid dumping everything into one file if you exceed limits.

Use this section to segment, clean, and signal changes without noise.

For large sites, use a sitemap index that groups content by type or update frequency. Keep sitemaps reachable from robots.txt and optionally your footer to aid discovery by multiple crawlers.

Takeaway: Fresh, segmented sitemaps plus strong internal links accelerate rediscovery.

XML index and lastmod best practices

Generate a sitemap index at /sitemap.xml that links to type-specific sitemaps (e.g., /sitemaps/blog.xml, /sitemaps/docs.xml).
Keep each sitemap under 50,000 URLs and ~50 MB uncompressed.
Update lastmod only when meaningful content changes; don’t churn timestamps.
Remove 404/410’d URLs promptly to reduce crawl waste.

Takeaway: Cleanliness and accuracy beat volume; quality lastmod signals help re-crawl.

Troubleshooting Matrix

Use this section to diagnose common issues fast. Start with robots.txt, then logs, then network rules, and finally content and schema if access isn’t the problem.

No OpenAI hits in logs → check robots.txt, WAF, IP allowlist

Confirm robots.txt is accessible and not disallowing OAI-SearchBot.
Verify CDN/WAF isn’t blocking OpenAI IPs; create allow rules using OpenAI’s CIDRs.
Ensure your origin isn’t geo-blocking OpenAI’s data centers.
Wait ~24 hours after policy changes and recheck.

Takeaway: Most “no hits” cases are policy blocks or network filtering.

403/401 errors → bot identification and firewall rules

Make sure auth-protected areas return 401 with WWW-Authenticate, not blanket 403s.
If using bot challenges, exempt OpenAI by (UA AND IP) match; don’t trust UA alone.
Inspect rate limits (429s) and raise budgets for OpenAI IPs during discovery windows.
Test with curl using OAI-SearchBot UA; compare CDN vs origin responses.

Takeaway: Balance protection with precise allow rules to avoid collateral blocks.

Pages not surfacing/cited → content and schema checks

Verify OAI-SearchBot can fetch 200/304 and sees full HTML content (SSR/pre-render).
Add or improve Article/FAQ/HowTo schema; include concise definitions near the top.
Strengthen E-E-A-T: bylines, bios, outbound references, and clear sourcing.
Fix canonicalization, pagination, hreflang inconsistencies that dilute signals.

Takeaway: Eligibility requires both access and high-quality, unambiguous content.

Bottom line: Debug in order—robots, logs, network, then content—so you fix the right layer first.

FAQs

Here are quick answers to common questions that cause misconfigurations. Use these to validate assumptions and decide which controls to apply.

Do OpenAI crawlers execute JavaScript?

Assume limited or no JS execution; serve SSR or pre-rendered HTML for public pages. Critical content should be present in the initial response HTML, not loaded solely via hydration or infinite scroll.

When in doubt, view-source and confirm your text is there.

Does noindex prevent ChatGPT citations?

Meta robots noindex targets traditional web indices, not necessarily LLM surfacing. If you must avoid citations, explicitly disallow OAI-SearchBot in robots.txt.

If you only want to block training, disallow GPTBot but allow OAI-SearchBot. Default to explicit bot controls for clarity.

Can I allow ChatGPT search but block training?

Yes. Allow OAI-SearchBot and disallow GPTBot in robots.txt (templates below). This preserves eligibility for ChatGPT search citations while opting out of model training.

Takeaway: Prefer explicit, bot-specific directives over generic meta signals when controlling ChatGPT visibility and training.

Copy-Paste Templates

Use these ready-to-go snippets to implement policy and test access quickly. Customize paths and domains as needed, then verify behavior in logs after ~24 hours.

Robots.txt: allow search, block training

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Allow: /

Sitemap: https://example.com/sitemap.xml

Optional: Add path-level controls if needed (e.g., Disallow: /private/ for all bots).

Curl commands and expected responses

Check robots.txt:

curl -i https://example.com/robots.txt

Test as OAI-SearchBot:

curl -I -A "Mozilla/5.0 (compatible; OAI-SearchBot/1.0; +https://openai.com/bot)" https://example.com/

Test as GPTBot:

curl -I -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" https://example.com/

Expected:

OAI-SearchBot: 200 on allowed pages; robots.txt shows Allow.
GPTBot: Allowed or effectively blocked depending on your robots policy; verify the Disallow is present.

Takeaway: Implement, test, and confirm with logs so policy matches intent.

Summary and Next Steps

You now know how OpenAI crawlers work, how to allow OAI-SearchBot while blocking GPTBot, and how to verify everything with logs and curl. Pair robots policies with IP-based allow rules, deliver SSR/pre-rendered HTML, and add schema + E-E-A-T to improve citation odds.

Next steps:

Implement your chosen robots.txt template; wait ~24 hours.
Set CDN/WAF rules that validate both UA and OpenAI IPs.
Verify in logs and with curl; fix any 401/403/429 issues.
Improve sitemaps, lastmod hygiene, and structured data.

For authoritative details (UA strings, IP ranges, propagation behavior), consult OpenAI’s official crawler documentation for GPTBot, OAI-SearchBot, and ChatGPT-User.