What is LLMs.txt? - Searcle AI

Quick Answer: What is LLMs.txt?

LLMs.txt is a plain‑text/Markdown manifest at your site’s root (https://example.com/llms.txt). It lists high‑quality, LLM‑friendly content with short summaries. AI crawlers and tools can retrieve the right pages quickly.

Its purpose is to cut HTML noise and steer generative systems to authoritative, up‑to‑date sources on your domain.

Why LLMs.txt matters: context limits, HTML noise, and LLM-friendly Markdown

Large language models have finite context windows. Pushing a 200–500 KB HTML page (plus scripts and boilerplate) into a prompt wastes tokens and dilutes signal.

Markdown preserves headings and semantics with far fewer tokens. It aligns with how LLMs infer structure. A single FAQ in Markdown may be 5–20 KB, while the rendered HTML can be 10–20× heavier.

This is why LLMs.txt lists concise, canonical sources with brief summaries and links to clean Markdown or text. You’re curating what should enter a model’s “short‑term memory” for better answers. The takeaway: pre‑selecting and compressing content improves retrieval precision and reduces hallucinations.

The llms.txt format: required sections, optional fields, and a minimal example

LLMs.txt is a readable, line‑oriented text file designed for both humans and machines.

Most implementations follow a simple pattern: a small header (site, updated, policy), then a list of entries describing important pages with a short summary. Keep the structure consistent and predictable so tools can parse it reliably.

Common fields include:

url (required): absolute URL to the canonical page (or its Markdown).
title (required): clear page title.
summary (required): 1–3 sentences describing coverage and scope.
format (optional): markdown, text, or html (prefer markdown).
updated (optional): ISO 8601 date for freshness.
lang (optional): BCP‑47 code like en, es‑MX.
tags/section/weight (optional): organization hints for tools.

Canonical /llms.txt example (copy-paste)

# llms.txt v1
site: https://example.com
updated: 2025-01-10
owner: docs@example.com
policy: Public pages listed here may be used for answer generation; do not include paywalled or private content.

- url: https://example.com/docs/getting-started
title: Getting started
format: markdown
lang: en
updated: 2024-12-11
summary: Install the CLI, authenticate, and deploy your first project in under 5 minutes.

- url: https://example.com/docs/pricing
title: Pricing and limits
format: markdown
lang: en
updated: 2024-11-20
summary: Transparent plans with rate limits, overage rules, and billing FAQs.

- url: https://example.com/blog/product-roadmap-2025
title: 2025 product roadmap
format: markdown
lang: en
updated: 2024-12-01
summary: High-level themes and timelines for Q1–Q4; subject to change.

/llms-full.txt: when to use the full-content variant

/llms-full.txt is an optional companion that embeds the full, authoritative Markdown for a small set of mission‑critical pages. Use it when you want guaranteed, lossless context without a separate crawl (e.g., a terms page, API quickstart, or outage procedure). Keep it small to avoid token bloat and stale copies.

Recommended usage:

Include only a handful of evergreen, frequently cited docs.
Update in lockstep with the source pages; add “updated” dates.
Prefer headings and short sections for chunkable ingestion.

# llms-full.txt v1
site: https://example.com
updated: 2025-01-10

### url: https://example.com/docs/quickstart
### title: Quickstart (full text)
### lang: en
### updated: 2024-12-11
# Quickstart
Install the CLI:
1) brew install example/tap/example
2) example auth login
3) example deploy
Troubleshooting: See https://example.com/docs/troubleshooting

### url: https://example.com/legal/terms
### title: Terms of Service (excerpt)
### lang: en
### updated: 2024-10-01
# Terms (Summary)
This summary is non-binding. The canonical version is at the URL above.
...

llms.txt vs robots.txt vs sitemap.xml

These files complement each other but serve different audiences and workflows. Confusion arises because all live at the root and influence crawlers, yet each optimizes a different phase of discovery and use.

llms.txt: Curates LLM‑friendly content and summaries for answer generation and retrieval. It is advisory and focused on quality context.
robots.txt: Controls crawler access (allow/disallow) to URLs. It governs permissions, not curation or summarization.
sitemap.xml: Lists discoverable URLs and metadata (priority, lastmod) for search engines. It handles coverage and discovery, not summaries.

Use cases and limitations for each file

Use llms.txt to spotlight canonical sources, minimize HTML noise, and provide short summaries that help AI choose the right page. Limitation: not universally enforced; different AI crawlers may treat it as a hint.
Use robots.txt to block sensitive or low‑value paths (e.g., /admin/, faceted search). Limitation: public content may still be accessed via direct links or caches.
Use sitemap.xml to ensure search engines see all public pages. Limitation: it does not indicate which content is “best” to load into prompts.

How LLMs use llms.txt during inference (and what it means for training)

During inference, agents and tools fetch llms.txt, read summaries, and select 1–N URLs to load into the prompt (or a retrieval pipeline) to answer a user question.

Some tools prefer Markdown links and will fetch the clean .md if available, then chunk headings before embedding.

For training, treat llms.txt as advisory. It may guide what third‑party systems crawl, but it is not a universal opt‑in/opt‑out mechanism for model pretraining.

If you have training or licensing requirements, publish them in a clear policy section and via legal pages. Coordinate with vendors as needed.

Create your llms.txt: four implementation paths

There’s no single “right” path—choose the option that matches your stack and maturity. Start with a minimal file, then automate as you scale.

Manual authoring (Markdown best practices and summaries)

Hand‑craft a first version in a text editor and save it to your site root as /llms.txt. Use plain Markdown headings, one entry per page, and keep summaries to 1–3 sentences with keywords users actually search for. Prefer canonical docs, FAQs, and pricing/policy pages.

Best practices:

Use absolute URLs and include updated dates.
Keep summaries factual and scannable; avoid marketing fluff.
Link to clean Markdown versions of docs where possible.

Use a generator or validator

If you have many pages, use a generator to build entries from frontmatter or docs metadata. Pair it with a validator to catch empty summaries, missing URLs, or broken dates. Look for tools referenced by the spec and community directories (e.g., on llmstxt.org) and run them locally before publishing.

Practical flow:

Crawl or read your docs source.
Generate /llms.txt and, optionally, /llms-full.txt for key pages.
Validate structure and links, then publish.

CLI and docs-as-code (e.g., VitePress/Docusaurus plugins, CI integration)

Docs‑as‑code teams can install a CLI (e.g., community tools like llms_txt2ctx) or a static‑site plugin. VitePress/Docusaurus/Drupal integrations are emerging per the spec’s ecosystem pages.

Generate llms.txt during build. Commit to the repo, and publish via CI.

In CI:

Lint summaries (length, forbidden phrases).
Check links (200 OK) and updated dates (not too old).
Fail the build on structural errors.

CMS/WordPress patterns

On WordPress, either:

Upload a static /llms.txt to the web root via SFTP or a file manager, or
Register a custom route that renders llms.txt from selected posts/pages with summaries.

Keep a short allowlist (Docs, Pricing, Legal, Support) and exclude posts that are thin or time‑sensitive unless you can maintain them.

Validation and QA: schema, CI checks, and prompt tests

Validation ensures machines and humans can trust the file. Treat llms.txt like production content: broken links or vague summaries degrade AI answers.

Add checks to your pipeline and test with at least two LLMs before rollout.

A robust QA flow includes structural validation, link checks, and prompt‑based verification. For large sites, add per‑language checks and freshness SLAs for critical entries to maintain relevance over time.

Recommended validation schema and lint rules

If your generator can emit a JSON mirror (llms.json), validate it with a lightweight schema like:

{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "llms.json",
"type": "object",
"required": ["site", "entries"],
"properties": {
"site": { "type": "string", "format": "uri" },
"updated": { "type": "string" },
"entries": {
"type": "array",
"items": {
"type": "object",
"required": ["url", "title", "summary"],
"properties": {
"url": { "type": "string", "format": "uri" },
"title": { "type": "string", "minLength": 3 },
"summary": { "type": "string", "minLength": 30, "maxLength": 600 },
"format": { "type": "string", "enum": ["markdown", "text", "html"] },
"updated": { "type": "string" },
"lang": { "type": "string" },
"tags": { "type": "array", "items": { "type": "string" } }
}
}
}
}
}

Lint rules (suggested):

Summary length: 30–120 words; avoid buzzwords and future‑dated claims.
URLs must be absolute; no query‑only links or redirects.
updated must be within 180 days for fast‑moving docs; flag older items.

Prompt-based verification in ChatGPT/Claude

After publishing, test retrieval and selection behavior with targeted prompts:

“You can use only sources listed in https://example.com/llms.txt. Answer: How do I authenticate and deploy?”
“Which three URLs from https://example.com/llms.txt are most relevant to pricing overages? Why?”

Compare answers before/after updates and track whether the model cites the intended pages. If it selects noisy or outdated sources, refine summaries and demote or remove weak pages.

Governance, security, and privacy: exclusions, approvals, and audits

Treat llms.txt as an externally visible “menu” of sources. Without governance, you risk leaking sensitive paths, promoting outdated docs, or implying licensing you don’t intend.

Define clear ownership, a review cadence, and an audit trail to keep the file trustworthy.

Security starts with an allowlist mindset. Only include public, non‑sensitive pages that you’re comfortable being used in LLM prompts as‑is. Align content with internal policies and ensure edits receive appropriate oversight.

Excluding sensitive or paywalled content

Never list private, paywalled, or export‑restricted materials.
Avoid internal dashboards, preview builds, staging URLs, and signed links.
If needed, add an explicit exclusion note: “Private content, paywalled docs, and personal data are out of scope.”
Align with your robots.txt and legal pages (e.g., AI usage policy, licensing).

Ownership, change control, and audit logs

Owner: Assign a Docs/SEO lead with an engineering counterpart.
Workflow: Pull requests with dual approvals (Docs + Legal/Sec for sensitive areas).
Change log: Add a dated “updated” header and keep a HISTORY.md in the repo.
Rollback: Keep previous versions and a quick re‑publish path if an entry proves risky.

Internationalization and large-site patterns

Multilingual and multi‑brand portfolios benefit most from curation, but structure matters. Keep language and brand boundaries explicit so tools don’t mix sources or prefer the wrong locale.

Language-specific sections and discovery

For a single domain with multiple languages, include lang per entry and, optionally, group by headings:

# Language: en
- url: https://example.com/en/docs/...
lang: en
summary: ...

# Language: es
- url: https://example.com/es/docs/...
lang: es
summary: ...

Also link to language‑specific Markdown when available. Ensure each language has its own authoritative versions of shared topics.

Subdomains, subdirectories, and multi-brand portfolios

Subdomains (docs.brand.com): publish a separate /llms.txt per subdomain.
Subdirectories (example.com/de/): you can centralize in the root /llms.txt or host a localized file at /de/llms.txt and link to it from the root.
Multi‑brand: keep one file per brand domain to prevent cross‑brand content bleed.

Performance and limits: file size, summary length, and update cadence

Aim to keep /llms.txt under 200–500 KB and under ~300 entries for fast fetches and easier parsing. If you exceed that, split by section or language and link the alternates near the top.

For /llms-full.txt, keep total content below ~1–2 MB and limit to a handful of cornerstone docs.

Summary guidance:

30–120 words (roughly 200–700 characters) per entry is a practical sweet spot.
Use clear nouns and task verbs (“install,” “authenticate,” “limits,” “pricing”).
Refresh high‑traffic or high‑risk entries at least quarterly; hot paths monthly.

Discoverability: helping AI systems find your llms.txt

Discovery isn’t guaranteed, so add gentle hints across your site. Some AI crawlers look for common filenames at the root; others follow links or ecosystem directories mentioned by the spec.

Linking strategies (robots.txt hints, sitemap references, site footer)

In robots.txt, add an informational comment and absolute URL:

# LLM sources: https://example.com/llms.txt
# Full context: https://example.com/llms-full.txt

Link llms.txt from your docs footer or “For AI and developers” page.
Optionally mention llms.txt in your About/Legal pages and submit your domain to community directories referenced by the official spec.

Checklist: publish, validate, govern, measure

Define scope and owner; list 10–30 must‑include pages first.
Draft /llms.txt with clean summaries; add updated dates and absolute URLs.
(Optional) Create /llms-full.txt for 3–5 cornerstone docs.
Validate structure, links, and summary length; add CI checks.
Publish at the site root; add discovery hints and directory submissions.
Review monthly; rotate stale entries out and update summaries.
Run prompt tests in two LLMs and track answer quality.

Ongoing measurement and iteration

Measure proxy impact across channels:

Compare AI assistant answers before/after updates; track source citation accuracy.
Watch support deflection, doc search success, and time‑to‑answer.
Monitor branded Q&A on the web; if off‑target sources appear, refine summaries to lift your canonical pages.
Keep a standing backlog of candidates to add/remove, and revisit quarterly.

Resources and references

Specification and ecosystem: llmstxt.org (format guidance, fixed path, community directories, and integrations including VitePress/Docusaurus/Drupal).
Background and adoption: industry write‑ups referencing early adopters (e.g., Mintlify) and discussions by practitioners (e.g., on Medium and community forums).
Tools: community CLIs (e.g., llms_txt2ctx), generators/validators linked from the spec’s ecosystem pages, and emerging CMS/framework plugins.
Related standards: robots.txt (crawler access control), sitemap.xml (URL discovery); use them alongside llms.txt without conflating roles.
Docs‑as‑code: integrate generation in your CI for frameworks like VitePress, Docusaurus, Sphinx, or nbdev to export clean .md mirrors.

Notes and cautions:

LLM vendors vary in how they use llms.txt; treat it as strong guidance, not a legal control.
For licensing and training usage, publish explicit policies and coordinate with vendors beyond llms.txt.