Discoverability
robots.txt
Instruction file for search and AI crawlers: what is allowed or disallowed to index (RFC 9309).
What is robots.txt?
robots.txt is a text file at /robots.txt containing directives for search bots and AI crawlers: which pages they are allowed or disallowed to visit.
The standard is formalized in RFC 9309 (2022). It has existed since 1994 and is supported by all search engines and most AI bots.
Why does a site need robots.txt?
Without robots.txt, AI bots (GPTBot, ClaudeBot, PerplexityBot) don’t know where they can or cannot go. The file solves four problems:
- Explicitly allow AI bots to index public content
- Block indexing of sensitive paths (
/admin/,/api/private/) - Point to the sitemap via the
Sitemap:directive - Add a Content-Signal to declare consent for AI use
A correct robots.txt for GEO:
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/private/
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
Sitemap: https://example.com/sitemap.xml
Content-Signal: ai-train=yes, search=yes, ai-input=yes
How to configure robots.txt?
Create a public/robots.txt file (for static sites) or a /robots.txt endpoint (for dynamic sites).
WordPress: plugins Yoast SEO or Rank Math generate robots.txt automatically. Add AI-bot sections manually via the file editor.
Next.js: create app/robots.ts or public/robots.txt:
// app/robots.ts
import type { MetadataRoute } from 'next';
export default function robots(): MetadataRoute.Robots {
return {
rules: [
{ userAgent: '*', allow: '/' },
{ userAgent: 'GPTBot', allow: '/' },
{ userAgent: 'ClaudeBot', allow: '/' },
],
sitemap: 'https://example.com/sitemap.xml',
};
}
Astro: place public/robots.txt — it is served as a static file.
How do we check robots.txt?
The scanner performs GET /robots.txt and checks sequentially:
- HTTP 200 — the file exists and is accessible
- Content-Type: text/plain — served as text, not HTML
- Non-empty content — the file is not empty
- Presence of
User-agent:directives — at least one block (RFC 9309) - Format validity — no structural errors
Gradient result: 1.0 if a Sitemap: directive or at least one non-wildcard User-agent block is present; 0.5 if only User-agent: * with no Sitemap. Status fail — on non-200 HTTP, empty file, or missing directives.
The file is cached and passed to dependent checks: AI bot rules, Content Signals, Sitemap.