Discoverability

robots.txt

Instruction file for search and AI crawlers: what is allowed or disallowed to index (RFC 9309).

What is robots.txt?

robots.txt is a text file at /robots.txt containing directives for search bots and AI crawlers: which pages they are allowed or disallowed to visit.

The standard is formalized in RFC 9309 (2022). It has existed since 1994 and is supported by all search engines and most AI bots.

Why does a site need robots.txt?

Without robots.txt, AI bots (GPTBot, ClaudeBot, PerplexityBot) don’t know where they can or cannot go. The file solves four problems:

Explicitly allow AI bots to index public content
Block indexing of sensitive paths (/admin/, /api/private/)
Point to the sitemap via the Sitemap: directive
Add a Content-Signal to declare consent for AI use

A correct robots.txt for GEO:

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/private/

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

Sitemap: https://example.com/sitemap.xml

Content-Signal: ai-train=yes, search=yes, ai-input=yes

How to configure robots.txt?

Create a public/robots.txt file (for static sites) or a /robots.txt endpoint (for dynamic sites).

WordPress: plugins Yoast SEO or Rank Math generate robots.txt automatically. Add AI-bot sections manually via the file editor.

Next.js: create app/robots.ts or public/robots.txt:

// app/robots.ts
import type { MetadataRoute } from 'next';

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      { userAgent: '*', allow: '/' },
      { userAgent: 'GPTBot', allow: '/' },
      { userAgent: 'ClaudeBot', allow: '/' },
    ],
    sitemap: 'https://example.com/sitemap.xml',
  };
}

Astro: place public/robots.txt — it is served as a static file.

How do we check robots.txt?

The scanner performs GET /robots.txt and checks sequentially:

HTTP 200 — the file exists and is accessible
Content-Type: text/plain — served as text, not HTML
Non-empty content — the file is not empty
Presence of User-agent: directives — at least one block (RFC 9309)
Format validity — no structural errors

Gradient result: 1.0 if a Sitemap: directive or at least one non-wildcard User-agent block is present; 0.5 if only User-agent: * with no Sitemap. Status fail — on non-200 HTTP, empty file, or missing directives.

The file is cached and passed to dependent checks: AI bot rules, Content Signals, Sitemap.

What is robots.txt?

Why does a site need robots.txt?

How to configure robots.txt?

How do we check robots.txt?

Step-by-step guides

Checks in our scanner

Related terms

Sources and specifications