Discoverability Easy

How to set up robots.txt for AI crawlers

Why robots.txt matters for AI agents, a minimal working example, right vs wrong, common mistakes, and how to verify.

Updated:

What it is

robots.txt is a plain-text file at your site root (/robots.txt) that tells search and AI crawlers which sections they may fetch. Its format is standardised in RFC 9309. In the AI era it has a second job: to state explicitly how you treat the major AI bots (GPTBot, ClaudeBot, PerplexityBot, and others) — allow or block.

Why it matters for AI agents

If an AI crawler can’t find robots.txt, or hits a block, it either skips your site or reads the rules conservatively. No crawl means no chance of appearing in ChatGPT Search, Perplexity, Google AI Overview, or YandexGPT answers. An explicit Allow for AI bots is your entry ticket into GEO/AEO results.

Minimal working example

# General rule
User-agent: *
Allow: /

# Explicitly allow the major AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://example.com/sitemap.xml

Right vs wrong

RightWrong
Served at /robots.txt with 200 and Content-Type: text/plainReturns 404, an HTML page, or a redirect
Explicit Allow: / for the AI bots you wantA blanket Disallow: / under User-agent: * that also cuts off AI
A Sitemap: directive with an absolute URLNo sitemap — the crawler can’t discover structure

Common mistakes

  • An empty Disallow: means “allow everything” — easily confused with Disallow: /.
  • A User-agent: * block with Disallow: / that accidentally blocks AI bots.
  • robots.txt served as HTML (an SPA returning index.html for every path) — not a valid robots file to a crawler.
  • A relative path in Sitemap: — it must be an absolute URL.
  • Typos in bot names — it’s GPTBot, not GptBot.

How to verify

Run a free scan of your site — the robots.txt check is part of the suite and shows the status, the rules found, and any missing AI bots. Manually:

curl -sI https://example.com/robots.txt   # expect 200 + text/plain
curl -s  https://example.com/robots.txt   # inspect the contents

Sources