How to set up robots.txt for AI crawlers
Why robots.txt matters for AI agents, a minimal working example, right vs wrong, common mistakes, and how to verify.
Updated:
What it is
robots.txt is a plain-text file at your site root (/robots.txt) that tells
search and AI crawlers which sections they may fetch. Its format is standardised
in RFC 9309. In the AI era it has a second job: to state explicitly how you treat
the major AI bots (GPTBot, ClaudeBot, PerplexityBot, and others) — allow or block.
Why it matters for AI agents
If an AI crawler can’t find robots.txt, or hits a block, it either skips your
site or reads the rules conservatively. No crawl means no chance of appearing in
ChatGPT Search, Perplexity, Google AI Overview, or YandexGPT answers. An explicit
Allow for AI bots is your entry ticket into GEO/AEO results.
Minimal working example
# General rule
User-agent: *
Allow: /
# Explicitly allow the major AI crawlers
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
Sitemap: https://example.com/sitemap.xml
Right vs wrong
| Right | Wrong |
|---|---|
Served at /robots.txt with 200 and Content-Type: text/plain | Returns 404, an HTML page, or a redirect |
Explicit Allow: / for the AI bots you want | A blanket Disallow: / under User-agent: * that also cuts off AI |
A Sitemap: directive with an absolute URL | No sitemap — the crawler can’t discover structure |
Common mistakes
- An empty
Disallow:means “allow everything” — easily confused withDisallow: /. - A
User-agent: *block withDisallow: /that accidentally blocks AI bots. - robots.txt served as HTML (an SPA returning index.html for every path) — not a valid robots file to a crawler.
- A relative path in
Sitemap:— it must be an absolute URL. - Typos in bot names — it’s
GPTBot, notGptBot.
How to verify
Run a free scan of your site — the robots.txt check is part of the suite and
shows the status, the rules found, and any missing AI bots. Manually:
curl -sI https://example.com/robots.txt # expect 200 + text/plain
curl -s https://example.com/robots.txt # inspect the contents