Bot Access Control Easy

Content Signals: allow search but control AI training

What Content-Signal is, why separate search / ai-input / ai-train, a minimal example, right vs wrong, mistakes, and how to verify.

Updated:

What it is

Content Signals are a machine-readable expression of your usage preferences for your content: whether it may be indexed for search (search), used as context in AI answers (ai-input), and used to train models (ai-train). It’s declared with a Content-Signal directive in robots.txt (and is evolving as a convention/header).

Why it matters for AI agents

robots.txt is a blunt “may/may not crawl” switch. Content Signals add nuanced consent: “index me for search and use me in answers, but don’t train models on me.” That lets you stay visible in AI results (GEO) without handing your content to training datasets. Well-behaved AI operators respect it.

Minimal working example

In robots.txt:

User-agent: *
Allow: /

# Allow search and AI answers, disallow training
Content-Signal: search=yes, ai-input=yes, ai-train=no

Sitemap: https://example.com/sitemap.xml

Values are yes/no per signal: search, ai-input, ai-train.

Right vs wrong

RightWrong
Explicit values per signalJust robots.txt, no usage nuance
Signals consistent with Allow/DisallowDisallow: / + search=yes (conflict)
A deliberate ai-train choiceA careless ai-train=no that cuts legitimate cases

Common mistakes

  • Conflict with robots rules: you disallowed crawling but enabled search=yes.
  • Typos in signal names (ai_train instead of ai-train).
  • Expecting hard protection — it’s a statement of intent, not access control; bad-faith crawlers may ignore it (for a hard block use robots rules / a firewall).
  • No Content-Signal at all — by default your preferences are unknown.

How to verify

A scan parses Content-Signal from robots.txt. Manually:

curl -s https://example.com/robots.txt | grep -i 'content-signal'

Expect a Content-Signal: line with meaningful search/ai-input/ai-train.

Sources