Content Signals: allow search but control AI training
What Content-Signal is, why separate search / ai-input / ai-train, a minimal example, right vs wrong, mistakes, and how to verify.
Updated:
What it is
Content Signals are a machine-readable expression of your usage preferences
for your content: whether it may be indexed for search (search), used as context
in AI answers (ai-input), and used to train models (ai-train). It’s declared
with a Content-Signal directive in robots.txt (and is evolving as a
convention/header).
Why it matters for AI agents
robots.txt is a blunt “may/may not crawl” switch. Content Signals add nuanced
consent: “index me for search and use me in answers, but don’t train models on
me.” That lets you stay visible in AI results (GEO) without handing your content
to training datasets. Well-behaved AI operators respect it.
Minimal working example
In robots.txt:
User-agent: *
Allow: /
# Allow search and AI answers, disallow training
Content-Signal: search=yes, ai-input=yes, ai-train=no
Sitemap: https://example.com/sitemap.xml
Values are yes/no per signal: search, ai-input, ai-train.
Right vs wrong
| Right | Wrong |
|---|---|
| Explicit values per signal | Just robots.txt, no usage nuance |
Signals consistent with Allow/Disallow | Disallow: / + search=yes (conflict) |
A deliberate ai-train choice | A careless ai-train=no that cuts legitimate cases |
Common mistakes
- Conflict with robots rules: you disallowed crawling but enabled
search=yes. - Typos in signal names (
ai_traininstead ofai-train). - Expecting hard protection — it’s a statement of intent, not access control; bad-faith crawlers may ignore it (for a hard block use robots rules / a firewall).
- No
Content-Signalat all — by default your preferences are unknown.
How to verify
A scan parses Content-Signal from robots.txt. Manually:
curl -s https://example.com/robots.txt | grep -i 'content-signal'
Expect a Content-Signal: line with meaningful search/ai-input/ai-train.