Bot Access Control

Content Signals

Declaration of consent for AI training and search in robots.txt via the Content-Signal directive (contentsignals.org).

What are Content Signals?

Content Signals is an extension to the robots.txt standard that lets site owners explicitly declare their content-use policy for AI systems. The standard is developed by contentsignals.org.

The directive is added to the robots.txt file as:

Content-Signal: ai-train=yes, search=yes, ai-input=yes

Three keys:

  • ai-train — whether you allow your content to be used for training AI models (yes / no)
  • search — whether you allow AI search engines to index and cite your content (yes / no)
  • ai-input — whether you allow your content to be fed as input to AI agents when they execute tasks (yes / no)

The absence of a Content-Signal means ambiguity — each AI provider interprets that in its own way.

Why does a site need Content Signals?

Before Content Signals, site owners had only one tool to control AI bots — blocking them via User-agent: GPTBot / Disallow: /. That is binary: allow everything or block everything.

Content Signals adds granularity:

  • You can allow search=yes (citation in AI search) while prohibiting ai-train=no (don’t use for model training)
  • Media companies that sell training licenses set ai-train=no as an explicit declaration of their position
  • Open-source and educational content often sets ai-train=yes, search=yes, ai-input=yes as support for the AI ecosystem

For GEO: search=yes signals to AI search engines (Perplexity, ChatGPT Search, Google AI Overview) that your content may be cited. Without this signal, AI systems act more cautiously.

Legal aspect: an explicit ai-train=no declaration in robots.txt is becoming one element of copyright protection in the context of AI training. Some jurisdictions treat such a declaration as a legally meaningful statement of non-consent.

How to add Content Signals?

Add the directive to the end of your robots.txt:

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Content-Signal: ai-train=yes, search=yes, ai-input=yes

Recommended values by site type:

Site typeRecommendation
Open content, documentation, educationai-train=yes, search=yes, ai-input=yes
Commercial content, news, mediaai-train=no, search=yes, ai-input=yes
Closed/premium contentai-train=no, search=no, ai-input=no
SaaS, product without editorial contentai-train=yes, search=yes, ai-input=yes

Important: the Content-Signal directive is at the file level, not inside a User-agent block. It is a global policy declaration for the site.

Example with different policies per section:

The Content Signals spec also supports per-path directives in its extended syntax, but a single global declaration is sufficient for most use cases.

How do we check Content Signals?

Our scanner depends on the robots_txt check — we first retrieve the robots.txt file, then parse it for the Content-Signal directive.

Algorithm:

  1. Fetch robots.txt (result cached from the robots_txt check)
  2. Find the directive Content-Signal: (case-insensitive search)
  3. Parse values — split by comma, extract keys ai-train, search, ai-input

Status pass — directive found with at least one key. Status fail — directive absent from robots.txt.

We deliberately do not penalize ai-train=no or other specific values — that is the site’s policy, not an error. The scanner only records the presence of an explicit declaration, not its content.

Sources and specifications