Control AI Crawler Access: GPTBot, ClaudeBot, Google-Extended

Why this matters

Not all AI crawlers are equal. Some fetch content for search answers, some for model training, and some for both. Legible gives you a single policy layer to control each one independently — enforced in robots.txt, X-Robots-Tag headers, Content-Signal, and content delivery.

Why per-bot control matters

The AI crawler landscape includes systems with fundamentally different purposes. GPTBot and OAI-SearchBot fetch content for OpenAI's products. ClaudeBot does the same for Anthropic. Google-Extended is specifically about AI training, separate from regular Google Search indexing. CCBot powers Common Crawl datasets used by many AI labs.

A blanket 'allow all' or 'block all' policy rarely matches what publishers actually want. Most prefer something like: allow AI systems to read and cite content for live answers, but restrict training use. Or: allow some AI search engines but not others.

Without per-bot control, the only options are crude: block everyone in `robots.txt`, or allow everyone and hope they respect your informal preferences.

Crawlers Legible recognizes

Legible detects and can set policy for the following AI crawlers. This list is updated as new crawlers emerge:

GPTBot — OpenAI's general-purpose crawler for training and content retrieval.
OAI-SearchBot — OpenAI's search-focused crawler for live answer generation.
ChatGPT-User — OpenAI's user-initiated browsing agent.
ClaudeBot — Anthropic's crawler for content retrieval and training.
Google-Extended — Google's AI-specific training crawler (separate from Googlebot).
PerplexityBot — Perplexity AI's answer-engine crawler.
CCBot — Common Crawl's crawler, used by many AI labs as training data.
Amazonbot — Amazon's crawler for Alexa and AI services.
Applebot-Extended — Apple's AI-focused crawler for Apple Intelligence.
Bytespider — ByteDance/TikTok's AI crawler.
Cohere-ai — Cohere's crawler for model training.
FacebookBot / meta-externalagent — Meta's crawlers for AI training and content retrieval.

How the crawler policy works

Legible's crawler policy is a single configuration that fans out into every technical enforcement layer. You define your intent — Legible handles the implementation across robots.txt, HTTP headers, content delivery, and page-level directives.

The policy has three components: a default posture for all bots, individual bot rules that override the default, and path overrides for specific sections of your site.

{
  "version": 1,
  "preset": "ai_no_train",
  "defaultPolicy": {
    "search": "allow",
    "aiInput": "allow",
    "aiTrain": "deny",
    "serveMarkdown": true
  },
  "botRules": [
    { "bot": "GPTBot", "action": "allow" },
    { "bot": "CCBot", "action": "deny" }
  ],
  "robotsTagDirectives": {
    "perBot": {
      "Google-Extended": "noindex"
    }
  }
}

Where policy is enforced

A single crawler policy is enforced at every layer of the AI content stack:

robots.txt: Generated dynamically with per-bot User-agent blocks. Bots set to 'deny' get `Disallow: /`; bots set to 'allow' get `Allow: /`.
Bot denial: Denied bots receive a `403 Access denied by site crawler policy` response before any content is served.
X-Robots-Tag headers: Per-bot indexing directives are emitted as HTTP response headers on HTML passthrough responses (e.g., `X-Robots-Tag: GPTBot: noindex`).
Content-Signal header: Every Markdown response includes a `Content-Signal` header declaring your permissions for search, AI input, and training.
Content delivery: Bots set to 'deny' or content paths with `serveMarkdown: false` do not receive Markdown — they see the origin HTML response instead.
Meta tag injection: Optionally, per-bot `<meta>` tags can be injected directly into the HTML `<head>` for compatibility with on-page SEO tools.

Path overrides

Sometimes different sections of your site need different policies. Your blog might be fully open to AI, while legal pages or gated content should be restricted.

Path overrides let you change any policy dimension for a URL prefix. The most specific matching prefix wins.

{
  "pathOverrides": [
    {
      "pattern": "/blog/*",
      "policy": { "aiTrain": "allow", "serveMarkdown": true }
    },
    {
      "pattern": "/legal/*",
      "policy": { "aiInput": "deny", "aiTrain": "deny", "serveMarkdown": false }
    }
  ]
}

The GEO Readiness audit and crawler controls

Legible's GEO Readiness audit checks your site from the perspective of each recognized AI crawler. The audit reports which crawlers are blocked by robots.txt, which are blocked by page-level meta tags, and which have full access.

The robots.txt check carries 10 points in the GEO score — the more crawlers you allow, the higher the score. The page-level policy check carries 4 points and penalizes pages with `noindex` directives that suppress AI visibility.

Together, these checks give you a complete picture of your AI crawler accessibility across every enforcement layer.