Legible AI Content Terminology: robots.txt, llms.txt, SEO & AI Signals

Why this matters

Legible sits at the intersection of technical SEO, content operations, and AI retrieval. That means customers often need a plain-English reference that explains what each file, tag, and header actually does.

This guide is designed as both a GEO onboarding glossary and a practical implementation reference for customer teams.

robots.txt

`robots.txt` is the long-standing crawl policy file at the root of a site. It tells crawlers which paths they may or may not request.

For Legible customers, it is still relevant because many AI crawlers start with the same crawl hygiene as search bots. It does not replace Legible's AI-specific signals, but it remains part of the baseline crawl contract.

Legible does not ask customers to reinvent `robots.txt`. Instead, the product layers AI-specific discovery and delivery on top of the site's existing crawl posture, so teams can keep traditional search controls while adding machine-readable endpoints and policies for AI systems.

Use it to allow or disallow paths at a broad level.
It does not describe the full AI intent or content structure by itself.
It should work together with `llms.txt`, sitemaps, and response headers.
Legible makes this easier by generating the AI-specific layer separately, rather than forcing customers to overload `robots.txt` with jobs it was never designed to do.

Page-Level Robots Meta Tags

Page-level robots meta tags live in the HTML `<head>` and let a site declare instructions such as `noindex` or `nofollow` for a specific page. They can also target specific crawlers — for example, `<meta name="GPTBot" content="noindex">` blocks OpenAI's crawler while leaving other systems unaffected.

This matters because a page can look otherwise AI-ready and still be suppressed by a page-level `noindex` signal. In Legible's audit, that is treated as a serious visibility problem because it can undermine discovery and citation even when your discovery files are present.

Legible detects both generic `<meta name="robots">` and bot-specific variants like `<meta name="GPTBot">`, `<meta name="Google-Extended">`, and `<meta name="ClaudeBot">`. The audit reports which bots are specifically blocked at page level, even when the generic robots meta allows indexing.

Legible can also generate and manage these directives for you. Through the crawler policy, you can configure per-bot indexing rules that Legible enforces via `X-Robots-Tag` response headers — and optionally inject `<meta>` tags directly into the HTML `<head>` for maximum compatibility with SEO audit tools.

`noindex` is usually a strong warning for any page you want discoverable and citable.
`nofollow` can reduce downstream discovery even if the page itself is readable.
Bot-specific meta tags like `<meta name="GPTBot" content="noindex">` can selectively block AI crawlers while preserving search visibility.
Legible detects these bot-specific blocks and reports them separately in the GEO Readiness audit.
This signal is different from `robots.txt`: `robots.txt` works at crawl-path level, while robots meta works at page level.

AI-Specific Crawler Directives For Each Bot

Some AI systems publish their own crawler identities and obey bot-specific rules or policies. Legible helps teams think in terms of per-bot control rather than one universal crawler behavior.

This matters because search indexing, answer generation, training, and summarization are not always the same use case, and different systems may interpret permissions differently.

In Legible, the goal is to let one content policy fan out into the different technical places crawlers may look: headers, machine-readable indexes, page metadata, and clean content delivery. That reduces the amount of manual per-bot reasoning a customer has to do on their own.

Legible is designed so policy does not live in one fragile snippet or plugin setting.
Customers define how content should be available, and Legible propagates that intent into the AI-readable delivery layer.
This is especially useful when one team needs to allow inference access but restrict training use.

How Legible Handles Different AI Crawlers

Not every crawler behaves the same way. Some are focused on search-style indexing, some fetch content for live answer generation, and some may be associated with model improvement or training workflows. Legible treats those as distinct policy questions rather than assuming one crawler policy fits every AI system.

The practical product advantage is that customers do not have to maintain separate logic in `robots.txt`, CMS tags, CDN rules, and app templates every time they want to refine access. Legible gives one content delivery layer that can express policy consistently across the AI-readable surface.

Legible separates discovery, retrieval, citation, and training concerns so policy can be more precise.
Customers manage intent in one product layer instead of writing one-off crawler rules everywhere.
The same content item can carry AI availability signals through headers, metadata, and indexes without duplicating manual work.

Example policy model in Legible:
- Allow AI systems to read content for live answers and citation
- Allow discovery of llms.txt and ai-sitemap.json
- Allow clean Markdown delivery for supported crawlers
- Restrict or opt out of training-oriented reuse where desired
- Keep human-facing HTML behavior unchanged

Per-Bot Policy Examples

The exact crawler landscape will keep changing, but the useful pattern is stable: different AI systems may consume content for different reasons, and Legible is designed to express those differences without requiring a custom implementation for each vendor.

The examples below are not a promise of vendor-specific enforcement semantics. They show the type of policy posture Legible helps customers express across the AI-readable delivery layer.

`GPTBot` or similar discovery crawlers: allow discovery and retrieval for answer generation while still controlling training-related permissions.
`ClaudeBot` or similar retrieval crawlers: allow clean Markdown access and citation-friendly delivery without requiring raw HTML extraction.
`PerplexityBot` or similar answer-engine crawlers: expose structured discovery endpoints and lightweight Markdown so citation and retrieval are efficient.
`Google-Extended` or similar training-related controls: express when content may be used for AI system improvement versus live user-facing responses.
General search bots such as `Googlebot` and `Bingbot`: keep traditional crawlability intact while the Legible layer adds AI-specific discovery on top.

What Customers Would Otherwise Have To Do Manually

Without Legible, a team that wants nuanced crawler policy usually ends up juggling several weakly connected systems: `robots.txt`, custom headers at the CDN, CMS template conditionals, manually updated sitemaps, and scattered documentation about what each bot is supposed to be allowed to do.

Legible reduces that operational burden by making AI-specific delivery a product capability instead of a collection of one-off technical patches.

Manually decide which crawlers may discover versus retrieve content.
Manually wire policy into headers, HTML, and hosted files.
Manually keep policy aligned when content, platforms, or site structure changes.
Manually verify whether AI systems are seeing the intended representation.

llms.txt

`llms.txt` is a machine-readable discovery document for AI systems. It helps point models and agentic crawlers toward the important parts of a site in a cleaner, more targeted way than general web crawling alone.

Legible generates and hosts `llms.txt` so AI systems can find key pages, collections, and AI-readable variants without guessing from navigation chrome.

That means customers do not need to hand-author and maintain this file as the site changes. Legible can keep it aligned with the site's synced content and hosted delivery paths.

Legible exposes `llms.txt` from the hosted endpoint in proxy-free mode.
In proxy mode, Legible can serve AI discovery from the same domain context where the content lives.
The product keeps the file in sync as content changes instead of treating it as a one-time setup artifact.

JSON-LD Structured Data

JSON-LD is a structured data format commonly used with Schema.org vocabularies. It helps machines understand what a page represents: an article, FAQ, organization, product, documentation page, and more.

Legible uses structured data as part of a broader machine-readability layer. It is not a replacement for clean Markdown, but it gives AI systems more context about entity type and page purpose.

In practice, Legible makes this easier by giving customers a system that combines structured data, AI indexes, and clean content output instead of relying on Schema markup alone to carry the whole AI visibility strategy.

Canonical URLs, Title Tags, And Meta Descriptions

Canonical URLs, titles, and meta descriptions are still foundational signals for AI visibility because they help define what a page is, what it should be called, and which URL should be treated as the primary reference.

In Legible audits, these are treated as page-identity signals. If they are missing or weak, AI systems have to infer more than they should from surrounding layout and body copy.

Legible makes this easier by preserving canonical references in AI-readable output and by helping customers spot missing page metadata before it hurts discoverability or citation quality.

Canonical tags reduce ambiguity when multiple URLs can represent similar content.
Clear titles and descriptions strengthen page-level context for both search systems and AI systems.
These signals work best together with structured data, headings, and clean Markdown delivery.

Vary: Accept Headers

This header tells caches and clients that the response changes based on the `Accept` header. In Legible proxy mode, that is what enables one URL to serve HTML to a browser and Markdown to an AI client without those variants being mixed together in cache.

A common misunderstanding is that `Vary: Accept` tells a crawler to ask for Markdown. It does not. It only tells caches and intermediaries that the representation changes when the request's `Accept` header changes.

In practice, the crawler learns to request Markdown from other signals: `llms.txt`, `ai-sitemap.json`, explicit Markdown URLs, or HTML discovery tags such as `<link rel="alternate" type="text/markdown">`. Legible handles this in product by separating the two integration modes. In proxy mode, Legible can negotiate the response at the same URL and send `Vary: Accept` so caches treat HTML and Markdown as different variants. In proxy-free mode, Legible avoids pretending negotiation exists and instead serves Markdown from explicit hosted `.md` URLs.

`Vary: Accept` is for cache correctness, not discovery.
Legible uses `llms.txt`, `ai-sitemap.json`, and alternate Markdown links to advertise where Markdown exists.
Legible removes the need for customers to hand-build content negotiation logic at the edge.
The product also prevents a common failure mode: stale caches mixing Markdown and HTML responses.
If a customer cannot support proxy mode, Legible falls back cleanly to separate AI-readable URLs instead of a brittle partial negotiation setup.

Discovery:
- llms.txt or ai-sitemap.json tells the crawler Markdown is available
- HTML can also expose <link rel="alternate" type="text/markdown">

Proxy mode:
- Browser requests HTML from https://example.com/page
- AI crawler requests the same URL with Accept: text/markdown
- Legible responds with Vary: Accept so caches keep those variants separate

Proxy-free mode:
- Browser keeps using https://example.com/page
- AI crawler is pointed to https://read.example.com/page.md
- No same-URL negotiation is required

Markdown vs HTML Delivery In Legible

Legible uses two delivery models depending on the customer's infrastructure. Proxy mode returns different representations from the same URL based on the request context. Proxy-free mode keeps the human page unchanged and exposes a clean Markdown version from a hosted endpoint.

The practical point: customers do not have to choose between changing their CMS and building custom AI infrastructure. Legible adapts the delivery method to the environment they already have.

Proxy mode is best when Cloudflare or another reverse proxy can sit in front of the site.
Proxy-free mode is best for hosted platforms that do not allow request interception.
The dashboard makes the difference explicit so customers know whether they are getting same-URL negotiation or hosted AI endpoints.

From manual setup to guided deployment

Without a product layer, teams would need to figure out where to put discovery files, how to generate clean Markdown, how to handle path mapping, and how to keep everything updated as content changes.

Legible turns that into a guided setup. The product generates the right hosted endpoints, gives customers copy-ready discovery tags, verifies the hosted files, and checks whether the origin site is exposing those tags correctly.

The dashboard shows hosted `llms.txt`, `ai-sitemap.json`, and sample Markdown URLs.
Customers get exact head tags to paste into their CMS or template.
Legible can distinguish a healthy hosted endpoint from a broken origin discovery setup.

Content-Signal Headers

`Content-Signal` is Legible's response-layer way of embedding AI usage permissions and related metadata into HTTP responses.

This matters because crawlers do not always rely only on HTML meta tags. Response headers let policy travel with the exact representation being requested.

Legible makes this easier by attaching those policies as part of content delivery rather than asking every customer to manually wire custom headers in their hosting stack.

In proxy mode, Legible can add AI policy headers directly on the negotiated response.
In hosted delivery, Legible can preserve those signals on the Markdown and AI index endpoints it serves.
Customers get one product-managed delivery layer instead of custom CDN, server, and CMS policy logic.

ai-sitemap.json

`ai-sitemap.json` is an AI-focused sitemap that complements traditional XML sitemaps. Instead of being optimized for search engine discovery alone, it is designed to expose machine-friendly URLs and content details more directly.

Legible uses it to expose clean content coverage for AI clients that want something more structured and targeted than HTML crawling.

The product benefit is that customers do not need to maintain a second sitemap system by hand. Legible can generate and update this AI-oriented index from the synced content inventory.

Legible keeps `ai-sitemap.json` aligned with available Markdown pages and content items.
It gives AI clients a cleaner map than a generic XML sitemap alone.
Teams can verify the endpoint directly in the dashboard instead of guessing whether it is valid.

Clean Markdown Delivery

Clean Markdown delivery means returning the core content without heavy navigation, app chrome, scripts, consent banners, and layout noise.

AI systems pay a token and retrieval cost for noisy HTML. Reducing that overhead improves the chance that your actual content gets processed and cited.

Legible makes this easy by generating the Markdown representation from the content pipeline rather than forcing every customer to build a custom extractor, cleaner, and serializer for their own CMS.

Legible strips navigation, scripts, and layout noise from the AI-readable version.
The product can include structured frontmatter and canonical references back to the human page.
Customers get a consistent Markdown output format across proxy and proxy-free delivery paths.

JavaScript Rendering And AI Crawlers

Many AI crawlers do not execute a full client-side JavaScript application before trying to understand a page. If the important content only appears after hydration, the crawler may see a shell, not the substance.

That is why Legible's audit treats JavaScript rendering risk as a major signal. A page can have good metadata and still be effectively invisible to AI if the actual content is not available in the server-rendered response.

Legible solves this by generating a clean AI-readable representation from the content layer itself, rather than hoping every crawler can reconstruct a JavaScript-heavy page correctly.

If AI systems miss the content entirely, better discovery files alone will not fix the problem.
Server-rendered HTML is still valuable even when Legible also provides Markdown.
Legible is especially useful for sites where the visible page experience depends heavily on client-side rendering.

AI Meta Tags

AI meta tags are HTML-level hints that help crawlers discover AI-readable endpoints or understand how a page should be interpreted.

In proxy-free mode, Legible relies on them heavily because the main site's raw HTML is what tells a crawler where the hosted Markdown and AI indexes live.

Legible makes this manageable by generating the tags for customers instead of requiring them to invent conventions or guess which tags belong globally versus per page.

The dashboard separates global tags from page-template tags.
Customers can copy the tags directly into Webflow, WordPress, Shopify, Squarespace, or Next.js setups.
Legible validates whether those tags are visible in the raw source HTML.

Freshness Headers: Last-Modified And ETag

`Last-Modified` and `ETag` are response headers that help crawlers and intermediaries understand whether content has changed since the last fetch.

In Legible's audit, these act as freshness signals. They do not replace good content, but they help AI systems re-check the right pages efficiently instead of re-fetching everything blindly.

Legible now automatically generates both headers on every AI-readable response. `Last-Modified` is derived from the content's actual publication or modification date (from CMS metadata or cache timestamps), and `ETag` is computed as a content-based hash so crawlers can make conditional requests.

When a crawler sends `If-None-Match` or `If-Modified-Since`, Legible can respond with a lightweight `304 Not Modified` instead of re-transmitting the full content. This reduces bandwidth for high-frequency AI crawlers and improves crawl efficiency for your site.

Both `Last-Modified` and `ETag` are now generated automatically on all Legible-served Markdown responses.
`Last-Modified` uses actual content dates from your CMS, not arbitrary timestamps.
`ETag` is a deterministic content hash — the same content always produces the same tag.
Conditional requests (304 Not Modified) reduce bandwidth for AI crawlers that re-fetch frequently.
These headers work alongside `Cache-Control` rather than replacing it.

Cache Headers For AI

Cache headers tell clients and intermediaries how long a response can be reused. For AI workloads, good caching reduces repeated token-heavy fetches and keeps content retrieval efficient.

Legible balances freshness with efficiency so that AI systems get stable, lightweight content without needlessly re-fetching noisy pages.

This is easier with Legible because caching becomes part of the content delivery product rather than a custom CDN project each customer has to tune alone.

Legible can keep AI endpoints cache-friendly while preserving content freshness.
In proxy mode, cache behavior works together with `Vary: Accept` to keep representations separate.
In proxy-free mode, hosted endpoints can be cached independently from the human site.

Training Opt-Out Controls

Training controls distinguish between using content to answer a live query and using content to improve future model training. Customers often care about those differently.

Legible exposes these controls in metadata and headers so policy can be attached at page level rather than treated as one blanket site-wide assumption.

The product value is that teams can manage these choices in the Legible content layer instead of trying to express them inconsistently across plugins, templates, and hosting rules.

Legible supports a model where inference and training are treated as different policy questions.
Those controls can travel with the content in metadata and delivery headers.
Customers do not need to hand-maintain different policy signals across multiple technical surfaces.

Semantic HTML Structure

Semantic HTML means headings, sections, lists, tables, and other structure are represented clearly in the source. Even when Legible provides Markdown, semantic HTML still matters because it affects extraction quality and structured understanding.

A well-structured source page is easier to transform into a reliable machine-readable representation.

Legible makes this easier operationally because the product can normalize and re-emit content into a cleaner representation, even when the source HTML is heavier than ideal.

Heading Hierarchy And H1s

A clear heading hierarchy helps both crawlers and retrieval systems understand what the page is mainly about and how the content is organized beneath that main topic.

In Legible's audit, the H1 check is intentionally simple: if there is no obvious H1, the page may be harder for machines to summarize and chunk accurately.

Legible can normalize a lot of noisy source markup, but strong source headings still improve extraction quality and downstream citation clarity.

One clear H1 is usually better than multiple competing top-level headings.
Subheadings help chunking and retrieval stay grounded in the right section.
This signal works together with semantic containers like `<main>` and `<article>`.

Ongoing Sync And Updates

Legible is not a one-time conversion tool. It keeps AI-readable assets in sync as your site changes, so `llms.txt`, Markdown pages, sitemaps, and metadata stay current.

That ongoing sync matters because outdated AI endpoints lead to stale citations and incorrect answers.

This is where the product saves the most operational effort. Instead of customers rebuilding exports and indexes manually after every content change, Legible regenerates the AI-readable layer as content is synced and updated.

New or changed content can automatically flow into Markdown, `llms.txt`, and `ai-sitemap.json`.
The product keeps AI delivery from becoming a one-time launch project that drifts out of date.
Customers get a durable AI visibility layer instead of another manual publishing checklist.