Legible Schema Engine: Page-Aware JSON-LD With Async Review, AI Suggestions, And Observability

Why this matters

Legible's Schema Engine turns your existing site content into page-aware JSON-LD instead of asking your team to hand-author schema for every important page. It classifies each page, builds a structured graph, validates the result, and decides whether that graph is ready to publish.

The system is now much more than automatic JSON-LD. It includes async AI review layers for category, primary entity, facts, FAQ normalization, and relationships, plus observability, retries, and operator controls so the workflow is usable at production scale.

What customers get

The Schema Engine is designed to make structured data feel like a maintained product surface, not an SEO chore. Instead of shipping one generic JSON-LD snippet everywhere, Legible looks at each page in your content library and generates schema that matches the purpose of that page.

Page-aware JSON-LD, not one-size-fits-all structured data.
Automatic graph generation for pages like docs, blogs, FAQ pages, pricing pages, service pages, and company pages.
Confidence scoring so Legible can auto-publish easy cases and hold back uncertain ones.
A review workflow so teams can approve, block, or override schema before it goes live.
Async AI review layers that enrich the deterministic engine without slowing normal preview.
Site-level health, retries, and queue controls so teams can actually operate the system over time.
One schema system across both proxy and proxy-free delivery paths.

How the Schema Engine works

The engine starts with the same content layer Legible already uses for clean Markdown, `llms.txt`, and AI discovery. For each page, it combines URL and path signals, content type, titles, excerpts, dates, images, FAQ content, and site-level business details such as your organization name and elevator pitch.

From there, the engine classifies the page into a schema pack, builds a JSON-LD `@graph`, validates the nodes, scores how complete and trustworthy the result is, and decides whether that graph should publish automatically or wait for review.

Common schema packs include homepage, docs, blog, FAQ, pricing, service, company, legal, contact, product, and minimal.
The engine can attach reusable nodes like `WebSite`, `Organization`, `WebPage`, and `BreadcrumbList` around the primary entity for the page.
Primary entity selection can stay automatic or be overridden manually per content item.
Accepted facts, accepted FAQ pairs, and accepted relationships can enrich the graph when those additions are semantically safe.

What is deterministic vs AI-assisted

Legible keeps deterministic graph generation as the source of truth. The schema pack, baseline graph, validation, confidence scoring, and publish decision do not depend on a live model call succeeding in the request path.

AI is used as an advisory layer on top of that foundation. It suggests categories, primary entities, structured facts, FAQ normalization, and page-to-page relationships for human review. That gives customers the upside of richer schema without turning the system into an opaque black box.

Deterministic: classification baseline, graph assembly, validation, confidence, publish decisions.
AI-assisted: second-opinion category suggestions, primary entity suggestions, candidate facts, FAQ normalization, relationship suggestions.
Accepted AI outputs can enrich the graph, but they do not publish automatically without the existing review rules.

Schema types Legible can publish

Legible does not force every page into the same schema type. It selects the combination that best matches the page and only publishes a graph when the result clears the engine's confidence and review rules.

`WebSite` for the site-level entity.
`Organization` for the business identity behind the site.
`WebPage` as the common wrapper node for individual pages.
`BreadcrumbList` for page hierarchy and crawlable context.
`TechArticle`, `BlogPosting`, or `Article` for content-heavy pages.
`FAQPage` when the page actually contains visible question-and-answer content.
`SoftwareApplication` or `Service` when the page reads like a product or service offer.

Example: a documentation page graph

A documentation or technical guide page will usually publish a graph that includes the site identity, breadcrumb context, a `WebPage` node, and a more specific content entity such as `TechArticle`.

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "WebSite",
      "@id": "https://acme.com/#website",
      "url": "https://acme.com",
      "name": "Acme"
    },
    {
      "@type": "Organization",
      "@id": "https://acme.com/#organization",
      "name": "Acme",
      "url": "https://acme.com",
      "description": "Platform for AI-readable websites."
    },
    {
      "@type": "BreadcrumbList",
      "itemListElement": [
        { "@type": "ListItem", "position": 1, "name": "Docs", "item": "https://acme.com/docs/" },
        { "@type": "ListItem", "position": 2, "name": "Proxy setup", "item": "https://acme.com/docs/proxy-setup/" }
      ]
    },
    {
      "@type": "WebPage",
      "url": "https://acme.com/docs/proxy-setup/",
      "name": "Proxy setup guide",
      "isPartOf": {
        "@id": "https://acme.com/docs/",
        "url": "https://acme.com/docs/",
        "name": "Documentation"
      }
    },
    {
      "@type": "TechArticle",
      "headline": "Proxy setup guide",
      "dateModified": "2026-04-02T10:00:00.000Z",
      "author": { "@type": "Organization", "name": "Acme" },
      "mentions": [
        {
          "@id": "https://acme.com/pricing/",
          "url": "https://acme.com/pricing/",
          "name": "Pricing",
          "@type": "SoftwareApplication"
        }
      ]
    }
  ]
}

Example: FAQ and service pages

FAQ pages and commercial pages behave differently on purpose. If Legible can see real question-and-answer content, it can generate `FAQPage`. If the page reads like a commercial offer, it may choose `Service` or `SoftwareApplication` instead.

FAQPage only appears when visible FAQ content is present and FAQ schema is enabled for the site.
A pricing or product page may still publish `WebPage` plus `Service` or `SoftwareApplication` even if it is not a blog or docs page.
If the engine is uncertain about the primary entity, the page is more likely to land in `needs_review` or `blocked` instead of publishing automatically.

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "WebPage",
      "url": "https://acme.com/services/ai-content-setup/",
      "name": "AI content setup"
    },
    {
      "@type": "Service",
      "name": "AI content setup",
      "description": "Implementation service for making existing websites readable by AI crawlers.",
      "serviceType": "AI content implementation",
      "audience": "Marketing and web teams",
      "provider": {
        "@type": "Organization",
        "name": "Acme"
      }
    }
  ]
}

Async review layers that make the engine richer

The mature Schema Engine does not stop at deterministic graph generation. It can also generate reviewable AI suggestions in the background, so the baseline schema preview stays fast while richer suggestions appear when they are ready.

Category suggestion: a second opinion on the page category.
Primary entity suggestion: a second opinion on whether the page is best represented as `WebPage`, `Service`, `SoftwareApplication`, `TechArticle`, and so on.
Candidate facts: reviewable facts like `applicationCategory`, `featureList`, `serviceType`, `audience`, or `operatingSystem`.
FAQ normalization: reviewable question-and-answer pairs grounded in visible page content.
Relationship suggestions: reviewable `about`, `mentions`, and `isPartOf` links between same-site pages.

Confidence, review, and publish decisions

Every page gets a confidence score from 0 to 100. That score reflects how well the engine could classify the page, whether the graph validates cleanly, and whether important supporting fields like title, description, author, or modified date are present.

Legible then converts that score into a review state so your team can focus on the pages that actually need human judgment.

High-confidence pages can publish automatically.
Medium-confidence pages are marked `needs review` so a human can confirm the graph.
Low-confidence pages are blocked by default instead of risking incorrect schema on live pages.
Warnings and missing recommended fields explain why a page did not earn a stronger publish decision.
Manual publish modes let a reviewer force a graph on or keep it off for a specific page.
Accepted AI suggestions can enrich a page safely without changing the underlying deterministic review model.

How teams review and control schema

The Schema tab in the content library is where teams can inspect the generated graph before it reaches production. This is especially useful for important pages like pricing, homepage variants, service pages, and high-traffic help center articles.

What started as a preview-and-approve workflow is now a broader review surface. Teams can work through AI second opinions, accept or reject candidate facts, approve FAQ pairs, and review relationship suggestions without losing the deterministic core of the engine.

Preview the generated `@graph` for a specific content item.
See confidence, warnings, and missing recommended fields before publishing.
Override the primary entity if the automatic choice is too generic.
Mark a page as approved, blocked, or still in auto mode.
Add internal review notes so SEO and engineering teams stay aligned.
Use the schema review queue to work through pages that need manual attention.
Resolve AI review items with explicit keep, apply, dismiss, accept, reject, and restore actions depending on the layer.

Observability, retries, and operator controls

The Schema Engine is now observable as an operating system, not only as a renderer. Site-level health views show how many items are idle, pending, processing, ready, or errored. The dashboard also surfaces retry health, average generation times, and review workload across category, entity, facts, FAQ, and relationships.

That matters because a mature schema workflow needs more than generation. Teams need to know what is stuck, what needs review, and how to recover safely when background AI work fails.

Schema Engine health shows queue status, retry pressure, and review backlog.
Automatic retries use bounded backoff instead of retrying forever.
Manual actions let operators retry failed items or requeue stale items safely.
Preview remains fast and usable even when AI enrichment is still pending or in error.

How accepted facts, FAQs, and relationships enrich the graph

Accepted fact candidates can make `SoftwareApplication` and `Service` nodes more useful by adding fields that are often missing from HTML alone. Accepted FAQ candidates can supplement deterministic FAQ parsing when visible Q&A content exists but is messy. Accepted relationships can safely add references like `isPartOf`, `about`, and `mentions` where they make semantic sense.

Legible keeps these enrichments conservative on purpose. Ambiguous suggestions stay in preview and evidence instead of being pushed into published JSON-LD just because an AI model proposed them.

Facts can enrich service and software nodes.
FAQ normalization supplements visible Q&A without inventing hidden answers.
Relationship wiring is intentionally conservative in v1.
`mainEntityOfPage` stays preview-only until the directional semantics are clear enough to wire safely.

How schema reaches live pages

Legible uses the same delivery split it already uses for other AI-readable outputs. In proxy mode, schema can be injected server-side so crawlers see it directly in the HTML. In proxy-free mode, the public schema endpoint powers the discovery snippet and the hosted fallback path.

That makes the Schema Engine one system even though the delivery path can differ by deployment model.

Proxy mode: server-side HTML injection is best when you want crawler-visible JSON-LD on the main domain.
Proxy-free mode: the hosted/public schema response gives the snippet a consistent source of truth.
Existing JSON-LD detection can be kept on to avoid duplicate schema when a CMS or SEO plugin already injects its own graph.
If a site needs guaranteed crawler visibility without JavaScript execution, proxy mode remains the stronger choice.

Best practices for better schema output

Give important pages clear titles and descriptions in the source content so the engine has better inputs.
Keep author and modified-date fields available on blog and documentation content.
Use real FAQ content items for FAQ material instead of hiding key answers only in collapsible scripts.
Fill in organization name, URL, logo, and elevator pitch in site settings so organization-level schema is richer.
Review high-impact pages first: homepage, pricing, product, service, docs hubs, and major FAQ pages.
Treat the review queue as editorial QA rather than a one-time setup task.
Use the health card to monitor error states, retry exhaustion, and review backlog before those issues pile up.

Troubleshooting

No schema is publishing: Check whether site-wide schema injection is enabled and whether the page is currently blocked or still awaiting review.
A page is stuck in needs review: Look at missing recommended fields such as author, modified date, or page description and then re-check the preview.
The wrong schema type was chosen: Use the primary entity override on that content item, then save and verify the new preview.
FAQPage is missing: Make sure the page actually contains visible question-and-answer content or that the related FAQ items are included in Legible.
Duplicate schema appears on page: Keep existing-schema detection enabled if your CMS or SEO plugin is already outputting JSON-LD.
AI suggestions are still loading: The async layers may still be pending. The deterministic preview should be available immediately while enrichment finishes in the background.
An item is in error: Use the retry controls or site-level health actions to requeue failed AI analysis safely.
Changes are not visible yet: Re-save the page settings, allow a short propagation window, and re-test the live page source rather than only the rendered DOM.