AI Search for the Code-Curious: robots.txt, llms.txt & Extractable Markup

For the Code-Curious Make AI parse you robots.txt for AI crawlers, llms.txt, FAQPage JSON-LD and answer-first HTML. Show me how →
Your journey cost
Tick the steps you want — total updates live
Total
Live prices · pay as you go
Pricing comparison
PAYG vs Subscription
PAYG
£0 /mo min

Top up from £4.99 · credits never expire

Subscription

Select a plan to compare.

£4.99/mo
Compare against plan:
Calculating…

AI search for the code-curious: robots.txt, llms.txt and extractable markup

This guide is the files and markup that make AI engines able to crawl, understand and cite you, for the code-curious. We will allow the right AI crawlers in robots.txt, ship an llms.txt, add FAQPage JSON-LD, and structure HTML answer-first so a model can extract a self-contained passage. For the strategy, measurement and retrieval mechanics behind it, see our AI search strategy and AI search mastery guides.

Step 1: Allow AI crawlers in robots.txt

Accessibility is the precondition — if your robots.txt blocks the AI bots, no on-page work can compensate. These are the current user-agents worth allowing explicitly:

# robots.txt
User-agent: GPTBot          # OpenAI training
User-agent: OAI-SearchBot   # ChatGPT search
User-agent: ChatGPT-User    # ChatGPT browsing on user request
User-agent: ClaudeBot       # Anthropic
User-agent: PerplexityBot   # Perplexity
User-agent: Google-Extended # Google AI (Gemini / AI training)
Allow: /

# point crawlers at your sitemap
Sitemap: https://example.com/sitemap.xml

Listing a bot under User-agent with Allow: / permits it; to block one, give it Disallow: / instead. Google-Extended controls Google’s AI uses without affecting normal Search indexing. Decide deliberately which to allow — but if you want AI citations, the engines you are targeting must be allowed.

Step 2: Ship an llms.txt

An llms.txt at your domain root is a clean, link-first map of your most important content, intended to help LLM-based tools find the canonical version of what matters. It is Markdown, headed by your site name and a short summary, then curated links:

# /llms.txt

# Example Ltd
> Plumbing and heating services across Derbyshire.

## Key pages
- [Services](https://example.com/services): what we do
- [Boiler installation](https://example.com/boiler-installation): process and pricing
- [Service areas](https://example.com/areas): where we work
- [FAQ](https://example.com/faq): common questions answered

## About
- [About us](https://example.com/about): team, credentials, reviews
- [Contact](https://example.com/contact): phone, address, hours

Keep it curated — point at your best, most answer-dense pages, not everything. Audit it with the LLMs.txt Auditor.

Step 3: Ship FAQPage JSON-LD

FAQ markup presents questions and answers in the exact structured shape engines extract from — and it earns Google’s FAQ rich results too. Mark up only genuine Q&A that is visible on the page:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "How much does a boiler service cost?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "A standard boiler service costs £80–120 and takes about an hour."
      }
    }
  ]
}
</script>

Each answer should be self-contained — it makes sense lifted out of the page, because that is exactly what an engine does with it.

Step 4: Write answer-first HTML

Engines lift passages that stand alone, so structure content so the answer comes first, under a heading phrased as the real question:

<h2>Our approach to boiler servicing</h2>
<p>For over twenty years we have prided ourselves on… (answer buried far below)</p>

After:

<h2>How much does a boiler service cost?</h2>
<p>A boiler service costs £80–120 and takes about an hour.
   Annual servicing keeps the manufacturer warranty valid.</p>

Heading is the question a person would type; the first sentence answers it directly with a concrete, quotable fact; supporting detail follows. This is the single highest-leverage change for being quoted, and it pairs one-to-one with each FAQ item.

Step 5: Reinforce your entity with Organization schema

Help engines resolve who you are with consistent Organization markup and sameAs links, so scattered mentions connect to one trusted entity:

{
  "@type": "Organization",
  "@id": "https://example.com/#org",
  "name": "Example Ltd",
  "url": "https://example.com/",
  "logo": "https://example.com/logo.png",
  "sameAs": [
    "https://x.com/example",
    "https://www.linkedin.com/company/example",
    "https://www.wikidata.org/wiki/Q000000"
  ]
}

Generate it with the AI Schema Generator; the deeper entity work is in our schema mastery guide.

Verify and measure

Confirm the bots are not blocked and your visibility with the AEO Checker, keep answers cleanly extractable with the Readability checker, and test directly — ask ChatGPT and Perplexity your customers’ real questions and see whether you are named, and whether it is your wording.

A worked example

A site is invisible to AI. Checking robots.txt shows a blanket Disallow catching the AI bots; the content buries answers in long prose with no FAQ markup. The coder allows GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot and Google-Extended, adds an llms.txt pointing at the key pages, restructures the top customer questions answer-first under question headings, and ships matching FAQPage and Organization schema. Within weeks the site starts being cited — Perplexity first, where live search rewards the new structure fastest. The unlock was accessibility plus extractable structure, both shipped in code.

Common mistakes to avoid

Blocking AI crawlers in robots.txt (the silent killer). Burying the answer instead of stating it first. No FAQ markup, or marking up Q&A not visible on the page. Faking reviews or ratings to look authoritative — a manual-action risk. An inconsistent entity with no sameAs. And treating AEO as one-off rather than testing across engines and iterating.

Frequently asked questions

Which AI crawlers should I allow in robots.txt?

The ones you want citing you: typically GPTBot and OAI-SearchBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, and Google-Extended for Google’s AI. List each under User-agent with Allow: /. Blocking them means no citations from those engines.

What is llms.txt and where does it go?

A Markdown file at your domain root that maps your most important content with curated links and short descriptions, helping LLM tools find your canonical pages. Keep it focused on your best, answer-dense pages.

How do I make content AI can quote?

Write answer-first: a heading phrased as the real question, the answer in the first sentence with a concrete fact, supporting detail after. Make each answer self-contained so it stands alone when lifted, and mirror it in FAQPage schema.

Does FAQPage schema help with AI citations?

Yes — it presents Q&A in the structured form engines extract from, and earns Google FAQ rich results. Mark up only genuine answers that are visible on the page.

How does Organization schema help AI?

Consistent Organization markup with sameAs links helps engines resolve scattered mentions into one trusted entity, which supports being selected and cited. Keep name, URL and logo identical everywhere.

How do I check if AI can see my site?

Confirm your robots.txt does not disallow the AI bots, run the AEO Checker for overall visibility, and test directly by asking the engines your customers’ real questions to see if you are named.