This guide is the files and markup that make AI engines able to crawl, understand and cite you, for the code-curious. We will allow the right AI crawlers in robots.txt, ship an llms.txt, add FAQPage JSON-LD, and structure HTML answer-first so a model can extract a self-contained passage. For the strategy, measurement and retrieval mechanics behind it, see our AI search strategy and AI search mastery guides.
Accessibility is the precondition — if your robots.txt blocks the AI bots, no on-page work can compensate. These are the current user-agents worth allowing explicitly:
# robots.txt User-agent: GPTBot # OpenAI training User-agent: OAI-SearchBot # ChatGPT search User-agent: ChatGPT-User # ChatGPT browsing on user request User-agent: ClaudeBot # Anthropic User-agent: PerplexityBot # Perplexity User-agent: Google-Extended # Google AI (Gemini / AI training) Allow: / # point crawlers at your sitemap Sitemap: https://example.com/sitemap.xml
Listing a bot under User-agent with Allow: / permits it; to block one, give it Disallow: / instead. Google-Extended controls Google’s AI uses without affecting normal Search indexing. Decide deliberately which to allow — but if you want AI citations, the engines you are targeting must be allowed.
An llms.txt at your domain root is a clean, link-first map of your most important content, intended to help LLM-based tools find the canonical version of what matters. It is Markdown, headed by your site name and a short summary, then curated links:
# /llms.txt # Example Ltd > Plumbing and heating services across Derbyshire. ## Key pages - [Services](https://example.com/services): what we do - [Boiler installation](https://example.com/boiler-installation): process and pricing - [Service areas](https://example.com/areas): where we work - [FAQ](https://example.com/faq): common questions answered ## About - [About us](https://example.com/about): team, credentials, reviews - [Contact](https://example.com/contact): phone, address, hours
Keep it curated — point at your best, most answer-dense pages, not everything. Audit it with the LLMs.txt Auditor.
FAQ markup presents questions and answers in the exact structured shape engines extract from — and it earns Google’s FAQ rich results too. Mark up only genuine Q&A that is visible on the page:
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "How much does a boiler service cost?",
"acceptedAnswer": {
"@type": "Answer",
"text": "A standard boiler service costs £80–120 and takes about an hour."
}
}
]
}
</script>
Each answer should be self-contained — it makes sense lifted out of the page, because that is exactly what an engine does with it.
Engines lift passages that stand alone, so structure content so the answer comes first, under a heading phrased as the real question:
After:
<h2>How much does a boiler service cost?</h2> <p>A boiler service costs £80–120 and takes about an hour. Annual servicing keeps the manufacturer warranty valid.</p>
Heading is the question a person would type; the first sentence answers it directly with a concrete, quotable fact; supporting detail follows. This is the single highest-leverage change for being quoted, and it pairs one-to-one with each FAQ item.
Help engines resolve who you are with consistent Organization markup and sameAs links, so scattered mentions connect to one trusted entity:
{
"@type": "Organization",
"@id": "https://example.com/#org",
"name": "Example Ltd",
"url": "https://example.com/",
"logo": "https://example.com/logo.png",
"sameAs": [
"https://x.com/example",
"https://www.linkedin.com/company/example",
"https://www.wikidata.org/wiki/Q000000"
]
}
Generate it with the AI Schema Generator; the deeper entity work is in our schema mastery guide.
Confirm the bots are not blocked and your visibility with the AEO Checker, keep answers cleanly extractable with the Readability checker, and test directly — ask ChatGPT and Perplexity your customers’ real questions and see whether you are named, and whether it is your wording.
A site is invisible to AI. Checking robots.txt shows a blanket Disallow catching the AI bots; the content buries answers in long prose with no FAQ markup. The coder allows GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot and Google-Extended, adds an llms.txt pointing at the key pages, restructures the top customer questions answer-first under question headings, and ships matching FAQPage and Organization schema. Within weeks the site starts being cited — Perplexity first, where live search rewards the new structure fastest. The unlock was accessibility plus extractable structure, both shipped in code.
Blocking AI crawlers in robots.txt (the silent killer). Burying the answer instead of stating it first. No FAQ markup, or marking up Q&A not visible on the page. Faking reviews or ratings to look authoritative — a manual-action risk. An inconsistent entity with no sameAs. And treating AEO as one-off rather than testing across engines and iterating.
The ones you want citing you: typically GPTBot and OAI-SearchBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, and Google-Extended for Google’s AI. List each under User-agent with Allow: /. Blocking them means no citations from those engines.
A Markdown file at your domain root that maps your most important content with curated links and short descriptions, helping LLM tools find your canonical pages. Keep it focused on your best, answer-dense pages.
Write answer-first: a heading phrased as the real question, the answer in the first sentence with a concrete fact, supporting detail after. Make each answer self-contained so it stands alone when lifted, and mirror it in FAQPage schema.
Yes — it presents Q&A in the structured form engines extract from, and earns Google FAQ rich results. Mark up only genuine answers that are visible on the page.
Consistent Organization markup with sameAs links helps engines resolve scattered mentions into one trusted entity, which supports being selected and cited. Keep name, URL and logo identical everywhere.
Confirm your robots.txt does not disallow the AI bots, run the AEO Checker for overall visibility, and test directly by asking the engines your customers’ real questions to see if you are named.