Optimising Websites for AI Discovery

Introduction

The way people find information is changing. Traditional search engines still matter, but AI-powered tools — ChatGPT browsing, Perplexity, Google AI Overviews, Claude’s web access — are becoming a significant discovery channel. If your content is well-structured and easy for these systems to parse, you’re more likely to be cited, referenced, and surfaced.

This isn’t about gaming algorithms. It’s about making your content machine-readable in the same way that good accessibility practices make it screen-reader-friendly. The principles overlap significantly: semantic HTML, clear structure, explicit metadata.

I’ve spent the last few months testing what actually works. This post collects those findings into a practical framework.

The AI Crawling Landscape

AI systems discover web content through several mechanisms:

Direct crawling. Tools like GPTBot, ClaudeBot, and PerplexityBot crawl websites similarly to Googlebot. They respect robots.txt, follow sitemaps, and index page content. If you allow these crawlers and provide a sitemap, your content will be indexed.

Search-augmented retrieval. When a user asks an AI assistant a question, the system may perform web searches behind the scenes and synthesise results. Your content’s ranking in traditional search engines still affects whether AI tools find it.

Cached knowledge. Large language models are trained on web content. While you can’t control what’s in the training data, well-structured content with clear authorship is more likely to be accurately represented.

The practical implication: optimising for AI discovery isn’t a separate discipline from web development. It’s an extension of good practices you should already be following.

Structured Data: The Foundation

JSON-LD is the recommended format for structured data. It’s what Google recommends, it’s clean to implement, and it doesn’t pollute your HTML with extra attributes.

Three schema types matter most for content sites:

Person Schema

Establishes authorship and identity. Place this on your about page and optionally site-wide.

{
  "@context": "https://schema.org",
  "@type": "Person",
  "name": "Your Name",
  "url": "https://yoursite.com",
  "sameAs": [
    "https://github.com/yourusername",
    "https://linkedin.com/in/yourusername"
  ]
}

This helps AI systems connect your content to your identity across platforms. When a user asks “what has [person] written about [topic],” this schema provides the link.

BlogPosting Schema

Applied to every article. Includes headline, dates, author reference, description, and keywords.

{
  "@context": "https://schema.org",
  "@type": "BlogPosting",
  "headline": "Your Post Title",
  "datePublished": "2026-02-10",
  "author": {
    "@type": "Person",
    "name": "Your Name"
  },
  "keywords": ["ai", "seo", "structured-data"]
}

The keywords field is particularly useful — it gives AI systems explicit topic signals without relying on content analysis alone.

BreadcrumbList Schema

Provides navigation context. This helps AI systems understand where a page sits within your site’s hierarchy.

{
  "@context": "https://schema.org",
  "@type": "BreadcrumbList",
  "itemListElement": [
    {"@type": "ListItem", "position": 1, "name": "Home", "item": "https://yoursite.com/"},
    {"@type": "ListItem", "position": 2, "name": "Deep Dives", "item": "https://yoursite.com/deep-dives/"},
    {"@type": "ListItem", "position": 3, "name": "This Post"}
  ]
}

Semantic HTML: More Than Accessibility

AI crawlers parse HTML structure to understand content hierarchy. The semantic elements matter:

<article> wrapping each post tells crawlers “this is a self-contained piece of content”
<nav> identifies navigation vs. content links
<main> signals the primary content area
<time> with datetime attributes provides unambiguous date information
<h1> through <h4> in proper hierarchy gives crawlers an outline of your content

This isn’t new advice, but it’s newly important. Traditional search engines have sophisticated content extraction that can work around messy HTML. AI crawlers are generally less forgiving — they’re parsing your content to understand it, not just to index keywords.

Content Strategy for AI Discoverability

Beyond the technical markup, how you write and structure content affects AI discoverability:

Lead with the answer. AI systems often extract the first paragraph or the content near relevant headings. If your key insight is buried in paragraph twelve, it’s less likely to be surfaced. This aligns with the inverted pyramid structure that journalists have used for decades.

Use descriptive headings. “Analysis” is less useful than “Why Smaller Models Outperform on Constrained Hardware.” Headings serve as content anchors that AI systems use for retrieval.

Write self-contained paragraphs. Each paragraph should make sense on its own, because AI tools may extract individual paragraphs as citations. Avoid paragraphs that only make sense in the context of the preceding text.

Include explicit summaries. Whether through a description meta tag, a key-takeaways section, or a TL;DR at the top, give AI systems a pre-written summary to work with. They’ll generate their own if you don’t, and yours will be more accurate.

Tag consistently. Taxonomies and tags provide explicit topic signals. Use a consistent, focused set of tags rather than a sprawling collection.

Technical Checklist

A practical summary of what to implement:

Sitemap — ensure it’s accurate and submitted. AI crawlers use this as their primary discovery mechanism.
robots.txt — allow AI crawlers explicitly. Check that you’re not blocking GPTBot, ClaudeBot, or PerplexityBot.
JSON-LD schemas — Person, BlogPosting, BreadcrumbList at minimum.
Meta descriptions — write them manually for every page. Don’t rely on auto-generation.
Canonical URLs — prevent duplicate content confusion.
Fast response times — AI crawlers, like all crawlers, prefer fast sites. Static site generators like Hugo are ideal here.
Clean HTML — semantic elements, proper heading hierarchy, no JavaScript-rendered content for important text.
RSS feeds — some AI systems use RSS for content discovery and freshness signals.

What Doesn’t Matter (Yet)

Some things that are commonly discussed but don’t yet have clear evidence of impact:

AI-specific meta tags — there’s no widely adopted standard for AI-specific metadata beyond what schema.org provides.
Content length thresholds — there’s no evidence that AI systems prefer specific content lengths. Write as much as the topic needs.
Publishing frequency — quality and structure matter more than volume.

Conclusion

Optimising for AI discovery is less about learning new tricks and more about doing the fundamentals well. Semantic HTML, structured data, clean content architecture, and explicit metadata. These practices also improve accessibility, traditional SEO, and the reading experience for humans.

The landscape is evolving quickly. The specific crawlers and tools will change. But the principle won’t: make your content easy to understand, and the machines — whatever form they take — will understand it.

Build the foundation now. The details will sort themselves out.