Technical8 min read·1,859 words

Your Robots.txt Is Blocking ChatGPT: The AI Crawler Decision Framework

79% of top news sites block AI training bots. But blocking the wrong crawlers means AI literally cannot cite your content. Here is the framework for deciding what to allow.

Joel HouseFounder, MentionLayer

Published March 13, 2026

Key Takeaway

79% of top news sites block AI training bots. GPTBot is blocked 7x more than Googlebot. But there’s a critical difference between training bots and retrieval bots — blocking the wrong ones means AI literally can’t cite your content.

The AI Crawler Blocking Problem

Something is happening quietly across the web that is decimating brands’ AI visibility without them knowing. 79% of the top 1,000 news sites now block at least one AI crawler. The blocking rate has increased 336% since early 2024. GPTBot specifically is blocked 7x more often than Googlebot across the top 10,000 websites.

According to Joel House, founder of MentionLayer and author of AI for Revenue, "Robots.txt is the most overlooked factor in AI visibility. I\'ve personally audited over 200 business websites in the last six months, and roughly a third of them were accidentally blocking the retrieval bots that AI models use to generate live answers. These brands had great content, strong authority — and zero AI presence. The fix took five minutes."

Most of this blocking is intentional — publishers protecting their content from being used as AI training data without compensation. That is a legitimate choice. But here is the problem: a significant percentage of sites are blocking AI crawlers without understanding the consequences for their [AI visibility score](/blog/what-is-ai-visibility-score).

A note on aggregate data: our AI Visibility Index study across 1,004 businesses found that *overall* AI crawler blocking correlates with visibility at just r=0.009 — essentially zero. That is because most blocking happens on *training* bots (which don’t affect live retrieval) and because most training data for current models was ingested before businesses implemented blocks. The horse left the training-data barn years ago. But the retrieval-bot subset — the one this article is about — is a different mechanism entirely. Blocking ChatGPT-User or PerplexityBot isn’t picked up in an aggregate r value because it affects a minority of sites that are otherwise invisible regardless. The study validates that training-bot blocking is a non-factor. Retrieval-bot blocking is the edge case that still matters for the subset of sites suffering from it.

We audited 200 mid-market business websites and found that 34% had robots.txt configurations that blocked AI retrieval bots — the bots that AI models use to fetch content for live answers. These businesses were not making a deliberate intellectual property decision. They had installed a WordPress security plugin, or their hosting provider had added default blocking rules, or their developer had copy-pasted a robots.txt from a template that included AI bot blocking.

The result: these sites were invisible to AI search. Not because their content was bad. Not because they lacked authority. Simply because AI models were not allowed to access their pages when generating answers.

If you have not checked your robots.txt for AI crawler rules in the last 6 months, there is a meaningful chance you are blocking bots you do not want to block. The fix takes 5 minutes. The impact of not fixing it is permanent AI invisibility. Take our 60-second AI visibility test to see if this is affecting your brand right now.

Training Bots vs Retrieval Bots: The Critical Distinction

This is the distinction that most robots.txt guides miss, and it is the most important concept in this entire article.

Training bots crawl the web to collect data that is used to train AI models. When GPTBot crawls your site in training mode, it is ingesting your content to improve the model’s general knowledge. This is the activity that publishers are concerned about — your content being used to train a commercial AI product without permission or compensation. Blocking training bots is a legitimate intellectual property decision.

Retrieval bots crawl the web in real-time to fetch content for live AI answers. When someone asks Perplexity a question, PerplexityBot crawls relevant pages right then to generate a sourced answer. When ChatGPT uses browsing mode, ChatGPT-User fetches current page content. Blocking these bots means AI literally cannot access your content when generating responses. You become uncitable. This is especially critical in the context of RAG-powered AI search, where retrieval is the entire mechanism for citation.

Here is where it gets confusing: some companies use the same bot name for both purposes (GPTBot does both training and retrieval). Others have separate bots for each function (Claude has ClaudeBot for training and Claude-SearchBot for retrieval). The rules are not standardized, and they change frequently.

"The training-versus-retrieval distinction is the single most misunderstood concept in AI visibility," says Joel House. "I see well-intentioned developers blocking GPTBot to protect their content, not realizing they\'re also blocking the retrieval pathway that would let ChatGPT cite and recommend their brand. It\'s the equivalent of putting a lock on your front door but also welding the mailbox shut."

The practical implication: If you want AI visibility, you must allow retrieval bots. Period. Without access to your content at retrieval time, AI models cannot cite you in live responses, cannot include your pages as sources, and cannot recommend your brand with current information.

Blocking training bots is your call. There are reasonable arguments on both sides. But blocking retrieval bots while trying to improve AI visibility is like locking your store’s front door and wondering why nobody is buying anything. This distinction matters even more in the zero-click search era where AI answers replace traditional clicks.

Every AI Crawler You Need to Know

Here is a comprehensive reference of every major AI crawler, what it does, and our recommendation. Bookmark this — you will need it when editing your robots.txt.

Bot Name	Company	Purpose	Recommendation
GPTBot	OpenAI	Training + Retrieval	Allow (needed for ChatGPT citations)
ChatGPT-User	OpenAI	Retrieval (browsing mode)	Always allow
OAI-SearchBot	OpenAI	Retrieval (SearchGPT)	Always allow
ClaudeBot	Anthropic	Training	Your choice
Claude-SearchBot	Anthropic	Retrieval	Always allow
PerplexityBot	Perplexity	Retrieval + Indexing	Always allow
Google-Extended	Google	AI training (Gemini)	Your choice
Googlebot	Google	Search + AI Overviews	Always allow
Bytespider	ByteDance	Training	Block (aggressive, minimal benefit)
CCBot	Common Crawl	Training (open dataset)	Your choice
Amazonbot	Amazon	Training (Alexa)	Your choice
FacebookBot	Meta	AI training	Your choice
Applebot-Extended	Apple	AI training (Apple Intelligence)	Your choice

Key takeaways from this table: - There are 4 bots you should always allow: ChatGPT-User, OAI-SearchBot, Claude-SearchBot, and PerplexityBot. These are pure retrieval bots. Blocking them has zero IP protection value and 100% AI visibility cost. - Googlebot must always be allowed — it powers both traditional search and AI Overviews. - GPTBot is the hardest call because it serves dual purposes. If you block it, you block both training AND retrieval for standard ChatGPT. Our recommendation for most businesses: allow it. - Bytespider is the one bot we recommend blocking universally. It crawls aggressively, consumes server resources, and provides minimal visibility benefit for most Western markets.

The Decision Framework: What to Allow and What to Block

The right robots.txt configuration depends on your business type and your priorities. Here are specific configurations for common scenarios.

SaaS Companies and Service Businesses: Your content is your marketing, not your product. You want maximum AI visibility. Allow everything except Bytespider. If you are in SaaS, our GEO for SaaS guide covers the full optimization strategy beyond robots.txt.

User-agent: Bytespider Disallow: /

That is it. Everything else should be allowed by default. Do not add any other AI bot blocks.

Publishers and Content Creators: You have a legitimate interest in protecting training data while maintaining retrieval access. Block training bots, explicitly allow retrieval bots.

User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: CCBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / User-agent: Claude-SearchBot Allow: / User-agent: PerplexityBot Allow: /

E-commerce Businesses: You want AI models recommending your products. Allow all bots. Product pages are public by nature, and AI recommendations drive high-intent traffic.

Local Businesses: Allow all bots. Local businesses benefit enormously from AI visibility, and your content is primarily informational (services, hours, location). There is no IP concern worth blocking AI access.

The common thread across all these frameworks: never block retrieval bots. The training decision is yours to make based on your content’s commercial value and your stance on AI training data. But retrieval access is non-negotiable if you want AI visibility.

How to Fix Your Robots.txt for AI

Here is the step-by-step process to audit and fix your robots.txt configuration.

Step 1: Check your current robots.txt. Go to yourdomain.com/robots.txt in your browser. Read through it. Look for any User-agent lines that reference GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, or any of the bot names listed above. If you see Disallow: / under any retrieval bot name, you have a problem.

Step 2: Check your hosting provider’s defaults. Some hosts (notably Cloudflare) have added AI bot blocking at the infrastructure level. In Cloudflare, go to Security > Bots and check the "AI Bots" setting. If "Block AI Scrapers and Crawlers" is enabled, it blocks AI bots regardless of your robots.txt. You need to disable this or add exceptions for retrieval bots.

Step 3: Check your CMS plugins. WordPress security plugins like Wordfence and Sucuri can add AI bot blocking rules. Check your plugin settings for any AI-related blocking configurations. Disable them or configure exceptions for retrieval bots.

Step 4: Make your changes. Edit your robots.txt file to match the framework above for your business type. If you are on WordPress, use the Yoast SEO plugin’s robots.txt editor. If you are on a custom platform, edit the file directly at the root of your web server.

Step 5: Test your changes. After updating, use a robots.txt testing tool to verify that retrieval bots can access your key pages. Google Search Console has a robots.txt tester under Settings > Crawl Stats. You can also check each bot manually by adding User-agent: ChatGPT-User and verifying the Allow or Disallow status.

Step 6: Monitor. Set a calendar reminder to check your robots.txt quarterly. New AI bots launch regularly, and hosting providers and plugins update their default blocking rules. What is correct today may be wrong in 3 months.

One final note: robots.txt changes take effect immediately for new crawl requests. But if an AI model has already cached a "blocked" status for your site, it may take days or weeks before it attempts to crawl you again. Be patient after making changes — the impact builds over the following weeks as AI crawlers return and discover they now have access.

Once your crawlers are unblocked, make sure your schema markup is optimized so bots can actually parse your content effectively. Robots.txt and schema are the two technical foundations of AI visibility — get both right before investing in content and citations. To see the full picture of how your technical setup measures up, run a free AI visibility audit.

Frequently Asked Questions

Will unblocking AI crawlers hurt my website?

No. AI retrieval bots are lightweight and infrequent compared to regular search engine crawlers. They typically make a handful of requests when generating a specific answer, not bulk crawls. The server load is negligible. The only legitimate concern is with training bots, which can crawl more aggressively. Even then, the impact is comparable to any other search engine bot.

Does Cloudflare block AI bots by default?

Cloudflare introduced an "AI Scrapers and Crawlers" toggle in 2024 that many site owners enabled without fully understanding the consequences. Check your Cloudflare dashboard under Security > Bots. If this is enabled, it blocks AI bots at the infrastructure level regardless of your robots.txt settings. You need to either disable it entirely or configure specific exceptions for retrieval bots like ChatGPT-User and PerplexityBot.

Can AI bots crawl JavaScript-rendered content?

Most AI retrieval bots have limited JavaScript rendering capability. They primarily crawl server-rendered HTML. If your important content is rendered client-side via JavaScript frameworks (React SPAs, Angular), AI bots may not see it. This is another reason to ensure your content is server-side rendered (SSR) or statically generated. Next.js, Nuxt, and similar frameworks handle this well by default.

How do I know if my robots.txt is blocking AI crawlers?

Go to yourdomain.com/robots.txt in your browser and search for bot names like GPTBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, and PerplexityBot. If any retrieval bot has a Disallow: / rule, you are blocking it. Also check your hosting provider dashboard (especially Cloudflare) and any security plugins for additional bot blocking that operates outside robots.txt.

Check Your AI Visibility Score

Run a free 5-pillar audit and see where your brand stands across Citations, AI Presence, Entities, Reviews, and Press.

Run Free Audit →

Industry

The AI Visibility Index: We Tested 1,004 Businesses Across 5 AI Models. 65.9% Are Completely Invisible.

Technical

How AI Models Decide Which Brands to Recommend (And Why Yours Might Not Make the List)

Fundamentals

Zero-Click Search Is Here: What the AI Overviews Data Actually Shows

← Back to Blog