What is the issue with AI indexing crawlers?
AI indexing crawlers are tools that are used to index the content of your website and use the crawled data to train LLMs. There are many reasons for which you could want to block these bots from accessing your website. For example:
- you might want to prevent AI models from being trained with your originally-crafted content or data;
- you might want to prevent AI bots from consuming too many resources on your server.
How does this recipe work?
This recipe blocks AI bots from indexing your website. Once installed, it will enable two rules on your website:
- A
robots.txt
file, that disallows all bots from indexing your website. This file is the standard way to tell bots which parts of your website they are allowed to access. Unfortunately, some bots do not respect this file, so we also need to add a second rule to ensure that they are blocked. - A
403 Forbidden
rule, that completely blocks all bots from accessing your website, based on their user-agent.
This recipe is officially supported by the redirection.io team and will be updated regularly.
How to identify which AI bots are crawling my website?
We offer a "Log View" dedicated to analyzing which AI bot crawlers explore your website. On the logs screen of the manager, choose the "AI Bots crawlers" Log View to see all the requests performed by bots for AI-training purpose.
Which bots are blocked by this recipe?
This recipe instructs the following bots from crawling your website:
- AI2Bot: https://allenai.org/crawler
- anthropic-ai: https://www.anthropic.com/ - see details
- Amazonbot: https://developer.amazon.com/fr/amazonbot
- Applebot-Extended: https://support.apple.com/en-us/119829
- Bytespider: https://www.bytedance.com/en/
- CCBot: https://commoncrawl.org/ccbot
- ChatGPT-User: https://platform.openai.com/docs/bots
- ClaudeBot: https://www.anthropic.com/ - see details
- Claude-Web: https://www.anthropic.com/ - see details
- cohere-ai: https://docs.cohere.com/
- Diffbot: https://docs.diffbot.com/docs/getting-started-with-diffbot
- FacebookBot: https://developers.facebook.com/docs/sharing/bot/
- Google-CloudVertexBot: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#google-cloudvertexbot
- Google-Extended: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#google-extended
- GPTBot: https://platform.openai.com/docs/bots
- ICC-Crawler: https://ucri.nict.go.jp/en/icccrawler/
- ImagesiftBot: https://imagesift.com/about
- Meta-ExternalAgent: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/
- Meta-ExternalFetcher: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/
- OAI-SearchBot: https://platform.openai.com/docs/bots
- Omgili: https://webz.io/
- Omgilibot: https://webz.io/
- PerplexityBot: https://docs.perplexity.ai/guides/perplexitybot
- Timpibot: https://timpi.io/
- VelenPublicWebCrawler: https://velen.io/
- Webzio-Extended: https://webz.io/bot.html
- YouBot: https://about.you.com/es/youbot/
How to install this recipe on my website with redirection.io?
Installing this recipe on your website requires the following steps:
In order to install this recipe, you need to:
-
Configure the path which you wish to deny access to the indexing robots of AI tools: Define the part of your website that you want to block from AI bots. This can be the whole website, or a specific part of it. In order to prevent AI bots from indexing your whole website, you can use the
/
path. If you want to block a specific part of your website, you can use a more specific path, such as/blog
. -
Define the current Content of your robots.txt file: Paste the content of your current
robots.txt
file. If you don't have arobots.txt
file, you can leave this field empty. - Click on "Install on My Website": Execute the installation process by clicking the "Install on My Website" button. This action will create the associated redirection.io rules in draft mode, so you can review them, change them if needed, and publish them to protect your website from AI bots.
Please review the created rules. One of them overrides your robots.txt
file to include AI crawler directives, while the other completely blocks AI crawlers requests and responds with a 403 - Forbidden
status code. You may wish to use only one of these two rules.