How to block AI Crawler Bots using robots.txt file

Cynicus Rex@lemmy.ml · 1 year ago

How to block AI Crawler Bots using robots.txt file

asudox@lemmy.world · 1 year ago

Block? Nope, robots.txt does not block the bots. It’s just a text file that says: “Hey robot X, please do not crawl my website. Thanks :>”

Cynicus Rex@lemmy.ml · 1 year ago

Unfortunate indeed.

“Can AI bots ignore my robots.txt file? Well-established companies such as Google and OpenAI typically adhere to robots.txt protocols. But some poorly designed AI bots will ignore your robots.txt.”

breadsmasher@lemmy.world · 1 year ago

typically adhere. but they don’t have to follow it.

poorly designed AI bots

Is it a poor design if its explicitly a design choice to ignore it entirely to scrape as much data as possible? Id argue its more AI bots designed to scrape everything regardless of robots.txt. That’s the intention. Asshole design vs poor design.

DigitalDilemma@lemmy.ml · 1 year ago

robots.txt does not work. I don’t think it ever has - it’s an honour system with no penalty for ignoring it.

I have a few low traffic sites hosted at home, and when a crawler takes an interest they can totally flood my connection. I’m using cloudflare and being incredibly aggressive with my filtering but so many bots are ignoring robots.txt as well as lying about who they are with humanesque UAs that it’s having a real impact on my ability to provide the sites for humans.

Over the past year it’s got around ten times worse. I woke up this morning to find my connection at a crawl and on checking the logs, AmazonBot has been hitting one site 12000 times an hour, and that’s one of the more well-behaved bots. But there’s thousands and thousands of them.

fubarx@lemmy.ml · 1 year ago

Cloudflare just announced an AI Bot prevention system: https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/

spookedintownsville@lemmy.world · 1 year ago

When I changed my domain name I set this to on and then wondered why I couldn’t log into the Nextcloud desktop app.

5opn0o30@lemmy.world · 1 year ago

Wow. A lot of cynicism here. The AI bots are (currently) honoring robots.txt so this is an easy way to say go away. Honeypot urls can be a second line of defense as well as blocking published IP ranges. They’re no different than other bots that have existed for years.

DigitalDilemma@lemmy.ml · edit-2 1 year ago

In my experience, the AI bots are absolutely not honoring robots.txt - and there are literally hundreds of unique ones. Everyone and their dog has unleashed AI/LLM harvesters over the past year without much thought to the impact to low bandwidth sites.

Many of them aren’t even identifying themselves as AI bots, but faking human user-agents.

breadsmasher@lemmy.world · 1 year ago

It isn’t an enforceable solution. robots.txt and similar are just please bots dont index these pages. Doesn’t mean any bots will respect it

Cynicus Rex@lemmy.ml · 1 year ago

#TL;DR:

User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-Agent: FacebookBot
Disallow: /
User-Agent: Applebot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: ImagesiftBot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: Omgili
Disallow: /
User-agent: YouBot
Disallow: /

How to block AI Crawler Bots using robots.txt file

How to block AI Crawler Bots using robots.txt file

Just a moment...